Implementing thread-local storage in custom libc - c

I'm implementing a small subset of libc for very small and statically linked programs, and I figured that adding TLS support would be a good learning experience. I use Ulrich Drepper's TLS document as a reference.
I have two strings set up to try this out:
static __thread const char msg1[] = "TLS (1).\n"; /* 10 bytes */
static __thread const char msg2[] = "TLS (2).\n"; /* 10 bytes */
And the compiler generates the following instructions to access them:
mov rbx, QWORD PTR fs:0x0 ; Load TLS.
lea rsi, [rbx-0x14] ; Get a pointer to 'msg1'. 20 byte offset.
lea rsi, [rbx-0xa] ; Get a pointer to 'msg2'. 10 byte offset.
Let's assume I place the TCB somewhere on the stack:
struct tcb {
void* self; /* Points to self. I read that this was necessary somewhere. */
int errno; /* Per-thread errno variable. */
int padding;
};
And then place the TLS area just next to it at tls = &tcb - tls_size. Then I set the FS register to point at fs = tls + tls_size, and copy the TLS initialization image to tls.
However, this doesn't work. I have verified that I locate the TLS initialization image properly by writing the 20 bytes at tls_image to stdout. This either leads me to believe that I place the TCB and/or TLS area incorrectly, or that I'm otherwise not conforming to the ABI.
I set the FS register using arch_prctl(2). Do I need to use set_thread_area(2) somehow?
I don't have a dtv. I'm assuming this isn't necessary since I am linking statically.
Any ideas as to what I'm doing wrong? Thanks a lot!

I'm implementing a small subset of libc for very small and statically
linked programs, and I figured that adding TLS support would be a good
learning experience.
Awesome idea! I had to implement my own TLS in a project because I could not use any common thread library like pthread. I do not have a completely solution for your problems, but sharing my experience could be useful.
I set the FS register using arch_prctl(2). Do I need to use
set_thread_area(2) somehow?
The answer depends on the architecture, you are actually using. If you are using a x86-64 bit, you should use exclusively arch_prctl to set the FS register to an area of memory that you want to use as TLS (it allows you to address memory areas bigger than 4GB). While for x86-32 you must use set_thread_area as it is the only system call supported by the kernel.
The idea behind my implementation is to allocate a private memory area for each thread and save its address into the %GS register. It is a rather easy method, but in my case, it worked quite well. Each time you want to access the private area of a thread you just need to use as base address the value saved in %GS and an offset which identifies a memory location. I usually allocate a memory page (4096) for each thread and I divide it in 8 bytes blocks. So, I have 512 private memory slots for each thread, which can be accessed like an array whose indexes span from 0 to 511.
This is the code I use :
#define _GNU_SOURCE 1
#include "tls.h"
#include <asm/ldt.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <asm/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
void * install_tls() {
void *addr = mmap(0, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (syscall(SYS_arch_prctl,ARCH_SET_GS, addr) < 0)
return NULL;
return addr;
}
void freeTLS() {
void *addr;
syscall(SYS_arch_prctl,ARCH_GET_GS, &addr);
munmap(addr, 4096);
}
bool set_tls_value(int idx, unsigned long val) {
if (idx < 0 || idx >= 4096/8) {
return false;
}
asm volatile(
"movq %0, %%gs:(%1)\n"
:
: "q"((void *)val), "q"(8ll * idx));
return true;
}
unsigned long get_tls_value(int idx) {
long long rc;
if (idx < 0 || idx >= 4096/8) {
return 0;
}
asm volatile(
"movq %%gs:(%1), %0\n"
: "=q"(rc)
: "q"(8ll * idx));
return rc;
}
This is the header with some macros :
#ifndef TLS_H
#define TLS_H
#include <stdbool.h>
void *install_tls();
void freeTLS();
bool set_tls_value (int, unsigned long);
unsigned long get_tls_value(int );
/*
*macros used to set and retrieve the values
from the tls area
*/
#define TLS_TID 0x0
#define TLS_FD 0x8
#define TLS_MONITORED 0x10
#define set_local_tid(_x) \
set_tls_value(TLS_TID, (unsigned long)_x)
#define set_local_fd(_x) \
set_tls_value(TLS_FD, (unsigned long)_x)
#define set_local_monitored(_x) \
set_tls_value(TLS_MONITORED, (unsigned long)_x)
#define get_local_tid() \
get_tls_value(TLS_TID)
#define get_local_fd() \
get_tls_value(TLS_FD)
#define get_local_monitored() \
get_tls_value(TLS_MONITORED)
#endif /* end of include guard: TLS_H */
The first action to be accomplished by each thread is to install the TLS memory area. Once the TLS are has been initialized, each thread can start using this area as private TLS.

Related

Complex casting

After learn C and theorical stuffs of operatives systems, i decided to analyze one kernel rootkit for linux, but i can't understand one line of code, i don't know how to read that line:
*(void **)&((char *)h->original_function)[ASM_HOOK_CODE_OFFSET] = h->modified_function;
Full context:
#if defined __i386__
// push 0x00000000, ret
#define ASM_HOOK_CODE "\x68\x00\x00\x00\x00\xc3"
#define ASM_HOOK_CODE_OFFSET 1
// alternativly we could do `mov eax 0x00000000, jmp eax`, but it's a byte longer
//#define ASM_HOOK_CODE "\xb8\x00\x00\x00\x00\xff\xe0"
#elif defined __x86_64__
// there is no push that pushes a 64-bit immidiate in x86_64,
// so we do things a bit differently:
// mov rax 0x0000000000000000, jmp rax
#define ASM_HOOK_CODE "\x48\xb8\x00\x00\x00\x00\x00\x00\x00\x00\xff\xe0"
#define ASM_HOOK_CODE_OFFSET 2
#else
#error ARCH_ERROR_MESSAGE
#endif
struct asm_hook {
void *original_function;
void *modified_function;
char original_asm[sizeof(ASM_HOOK_CODE)-1];
struct list_head list;
};
/**
* Patches machine code of the original function to call another function.
* This function should not be called directly.
*/
void _asm_hook_patch(struct asm_hook *h)
{
DISABLE_W_PROTECTED_MEMORY
memcpy(h->original_function, ASM_HOOK_CODE, sizeof(ASM_HOOK_CODE)-1);
*(void **)&((char *)h->original_function)[ASM_HOOK_CODE_OFFSET] = h->modified_function;
ENABLE_W_PROTECTED_MEMORY
}
Rootkit link:
https://github.com/nurupo/rootkit/blob/master/rootkit.c
(line 314 of rootkit.c)
I don't want to be explained how the rootkit, only want to understand how to read that line of code, the first part of the line makes me dizzy.
If I am not mistaken: you start at the innermost part, which is
h->original_function
Then we see a brace at the right so now we scan left for the matching brace. But wait a second, we see a cast to (char *) so it is a pointer to a char and now the brace closes.
On the right we now see an array indexing to take element [ASM_HOOK_CODE_OFFSET] and on the left we now see & to take its address. So we now have the address of a char.
Now we can only go the the left and see *(void **), which casts this address to be a pointer-to-a-pointer-to-void and then dereference it and assign the right part to the address.

mbed-OS porting to TivaC TM4123, Trouple with dynamic interrupt handling

Recently I'm trying to port mbed-OS to Tiva-C launchpad TM4C123, I am facing problem with file supplied by mbed which is cmsis_nvic.c and cmsis_nvic.h
This module is supposed to dynamically allocate the interrupt handler of OS timer to addressable function.(Or as far as I understand).
What happen is, The software jumps to "Hard Fault Handler" after executing the following line
vectors[i] = old_vectors[i];
Here's files which I use
#include "cmsis_nvic.h"
#define NVIC_RAM_VECTOR_ADDRESS (0x02000000) // Vectors positioned at start of RAM
#define NVIC_FLASH_VECTOR_ADDRESS (0x0) // Initial vector position in flash
void NVIC_SetVector(IRQn_Type IRQn, uint32_t vector) {
uint32_t *vectors = (uint32_t*)SCB->VTOR;
uint32_t i;
// Copy and switch to dynamic vectors if the first time called
if (SCB->VTOR == NVIC_FLASH_VECTOR_ADDRESS) {
uint32_t *old_vectors = vectors;
vectors = (uint32_t*)NVIC_RAM_VECTOR_ADDRESS;
for (i=0; i<NVIC_NUM_VECTORS; i++) {
vectors[i] = old_vectors[i];
}
SCB->VTOR = (uint32_t)NVIC_RAM_VECTOR_ADDRESS;
}
vectors[IRQn + 16] = vector;
}
uint32_t NVIC_GetVector(IRQn_Type IRQn) {
uint32_t *vectors = (uint32_t*)SCB->VTOR;
return vectors[IRQn + 16];
}
And here is cmsis_nvic.h
#ifndef MBED_CMSIS_NVIC_H
#define MBED_CMSIS_NVIC_H
#define NVIC_NUM_VECTORS (154) // CORE + MCU Peripherals
#define NVIC_USER_IRQ_OFFSET 16
#include "cmsis.h"
#ifdef __cplusplus
extern "C" {
#endif
void NVIC_SetVector(IRQn_Type IRQn, uint32_t vector);
uint32_t NVIC_GetVector(IRQn_Type IRQn);
#ifdef __cplusplus
}
#endif
#endif
and I am calling
NVIC_SetVector(IRQn_Type IRQn, uint32_t vector)
from file us_ticker.c like this
NVIC_SetVector(TIMER0A_IRQn, (uint32_t)us_ticker_irq_handler);
(my compiler is ARM GCC, I am using CDT for building, And GDB openOCD for debugging, and integrated all those tools on Eclipse)
Can anyone please let me know what is going wrong here? or at least where should I debug or read to help me solve this problem???
UPDATE
I figured out part of the problem, The vector is not pointing to the first address of the target SRAM which should be
#define NVIC_RAM_VECTOR_ADDRESS (0x20000000)
instead of
#define NVIC_RAM_VECTOR_ADDRESS (0x02000000)
So now when calling NVIC_SetVector , the function is executed. But then when enabling the interrupt, Software still jumps to Hard Fault, I guess(just guessing or might be part of solution) that the defines in the header file are not configured correctly, Can someone explain to me what do they mean? and how to calculate the number of vector addresses? and what is the USER OFFSET?
I have solved this issue, here's what I have found
1- NVIC_RAM_VECTOR_ADDRESS wasn't the first address of my target RAM which should be `0x20000000'
2- Linker file should be update so that the stack pointer shouldn't write over the new copied vector table. So shift the RAM address by number of bytes that vector table should occupy.
3-(THE MAIN CAUSE) inside function NVIC_SetVector, i was declared as uint32_t and then compared to less than 255 pre-processor value. So compile get confused by comparing uint32_t with uint8_t, by adding UL to the pre-processor value, it solved the whole issue.

How to do computations with addresses at compile/linking time?

I wrote some code for initializing the IDT, which stores 32-bit addresses in two non-adjacent 16-bit halves. The IDT can be stored anywhere, and you tell the CPU where by running the LIDT instruction.
This is the code for initializing the table:
void idt_init(void) {
/* Unfortunately, we can't write this as loops. The first option,
* initializing the IDT with the addresses, here looping over it, and
* reinitializing the descriptors didn't work because assigning a
* a uintptr_t (from (uintptr_t) handler_func) to a descr (a.k.a.
* uint64_t), according to the compiler, "isn't computable at load
* time."
* The second option, storing the addresses as a local array, simply is
* inefficient (took 0.020ms more when profiling with the "time" command
* line program!).
* The third option, storing the addresses as a static local array,
* consumes too much space (the array will probably never be used again
* during the whole kernel runtime).
* But IF my argument against the third option will be invalidated in
* the future, THEN it's the best option I think. */
/* Initialize descriptors of exception handlers. */
idt[EX_DE_VEC] = idt_trap(ex_de);
idt[EX_DB_VEC] = idt_trap(ex_db);
idt[EX_NMI_VEC] = idt_trap(ex_nmi);
idt[EX_BP_VEC] = idt_trap(ex_bp);
idt[EX_OF_VEC] = idt_trap(ex_of);
idt[EX_BR_VEC] = idt_trap(ex_br);
idt[EX_UD_VEC] = idt_trap(ex_ud);
idt[EX_NM_VEC] = idt_trap(ex_nm);
idt[EX_DF_VEC] = idt_trap(ex_df);
idt[9] = idt_trap(ex_res); /* unused Coprocessor Segment Overrun */
idt[EX_TS_VEC] = idt_trap(ex_ts);
idt[EX_NP_VEC] = idt_trap(ex_np);
idt[EX_SS_VEC] = idt_trap(ex_ss);
idt[EX_GP_VEC] = idt_trap(ex_gp);
idt[EX_PF_VEC] = idt_trap(ex_pf);
idt[15] = idt_trap(ex_res);
idt[EX_MF_VEC] = idt_trap(ex_mf);
idt[EX_AC_VEC] = idt_trap(ex_ac);
idt[EX_MC_VEC] = idt_trap(ex_mc);
idt[EX_XM_VEC] = idt_trap(ex_xm);
idt[EX_VE_VEC] = idt_trap(ex_ve);
/* Initialize descriptors of reserved exceptions.
* Thankfully we compile with -std=c11, so declarations within
* for-loops are possible! */
for (size_t i = 21; i < 32; ++i)
idt[i] = idt_trap(ex_res);
/* Initialize descriptors of hardware interrupt handlers (ISRs). */
idt[INT_8253_VEC] = idt_int(int_8253);
idt[INT_8042_VEC] = idt_int(int_8042);
idt[INT_CASC_VEC] = idt_int(int_casc);
idt[INT_SERIAL2_VEC] = idt_int(int_serial2);
idt[INT_SERIAL1_VEC] = idt_int(int_serial1);
idt[INT_PARALL2_VEC] = idt_int(int_parall2);
idt[INT_FLOPPY_VEC] = idt_int(int_floppy);
idt[INT_PARALL1_VEC] = idt_int(int_parall1);
idt[INT_RTC_VEC] = idt_int(int_rtc);
idt[INT_ACPI_VEC] = idt_int(int_acpi);
idt[INT_OPEN2_VEC] = idt_int(int_open2);
idt[INT_OPEN1_VEC] = idt_int(int_open1);
idt[INT_MOUSE_VEC] = idt_int(int_mouse);
idt[INT_FPU_VEC] = idt_int(int_fpu);
idt[INT_PRIM_ATA_VEC] = idt_int(int_prim_ata);
idt[INT_SEC_ATA_VEC] = idt_int(int_sec_ata);
for (size_t i = 0x30; i < IDT_SIZE; ++i)
idt[i] = idt_trap(ex_res);
}
The macros idt_trap and idt_int, and are defined as follows:
#define idt_entry(off, type, priv) \
((descr) (uintptr_t) (off) & 0xffff) | ((descr) (KERN_CODE & 0xff) << \
0x10) | ((descr) ((type) & 0x0f) << 0x28) | ((descr) ((priv) & \
0x03) << 0x2d) | (descr) 0x800000000000 | \
((descr) ((uintptr_t) (off) & 0xffff0000) << 0x30)
#define idt_int(off) idt_entry(off, 0x0e, 0x00)
#define idt_trap(off) idt_entry(off, 0x0f, 0x00)
idt is an array of uint64_t, so these macros are implicitly cast to that type. uintptr_t is the type guaranteed to be capable of holding pointer values as integers and on 32-bit systems usually 32 bits wide. (A 64-bit IDT has 16-byte entries; this code is for 32-bit).
I get the warning that the initializer element is not constant due to the address modification in play.
It is absolutely sure that the address is known at linking time.
Is there anything I can do to make this work? Making the idt array automatic would work but this would require the whole kernel to run in the context of one function and this would be some bad hassle, I think.
I could make this work by some additional work at runtime (as Linux 0.01 also does) but it just annoys me that something technically feasible at linking time is actually infeasible.
Related: Solution needed for building a static IDT and GDT at assemble/compile/link time - a linker script for ld can shift and mask to break up link-time-constant addresses. No earlier step has the final addresses, and relocation entries are limited in what they can represent in a .o.
The main problem is that function addresses are link-time constants, not strictly compile time constants. The compiler can't just get 32b binary integers and stick that into the data segment in two separate pieces. Instead, it has to use the object file format to indicate to the linker where it should fill in the final value (+ offset) of which symbol when linking is done. The common cases are as an immediate operand to an instruction, a displacement in an effective address, or a value in the data section. (But in all those cases it's still just filling in 32-bit absolute address so all 3 use the same ELF relocation type. There's a different relocation for relative displacements for jump / call offsets.)
It would have been possible for ELF to have been designed to store a symbol reference to be substituted at link time with a complex function of an address (or at least high / low halves like on MIPS for lui $t0, %hi(symbol) / ori $t0, $t0, %lo(symbol) to build address constants from two 16-bit immediates). But in fact the only function allowed is addition/subtraction, for use in things like mov eax, [ext_symbol + 16].
It is of course possible for your OS kernel binary to have a static IDT with fully resolved addresses at build time, so all you need to do at runtime is execute a single lidt instruction. However, the standard
build toolchain is an obstacle. You probably can't achieve this without post-processing your executable.
e.g. you could write it this way, to produce a table with the full padding in the final binary, so the data can be shuffled in-place:
#include <stdint.h>
#define PACKED __attribute__((packed))
// Note, this is the 32-bit format. 64-bit is larger
typedef union idt_entry {
// we will postprocess the linker output to have this format
// (or convert at runtime)
struct PACKED runtime { // from OSdev wiki
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
uint8_t zero; // unused, set to 0
uint8_t type_attr; // type and attributes, see below
uint16_t offset_2; // offset bits 16..31
} rt;
// linker output will be in this format
struct PACKED compiletime {
void *ptr; // offset bits 0..31
uint8_t zero;
uint8_t type_attr;
uint16_t selector; // to be swapped with the high16 of ptr
} ct;
} idt_entry;
// #define idt_ct_entry(off, type, priv) { .ptr = off, .type_attr = type, .selector = priv }
#define idt_ct_trap(off) { .ct = { .ptr = off, .type_attr = 0x0f, .selector = 0x00 } }
// generate an entry in compile-time format
extern void ex_de(); // these are the raw interrupt handlers, written in ASM
extern void ex_db(); // they have to save/restore *all* registers, and end with iret, rather than the usual C ABI.
// it might be easier to use asm macros to create this static data,
// just so it can be in the same file and you don't need cross-file prototypes / declarations
// (but all the same limitations about link-time constants apply)
static idt_entry idt[] = {
idt_ct_trap(ex_de),
idt_ct_trap(ex_db),
// ...
};
// having this static probably takes less space than instructions to write it on the fly
// but not much more. It would be easy to make a lidt function that took a struct pointer.
static const struct PACKED idt_ptr {
uint16_t len; // encoded as bytes - 1, so 0xffff means 65536
void *ptr;
} idt_ptr = { sizeof(idt) - 1, idt };
/****** functions *********/
// inline
void load_static_idt(void) {
asm volatile ("lidt %0"
: // no outputs
: "m" (idt_ptr));
// memory operand, instead of writing the addressing mode ourself, allows a RIP-relative addressing mode in 64bit mode
// also allows it to work with -masm=intel or not.
}
// Do this once at at run-time
// **OR** run this to pre-process the binary, after link time, as part of your build
void idt_convert_to_runtime(void) {
#ifdef DEBUG
static char already_done = 0; // make sure this only runs once
if (already_done)
error;
already_done = 1;
#endif
const int count = sizeof idt / sizeof idt[0];
for (int i=0 ; i<count ; i++) {
uint16_t tmp1 = idt[i].rt.selector;
uint16_t tmp2 = idt[i].rt.offset_2;
idt[i].rt.offset_2 = tmp1;
idt[i].rt.selector = tmp2;
// or do this swap in fewer insns with SSE or MMX pshufw, but using vector instructions before setting up the IDT may be insane.
}
}
This does compile. See a diff of the -m32 and -m64 asm output on the Godbolt compiler explorer. Look at the layout in the data section (note that .value is a synonym for .short, and is 16 bits.) (But note that the IDT table format is different for 64-bit mode.)
I think I have the size calculation correct (bytes - 1), as documented in http://wiki.osdev.org/Interrupt_Descriptor_Table. Minimum value 100h bytes long (encoded as 0x99). See also https://en.wikibooks.org/wiki/X86_Assembly/Global_Descriptor_Table. (lgdt size/pointer works the same way, although the table itself has a different format.)
The other option, instead of having the IDT static in the data section, is to have it in the bss section, with the data stored as immediate constants in a function that will initialize it (or in an array read by that function).
Either way, that function (and its data) can be in a .init section whose memory you re-use after it's done. (Linux does this to reclaim memory from code and data that's only needed once, at startup.) This would give you the optimal tradeoff of small binary size (since 32b addresses are smaller than 64b IDT entries), and no runtime memory wasted on code to set up the IDT. A small loop that runs once at startup is negligible CPU time. (The version on Godbolt fully unrolls because I only have 2 entries, and it embeds the address into each instruction as a 32-bit immediate, even with -Os. With a large enough table (just copy/paste to duplicate a line) you get a compact loop even at -O3. The threshold is lower for -Os.)
Without memory-reuse haxx, probably a tight loop to rewrite 64b entries in place is the way to go. Doing it at build time would be even better, but then you'd need a custom tool to run the tranformation on the kernel binary.
Having the data stored in immediates sounds good in theory, but the code for each entry would probably total more than 64b, because it couldn't loop. The code to split an address into two would have to be fully unrolled (or placed in a function and called). Even if you had a loop to store all the same-for-multiple-entries stuff, each pointer would need a mov r32, imm32 to get the address in a register, then mov word [idt+i + 0], ax / shr eax, 16 / mov word [idt+i + 6], ax. That's a lot of machine-code bytes.
One way would be to use an intermediate jump table that is located at a fixed address. You could initialize the idt with addresses of the locations in this table (which will be compile time constant). The locations in the jump table will contain jump instructions to the actual isr routines.
The dispatch to an isr will be indirect as follows:
trap -> jump to intermediate address in the idt -> jump to isr
One way to create a jump table at a fixed address is as follows.
Step 1: Put jump table in a section
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can also be asm("call ...") depending on need
asm("jmp isr1");
asm("jmp isr2");
}
Step 2: Instruct the linker to locate the section at a fixed address
SECTIONS
{
.so.idt 0x600000 :
{
*(.si.idt)
}
}
Put this in the linker script right after the .text section. This will make sure that the executable code in the table will go into a executable memory region.
You can instruct the linker to use your script as follows using the --script option in the Makefile.
LDFLAGS += -Wl,--script=my_script.lds
The following macro gives you the address of the location which contains the jump (or call) instruction to the corresponding isr.
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
You will then initialize the idt as follows using modified macros.
// your real idt will be initialized as follows
#define idt_entry(addr, type, priv) \
( \
((descr) (uintptr_t) (addr) & 0xffff) | \
((descr) (KERN_CODE & 0xff) << 0x10) | \
((descr) ((type) & 0x0f) << 0x28) | \
((descr) ((priv) & 0x03) << 0x2d) | \
((descr) 0x1 << 0x2F) | \
((descr) ((uintptr_t) (addr) & 0xffff0000) << 0x30) \
)
#define idt_int(off) idt_entry(JUMP_ADDR(off), 0x0e, 0x00)
#define idt_trap(off) idt_entry(JUMP_ADDR(off), 0x0f, 0x00)
descr idt[] =
{
...
idt_trap(ex_de),
...
idt_int(int_casc),
...
};
A demo working example is below, which shows dispatch to a isr with a non-fixed address from a instruction at a fixed address.
#include <stdio.h>
// dummy isrs for demo
void isr0(void) {
printf("==== isr0\n");
}
void isr1(void) {
printf("==== isr1\n");
}
void isr2(void) {
printf("==== isr2\n");
}
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can be asm("call ...")
asm("jmp isr1");
asm("jmp isr2");
}
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
// dummy idt for demo
// see below for the real idt
char* idt[] =
{
JUMP_ADDR(0),
JUMP_ADDR(1),
JUMP_ADDR(2),
};
int main(int argc, char* argv[]) {
int trap;
char* addr = idt[trap = argc - 1];
printf("==== idt[%d]=%p\n", trap, addr);
asm("jmp *%0\n" : :"m"(addr));
}

Hooking mmap system to provide real-time type conversion?

I'm working on some stuff where I want to memory map some large files containing numeric data. The problem is that the data can be a number of formats, including real byte/short/int/long/float/double and complex byte/short/int/long/float/double. Naturally handling all those types all the time quickly gets unwieldy, so I was thinking of implementing a memory mapping interface that can do real-time type conversion for the user.
I really like the idea of mapping a file so you get a pointer in memory back, doing whatever you need and then unmapping it. No bufferology or anything else needed. So a function that reads the data and does the type conversion for me would take a lot away from that.
I was thinking I could memory map the file being operated on, and then simultaneously mapping an anonymous file, and somehow catching page fetches/stores and doing the type conversion on demand. I'll be working on 64-bit so this would give you a 63-bit address space in these cases, but oh well.
Does anyone know if this sort of mmap hooking would be possible, and if so, how might it be accomplished?
Yes(-ish). You can create inaccessible mmap regions. Whenever anybody tries to touch one, handle the SIGSEGV raised by fixing its permissions, filling it, and resuming.
long *long_view =
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
double *double_view =
mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
static void on_segv(int signum, siginfo_t *info, void *data) {
void *addr = info->si_addr;
if ((uintptr_t)addr - (uintptr_t)long_view < 4096) {
mprotect(long_view, 4096, PROT_READ|PROT_WRITE);
/* translate from double_view to long_view */
mprotect(double_view, 4096, PROT_NONE);
} else if ((uintptr_t)addr - (uintptr_t)double_view < 4096) {
mprotect(double_view, 4096, PROT_READ|PROT_WRITE);
/* translate from long_view to long_view */
mprotect(double_view, 4096, PROT_NONE);
} else {
abort();
}
}
struct sigaction segv_action = {
.sa_sigaction = on_segv,
.sa_flags = SA_RESTART | SA_SIGINFO,
};
sigaction(SIGSEGV, &segv_action, NULL);
long_view[0] = 42;
/* hopefully, this will trigger the code to fixup double_view and resume */
printf("%g\n", double_view[0]);
(Untested, but something along these lines ought to work...)
If you don't want to fill a whole page at once, that's still doable I think... the third argument can be cast to a ucontext_t *, with which you can decode the instruction being executed and fix it up as if it had performed the expected operation, while leaving the memorry PROT_NONE to catch further accesses... but it'll be a lot slower since you're trapping every access rather than just the first.
The reading part sounds somewhat doable to me. I have no experience in that, but in principle having a signal handler fetch your data and translate it as soon as you access a page that is not yet present in your user-presented buffer should be possible. But it could be that such a thing would be quite inefficient, you'd have a context switch on every page.
The other way round would be much harder I guess. Per default writes are asynchronous, so it will be difficult to capture them.
So "half" of what you want could be possible: always write the data in a new file in the format that the user wants, but translate it automatically on the fly when reading such a file.
But what would be much more important for you, I think, that you have a clear semantic on your different storage representation, and that you encapsulate a read or write of a data item properly. If you have such an interface (something like "store element E at position i with type T") you could easily trigger the conversion with respect to the target format.
Is there a reason why you're not using accessor functions?
There are two basic cases: structured data, and plain data. Structured data has mixed data types, plain data only one format of the ones you listed. You could also have support for transparent endianness correction, if you can store a prototype value (with distinct byte components) for each used type (28 or 30 bytes total for all types you listed -- complex types are just pairs, and have the same byte order as the base components). I have used this approach to store and access atom data from molecular dynamics simulations, and is quite efficient in practice -- the fastest portable one I've tested.
I would use a structure to describe the "file" (the backing file, memory map, endianness, and data format if plain data):
struct file {
int descriptor;
size_t size;
unsigned char *data;
unsigned int endian; /* Relative to current architecture */
unsigned int format; /* endian | format, for unstructured files */
};
#define ENDIAN_I16_MASK 0x0001
#define ENDIAN_I16_12 0x0000
#define ENDIAN_I16_21 0x0001
#define ENDIAN_I32_MASK 0x0006
#define ENDIAN_I32_1234 0x0000
#define ENDIAN_I32_4321 0x0002
#define ENDIAN_I32_2143 0x0004
#define ENDIAN_I32_3412 0x0006
#define ENDIAN_I64_MASK 0x0018
#define ENDIAN_I64_1234 0x0000
#define ENDIAN_I64_4321 0x0008
#define ENDIAN_I64_2143 0x0010
#define ENDIAN_I64_3412 0x0018
#define ENDIAN_F16_MASK 0x0020
#define ENDIAN_F16_12 0x0000
#define ENDIAN_F16_21 0x0020
#define ENDIAN_F32_MASK 0x00C0
#define ENDIAN_F32_1234 0x0000
#define ENDIAN_F32_4321 0x0040
#define ENDIAN_F32_2143 0x0080
#define ENDIAN_F32_3412 0x00C0
#define ENDIAN_F64_MASK 0x0300
#define ENDIAN_F64_1234 0x0000
#define ENDIAN_F64_4321 0x0100
#define ENDIAN_F64_2143 0x0200
#define ENDIAN_F64_3412 0x0300
#define FORMAT_MASK 0xF000
#define FORMAT_I8 0x1000
#define FORMAT_I16 0x2000
#define FORMAT_I32 0x3000
#define FORMAT_I64 0x4000
#define FORMAT_P8 0x5000 /* I8 pair ("complex I8") */
#define FORMAT_P16 0x6000 /* I16 pair ("complex I16") */
#define FORMAT_P32 0x7000 /* I32 pair ("complex I32") */
#define FORMAT_P64 0x8000 /* I64 pair ("complex I64") */
#define FORMAT_R16 0x9000 /* BINARY16 IEEE-754 floating-point */
#define FORMAT_R32 0xA000 /* BINARY32 IEEE-754 floating-point */
#define FORMAT_R64 0xB000 /* BINARY64 IEEE-754 floating-point */
#define FORMAT_C16 0xC000 /* BINARY16 IEEE-754 complex */
#define FORMAT_C32 0xD000 /* BINARY32 IEEE-754 complex */
#define FORMAT_C64 0xE000 /* BINARY64 IEEE-754 complex */
The accessor functions can be implemented in various ways. In Linux, functions marked static inline are as fast as macros, too.
Since the double type does not fully cover 64-bit integer types (since it has only 52 bits in the mantissa), I'd define a number structure,
#include <stdint.h>
struct number {
int64_t ireal;
int64_t iimag;
double freal;
double fimag;
};
and have the accessor functions always fill in the four fields. Using GCC, you can also create a macro to define a struct number using automatic type detection:
#define Number(x) \
( __builtin_types_compatible_p(__typeof__ (x), double) ? number_double(x) : \
__builtin_types_compatible_p(__typeof__ (x), _Complex double) ? number_complex_double(x) : \
__builtin_types_compatible_p(__typeof__ (x), _Complex long) ? number_complex_long(x) : \
number_int64(x) )
static inline struct number number_int64(const int64_t x)
{
return (struct number){ .ireal = (int64_t)x,
.iimag = 0,
.freal = (double)x,
.fimag = 0.0 };
}
static inline struct number number_double(const double x)
{
return (struct number){ .ireal = (int64_t)x,
.iimag = 0,
.freal = x,
.fimag = 0.0 };
}
static inline struct number number_complex_long(const _Complex long x)
{
return (struct number){ .ireal = (int64_t)(__real__ (x)),
.iimag = (int64_t)(__imag__ (x)),
.freal = (double)(__real__ (x)),
.fimag = (double)(__imag__ (x)) };
}
static inline struct number number_complex_double(const _Complex double x)
{
return (struct number){ .ireal = (int64_t)(__real__ (x)),
.iimag = (int64_t)(__imag__ (x)),
.freal = __real__ (x),
.fimag = __imag__ (x) };
}
This means that Number(value) constructs a correct struct number as long as value is an integer or floating-point real or complex type.
Note how the integer and floating-point components are set to the same values, as far as type conversions allow. (For very large integers in magnitude, the floating-point value will be an approximation. You could also use (int64_t)round(...) to round instead of truncate the floating-point parameter, when setting the integer component(s).
You'll need four accessor functions: Two for structured data, and two for unstructured data. For unstructured (plain) data:
static inline struct number file_get_number(const struct file *const file,
const size_t offset)
{
...
}
static inline void file_set_number(const struct file *const file,
const size_t offset,
const struct number number)
{
...
}
Note that offset above is not the byte offset, but the index of the number. For a structured file, you'll need to use the byte offset, and add a parameter specifying the number format used in the file:
static inline struct number file_get(const struct file *const file,
const size_t byteoffset,
const unsigned int format)
{
...
}
static inline void file_set(const struct file *const file,
const size_t byteoffset,
const unsigned int format,
const struct number number)
{
...
}
The conversions needed in the function bodies I omitted (...) are quite straightforward. There are a few tricks you can do for optimization, too. For example, I like to adjust the endianness constants so that the low bit is always a byte swap (ab -> ba, abcd -> badc, abcdefgh -> badcfehg), and the high bit is a short swap (abcd -> cdab, abcdefgh ->cdabghef). You might need a third bit for 64-bit values (abcdefgh -> efghabcd), if you want to be completely certain.
The if or case statements within the function body do cause a small access overhead, but it should be small enough to ignore in practice. All ways to avoid it will lead to much more complex code. (For maximum throughput, you'd need to open-code all access variants, and use __builtin_types_compatible_p() in a function or macro to determine the correct one to use. If you consider the endianness conversion, that means quite a few functions. I believe the very small access overhead -- a few clocks at most per access -- is much more preferable. (All my tests have been I/O bound anyway, even at 200 Mb/s, so to me the overhead is completely irrelevant.)
In general, for automatic endianness conversion using prototype values you simply test each possible conversion for the type. As long as each byte component of the prototype values are unique, then only one conversion will produce the correct expected value. On some architectures integers and floating-point values have different endianness; this is why the ENDIAN_ constants are for each type and size separately.
Assuming you have implemented all of the above, in your application code the data access would look something like
struct file f;
/* Set first number to zero. */
file_set_number(&f, 0, Number(0));
/* Set second number according to variable v,
* which can be just about any numeric type. */
file_set_number(&f, 1, Number(v));
I hope you find this useful.

I'm writing my own JIT-interpreter. How do I execute generated instructions?

I intend to write my own JIT-interpreter as part of a course on VMs. I have a lot of knowledge about high-level languages, compilers and interpreters, but little or no knowledge about x86 assembly (or C for that matter).
Actually I don't know how a JIT works, but here is my take on it: Read in the program in some intermediate language. Compile that to x86 instructions. Ensure that last instruction returns to somewhere sane back in the VM code. Store the instructions some where in memory. Do an unconditional jump to the first instruction. Voila!
So, with that in mind, I have the following small C program:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main() {
int *m = malloc(sizeof(int));
*m = 0x90; // NOP instruction code
asm("jmp *%0"
: /* outputs: */ /* none */
: /* inputs: */ "d" (m)
: /* clobbers: */ "eax");
return 42;
}
Okay, so my intention is for this program to store the NOP instruction somewhere in memory, jump to that location and then probably crash (because I haven't setup any way for the program to return back to main).
Question: Am I on the right path?
Question: Could you show me a modified program that manages to find its way back to somewhere inside main?
Question: Other issues I should beware of?
PS: My goal is to gain understanding, not necessarily do everything the right way.
Thanks for all the feedback. The following code seems to be the place to start and works on my Linux box:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
unsigned char *m;
int main() {
unsigned int pagesize = getpagesize();
printf("pagesize: %u\n", pagesize);
m = malloc(1023+pagesize+1);
if(m==NULL) return(1);
printf("%p\n", m);
m = (unsigned char *)(((long)m + pagesize-1) & ~(pagesize-1));
printf("%p\n", m);
if(mprotect(m, 1024, PROT_READ|PROT_EXEC|PROT_WRITE)) {
printf("mprotect fail...\n");
return 0;
}
m[0] = 0xc9; //leave
m[1] = 0xc3; //ret
m[2] = 0x90; //nop
printf("%p\n", m);
asm("jmp *%0"
: /* outputs: */ /* none */
: /* inputs: */ "d" (m)
: /* clobbers: */ "ebx");
return 21;
}
Question: Am I on the right path?
I would say yes.
Question: Could you show me a modified program that manages to find its way back to somewhere inside main?
I haven't got any code for you, but a better way to get to the generated code and back is to use a pair of call/ret instructions, as they will manage the return address automatically.
Question: Other issues I should beware of?
Yes - as a security measure, many operating systems would prevent you from executing code on the heap without making special arrangements. Those special arrangements typically amount to you having to mark the relevant memory page(s) as executable.
On Linux this is done using mprotect() with PROT_EXEC.
If your generated code follows the proper calling convention, then you can declare a pointer-to-function type and invoke the function this way:
typedef void (*generated_function)(void);
void *func = malloc(1024);
unsigned char *o = (unsigned char *)func;
generated_function *func_exec = (generated_function *)func;
*o++ = 0x90; // NOP
*o++ = 0xcb; // RET
func_exec();

Resources