"Trapping" a processes' own sysenter calls in userspace on Windows - c

I'm working on a runtime non-native binary translator in Windows, and so far I've been able to "trap" interrupts (i.e. INT 0x99) for the OS binaries I'm trying to emulate by using an ugly hack that uses Windows SEH to handle invalid interrupts; but only because the system call vector is different than the one in Windows, allowing me to catch these "soft" exceptions by doing something like this:
static int __stdcall handler_cb(EXCEPTION_POINTERS* pes, ...)
{
if (pes->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION)
return EXCEPTION_CONTINUE_SEARCH;
char* instruct = (char*) pes->ContextRecord->Eip;
if (!instruct)
handle_invalid_instruction(instruct);
switch (instruct[0])
{
case 0xcd: // INT
{
if (instruct[1] != 0x99) // INT 0x99
handle_invalid_instruction(instruct);
handle_syscall_translation();
...
}
...
default:
halt_and_catch_fire();
}
return EXCEPTION_SUCCESS;
}
Which works fairly well (but slowly), the problem with this is that Windows first attempts to handle the instruction/interrupt, and for non-native binaries that use sysenter/sysexit instead of int 0x99, some systenter instructions in the non-native binary are actually valid NT kernel calls themselves when executed, meaning my handler is never called, and worse; the state of the "host" OS is also compromised. Is there any way to "trap" sysenter instructions in Windows? How would I go about doing this?

As far as I know, there is no way (from a user-mode process) to "disable" SYSENTER, so that executing it will generate an exception. (I'm assuming your programs don't try to SYSEXIT, because only Ring 0 can do that).
The only I option I think you have is to do like VirtualBox does, and scan for invalid instructions, replacing them with illegal opcodes or something similar, that you can trap on, and emulate. See 10.4. Details about software virtualization.
To fix these performance and security issues, VirtualBox contains a Code Scanning and Analysis Manager (CSAM), which disassembles guest code, and the Patch Manager (PATM), which can replace it at runtime.
Before executing ring 0 code, CSAM scans it recursively to discover problematic instructions. PATM then performs in-situ patching, i.e. it replaces the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler.

Related

how to single-step code on-target with no jtag, breakpoints, simulator, emulator

Let's say you have a pointer to function whose source you do not have and which is "untrusted" because it might read/write to disallowed memory region.
Before it executes each assembly instruction, you want to verify that it doesn't access disallowed memory regions.
The OS is (almost) bare-metal i.e. a custom RTOS (so no Linux or QNX).
This is for a functionality that needs to be enabled not only during development but during normal runtime.
Ideally, it'd run something like this:
void (*fptr)(int);
fptr = &someFunction; // untrusted, don't have source
// enable interrupts for each assembly instruction
_EN_INT();
// call the function
fptr();
// everytime the PC increments, some other code runs which verifies that if any load/stores are executed, it doesn't access some disallowed memory range
// disable interrupts for each assembly instruction
_DIS_INT();
QUESTION
Is it possible to call that function and pause execution after every assembly instruction?
The OS is (almost) bare-metal i.e. a custom RTOS (so no Linux or QNX).
My answer assumes that you can modify the "OS" the way you need it...
Cortex MK20DX256VLH7
This seems to be a Cortex M4 CPU.
how to single-step code on-target with no jtag, breakpoints
From the doc, it doesn't say whether you NEED an external debugger to resume execution.
If the CPU is really stopped, you'll definitely need an external signal (e.g. from a debugger).
However most CPUs support software debugging. This means that an interrupt service routine is executed whenever a breakpoint is hit. To continue execution you simply return from the interrupt service routine.
I don't know about the Cortex M4 but for the Cortex M3 you'll have to set some special registers to enable that feature. Whenever a "BKPT" instruction is hit then interrupt #12 (*) is executed.
For code in RAM you simply write an BKPT instruction (0xBExx, e.g. 0xBEBE) to the address where you want to set your breakpoint. (Before writing you read out the value to be able to restore it later on).
For code in Flash the M3 has a "Flash patching unit" which allows you to specify up to three addresses which shall be read out as 0xBExx (0xBEBE ?) even if other data is stored there. This allows you to set up to 3 breakpoints in Flash.
Interesting for you: The register controlling the debug features in the M3 (named "DEMCR") also has a bit named "MON_STEP":
If you set this bit in interrupt handler #12 then exactly one instruction is executed after returning from the interrupt handler and interrupt #12 is triggered again. The use case for this feature is - of course - single-stepping code!
To stop single-stepping you'll have to clear the MON_STEP bit again...
Important 1:
I don't know if the MK20DX256VLH7 really has all these features. However because it is a Cortex M4 chip and the M4 should have nearly all features of the M3 these features should be present...
Important 2:
Implementing single-stepping and debugging is not done quickly. Assembly language knowledge will be very helpful and you'll need a lot of time...
From the doc, ...
You will not only need the documentation for the MK20DX256VLH7 from NXP but you'll also need the Cortex M4 documentation from ARM.
(*) Offset 4*12 in the vector table is meant here (which is named "IRQ(-4)" in some ARM documents); not IRQ12.
yes the ARM emulator/interpreter sounds exactly like what I want. Is there a free one?
qemu is open-source, most of it is GPLv2. https://wiki.qemu.org/License. You'd probably need to modify it a lot, because it's designed for use as a stand-alone wrapper for a whole Unix process (qemu-user) or whole machine (qemu-system).
I googled, and there's also http://www.unicorn-engine.org/ which is designed to be used as part of a larger program (written in C with bindings for calling from various languages). It's also GPLv2 (not LGPL), so you can use it if the rest of your code is also Free software.
It's actually based on the CPU-emulation code from QEMU; they stripped out all the device / BIOS emulation stuff to make a flexible library for just emulating CPUs.
Presumably you could configure some memory protections for it and set up the starting machine state, and let it run your function (with a return address that leads to some code that hands control back to your main code?)

Safely Exiting to a Particular State in Case of Error

When writing code I often have checks to see if errors occurred. An example would be:
char *x = malloc( some_bytes );
if( x == NULL ){
fprintf( stderr, "Malloc failed.\n" );
exit(EXIT_FAILURE);
}
I've also used strerror( errno ) in the past.
I've only ever written small desktop appications where it doesn't matter if the program exit()ed in case of an error.
Now, however, I'm writing C code for an embedded system (Arduino) and I don't want the system to just exit in case of an error. I want it to go to a particular state/function where it can power down systems, send error reports and idle safely.
I could simply call an error_handler() function, but I could be deep in the stack and very low on memory, leaving error_handler() inoperable.
Instead, I'd like execution to effectively collapse the stack, free up a bunch of memory and start sorting out powering down and error reporting. There is a serious fire risk if the system doesn't power down safely.
Is there a standard way that safe error handling is implemented in low memory embedded systems?
EDIT 1:
I'll limit my use of malloc() in embedded systems. In this particular case, the errors would occur when reading a file, if the file was not of the correct format.
Maybe you're waiting for the Holy and Sacred setjmp/longjmp, the one who came to save all the memory-hungry stacks of their sins?
#include <setjmp.h>
jmp_buf jumpToMeOnAnError;
void someUpperFunctionOnTheStack() {
if(setjmp(jumpToMeOnAnError) != 0) {
// Error handling code goes here
// Return, abort(), while(1) {}, or whatever here...
}
// Do routinary stuff
}
void someLowerFunctionOnTheStack() {
if(theWorldIsOver)
longjmp(jumpToMeOnAnError, -1);
}
Edit: Prefer not to do malloc()/free()s on embedded systems, for the same reasons you said. It's simply unhandable. Unless you use a lot of return codes/setjmp()s to free the memory all the way up the stack...
If your system has a watchdog, you could use:
char *x = malloc( some_bytes );
assert(x != NULL);
The implementation of assert() could be something like:
#define assert (condition) \
if (!(condition)) while(true)
In case of a failure the watchdog would trigger, the system would make a reset. At restart the system would check the reset reason, if the reset reason was "watchdog reset", the system would goto a safe state.
update
Before entering the while loop, assert cold also output a error message, print the stack trace or save some data in non volatile memory.
Is there a standard way that safe error handling is implemented in low memory embedded systems?
Yes, there is an industry de facto way of handling it. It is all rather simple:
For every module in your program you need to have a result type, such as a custom enum, which describes every possible thing that could go wrong with the functions inside that module.
You document every function properly, stating what codes it will return upon error and what code it will return upon success.
You leave all error handling to the caller.
If the caller is another module, it too passes on the error to its own caller. Possibly renames the error into something more suitable, where applicable.
The error handling mechanism is located in main(), at the bottom of the call stack.
This works well together with classic state machines. A typical main would be:
void main (void)
{
for(;;)
{
serve_watchdog();
result = state_machine();
if(result != good)
{
error_handler(result);
}
}
}
You should not use malloc in bare bone or RTOS microcontroller applications, not so much because of safety reasons, but simple because it doesn't make any sense whatsoever to use it. Apply common sense when programming.
Use setjmp(3) to set a recovery point, and longjmp(3) to jump to it, restoring the stack to what it was at the setjmp point. It wont free malloced memory.
Generally, it is not a good idea to use malloc/free in an embedded program if it can be avoided. For example, a static array may be adequate, or even using alloca() is marginally better.
to minimize stack usage:
write the program so the calls are in parallel rather than function calls sub function that calls sub function that calls sub function.... I.E. top level function calls sub function where sub function promptly returns, with status info. top level function then calls next sub function... etc
The (bad for stack limited) nested method of program architecture:
top level function
second level function
third level function
forth level function
should be avoided in embedded systems
the preferred method of program architecture for embedded systems is:
top level function (the reset event handler)
(variations in the following depending on if 'warm' or 'cold' start)
initialize hardware
initialize peripherals
initialize communication I/O
initialize interrupts
initialize status info
enable interrupts
enter background processing
interrupt handler
re-enable the interrupt
using 'scheduler'
select a foreground function
trigger dispatch for selected foreground function
return from interrupt
background processing
(this can be, and often is implemented as a 'state' machine rather than a loop)
loop:
if status info indicates need to call second level function 1
second level function 1, which updates status info
if status info indicates need to call second level function 2
second level function 2, which updates status info
etc
end loop:
Note that, as much as possible, there is no 'third level function x'
Note that, the foreground functions must complete before they are again scheduled.
Note: there are lots of other details that I have omitted in the above, like
kicking the watchdog,
the other interrupt events,
'critical' code sections and use of mutex(),
considerations between 'soft real-time' and 'hard real-time',
context switching
continuous BIT, commanded BIT, and error handling
etc

why clear interrput flag cause segmentation fault in C?

I am learning some basics about Assembly and C. for learning purpose I decide to write a simple program that disable Interrupts and when user wants to type something in the console he/she can't :
#include <stdio.h>
int main(){
int a;
printf("enter your number : ");
asm ("cli");
scanf("%d", &a);
printf("your number is %d\n" , a);
return 0;
}
but when I compile this with GCC I got segmentation fault :
Segmentation fault (core dumped)
And when I debug it with gdb I got this message when program reach to the asm("cli"); line:
Program received signal SIGSEGV, Segmentation fault.
main () at cli.c:6
6 asm ("cli");
This is happening because You can't disable interrupts from user space program. All interrupts are under the control of kernel. You need to do it from kernel space. Before you do it you need to learn kernel internals first and playing with interrupts are very critical and requires more knowledge on kernel according to my knowledge.
You need to write a kernel module that can interact with user space through /dev/ (or some other) interface. User space code should request kernel module to disable interrupts.
cli is a privileged instruction. It raises a #GP(0) exception "If the CPL is greater (has less privilege) than the IOPL of the current program or procedure". This #GP is what causes Linux to deliver a SIGSEGV to your process.
Under Linux, you could make an iopl(3) system call to raise your IO priv level to match your ring 3 CPL, and then you could disable interrupts from user-space. (But don't do this, it's not supported AFAIK. The intended use-case for iopl is to use in and out instructions from user-space with high port numbers, not cli/sti. x86 just happens to use the same permissions for both.)
You'll probably crash your system if you don't re-enable interrupts right away, or maybe even if you do. Or at least screw up that CPU on a multi-core system. Basically don't do this unless you're ready to press the reset button, i.e. shut down X11, saved your files and run sync. Also remount your filesystems read-only.
Or try it in a virtual machine or simulator like BOCHS that will let you break in with a debugger even while interrupts are disabled. Or try it while booted from a USB stick.
Note that disabling interrupts only disables external interrupts. Software-generated interrupts like int $0x80 are still taken, but making system calls with interrupts disabled is probably an even worse idea. (It might work, though. The kernel saves/restores EFLAGS, so it probably won't return to user-space with interrupts re-enabled. Still, leaving interrupts disabled for a long time is a Bad Thing for interrupt latency.)
If you want to play around with disabling interrupts as a beginner, you should probably do it from a toy boot-sector program that uses BIOS calls for I/O. Or just look in the Linux kernel source for some places where it disables/enables interrupts if you're curious why it might do that.
IMO, "normal" asm in user-space is plenty interesting. With performance counters, you can see the details of how the CPU decodes and executes instructions. See links in the x86 tag wiki for manuals, guides, and performance tuning info.

How do I ensure my program runs from beginning to end without interruption?

I'm attempting to time code using RDTSC (no other profiling software I've tried is able to time to the resolution I need) on Ubuntu 8.10. However, I keep getting outliers from task switches and interrupts firing, which are causing my statistics to be invalid.
Considering my program runs in a matter of milliseconds, is it possible to disable all interrupts (which would inherently switch off task switches) in my environment? Or do I need to go to an OS which allows me more power? Would I be better off using my own OS kernel to perform this timing code? I am attempting to prove an algorithm's best/worst case performance, so it must be totally solid with timing.
The relevant code I'm using currently is:
inline uint64_t rdtsc()
{
uint64_t ret;
asm volatile("rdtsc" : "=A" (ret));
return ret;
}
void test(int readable_out, uint32_t start, uint32_t end, uint32_t (*fn)(uint32_t, uint32_t))
{
int i;
for(i = 0; i <= 100; i++)
{
uint64_t clock1 = rdtsc();
uint32_t ans = fn(start, end);
uint64_t clock2 = rdtsc();
uint64_t diff = clock2 - clock1;
if(readable_out)
printf("[%3d]\t\t%u [%llu]\n", i, ans, diff);
else
printf("%llu\n", diff);
}
}
Extra points to those who notice I'm not properly handling overflow conditions in this code. At this stage I'm just trying to get a consistent output without sudden jumps due to my program losing the timeslice.
The nice value for my program is -20.
So to recap, is it possible for me to run this code without interruption from the OS? Or am I going to need to run it on bare hardware in ring0, so I can disable IRQs and scheduling? Thanks in advance!
If you call nanosleep() to sleep for a second or so immediately before each iteration of the test, you should get a "fresh" timeslice for each test. If you compile your kernel with 100HZ timer interrupts, and your timed function completes in under 10ms, then you should be able to avoid timer interrupts hitting you that way.
To minimise other interrupts, deconfigure all network devices, configure your system without swap and make sure it's otherwise quiescent.
Tricky. I don't think you can turn the operating system 'off' and guarantee strict scheduling.
I would turn this upside down: given that it runs so fast, run it many times to collect a distribution of outcomes. Given that standard Ubuntu Linux is not a real-time OS in the narrow sense, all alternative algorithms would run in the same setup --- and you can then compare your distributions (using anything from summary statistics to quantiles to qqplots). You can do that comparison with Python, or R, or Octave, ... whichever suits you best.
You might be able to get away with running FreeDOS, since it's a single process OS.
Here's the relevant text from the second link:
Microsoft's DOS implementation, which is the de
facto standard for DOS systems in the
x86 world, is a single-user,
single-tasking operating system. It
provides raw access to hardware, and
only a minimal layer for OS APIs for
things like the file I/O. This is a
good thing when it comes to embedded
systems, because you often just need
to get something done without an
operating system in your way.
DOS has (natively) no concept of
threads and no concept of multiple,
on-going processes. Application
software makes system calls via the
use of an interrupt interface, calling
various hardware interrupts to handle
things like video and audio, and
calling software interrupts to handle
various things like reading a
directory, executing a file, and so
forth.
Of course, you'll probably get the best performance actually booting FreeDOS onto actual hardware, not in an emulator.
I haven't actually used FreeDOS, but I assume that since your program seems to be standard C, you'll be able to use whatever the standard compiler is for FreeDOS.
If your program runs in milliseconds, and if your are running on Linux,
Make sure that your timer frequency (on linux) is set to 100Hz (not 1000Hz).
(cd /usr/src/linux; make menuconfig, and look at "Processor type and features" -> "Timer frequency")
This way your CPU will get interrupted every 10ms.
Furthermore, consider that the default CPU time slice on Linux is 100ms, so with a nice level of -20, you will not get descheduled if your are running for a few milliseconds.
Also, you are looping 101 times on fn(). Please consider giving fn() to be a no-op to calibrate your system properly.
Make statistics (average + stddev) instead of printing too many times (that would consume your scheduled timeslice, and the terminal will eventually get schedule etc... avoid that).
RDTSC benchmark sample code
You can use chrt -f 99 ./test to run ./test with the maximum realtime priority. Then at least it won't be interrupted by other user-space processes.
Also, installing the linux-rt package will install a real-time kernel, which will give you more control over interrupt handler priority via threaded interrupts.
If you run as root, you can call sched_setscheduler() and give yourself a real-time priority. Check the documentation.
Maybe there is some way to disable preemptive scheduling on linux, but it might not be needed. You could potentially use information from /proc/<pid>/schedstat or some other object in /proc to sense when you have been preempted, and disregard those timing samples.

Exception wrapper for Carbon C app in OSX

How can I efficiently catch and handle segmentation faults from C in an OSX Carbon application?
Background: I am making an OSX Carbon application. I must call a library function from a third party. Because of threading issues, the function can occasionally crash, usually because it's updating itself from one thread, and it's got some internally stale pointer or handle as I query it from another. The function is a black box to me. I want to be able to call the function but be able to "catch" if it has crashed and supply an alternative return.
In Windows, I can use the simple Visual C and Intel C compiler's __try{} and __except.
/* Working Windows Example */
__try { x=DangerousFunction(y);}
__except(EXCEPTION_EXECUTE_HANDLER) {x=0.0;} /* whups, func crashed! */
I am trying to make the same kind of crash-catcher for OSX. I am using pure C on a very large application. I call the function millions of times per second, so efficiency is very important too. (Impressively, the Windows __try() overhead is immeasurably small!)
Here's what I have experimented with:
1) C++ exceptions. I am not sure if C++ exceptions catch the segfault crashes. And my app is currently C. I could try wrappers and #ifdefs to make it C++ but this is a lot of work for the app, and I don't think C++ exceptions will catch the crash.
2) signal + setjump + longjmp. I thought this would work... it's what it's designed for. But I set up my SEGV error handler [in fact I set it up for every signal!] and it's never called during the crash. I can manually test (and succeed) when calling raise(SEGV). But the crashes don't seem to actually call it. My thoughts are that CFM applications do NOT have access to the full BSD signals, only a subset, and that Mach applications are necessary for the Real Thing.
3) MPSetExceptionHandler. Not well documented. I attempted to set a handler. It compiled and ran, but did not catch the segfault.
Are you sure you're not getting a SIGBUS rather then a SIGSEGV?
The below catches SIGBUS as caused by trying to write at memory location 0:
cristi:tmp diciu$ cat test.c
#include <signal.h>
static void sigac(int sig)
{
printf("sig action here, signal is %d\n", sig);
exit(1);
}
int main()
{
(void)signal(SIGSEGV, sigac);
(void)signal(SIGBUS, sigac);
printf("Raising\n");
strcpy(0, "aaksdjkajskd|");
}
cristi:tmp diciu$ ./a.out
Raising
sig action here, signal is 10

Resources