Saving registers state in COM program - c

I disassembled a simple DOS .COM program and there was some code which saves and restores registers values
PUSH AX ; this is the first instruction
PUSH CX
....
POP CX
POP AX
MOV AX, 0x00 0x4C
INT 21 // call DOS interrupt 21 => END
This is very similar to function prologue and epilogue in C programs. But prologues are added automatically by compiler, and the program above was written manually in assembler, so the programmer took full responsibility for saving and restoring values in this code.
My question is what will happen if I unintentionally forgot to save some registers in my program?
And what if I intentionally replace these instructions to NOP in HEX editor? Will this lead to program crash? And why called function is responsible for saving outer context on the stack? From my point of view this should be done somehow in calling function to prevent problems if I use 3rd party libraries and poorly written code which may break my program execution.

One problem of making the calling function save all of its working registers before calling another function is that sometimes a function is interrupted (i.e. a hardware interrupt) without its knowledge. In DOS, for example, there was that pesky 54 millisecond timer tick. 18 times per second, a hardware interrupt would transfer control from whatever code was executing to the timer tick handler. This happened automatically unless your program specifically disabled interrupts.
The timer tick handler would then save all of the registers it was going to use, do its work, and then restore the registers it saved before returning.
Sure, you could say that interrupt handlers are special, but why? Even with the paucity of registers on the 8086 (AX, BX, CX, DX, SI, DI, Flags -- did I forget anything? I purposely didn't include the segment registers), making a function save its entire state before transferring control means that you'd be using a lot of unnecessary stack space and execution cycles to save things because they might be modified. But if the called function is responsible for saving just the registers it uses, and it only uses AX and CX, then it can save just those two registers. It makes for smaller and faster code, and much less stack space usage.
When you start talking about call hierarchies that are many levels deep, the difference between pushing 8 registers rather than 2 registers adds up pretty quickly.
Consider the x86-64, with its 64 general purpose registers. Do you really think a function should be forced to save all 64 of those registers before calling another function, even when the called function only uses two of them? Saving 64 64-bit registers requires 512 bytes of stack space. As opposed to saving two registers requiring only 16 bytes.
The primary point of writing things in assembly language these days is to write faster and smaller code than what a compiler can write. A guiding principle is don't do more work than you have to. That means it's up to you to know what registers your assembly language function is using, and to save those registers on entry and restore them on exit.

If you don't want to guard against forgetting what to push or pop I would advise sticking to a higher level language.
In assembler, if the function is your own then you should save and restore all registers you use within the function except those which return an output from the function. If others wrote the function, look up its documentation. If in doubt, save/restore registers before/after calling the function (except those which are supposed to return a value).

Since the DOS Terminate function does not rely on any register settings (other than AX) for its operation (*) both pushes/pops in the code you have posted seem superfluous. You should however be aware that the programmer could have pushed these values for the purpose of using them locally! So replacing both these pushes by NOP in HEX editor is surely a bad idea. You could however replace both pops by NOP because at that point in the program the restoration of AX/CX as well as balancing the stack are unnecessary because of (*).
Since your question is about saving registers on the program level the answer must be that pushing/popping registers for the sake of saving them is useless. Nothing bad will happen if you unintentionally forgot to save some registers in your program.

Related

Segmentation fault when attempting to print int value from x86 external function [duplicate]

I've noticed that a lot of calling conventions insist that [e]bx be preserved for the callee.
Now, I can understand why they'd preserve something like [e]sp or [e]bp, since that can mess up the callee's stack. I can also understand why you might want to preserve [e]si or [e]di since that can break the callee's string instructions if they aren't particularly careful.
But [e]bx? What on earth is so important about [e]bx? What makes [e]bx so special that multiple calling conventions insist that it be preserved throughout function calls?
Is there some sort of subtle bug/gotcha that can arise from messing with [e]bx?
Does modifying [e]bx somehow have a greater impact on the callee than modifying [e]dx or [e]cx for instance?
I just don't understand why so many calling conventions single out [e]bx for preservation.
Not all registers make good candidates for preserving:
no (e)ax -- Implicitly used in some instructions; Return value
no (e)dx -- edx:eax is implicity used in cdq, div, mul and in return values
(e)bx -- generic register, usable in 16-bit addressing modes (base)
(e)cx -- shift-counts, used in loop, rep
(e)si -- movs operations, usable in 16-bit addressing modes (index)
(e)di -- movs operations, usable in 16-bit addressing modes (index)
Must (e)bp -- frame pointer, usable in 16-bit addressing modes (base)
Must (e)sp -- stack pointer, not addressable in 8086 (other than push/pop)
Looking at the table, two registers have good reason to be preserved and two have a reason not to be preserved. accumulator = (e)ax e.g. is the most often used register due to short encoding. SI,DI make a logical register pair -- on REP MOVS and other string operations, both are trashed.
In a half and half callee/caller saving paradigm the discussion would basically go only if bx/cx is preferred over si/di. In other calling conventions, it's just EDX,EAX and ECX that can be trashed.
EBX does have a few obscure implicit uses that are still relevant in modern code (e.g. CMPXGH8B / CMPXGH16B), but it's the least special register in 32/64-bit code.
EBX makes a good choice for a call-preserved register because it's rare that a function will need to save/restore EBX because they need EBX specifically, and not just any non-volatile register. As Brett Hale's answer points out, it makes EBX a great choice for the global offset table (GOT) pointer in ABIs that need one.
In 16-bit mode, addressing modes were limited to (any subset of) [BP|BX + DI|SI + disp8/disp16]), so BX is definitely special there.
This is a compromise between not saving any of the registers and saving them all. Either saving none, or saving all, could have been proposed, but either extreme leads to inefficiencies caused by copying the contents to memory (the stack). Choosing to allow some registers to be preserved and some not, reduces the average cost of a function call.
One of the main reasons, certainly for the i386 ELF ABI, is that ebx holds the address of the global offset table (GOT) register for position-independent code (PIC). See 3-35 of the specification for the details. It would be disruptive in the extreme, if, say, shared library code had to restore the GOT after every function call return.

Randomizing registers

If certain conditions are not met I want to crash my program by jumping to a random location. I also want to randomize the registers by statements like
asm("rdtsc \n");
asm ("movq %rax, %r15 \n");
...
asm ("xor %rbp, %r13 \n");
...
Is there a better/stealthier method to do this? I am concerned, because rdtsc is not a frequent statement in programs. Calling it continually generates similar results too. Beside this, can I somehow clear/randomize the stack content too?
If you just want to crash, your random choice of destination might jump somewhere legal. Just run the ud2 instruction (0F 0B), which is guaranteed to cause an invalid-instruction exception (leading to SIGILL) on every future x86 CPU. i.e. it's reserved, so no future instruction-set extension will ever use that two-byte sequence at the beginning of an instruction.
If you care about high-quality randomness to frustrate any potential backtrace or core dump, then call a random number generator to fill a buffer of random data (or just one 32bit random value which you repeat). Fill all the registers with that garbage data. In 32bit code, you could use a popa instruction to fill all the registers with that garbage data. In 64bit mode, you have to load them manually.
Then scribble over the stack with that data, so your program eventually stops with a segfault when you try to write to an unmapped address (because you've gone outside the stack area).
You could do that scribbling with a rep stosd or something.
As far as "stealthier", you'll need to be much more elaborate about what your threat model is, and what you're trying to stop anyone from learning / doing. i.e. defend against someone modifying your binary to not crash this way?
In addition to Peter Cordes suggestions, I would add that the OP wants to code responsible for this obfuscation to stay out of scope (stealthier). The instruction causing the crash needs to be somewhere else, otherwise the obfuscation code will be obvious from a crash dump and the code will be easy to patch to remove the bomb.
A rather easy solution is to locate the RET opcode from a common library function such as read or strlen and JUMP there by pushing the address on the stack and executing a RET statement. This solution is not perfect: advanced debuggers exist that store the execution trace and will be able to backtrack to the obfuscator from the crash location. In order to defeat that, you may prefer to enter an infinite loop instead of crashing, but that loop can be easily found and removed.
You can also embed some complex code in your app that computes for a while by executing many different functions in a random manner and use that as a honey pot to jump to from the obfuscator.

call stack unwinding in ARM cortex m3

I would like to create a debugging tool which will help me debug better my application.
I'm working bare-bones (without an OS). using IAR embedded workbench on Atmel's SAM3.
I have a Watchdog timer, which calls a specific IRQ in case of timeout (This will be replaced with a software reset on release).
In the IRQ handler, I want to print out (UART) the stack trace, of where exactly the Watchdog timeout occurred.
I looked in the web, and I didn't find any implementation of that functionality.
Anyone has an idea on how to approach this kind of thing ?
EDIT: OK, I managed to grab the return address from the stack, so I know exactly where the WDT timeout occurred.
Unwinding the whole stack is not simple as it first appears, because each function pushes different amount of local variables into the stack.
The code I end up with is this (for others, who may find it usefull)
void WDT_IrqHandler( void )
{
uint32_t * WDT_Address;
Wdt *pWdt = WDT ;
volatile uint32_t dummy ;
WDT_Address = (uint32_t *) __get_MSP() + 16 ;
LogFatal ("Watchdog Timer timeout,The Return Address is %#X", *WDT_Address);
/* Clear status bit to acknowledge interrupt */
dummy = pWdt->WDT_SR ;
}
ARM defines a pair of sections, .ARM.exidx and .ARM.extbl, that contain enough information to unwind the stack without debug symbols. These sections exist for exception handling but you can use them to perform a backtrace as well. Add -funwind-tables to force GCC to include these sections.
To do this with ARM, you will need to tell your compiler to generate stack frames. For instance with gcc, check the option -mapcs-frame. It may not be the one you need, but this will be a start.
If you do not have this, it will be nearly impossible to "unroll" the stack, because you will need for each function the exact stack usage depending on parameters and local variables.
If you are looking for some exemple code, you can check dump_stack() in Linux kernel sources, and find back the related piece of code executed for ARM.
It should be pretty straight forward to follow execution. Not programmatically in your isr...
We know from the ARM ARM that on a Cortex-M3 it pushes xPSR,
ReturnAddress, LR (R14), R12, R3, R2, R1, and R0 on the stack. mangles the lr so it can detect a return from interrupt then calls the entry point listed in the vector table. if you implement your isr in asm to control the stack, you can have a simple loop that disables the interrupt source (turns off the wdt, whatever, this is going to take some time) then goes into a loop to dump a portion of the stack.
From that dump you will see the lr/return address, the function/instruction that was interrupted, from a disassembly of your program you can then see what the compiler has placed on the stack for each function, subtract that off at each stage and go as far back as you like or as far back as you have printed the stack contents.
You could also make a copy of the stack in ram and dissect it later rather than doing such things in an isr (the copy still takes too much time but is less intrusive than waiting on the uart).
If all you are after is the address of the instruction that was interrupted, that is the most trivial task, just read that from the stack, it will be at a known place, and print it out.
Did I hear my name? :)
You will probably need a tiny bit of inline assembly. Just figure out the format of the stack frames, and which register holds the ordinary1 stack pointer, and transfer the relevant values into C variables from which you can format strings for output to the UART.
It shouldn't be too tricky, but of course (being rather low-level) you need to pay attention to the details.
1As in "non-exception"; not sure if the ARM has different stacks for ordinary code and exceptions, actually.
Your watchdog timer can fire at any point, even when the stack does not contain enough information to unwind (e.g. stack space has been allocated for register spill, but the registers not copied yet).
For properly optimized code, you need debug info, period. All you can do from a watchdog timer is a register and stack dump in a format that is machine readable enough to allow conversion into a core dump for gdb.

Recognizing stack frames in a stack using saved EBP values

I would like to divide a stack to stack-frames by looking on the raw data on the stack. I thought to do so by finding a "linked list" of saved EBP pointers.
Can I assume that a (standard and commonly used) C compiler (e.g. gcc) will always update and save EBP on a function call in the function prologue?
pushl %ebp
movl %esp, %ebp
Or are there cases where some compilers might skip that part for functions that don't get any parameters and don't have local variables?
The x86 calling conventions and the Wiki article on function prologue don't help much with that.
Is there any better method to divide a stack to stack frames just by looking on its raw data?
Thanks!
Some versions of gcc have a -fomit-frame-pointer optimization option. If memory serves, it can be used even with parameters/local variables (they index directly off of ESP instead of using EBP). Unless I'm badly mistaken, MS VC++ can do roughly the same.
Offhand, I'm not sure of a way that's anywhere close to universally applicable. If you have code with debug info, it's usually pretty easy -- otherwise though...
Even with the framepointer optimized out, stackframes are often distinguishable by looking through stack memory for saved return addresses instead. Remember that a function call sequence in x86 always consists of:
call someFunc ; pushes return address (instr. following `call`)
...
someFunc:
push EBP ; if framepointer is used
mov EBP, ESP ; if framepointer is used
push <nonvolatile regs>
...
so your stack will always - even if the framepointers are missing - have return addresses in there.
How do you recognize a return address ?
to start with, on x86, instruction have different lengths. That means return addresses - unlike other pointers (!) - tend to be misaligned values. Statistically 3/4 of them end not at a multiple of four.
Any misaligned pointer is a good candidate for a return address.
then, remember that call instructions on x86 have specific opcode formats; read a few bytes before the return address and check if you find a call opcode there (99% most of the time, it's five bytes back for a direct call, and three bytes back for a call through a register). If so, you've found a return address.
This is also a way to distinguish C++ vtables from return addresses by the way - vtable entrypoints you'll find on the stack, but looking "back" from those addresses you don't find call instructions.
With that method, you can get candidates for the call sequence out of the stack even without having symbols, framesize debugging information or anything.
The details of how to piece the actual call sequence together from those candidates are less straightforward though, you need a disassembler and some heuristics to trace potential call flows from the lowest-found return address all the way up to the last known program location. Maybe one day I'll blog about it ;-) though at this point I'd rather say that the margin of a stackoverflow posting is too small to contain this ...

How can I create a parallel stack and run a coroutine on it?

I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.
This is how I thought it:
Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.
That's for the initialization. Now, the following situation will indefinitely loop:
Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.
All this sounds good to me. However, there is a number of things I'm not exactly at ease with.
Apparently, on good ol' x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It's just jmp foo. Dummy.
I've mentioned a mysterious data structure a few times. Up to now I've assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I'd really like to understand how things work, I'm pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?
Thanks for reading my question textwall.
You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.
Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:
a:
call b
b:
pop rax
RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address :)
Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function A? That's probably not necessary.
Here's how I would approach the entire thing, trying to make it as simple as possible:
Create a structure coroutine_state that holds the following:
initarg
arg
registers (also contains the flags)
caller_registers
Create a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);
where coro_func is a pointer to the coroutine function body.
This function does the following:
allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine's stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure
Now we have another function:
void* coroutine_next(coroutine_state cs, void* arg)
where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.
This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine
Note that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.
The last function is declared as:
void coroutine_yield(void* ret);
This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next.
store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker's stack frame, and since the return value is passed in RAX, we returned arg. Let's just say if arg is NULL, then the coroutine has terminated, otherwise it's an arbitrary data structure.
So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.
The coroutine's function itself is declared:
void my_coro(coroutine_state cs)
cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.
Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)
Good learning reference: libcoroutine, especially their setjmp/longjmp implementation. I know its not fun to use an existing library, but you can at least get a general bearing on where you are going.
Simon Tatham has an interesting implementation of coroutines in C that doesn't require any architecture-specific knowledge or stack fiddling. It's not exactly what you're after, but I thought it might nonetheless be of at least academic interest.
boost.coroutine (boost.context) at boost.org does all for you

Resources