I'm doing some experimenting and would like to be able to see what is saved on the stack during a system call (the saved state of the user land process). According to http://lxr.linux.no/#linux+v2.6.30.1/arch/x86/kernel/entry_32.S it shows that the various values of registers are saved at those particular offsets to the stack pointer. Here is the code I have been trying to use to examine what is saved on the stack (this is in a custom system call I have created):
asm("movl 0x1C(%esp), %ecx");
asm("movl %%ecx, %0" : "=r" (value));
where value is an unsigned long.
As of right now, this value is not what is expected (it is showing a 0 is saved for the user value of ds).
Am I correctly accessing the offset of the stack pointer?
Another possibility might be could I use a debugger such as GDB to examine the stack contents while in the kernel? I don't have much extensive use with debugging and am not sure of how to debug code inside the kernel. Any help is much appreciated.
No need for inline assembly. The saved state that entry_32.S pushes onto the stack for a syscall is laid out as a struct pt_regs, and you can get a pointer to it like this (you'll need to include <asm/ptrace.h> and/or <asm/processor.h> either directly or indirectly):
struct pt_regs *regs = task_pt_regs(current);
Inline assembly is trickier than it seems. Trying to shortly cover the concerns for GCC:
If it modifies processor registers, it's necessary to put these registers on the clobber list. It's important to note that the clobber list must contain ALL registers that you changed directly (read explicitly) or indirectly (read implicitly);
To reinforce (1), conditional and mathematical operations also change registers, more known as status flags (zero, carry, overflow, etc), so you have to inform it by adding "cc" to the clobber list;
Add "memory" if it modifies different (read random) memory positions;
Add the volatile keyword if it modifies memory that isn't mentioned on the input/output arguments;
Then, your code becomes:
asm("movl 0x1C(%%esp), %0;"
: "=r" (value)
: /* no inputs :) */
/* no modified registers */
);
The output argument isn't required to be on the clobber list because GCC already knows it will be changed.
Alternatively, since all you want is the value of ESP register, you can avoid all the pain doing this:
register int esp asm("esp");
esp += 0x1C;
It might not solve your problem, but it's the way to go. For reference, check this, this and this.
Keep in mind that x86_64 code will often pass values in registers (since it has so many) so nothing will be on the stack. Check the gcc intermediate output (-S IIRC) and look for push in the assembly.
I'm not familiar with debugging kernel code, but gdb is definitely nicer to examine the stack interactively.
Related
I'm trying to integrate my assembly code into c programs to make it easier to access.
I try to run the following code (I'm on an x64 64 bit architecture)
void push(long address) {
__asm__ __volatile__("movq %0, %%rax;"
"push %%rax"::"r"(address));
}
The value of $rsp doesn't seem to change (neither does esp for that matter). Am I missing something obvious about how constraints work? rax is getting correctly allocated with address, but address never seems to get pushed onto the stack?
You can't do that.
Inline asm must document to the compiler the inputs it takes, the outputs it produces, and any other state it clobbers as part of its execution. Yours fails to do so, but perhaps more to the point, there is no way you could possibly be allowed to clobber the stack pointer like you're doing, since the surrounding code, when it regains control after the asm block, would have no way to find any of its data - even if it had saved it on the stack knowing it would be clobbered, it would have no way to get it back.
I'm not sure what you're trying to do, but whatever it is, this is not the way to do it.
I wanted to understand how arm link register works and how is it helpful in debugging.
I started by writing a simple function.
#define MACRO_TEST() (event_log__add_args(MACRO_TEST,__return_address()))
static void do_print_r14(void) {
printf_all("return address 0x%08X \n",__return_address()); //prints 0x823194BB
MACRO_TEST();
printf_all("return address 0x%08X \n",__return_address()); //prints 0x823194BB
}
The event log prints the following :
Return Address: 0x0000ABAB
My question is why the prints in the do_print_r14 function prints the same value.
Wouldn't it be more helpful if I just login the Line number and function name , that
will point to the exact location of the code. Why do developers use r14 in debugging?
This question might sound very basic to you all but I am not at all sure why we need r14 register.
The line number and function name tell you where you are, but the return address tells you how you got there - that can be very useful in more complex code. When debugging pure assembly, heavily optimised code, or other situations where you might not have an intelligible stack frame, sometimes inspecting the link register is the only way to know that address.
Of course, this is always relative to the current function, thus having "print the return address" wrapped up in its own little function is self-defeating since it will then only tell you where you called that function from (assuming the compiler hasn't decided to inline it), in which case you indeed may as well have used __LINE__ instead of the call.
Now the subtle but important point: the intent of the __return_address() intrinsic is "the return address of the current function", which is by no means the same thing as "the current contents of R14" - if the compiler has saved the return address on entry it's then free to use R14 for whatever it wants. The two lines in do_print_r14() both print 0x823194BB because that is where do_print_r14() was called from, and during that call nothing is going to change that. If you want to look below the abstraction level of the C environment and see what actually happens to R14 during execution, you'll need to use inline assembly tricks, or step through with a debugger.
r14 is the address that the CPU will jump to when returning from the function. Thus, calls to __return_address will yield the same value.
Perhaps a better demonstration of this would be something like:
static void do_print_r14(void) {
printf_all("return address 0x%08X \n",__return_address());
}
static void test_r14(void) {
printf_all("first call...\n");
do_print_r14();
printf_all("second call...\n");
do_print_r14();
}
Here, two different values will be printed (corresponding to the two different places do_print_r14 is called).
Usage of LR is defined by ARM Architecture and ARM AAPCS. Link register's help on is merely a side effect rather than a feature.
From AAPCS:
5.3 Subroutine Calls
Both the ARM and Thumb instruction sets contain a primitive subroutine call instruction, BL, which performs a branch-with-link operation. The effect of executing BL is to transfer the sequentially next value of the program counter—the return address—into the link register (LR) and the destination address into the program counter (PC).
I'm optimizing C code for OpenRISC and I want to manually prereserve some computed values in registers, the pseudocode looks like that:
external loop
compute eight values (heavy calculations)
internal loop
use values computed above
When I looked at GCC ABI for OpenRISC I saw two groups of registers: callee-saved and temporary? Which registers I should use to store these eight values? I mean, which registers I can put on clobbered list in inline asm?
I need to hardoce registers, because we run executables on custom OpenRISC.
The answer is: whatever you like.
If you use callee-save registers then the compiler will save them for you (as long as you do mark them as clobbered).
If you use temporary registers (a.k.a. caller-save) then the compiler will be forced to save them if you make a function call. Beware that the compiler also prefers to use these for other variables, so if you use up the caller-save ones it'll have to use callee-save for other things, so it might end up being much the same difference.
At the end of the day, if you are doing heavy calculations then saving a few registers to stack before you start is not going to be a big deal.
There are some registers that contain important values (such as stack pointer) that you must not overwrite. Others, such as the GOT table pointer are less important, and the compiler will restore the value when you're done (just be sure you don't need it during the process.
Really though, you don't need to work it out for yourself: the compiler can select registers for you:
int a, b, c;
asm volatile ("whatever" : "=&w" (a), "=&w" (b), "=&w" (c));
The variables are not needed, but they must have registers assigned, so they effectively reserve a register for whatever you want. The & indicates an "early-clobber", which means that they can't share the same register as an input register (not that my example shows any).
For some functions, I need to switch the stack so that the original stack remains unmodified. For that purpose, I have written two macros as shown below.
#define SAVE_STACK() __asm__ __volatile__ ( "mov %%rsp, %0; mov %1, %%rsp" : \
"=m" (saved_sp) : "m" (temp_sp) );
#define RESTORE_STACK() __asm__ __volatile__ ( "mov %0, %%rsp" : \
"=m" (saved_sp) );
Here temp_sp and saved_sp are thread local variables. temp_sp points to the makeshift stack that we use. For a function, whose original stack I want unmodified, I place SAVE_STACK at the beginning and RESTORE_STACK at bottom. For example, like this.
int some_func(int param1, int param2)
{
int a, b, r;
SAVE_STACK();
// Function Body here
.....................
RESTORE_STACK();
return r;
}
Now my question is whether this approach is fine. On x86 (64bit), the local variables and parameters are accessed through the rbp register and rsp is accordingly subtracted in function prologue and not touched until in function epilogue where it is added to bring it back to the original value. Therefore, I see no problem here.
I am not sure, if this is correct in the presence of context switches and signals though. (On Linux). Also I'm not sure if this is correct if the function is inlined or if tail call optimization (where jmp instead of call is used) is applied. Do you see any problem or side effects with this approach?
With the code that you've shown above, I can think of the following breakage:
On x86/x64, GCC will "deco" your function with prologues/epilogues if it sees fit, and you can't stop it from doing that (like on ARM, where __attribute__((__naked__)) forces code creation without prologues/epilogues, aka without stackframe setup).
That might end up allocating stack / creating references to stack memory locations before you switch the stack. Even worse if, again, due to the compiler's choice, such an address is put into a nonvolatile register before you switch the stack, it might alias to two locations (the stackpointer-relative one that you changed and the other-reg-relative one that is the same).
Again, on x86/x64, the ABI suggests an optimization for leaf functions (the "red zone") where no stackframe is allocated yet 128 Bytes of stack "below" the end are usable by the function. Unless your memory buffer takes this into account, overruns might occur that you're not expecting.
Signals are handled on alternate stacks (see sigaltstack()) and doing your own stack switching might make your code uncallable from within signal handlers. It'll definitely make it non-reentrant, and depending on where/how you retrieve the "stack location" will also definitely make it non-threadsafe.
In general, if you want to run a specific piece of code on a different stack, why not either:
run it in a different thread (every thread gets a different stack) ?
trigger e.g. SIGUSR1 and run your code in a signal handler (which you can configure to use a different stack) ?
run it via makecontext() / swapcontext() (see the example in the manpage) ?
Edit:
Since you say "you want to compare the memory of two processes", again, there's different methods for that, in particular external process tracing - attach a "debugger" (that can be a process you write yourself that uses ptrace() to control what you want to monitor, and have it handle e.g. breakpoints / checkpoints on behalf of those you trace, to perform the validations you need). That'd be more flexible as well because it doesn't require to change the code you inspect.
-fomit-frame-pointer is on by default. Unless you plan to compile with optimization disabled, the assumption that functions don't touch RSP except in prologue/epilogue is super broken.
Even if you did use -O3 -fno-omit-frame-pointer, compilers will still move RSP around in some cases, although they won't use it to access args and locals. e.g. alloc / C99 VLA, or even calling a function that has more than 6 args (or more precisely, one with args that don't fit in registers), will all move RSP. (Calling a function might just use mov stores, depending on strategy chosen by the compiler.)
Also, "shrink wrap" optimization where a function delays saving call-preserved regs until after a possible early-out could mean your stack-switch happens before the compiler is ready to save/restore. And your restore might happen too late or too early. (This was mentioned in comments by ams.)
I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.
This is how I thought it:
Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.
That's for the initialization. Now, the following situation will indefinitely loop:
Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.
All this sounds good to me. However, there is a number of things I'm not exactly at ease with.
Apparently, on good ol' x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It's just jmp foo. Dummy.
I've mentioned a mysterious data structure a few times. Up to now I've assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I'd really like to understand how things work, I'm pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?
Thanks for reading my question textwall.
You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.
Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:
a:
call b
b:
pop rax
RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address :)
Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function A? That's probably not necessary.
Here's how I would approach the entire thing, trying to make it as simple as possible:
Create a structure coroutine_state that holds the following:
initarg
arg
registers (also contains the flags)
caller_registers
Create a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);
where coro_func is a pointer to the coroutine function body.
This function does the following:
allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine's stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure
Now we have another function:
void* coroutine_next(coroutine_state cs, void* arg)
where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.
This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine
Note that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.
The last function is declared as:
void coroutine_yield(void* ret);
This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next.
store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker's stack frame, and since the return value is passed in RAX, we returned arg. Let's just say if arg is NULL, then the coroutine has terminated, otherwise it's an arbitrary data structure.
So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.
The coroutine's function itself is declared:
void my_coro(coroutine_state cs)
cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.
Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)
Good learning reference: libcoroutine, especially their setjmp/longjmp implementation. I know its not fun to use an existing library, but you can at least get a general bearing on where you are going.
Simon Tatham has an interesting implementation of coroutines in C that doesn't require any architecture-specific knowledge or stack fiddling. It's not exactly what you're after, but I thought it might nonetheless be of at least academic interest.
boost.coroutine (boost.context) at boost.org does all for you