Is this inline-asm approach for stack switching ok?

Is this inline-asm approach for stack switching ok? - c

For some functions, I need to switch the stack so that the original stack remains unmodified. For that purpose, I have written two macros as shown below.
#define SAVE_STACK() __asm__ __volatile__ ( "mov %%rsp, %0; mov %1, %%rsp" : \
"=m" (saved_sp) : "m" (temp_sp) );
#define RESTORE_STACK() __asm__ __volatile__ ( "mov %0, %%rsp" : \
"=m" (saved_sp) );
Here temp_sp and saved_sp are thread local variables. temp_sp points to the makeshift stack that we use. For a function, whose original stack I want unmodified, I place SAVE_STACK at the beginning and RESTORE_STACK at bottom. For example, like this.
int some_func(int param1, int param2)
{
int a, b, r;
SAVE_STACK();
// Function Body here
.....................
RESTORE_STACK();
return r;
}
Now my question is whether this approach is fine. On x86 (64bit), the local variables and parameters are accessed through the rbp register and rsp is accordingly subtracted in function prologue and not touched until in function epilogue where it is added to bring it back to the original value. Therefore, I see no problem here.
I am not sure, if this is correct in the presence of context switches and signals though. (On Linux). Also I'm not sure if this is correct if the function is inlined or if tail call optimization (where jmp instead of call is used) is applied. Do you see any problem or side effects with this approach?

With the code that you've shown above, I can think of the following breakage:
On x86/x64, GCC will "deco" your function with prologues/epilogues if it sees fit, and you can't stop it from doing that (like on ARM, where __attribute__((__naked__)) forces code creation without prologues/epilogues, aka without stackframe setup).
That might end up allocating stack / creating references to stack memory locations before you switch the stack. Even worse if, again, due to the compiler's choice, such an address is put into a nonvolatile register before you switch the stack, it might alias to two locations (the stackpointer-relative one that you changed and the other-reg-relative one that is the same).
Again, on x86/x64, the ABI suggests an optimization for leaf functions (the "red zone") where no stackframe is allocated yet 128 Bytes of stack "below" the end are usable by the function. Unless your memory buffer takes this into account, overruns might occur that you're not expecting.
Signals are handled on alternate stacks (see sigaltstack()) and doing your own stack switching might make your code uncallable from within signal handlers. It'll definitely make it non-reentrant, and depending on where/how you retrieve the "stack location" will also definitely make it non-threadsafe.
In general, if you want to run a specific piece of code on a different stack, why not either:
run it in a different thread (every thread gets a different stack) ?
trigger e.g. SIGUSR1 and run your code in a signal handler (which you can configure to use a different stack) ?
run it via makecontext() / swapcontext() (see the example in the manpage) ?
Edit:
Since you say "you want to compare the memory of two processes", again, there's different methods for that, in particular external process tracing - attach a "debugger" (that can be a process you write yourself that uses ptrace() to control what you want to monitor, and have it handle e.g. breakpoints / checkpoints on behalf of those you trace, to perform the validations you need). That'd be more flexible as well because it doesn't require to change the code you inspect.

-fomit-frame-pointer is on by default. Unless you plan to compile with optimization disabled, the assumption that functions don't touch RSP except in prologue/epilogue is super broken.
Even if you did use -O3 -fno-omit-frame-pointer, compilers will still move RSP around in some cases, although they won't use it to access args and locals. e.g. alloc / C99 VLA, or even calling a function that has more than 6 args (or more precisely, one with args that don't fit in registers), will all move RSP. (Calling a function might just use mov stores, depending on strategy chosen by the compiler.)
Also, "shrink wrap" optimization where a function delays saving call-preserved regs until after a possible early-out could mean your stack-switch happens before the compiler is ready to save/restore. And your restore might happen too late or too early. (This was mentioned in comments by ams.)

Related

How does Google's `DoNotOptimize()` function enforce statement ordering

I'm trying to understand exactly how Google's DoNotOptimize() is supposed to work.
For completeness, here is its definition (for clang, and non-const data):
template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
asm volatile("" : "+r,m"(value) : : "memory");
}
As I understand we can use this in code like this:
start_time = time();
bench_output = run_bench(bench_inputs);
result = time() - start_time;
To ensure that the benchmark stays in the critical section:
start_time = time();
DoNotOptimize(bench_inputs);
bench_output = run_bench(bench_inputs);
DoNotOptimise(bench_output);
result = time() - start_time;
Specifically what I don't understand is why this guarantees (does it?) that run_bench() is not moved above start_time = time().
(Someone asked exactly this in this comment, however I don't understand the answer).
As I understand, the above DoNotOptimze() does several things:
It forces value to the stack, as it is passed by C++ reference. You can't have a pointer to a register, so it must be in memory.
Because value is now on the stack, subsequently clobbering memory (as done in the asm constraints) will force the compiler to assume that value is both read and written by the call to DoNotOptimize(value).
(it's not clear to me if the +r,m constraint is relevant. As far as I know this says that the pointer itself may be stored in a register or in memory, but the pointer value itself may be read and/or written.)
And this is where things get fuzzy for me.
If start_time is also stack allocated, the memory clobbering in DoNotOptimize() will mean that the compiler must assume that DoNotOptimize() potentially reads start_time. Therefore the order of the statements can only be:
start_time = time(); // on the stack
DoNotOptimize(bench_inputs); // reads start_time, writes bench_inputs
bench_output = run_bench(bench_inputs)
But if start_time is not stored in memory, but instead in a register, then clobbering memory will not clobber start_time, right? In that case the desired ordering of start_time = time() and DoNotOptimize(bench_inputs) is lost and the compiler is free to do:
DoNotOptimize(bench_inputs); // only writes bench_inputs
bench_output = run_bench(bench_inputs)
start_time = time(); // in a register
Obviously I've misunderstood something. Can anyone help explain? Thanks :)
I'm wondering if this is because reordering optimisations happen prior to register allocation, and thus everything is assumed to be stack allocated at that time. But if that were the case, then DoNotOptimize() would be redundant, as ClobberMemory() would be sufficient.

Summary: DoNotOptimize is ordered wrt. time() by the the "memory" clobbers, as if it were another function call to an opaque function that could modify any global state.
DoNotOptimize is ordered wrt. the computation of output from input by the data dependency of the calculation on the input, and the output on the calculation, as Chandler Carruth explained in the Q&A you linked. The "memory" clobber is irrelevant for this part.
"memory" clobber is like a non-inline function call
DoNotOptimize's asm statement contains a "memory" clobber. As far as the optimizer is concerned, that's equivalent to an opaque function call: it has to be assumed to read and write every globally-reachable object1. (Even ones this compilation unit might not know about.)
Since time() itself doesn't have an inline definition in any header, it can't reorder with DoNotOptimize at compile time for the same reason that a compiler can't reorder calls to foo() and bar() when it can't see the definitions of those functions. Same reason compilers don't need any special logic to stop them from reordering puts("hi"); puts("mom");.
(A hypothetical time() that could inline and only contained an asm statement would have to use asm volatile to make sure repeated calls didn't just use the first one's output. asm volatile statements can't reorder with each other or accesses to volatile variables, so that would be ok too, for a different reason.)
Footnote 1: Globally reachable = any object that might be pointed-to by any hypothetical global variable. i.e. anything except local variables within this function, or memory freshly allocated with new, if escape analysis can prove that nothing outside this function could have pointers to them.
How the asm statement works
I think you're pretty seriously misunderstanding how the asm works. "+r,m" tells the compiler to materialize the value in a register (or memory if it wants), and then use the value there at the end of the (empty) asm template as the new value of that C++ object.
So it forces the compiler to actually materialize (produce) the value somewhere, which means it has to be computed. And it means has to forget what it previously knew about the value (e.g. that it was a compile time constant 5, or non-negative, or anything) because the "+" modifier declares a read/write operand.
The point of DoNotOptimize on the input is to defeat constant-propagation that would let the benchmark optimize away.
And on the output to make sure a final result is actually materialized in a register (or memory) instead of optimizing away all the computation leading to an unused result. (This is where being asm volatile is relevant; defeating constant-propagation still works with non-volatile asm.)
So the computation you want to benchmark has to happen between the two DoNotOptimize() statements, and separately those two statements can't reorder with time().
The compiler has to assume that the asm statement modifies the value like val ^= random for all it knows, along with changing the value in memory of any/every other object except for private locals that weren't operands, so e.g. the "memory" clobber doesn't stop the compiler from keeping a local loop counter in memory. (It doesn't special case an empty asm template string here; programs don't contain asm statements like this by accident so nobody wants them optimized away.)
Misconceptions about the reference arg and picking "m"
I only got part way into the details of your attempt to reason about the "+r,m" operand and the reference function-arg before deciding it would probably be better to just explain from scratch. The correct reason isn't that complicated. But a couple things are worth specifically correcting:
The C++ function containing the asm statement can inline, letting the by-reference function arg optimize away. (It's even declared inline __attribute__((always_inline)) to force inlining even with optimization disabled, although in that case the reference variable won't optimize away.)
The net result is as if the asm statement were used directly on the C++ variable passed to DoNotOptimize. e.g. DoNotOptimize(foo) is like asm volatile("" : "+r,m"(foo) :: "memory")
The compiler can always pick register if it wants to, e.g. choosing to load a variable's value into a register before an asm statement. (And if the C++ semantics demand updating the variable's value in memory, also emitting a store instruction after the asm statement.)
For example, we can see that GCC does choose to do that. (I guess I could have used incl %0 as the example, but I just chose nop as a way to show what the compiler picked for the operand location as an alternative to # %0 pure comment, so the Godbolt compiler explorer wouldn't filter it out.)
void foo(int *p)
{
asm volatile("nop # operand picked %0" : "+r,m" (p[4]) );
}
# GCC 11.2 -O2
foo(int*):
movl 16(%rdi), %eax
nop # operand picked %eax
movl %eax, 16(%rdi)
ret
vs. clang choosing to leave the value in memory, so every instruction in the asm template would be accessing memory instead of a register. (If there were any instructions).
# clang 12.0.1 -O2 -fPIE
foo(int*): # #foo(int*)
nop # operand picked 16(%rdi)
retq
Fun fact: "r,m" is an attempt to work around a clang missed-optimization bug that makes it always pick memory for "rm" constraints, even if the value was already in a register. Spilling it first, even if it has to invent a temporary location for the value of an expression as an input.

Inline Assembly Stack Behavior

I'm trying to integrate my assembly code into c programs to make it easier to access.
I try to run the following code (I'm on an x64 64 bit architecture)
void push(long address) {
__asm__ __volatile__("movq %0, %%rax;"
"push %%rax"::"r"(address));
}
The value of $rsp doesn't seem to change (neither does esp for that matter). Am I missing something obvious about how constraints work? rax is getting correctly allocated with address, but address never seems to get pushed onto the stack?

You can't do that.
Inline asm must document to the compiler the inputs it takes, the outputs it produces, and any other state it clobbers as part of its execution. Yours fails to do so, but perhaps more to the point, there is no way you could possibly be allowed to clobber the stack pointer like you're doing, since the surrounding code, when it regains control after the asm block, would have no way to find any of its data - even if it had saved it on the stack knowing it would be clobbered, it would have no way to get it back.
I'm not sure what you're trying to do, but whatever it is, this is not the way to do it.

Using register in inline assembler

I'm optimizing C code for OpenRISC and I want to manually prereserve some computed values in registers, the pseudocode looks like that:
external loop
compute eight values (heavy calculations)
internal loop
use values computed above
When I looked at GCC ABI for OpenRISC I saw two groups of registers: callee-saved and temporary? Which registers I should use to store these eight values? I mean, which registers I can put on clobbered list in inline asm?
I need to hardoce registers, because we run executables on custom OpenRISC.

The answer is: whatever you like.
If you use callee-save registers then the compiler will save them for you (as long as you do mark them as clobbered).
If you use temporary registers (a.k.a. caller-save) then the compiler will be forced to save them if you make a function call. Beware that the compiler also prefers to use these for other variables, so if you use up the caller-save ones it'll have to use callee-save for other things, so it might end up being much the same difference.
At the end of the day, if you are doing heavy calculations then saving a few registers to stack before you start is not going to be a big deal.
There are some registers that contain important values (such as stack pointer) that you must not overwrite. Others, such as the GOT table pointer are less important, and the compiler will restore the value when you're done (just be sure you don't need it during the process.
Really though, you don't need to work it out for yourself: the compiler can select registers for you:
int a, b, c;
asm volatile ("whatever" : "=&w" (a), "=&w" (b), "=&w" (c));
The variables are not needed, but they must have registers assigned, so they effectively reserve a register for whatever you want. The & indicates an "early-clobber", which means that they can't share the same register as an input register (not that my example shows any).

How can I create a parallel stack and run a coroutine on it?

I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.
This is how I thought it:
Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.
That's for the initialization. Now, the following situation will indefinitely loop:
Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.
All this sounds good to me. However, there is a number of things I'm not exactly at ease with.
Apparently, on good ol' x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It's just jmp foo. Dummy.
I've mentioned a mysterious data structure a few times. Up to now I've assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I'd really like to understand how things work, I'm pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?
Thanks for reading my question textwall.

You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.
Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:
a:
call b
b:
pop rax
RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address :)
Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function A? That's probably not necessary.
Here's how I would approach the entire thing, trying to make it as simple as possible:
Create a structure coroutine_state that holds the following:
initarg
arg
registers (also contains the flags)
caller_registers
Create a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);
where coro_func is a pointer to the coroutine function body.
This function does the following:
allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine's stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure
Now we have another function:
void* coroutine_next(coroutine_state cs, void* arg)
where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.
This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine
Note that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.
The last function is declared as:
void coroutine_yield(void* ret);
This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next.
store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker's stack frame, and since the return value is passed in RAX, we returned arg. Let's just say if arg is NULL, then the coroutine has terminated, otherwise it's an arbitrary data structure.
So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.
The coroutine's function itself is declared:
void my_coro(coroutine_state cs)
cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.
Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)

Good learning reference: libcoroutine, especially their setjmp/longjmp implementation. I know its not fun to use an existing library, but you can at least get a general bearing on where you are going.

Simon Tatham has an interesting implementation of coroutines in C that doesn't require any architecture-specific knowledge or stack fiddling. It's not exactly what you're after, but I thought it might nonetheless be of at least academic interest.

boost.coroutine (boost.context) at boost.org does all for you

How to determine values saved on the stack?

I'm doing some experimenting and would like to be able to see what is saved on the stack during a system call (the saved state of the user land process). According to http://lxr.linux.no/#linux+v2.6.30.1/arch/x86/kernel/entry_32.S it shows that the various values of registers are saved at those particular offsets to the stack pointer. Here is the code I have been trying to use to examine what is saved on the stack (this is in a custom system call I have created):
asm("movl 0x1C(%esp), %ecx");
asm("movl %%ecx, %0" : "=r" (value));
where value is an unsigned long.
As of right now, this value is not what is expected (it is showing a 0 is saved for the user value of ds).
Am I correctly accessing the offset of the stack pointer?
Another possibility might be could I use a debugger such as GDB to examine the stack contents while in the kernel? I don't have much extensive use with debugging and am not sure of how to debug code inside the kernel. Any help is much appreciated.

No need for inline assembly. The saved state that entry_32.S pushes onto the stack for a syscall is laid out as a struct pt_regs, and you can get a pointer to it like this (you'll need to include <asm/ptrace.h> and/or <asm/processor.h> either directly or indirectly):
struct pt_regs *regs = task_pt_regs(current);

Inline assembly is trickier than it seems. Trying to shortly cover the concerns for GCC:
If it modifies processor registers, it's necessary to put these registers on the clobber list. It's important to note that the clobber list must contain ALL registers that you changed directly (read explicitly) or indirectly (read implicitly);
To reinforce (1), conditional and mathematical operations also change registers, more known as status flags (zero, carry, overflow, etc), so you have to inform it by adding "cc" to the clobber list;
Add "memory" if it modifies different (read random) memory positions;
Add the volatile keyword if it modifies memory that isn't mentioned on the input/output arguments;
Then, your code becomes:
asm("movl 0x1C(%%esp), %0;"
: "=r" (value)
: /* no inputs :) */
/* no modified registers */
);
The output argument isn't required to be on the clobber list because GCC already knows it will be changed.
Alternatively, since all you want is the value of ESP register, you can avoid all the pain doing this:
register int esp asm("esp");
esp += 0x1C;
It might not solve your problem, but it's the way to go. For reference, check this, this and this.

Keep in mind that x86_64 code will often pass values in registers (since it has so many) so nothing will be on the stack. Check the gcc intermediate output (-S IIRC) and look for push in the assembly.
I'm not familiar with debugging kernel code, but gdb is definitely nicer to examine the stack interactively.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight