Inline Assembly Stack Behavior - c

I'm trying to integrate my assembly code into c programs to make it easier to access.
I try to run the following code (I'm on an x64 64 bit architecture)
void push(long address) {
__asm__ __volatile__("movq %0, %%rax;"
"push %%rax"::"r"(address));
}
The value of $rsp doesn't seem to change (neither does esp for that matter). Am I missing something obvious about how constraints work? rax is getting correctly allocated with address, but address never seems to get pushed onto the stack?

You can't do that.
Inline asm must document to the compiler the inputs it takes, the outputs it produces, and any other state it clobbers as part of its execution. Yours fails to do so, but perhaps more to the point, there is no way you could possibly be allowed to clobber the stack pointer like you're doing, since the surrounding code, when it regains control after the asm block, would have no way to find any of its data - even if it had saved it on the stack knowing it would be clobbered, it would have no way to get it back.
I'm not sure what you're trying to do, but whatever it is, this is not the way to do it.

Related

How the arm link register r14 works

I wanted to understand how arm link register works and how is it helpful in debugging.
I started by writing a simple function.
#define MACRO_TEST() (event_log__add_args(MACRO_TEST,__return_address()))
static void do_print_r14(void) {
printf_all("return address 0x%08X \n",__return_address()); //prints 0x823194BB
MACRO_TEST();
printf_all("return address 0x%08X \n",__return_address()); //prints 0x823194BB
}
The event log prints the following :
Return Address: 0x0000ABAB
My question is why the prints in the do_print_r14 function prints the same value.
Wouldn't it be more helpful if I just login the Line number and function name , that
will point to the exact location of the code. Why do developers use r14 in debugging?
This question might sound very basic to you all but I am not at all sure why we need r14 register.
The line number and function name tell you where you are, but the return address tells you how you got there - that can be very useful in more complex code. When debugging pure assembly, heavily optimised code, or other situations where you might not have an intelligible stack frame, sometimes inspecting the link register is the only way to know that address.
Of course, this is always relative to the current function, thus having "print the return address" wrapped up in its own little function is self-defeating since it will then only tell you where you called that function from (assuming the compiler hasn't decided to inline it), in which case you indeed may as well have used __LINE__ instead of the call.
Now the subtle but important point: the intent of the __return_address() intrinsic is "the return address of the current function", which is by no means the same thing as "the current contents of R14" - if the compiler has saved the return address on entry it's then free to use R14 for whatever it wants. The two lines in do_print_r14() both print 0x823194BB because that is where do_print_r14() was called from, and during that call nothing is going to change that. If you want to look below the abstraction level of the C environment and see what actually happens to R14 during execution, you'll need to use inline assembly tricks, or step through with a debugger.
r14 is the address that the CPU will jump to when returning from the function. Thus, calls to __return_address will yield the same value.
Perhaps a better demonstration of this would be something like:
static void do_print_r14(void) {
printf_all("return address 0x%08X \n",__return_address());
}
static void test_r14(void) {
printf_all("first call...\n");
do_print_r14();
printf_all("second call...\n");
do_print_r14();
}
Here, two different values will be printed (corresponding to the two different places do_print_r14 is called).
Usage of LR is defined by ARM Architecture and ARM AAPCS. Link register's help on is merely a side effect rather than a feature.
From AAPCS:
5.3 Subroutine Calls
Both the ARM and Thumb instruction sets contain a primitive subroutine call instruction, BL, which performs a branch-with-link operation. The effect of executing BL is to transfer the sequentially next value of the program counter—the return address—into the link register (LR) and the destination address into the program counter (PC).

Can't seem to add %ES to the clobberlist (inline assembly, GCC)

I'm going through Micheal Abrash's Graphics Programming Black Book (which by the way, I am really enjoying, I strongly recommend it), so the example code I'm working with is quite old. Nonetheless, I don't see what the problem is:
__asm__(
//Some setup code here
"movl %%esi, %%edi;"
"movw %%ds, %%es;"
//A whole bunch more assembly code in between
"divloop:"
"lodsw;"
"divl %%ebx;"
"stosw;"
"loop divloop;"
//And a little more code here
: "=r" (ret)
: "0" (ret) /*Have to do this for some reason*/, "b" (div), "c" (l), "S" (p)
: "%edi", "%es"
);
The l variable is an unsigned int, the p variable is a char*. l
is a byte count for the length of the string pointed at by p. div
is the divisor and is an unsigned int. ret is the return value (an
unsigned int) of the function and is set inside to assembly block to
be the remainder of the division.
The error message I am getting is "error: unknown register name '%es' in 'asm'" (This is the only error message). My best guess is that it goes by another name in GAS syntax. I know I'm working with old code, but as far as I know, on my fairly new intel i3 there is still an ES register that gets used by stos*
Secondly, there's a question that's been bugging me. I've basically had no choice but to just assume that DS was already set to the right memory location for use with lods*. Since I am reading from, modifying, and writing to the same memory location (using stos* and lods*) I'm setting ES equal to DS. However, it's really scaring me that my DS could be anything and I don't know what else to set it to. What's more is that ESI and EDI are already 32 bit registers and should be enough on their own to access memory.
In my experience, two strange problems at once are usually related and caused by a more fundamental problem (and usually a PEBKAC). However, I'm stumped at this point. Does anyone know what's going on?
Thanks a bunch
P.S. I'm trying to recreate the code from Chapter 9 (Hints My Readers Gave Me, Listing 9.5, page 182) that divides a large number stored in contiguous memory by EBX. There is no other reason for doing this than my own personal growth and amusement.
If you're running in a flat 32-bit protected mode environment (like a Linux or Windows user-mode process), there's no need to set es.
The segment registers are set for you by the OS, and es and ds both allow you to access a flat 32-bit address space.
GCC won't generate code to save/restore segment registers, so it's not surprising that it won't allow you to add them to the clobber list.

How does including assembly inline with C code work?

I've seen code for Arduino and other hardware that have assembly inline with C, something along the lines of:
asm("movl %ecx %eax"); /* moves the contents of ecx to eax */
__asm__("movb %bh (%eax)"); /*moves the byte from bh to the memory pointed by eax */
How does this actually Work? I realize every compiler is different, but what are the common reasons this is done, and how could someone take advantage of this?
The inline assembler code goes right into the complete assembled code untouched and in one piece. You do this when you really need absolutely full control over your instruction sequence, or maybe when you can't afford to let an optimizer have its way with your code. Maybe you need every clock tick. Maybe you need every single branch of your code to take the exact same number of clock ticks, and you pad with NOPs to make this happen.
In any case, lots of reasons why someone may want to do this, but you really need to know what you're doing. These chunks of code will be pretty opaque to your compiler, and its likely you won't get any warnings if you're doing something bad.
Usually the compiler will just insert the assembler instructions right into its generated assembler output. And it will do this with no regard for the consequences.
For example, in this code the optimiser is performing copy propagation, whereby it sees that y=x, then z=y. So it replaces z=y with z=x, hoping that this will allow it to perform further optimisations. Howver, it doesn't spot that I've messed with the value of x in the mean time.
char x=6;
char y,z;
y=x; // y becomes 6
_asm
rrncf x, 1 // x becomes 3. Optimiser doesn't see this happen!
_endasm
z=y; // z should become 6, but actually gets
// the value of x, which is 3
To get around this, you can essentially tell the optimiser not to perform this optimisation for this variable.
volatile char x=6; // Tell the compiler that this variable could change
// all by itself, and any time, and therefore don't
// optimise with it.
char y,z;
y=x; // y becomes 6
_asm
rrncf x, 1 // x becomes 3. Optimiser doesn't see this happen!
_endasm
z=y; // z correctly gets the value of y, which is 6
Historically, C compilers generated assembly code, which would then be translated to machine code by an assembler. Inline assembly arises as a simple feature — in the intermediate assembly code, at that point, inject some user-picked code. Some compilers directly generate machine code, in which case they contain an assembler or call an external assembler to generate the machine code for the inline assembly snippets.
The most common use for assembly code is to use specialized processor instructions that the compiler isn't able to generate. For example, disabling interrupts for a critical section, controlling processor features (cache, MMU, MPU, power management, querying CPU capabilities, …), accessing coprocessors and hardware peripherals (e.g. inb/outb instructions on x86), etc. You'll rarely find asm("movl %ecx %eax"), because that affects general-purpose registers that the C code around it is also using, but something like asm("mcr p15, 0, 0, c7, c10, 5") has its use (data memory barrier on ARM). The OSDev wiki has several examples with code snippets.
Assembly code is also useful to implement features that break C's flow control model. A common example is context switching between threads (whether cooperative or preemptive, whether in the same address space or not) requiring assembly code to save and restore register values.
Assembly code is also useful to hand-optimize small bits of code for memory or speed. As compilers are getting smarter, this is rarely relevant at the application level nowadays, but it's still relevant in much of the embedded world.
There are two ways to combine assembly with C: with inline assembly, or by linking assembly modules with C modules. Linking is arguably cleaner but not always applicable: sometimes you need that one instruction in the middle of a function (e.g. for register saving on a context switch, a function call would clobber the registers), or you don't want to pay the cost of a function call.
Most C compilers support inline assembly, but the syntax varies. It is typically introduced by the keyword asm, _asm, __asm or __asm__. In addition to the assembly code itself, the inline assembly construct may contain additional code that allows you to pass values between assembly and C (for example, requesting that the value of a local variable is copied to a register on entry), or to declare that the assembly code clobbers or preserves certain registers.
asm("") and __asm__ are both valid usage. Basically, you can use __asm__ if the keyword asm conflicts with something in your program. If you have more than one instructions, you can write one per line in double quotes, and also suffix a ’\n’ and ’\t’ to the instruction. This is because gcc sends each instruction as a string to as(GAS) and by using the newline/tab you can send correctly formatted lines to the assembler. The code snippet in your question is basic inline.
In basic inline assembly, there is only instructions. In extended assembly, you can also specify the operands. It allows you to specify the input registers, output registers and a list of clobbered registers. It is not mandatory to specify the registers to use, you can leave that to GCC and that probably fits into GCC’s optimization scheme better. An example for the extended asm is:
__asm__ ("movl %eax, %ebx\n\t"
"movl $56, %esi\n\t"
"movl %ecx, $label(%edx,%ebx,$4)\n\t"
"movb %ah, (%ebx)");
Notice that the '\n\t' at the end of each line except the last, and each line is enclosed in quotes. This is because gcc sends each as instruction to as as a string as I mentioned before. The newline/tab combination is required so that the lines are fed to as according to the correct format.

Is this inline-asm approach for stack switching ok?

For some functions, I need to switch the stack so that the original stack remains unmodified. For that purpose, I have written two macros as shown below.
#define SAVE_STACK() __asm__ __volatile__ ( "mov %%rsp, %0; mov %1, %%rsp" : \
"=m" (saved_sp) : "m" (temp_sp) );
#define RESTORE_STACK() __asm__ __volatile__ ( "mov %0, %%rsp" : \
"=m" (saved_sp) );
Here temp_sp and saved_sp are thread local variables. temp_sp points to the makeshift stack that we use. For a function, whose original stack I want unmodified, I place SAVE_STACK at the beginning and RESTORE_STACK at bottom. For example, like this.
int some_func(int param1, int param2)
{
int a, b, r;
SAVE_STACK();
// Function Body here
.....................
RESTORE_STACK();
return r;
}
Now my question is whether this approach is fine. On x86 (64bit), the local variables and parameters are accessed through the rbp register and rsp is accordingly subtracted in function prologue and not touched until in function epilogue where it is added to bring it back to the original value. Therefore, I see no problem here.
I am not sure, if this is correct in the presence of context switches and signals though. (On Linux). Also I'm not sure if this is correct if the function is inlined or if tail call optimization (where jmp instead of call is used) is applied. Do you see any problem or side effects with this approach?
With the code that you've shown above, I can think of the following breakage:
On x86/x64, GCC will "deco" your function with prologues/epilogues if it sees fit, and you can't stop it from doing that (like on ARM, where __attribute__((__naked__)) forces code creation without prologues/epilogues, aka without stackframe setup).
That might end up allocating stack / creating references to stack memory locations before you switch the stack. Even worse if, again, due to the compiler's choice, such an address is put into a nonvolatile register before you switch the stack, it might alias to two locations (the stackpointer-relative one that you changed and the other-reg-relative one that is the same).
Again, on x86/x64, the ABI suggests an optimization for leaf functions (the "red zone") where no stackframe is allocated yet 128 Bytes of stack "below" the end are usable by the function. Unless your memory buffer takes this into account, overruns might occur that you're not expecting.
Signals are handled on alternate stacks (see sigaltstack()) and doing your own stack switching might make your code uncallable from within signal handlers. It'll definitely make it non-reentrant, and depending on where/how you retrieve the "stack location" will also definitely make it non-threadsafe.
In general, if you want to run a specific piece of code on a different stack, why not either:
run it in a different thread (every thread gets a different stack) ?
trigger e.g. SIGUSR1 and run your code in a signal handler (which you can configure to use a different stack) ?
run it via makecontext() / swapcontext() (see the example in the manpage) ?
Edit:
Since you say "you want to compare the memory of two processes", again, there's different methods for that, in particular external process tracing - attach a "debugger" (that can be a process you write yourself that uses ptrace() to control what you want to monitor, and have it handle e.g. breakpoints / checkpoints on behalf of those you trace, to perform the validations you need). That'd be more flexible as well because it doesn't require to change the code you inspect.
-fomit-frame-pointer is on by default. Unless you plan to compile with optimization disabled, the assumption that functions don't touch RSP except in prologue/epilogue is super broken.
Even if you did use -O3 -fno-omit-frame-pointer, compilers will still move RSP around in some cases, although they won't use it to access args and locals. e.g. alloc / C99 VLA, or even calling a function that has more than 6 args (or more precisely, one with args that don't fit in registers), will all move RSP. (Calling a function might just use mov stores, depending on strategy chosen by the compiler.)
Also, "shrink wrap" optimization where a function delays saving call-preserved regs until after a possible early-out could mean your stack-switch happens before the compiler is ready to save/restore. And your restore might happen too late or too early. (This was mentioned in comments by ams.)

How to determine values saved on the stack?

I'm doing some experimenting and would like to be able to see what is saved on the stack during a system call (the saved state of the user land process). According to http://lxr.linux.no/#linux+v2.6.30.1/arch/x86/kernel/entry_32.S it shows that the various values of registers are saved at those particular offsets to the stack pointer. Here is the code I have been trying to use to examine what is saved on the stack (this is in a custom system call I have created):
asm("movl 0x1C(%esp), %ecx");
asm("movl %%ecx, %0" : "=r" (value));
where value is an unsigned long.
As of right now, this value is not what is expected (it is showing a 0 is saved for the user value of ds).
Am I correctly accessing the offset of the stack pointer?
Another possibility might be could I use a debugger such as GDB to examine the stack contents while in the kernel? I don't have much extensive use with debugging and am not sure of how to debug code inside the kernel. Any help is much appreciated.
No need for inline assembly. The saved state that entry_32.S pushes onto the stack for a syscall is laid out as a struct pt_regs, and you can get a pointer to it like this (you'll need to include <asm/ptrace.h> and/or <asm/processor.h> either directly or indirectly):
struct pt_regs *regs = task_pt_regs(current);
Inline assembly is trickier than it seems. Trying to shortly cover the concerns for GCC:
If it modifies processor registers, it's necessary to put these registers on the clobber list. It's important to note that the clobber list must contain ALL registers that you changed directly (read explicitly) or indirectly (read implicitly);
To reinforce (1), conditional and mathematical operations also change registers, more known as status flags (zero, carry, overflow, etc), so you have to inform it by adding "cc" to the clobber list;
Add "memory" if it modifies different (read random) memory positions;
Add the volatile keyword if it modifies memory that isn't mentioned on the input/output arguments;
Then, your code becomes:
asm("movl 0x1C(%%esp), %0;"
: "=r" (value)
: /* no inputs :) */
/* no modified registers */
);
The output argument isn't required to be on the clobber list because GCC already knows it will be changed.
Alternatively, since all you want is the value of ESP register, you can avoid all the pain doing this:
register int esp asm("esp");
esp += 0x1C;
It might not solve your problem, but it's the way to go. For reference, check this, this and this.
Keep in mind that x86_64 code will often pass values in registers (since it has so many) so nothing will be on the stack. Check the gcc intermediate output (-S IIRC) and look for push in the assembly.
I'm not familiar with debugging kernel code, but gdb is definitely nicer to examine the stack interactively.

Resources