Inserting inline assembly code into C function - I/O questions

Inserting inline assembly code into C function - I/O questions - c

I am developing an embedded C application for my Cortex M3 microcontroller using the GNU arm-none-eabi toolchain.
I have plan to adopt an assembly subroutine that the vendor implemented into my C application. I plan to make a new C function, then within that, write an inline assembly block using the extended inline assembly protocol. In this post, I plan to treat this assembly subroutine as a black box, and plan to ask this forum about how to structure the inputs and clobber list; this routine has no outputs.
The assembly subroutine expects r0, r1, and r2 to be pre-set prior to the call. Further, the subroutine uses registers r4, r5, r6, r7, r8, r9 as scratch registers to do its function. It writes to a range of memory on the device, specified by r0 and r1 which are the start and stop addresses, respectively.
So, I am checking if my assumptions are correct. My questions follow.
My function that I think I should write, is this right?:
void my_asm_ported_func(int reg_r0, int reg_r1, int reg_r2 {
__asm__ __volatile__ (
"ldr r0, %0 \n\t",
"ldr r1, %1 \n\t",
"ldr r2, %2 \n\t",
"<vendor code...> ",
: /* no outputs */
: "r" (reg_r0), "r" (reg_r1), "r" (reg_r2) /* inputs */
: "r0", "r1", "r2", "r4", "r5", "r6",
"r7", "r8", "r9", "memory" /* clobbers */
);
}
Since this asm subroutine writes to a range of other memory on the device, is adding "memory" to the clobber list enough? Seems too simple.
Is there a more elegant way to feed in r0 - r2 from the input parameters in the surrounding C function? I understand from AAPCS that the registers r0-r3 are input parameters 1-4, so this seems redundant to feed r0-r2 inputs manually like I did in the input list. Should I somehow just have this be a pure assembly function in a separater .S file?
Thank you in advance.
I tried the above but with the basic inline assembly protocol with terrible results - it crashed. I did it that way because I thought the assembly block would naturally take r0-r2 via the function prologue, which it evidently did because it wrote the memory correctly, but crashed once my breakpoint at the beginning of the asm block was kicked off (my vs code extension doesn't have the step-by-step disassembly view, so it just runs it as a block box and it crashed). I haven't tried the extended yet, I have been doing a lot of reading into this so I just wanted to make sure my black box approach should work and I'm not missing anything too big.

Yes, a volatile asm with a "memory" clobber is fine for MMIO (or pretty much anything that's supported at all): the compiler will make sure the asm it generates has memory contents in sync with the C abstract machine before the asm statement, and will assume that any globally-reachable memory has changed after. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for a more in-depth explanation of why this matters when the pointed-to memory is C variables that you also access outside inline asm, not just MMIO registers.
Registers
To avoid wasted instructions, tell the compiler which registers you want inputs in, or better let the compiler pick and change the "vendor code" to use %0 instead of the hard register r0.
ldr r0, r0 from filling in your ldr r0, %0 template string is either invalid or treats the source r0 as a symbol name. Either way doesn't get the function arg into r0, since you force the compiler to have it in a different register (by declaring a clobber on "r0".) If you did want to copy between registers, the ARM instruction for that is mov. But if that's the first instruction of an asm template string, usually that means you're doing it wrong and should use better constraints to tell the compiler what you want.
// Worse way, but can use a template string with hard-coded registers unchanged
void my_asm_ported_func(int a, int b, int c)
{
register int reg_r0 asm ("r0") = a; // forces "r" to pick r0 for an asm template
register int reg_r1 asm ("r1") = b; // no other *guaranteed* effects.
register int reg_r2 asm ("r2") = c;
__asm__ __volatile__ (
// no extra mov or load instructions
"<vendor code...> " // still unchanged
: "+r" (reg_r0), "+r" (reg_r1), "+r" (reg_r2) // read-write outputs
: // no pure inputs
: "r4", "r5", "r6",
"r7", "r8", "r9", "memory" // clobbers
);
}
Best way
void my_asm_ported_func(int reg_r0, int reg_r1, int reg_r2) {
__asm__ __volatile__ (
// no extra mov or load instructions.
"<vendor code changed to use %0 instead of r0, etc...> "
: "+r" (reg_r0), "+r" (reg_r1), "+r" (reg_r2) // read-write outputs
: // no pure inputs
: "r4", "r5", "r6",
"r7", "r8", "r9", "memory" // clobbers. Not including r3??
);
// the C variables reg_r0 and so on have modified values here
// but they're local to this function so no effect outside of this
}
Actually, a further improvement would be to replace the register clobbers like "r4" through "r9" with "=r"(dummy1) output operands to let the compiler pick which registers to clobber.
I'm surprised the template string doesn't use r3. If it does, you forgot to tell the compiler about it, which is undefined behaviour that will bite you when this function inlines. You mentioned crashes; that could be the cause, if your ldr isn't.
Using %0 instead of r0 in the "vendor code" will get the compiler to fill in the register name it picked. Normally it will pick r0 for the C variable whose value was already there, unless the function inline and the value was in a different register.
I'm assuming the asm template modifies that register, which is why I made it an input/output operand with "+r"(reg_r0), with the output side basically being a dummy to let the compiler know that register changed. You can't declare a clobber on a register that's also an operand, and if you're letting the compiler pick registers you wouldn't even know which one.
If any of the input registers are left unmodified by the asm template, make them pure inputs. You can use [name] "r"(c_var) in the operands and %[name] in the template string to use names instead of numbers, making it easy to move them around without having to renumber and keep track of which operand is which number.
See also
ARM inline asm: exit system call with value read from memory re: getting values into specific ARM registers
https://stackoverflow.com/tags/inline-assembly/info
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html register T foo asm("regname") syntax.
Separate .S file:
Should I somehow just have this be a pure assembly function in a separate .S file?
That's 100% a valid option, especially if call/ret overhead is minor compared to how long this takes, or it's not called all the time.
Look at compiler-generated asm (gcc -S) if you're not sure about the syntax for declaring a function (.globl foo ; foo: to define the symbol, put its machine code after it.) And of course push and pop any call-preserved registers your function uses.
(GNU C inline asm requires you to describe the asm precisely to the compiler; the function-calling convention is irrelevant because it's inline asm. You're dancing with the compiler and need to not step on its toes, instead of just following the standard calling convention.)

Related

How to pass function address to Assembler Instructions with C Expression Operands

With gcc/clang for ARM cortex M, is there a way to pass a function address as a constant in Assembler Instructions with C Expression Operands ? More precisely I'd like to load R12 with function address (stored in the memory):
ldr R12, =func
within a C function, foe example like this one
// __attribute__((naked))
int loader(int fn)
{
__asm ("ldr R12, =%0"::??? (fn):"r12");
// ... then SVC #0, and the R0 is the return value
}
The question is what exactly I have to put for the Input Operand?
EDIT:
Thanks for the comments!
Actually I need to re-implement the KEIL's __svc_indirect(0) which loads R12 with function address and passes up to four arguments in R0..R3 (see __svc_indirect

Use an i constraint and manually prepend the = character:
__asm ("ldr r12, =%0" :: "i"(fn) : "r12");
Note that the inline assembly statement is still incorrect for other reasons, some of which were outlined in the comments on your question. Also consider using a register-constrainted variable for this sort of thing:
register int fn asm("r12");
__asm ("" :: "r"(fn));

x86-64 Zero Flag is clearing between inline calls (and another problem)

I am using the bsf x86-64 instruction found on page 210 of Intels developers manual found here. Essentially, if a least significant 1 bit is found, its bit index is stored in the destination operand .
Furthermore, the ZF flag is set to 1 if all the source operand is 0; otherwise, the ZF flag is cleared.
I am compiling my C code with inline x86-64 assembly instructions. I have defined a C function which invokes the bsf instruction:
uint64_t bitScanForward(T_bitboard b) {
__asm__(
"bsf %rcx,%rax\n"
"leave\n"
"ret\n"
);
}
and also another C function which checks if the status of the ZF bit in the flag register:
uint64_t isZFSet() {
printf("\n"); <- This is another problem I am having (see below)...
__asm__(
"jz true\n"
"movq $0,%rax\n"//return false
"jmp end\n"
"true:\n"
"movq $1,%rax\n"//return true
"end:\n"
"leave\n"
"ret\n"
);
}
I have tested these and found that the ZF flag is always cleared even when the bsf comand is applied to the number zero, seemingly going against the specification.
//Calling function...
//Do stuff...
bitScanForward(0ULL);//ULL is 64 bit on my machine
if(isZFSet()){//ZF flag *should* be set here but its not
printf("ZF flag is set\n");
}
//More stuff...
I suspect the reason the ZF flag is clearing is due to entering and leaving one set of inline instructions to another.
How can I ensure that the flag in the above code is set as specified in the documentation? (I don't want to change much of my code or design)
My "other problem" is that if I dont include the printf statement in the isZFFlagSet, the function seemingly doesnt execute. Totally bizarre. Can anyone explain why?

You are treating an aggressively optimizing C compiler as if it were a macro assembler. That just plain isn't going to work. To get GCC to emit correct code in the presence of assembly inserts, you have to annotate the inserts with complete information about the registers and memory regions that are affected by the assembly code, and you have to use ancillary C statements to mesh them with the surrounding code. Even then, there are things the assembly insert cannot do at all. I urge you to scrap this entire mess and instead use the __builtin_ctzll intrinsic, as suggested in the comments on the question.
Now, to specifics. Your first function is incorrect because GCC does not support use of leave or ret inside an assembly insert. (More generally, assembly inserts may not alter the stack pointer, and may only jump to designated labels within the same function.) The correct way to use bsf from a GCC-style assembly insert is with "extended asm" with input and output operands:
uint64_t bitScanForward(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0" : "=r" (ret) : "r" (b));
return ret;
}
You must declare a C variable to receive the output of the operation, and explicitly return that variable; having bsf write to %rax would not work (unlike how it was in old MSVC). BSF accepts any two registers as operands, so there is no need to use constraints more specific than r.
Your second function is incorrect because you didn't tell GCC that the condition codes were meaningful after bitScanForward, and because GCC does not support using the condition-code register as an input to an assembly insert. In order to read the ZF output from bsf you must do so within the same assembly insert that invoked bsf:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0\n\t"
"cmove %2, %0"
: "=&r" (ret)
: "r" (b), "rm" (64));
return ret;
}
This requires special care -- see how the constraint on operand 0 is now =&r instead of just =r? Without that, GCC is liable to think it can put operand 2 in the same register as operand 0.
Alternatively, you can specify that ZF is an output, which is supported (see the "flag output operands" section of the manual) and then supply a default value from C:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
int zf;
asm ("bsf %2, %0"
: "=r" (ret), "=#ccz" (zf) : "r" (b));
if (zf) ret = 64;
return ret;
}

Why doesn't this compiler barrier enforce ordering?

I was looking at the documentation on the Atmel website and I came across this example where they explain some issues with reordering.
Here's the example code:
#define cli() __asm volatile( "cli" ::: "memory" )
#define sei() __asm volatile( "sei" ::: "memory" )
unsigned int ivar;
void test2( unsigned int val )
{
val = 65535U / val;
cli();
ivar = val;
sei();
}
In this example, they're implementing a critical region-like mechanism. The cli instruction disables interrupts and the sei instruction enables them. Normally, I would save the interrupt state and restore to that state, but I digress...
The problem which they note is that, with optimization enabled, the division on the first line actually gets moved to after the cli instruction. This can cause some issues when you're trying to be inside of the critical region for the shortest amount of time as possible.
How come this is possible if the cli() MACRO expands to inline asm which explicitly clobbers the memory? How is the compiler free to move things before or after this statement?
Also, I modified the code to include memory barriers before every statement in the form of __asm volatile("" ::: "memory"); and it doesn't seem to change anything.
I also removed the memory clobber from the cli() and sei() MACROs, and the generated code was identical.
Of course, if I declare the test2 function argument as volatile, there is no reordering, which I assume to be because volatile statements can't be reordered with respect to other volatile statements (which the inline asm technically is). Is my assumption correct?
Can volatile accesses be reordered with respect to volatile inline asm?
Can non-volatile accesses be reordered with respect to volatile inline asm?
What's weird is that Atmel claims they need the memory clobber just to enforce the ordering of volatile accesses with respect to the asm. That doesn't make any sense to me.
If the compiler barrier isn't the proper solution for this, then how could I go about preventing any outside code from "leaking" into the critical region?
If anyone could shed some light, I'd appreciate it.
Thanks

How come this is possible if the cli() MACRO expands to inline asm which explicitly clobbers the memory? How is the compiler free to move things before or after this statement?
This is due to implementation details of avr-gcc: The compiler's support library, libgcc, provides many functions written in assembly for performance; including functions for integer division like __udivmodhi4. Not all of these functions clobber all of the callee-used registers as specified by the avr-gcc ABI. In particular, __udivmodhi4 does not clobber the Z register.
avr-gcc makes use of this as follows: On machines without 16-bit division instruction like AVR, GCC would issue a library call instead of generating code for it inline. avr-gcc however pretends that the architecture does have such division instruction and models it as having an effect on processor registers just like the library call. Finally, after all code analyzes and optimizations, the avr backend prints this instruction as [R]CALL __udivmodhi4. Let's call this a transparent call, i.e. a call which the compiler analysis does not see.
Example
int div (int a, int b, volatile const __flash char *z)
{
int ab;
(void) *z;
asm volatile ("" : "+r" (a));
ab = a / b;
asm volatile ("" : "+r" (ab));
(void) *z;
return ab;
}
Compile this with avr-gcc -S -Os -mmcu=atmega8 ... to get assembly file *.s:
div:
movw r30,r20
lpm r18,Z
rcall __divmodhi4
movw r24,r22
lpm r18,Z
ret
Explanation
(void) *z reads one byte from flash, and in order to use lpm instruction, the address must be in the Z register accomplished by movw r30,r20. After reading via lpm, the compiler issues rcall __divmodhi4 to perform signed 16-bit division. If this was an ordinary (non-transparent) call, the compiler would know nothing about the internal working of the callee, but as the avr backend models the call by hand, the compiler knows that the instruction sequence does not change Z and hence may use Z again after the call without any further ado. This allows for better code generation due to less register pressure, in particular z need not be saved / restores around the division.
The asm just serves to order the code: It is volatile and hence must not be reordered against the volatile read *z. And the asm must not be reordered against the division because the asm changes a and ab – at least that's what we are pretending and telling the compiler by means of the constraints. (These variables are not actually changed, but that does not matter here.)
Also, I modified the code to include memory barriers before every statement in the form of __asm volatile("" ::: "memory"); and it doesn't seem to change anything.
The division does not touch memory (it's a transparent call without memory clobber) hence the compiler machinery may reorder it against memory clobber / accesses.
If you need a specific order, then you'll have to introduce artificial dependencies like in in my example above.
In order to tell apart ordinary calls from transparent ones, you can dump the generated assembly in the .s file be means of -save-temps -dp where -dp prints insn names:
void func0 (void);
int func1 (int a, int b)
{
return a / b;
}
void func2 (void)
{
func0();
}
Every call that's neither call_insn nor call_value_insn is a transparent call, *divmodhi4_call in this case:
func1:
rcall __divmodhi4 ; 17 [c=0 l=1] *divmodhi4_call
movw r24,r22 ; 18 [c=4 l=1] *movhi/0
ret ; 23 [c=0 l=1] return
func2:
rjmp func0 ; 5 [c=0 l=1] call_insn/3

inline assembly in avr

void save_context(uint8_t index) {
context *this_context = contextArray + index;
uint8_t *this_stack = this_context->stack;
asm volatile("st %0 r0": "r"(this_stack));
}
I have something like this.
!!! I would like to store the registers r0 r1 r2... into my stack[] array.
What I am programming is the context switch. The context has the structure like this:
typedef struct context_t {
uint8_t stack[THREAD_STACK_SIZE];
void *pstack;
struct context_t *next;
}context;
My problem is that I am not able to pass the c variable "this_stack" to inline assembly. My aim is to store all the registers, stack pointer and SREG on my stack.
After compiling, it gives error:
Description Resource Path Location Type
`,' required 5_multitasking line 754, external location: C:\Users\Jiadong\AppData\Local\Temp\ccDo7xn3.s C/C++ Problem
I looked up the avr inline assembly tutorial. But I don't quite get a lot.
Could anyone help me?

asm volatile ("st %0 r0": "r"(this_stack));
There are several problems in that line: Wrong % print-modifier, missing , between the operands, incorrect constraint and missing description of side effects.
The memory access is supposed to use indirect addressing, so one way is to use indirect+displacement with "b"ase register Y or Z:
asm volatile ("std %a1+0, r0" "\n\t"
"std %a1+1, r1" "\n\t"
"..."
: "+m" (this_context->stack)
: "b" (this_stack));
Notice print modifier %a which prints R30 as Z and not as r30.
Operand 0 is just used to express that this_context->stack is being changed if you don't want the all-memory-clobber "memory". Moreover, there's no need for an intermediate variable for operand 1 because it's not altered: you can use just as well "b" (this_context->stack) for operand 1.
Alternatively, post-increment addressing on "e"xtended (pointer) registers X, Y or Z can be used:
asm volatile ("st %a1+, r0" "\n\t"
"st %a1+, r1" "\n\t"
"..."
: "=m" (this_context->stack), "+e" (this_stack));

"label" makes no sense, that should be a constraint. It also makes no sense trying to save the stack pointer into an array. It might make sense to load the stack pointer with the address of that array, but that's not the save_context.
Anyway, to get the value of SPL which is the stack pointer you can do something like this:
asm volatile("in %0, %1": "=r" (*this_stack) : "I" (_SFR_IO_ADDR(SPL)));
(There is a q constraint but at least my gcc version doesn't like it.)
To get true registers, for example r26 you can do:
register uint8_t r26_value __asm__("r26");
asm volatile("": "=r" (r26_value));

There is a constraint, "m", documented in the GCC manual, but it doesn't always work on AVR. Here is an example of how it should work from sanguino/bootloaders/atmega644p/ATmegaBOOT
asm volatile("...
...
"sts %0,r16 \n\t"
...
: "=m" (SPMCSR) : ... );
I have found "m" to be fragile though. If a function uses a variable in C code, outside of the inline assembly, the compiler may choose to place it in the Z register and it will try to use Z in assembler too. This causes an assembler error when used with the sts instruction. Looking at the assembler output from the C compiler is the best way to debug this kind of problem.
Rather than using an "m" constraint, you can just put the literal address you want into your assembler code. For an example, see pins_teensy.c, where timer_0_fract_count is not included in the :
asm volatile(
...
"sts timer0_fract_count, r24" "\n\t"

GCC inline - push address, not its value to stack

I'm experimenting with GCC's inline assembler (I use MinGW, my OS is Win7).
Right now I'm only getting some basic C stdlib functions to work. I'm generally familiar with the Intel syntax, but new to AT&T.
The following code works nice:
char localmsg[] = "my local message";
asm("leal %0, %%eax" : "=m" (localmsg));
asm("push %eax");
asm("call %0" : : "m" (puts));
asm("add $4,%esp");
That LEA seems redundant, however, as I can just push the value straight onto the stack. Well, due to what I believe is an AT&T peculiarity, doing this:
asm("push %0" : "=m" (localmsg));
will generate the following assembly code in the final executable:
PUSH DWORD PTR SS:[ESP+1F]
So instead of pushing the address to my string, its contents were pushed because the "pointer" was "dereferenced", in C terms. This obviously leads to a crash.
I believe this is just GAS's normal behavior, but I was unable to find any information on how to overcome this. I'd appreciate any help.
P.S. I know this is a trivial question to those who are experienced in the matter. I expect to be downvoted, but I've just spent 45 minutes looking for a solution and found nothing.
P.P.S. I realize the proper way to do this would be to call puts( ) in the C code. This is for purely educational/experimental reasons.

While inline asm is always a bit tricky, calling functions from it is particularly challenging. Not something I would suggest for a "getting to known inline asm" project. If you haven't already, I suggest looking through the very latest inline asm docs. A lot of work has been done to try to explain how inline asm works.
That said, here are some thoughts:
1) Using multiple asm statements like this is a bad idea. As the docs say: Do not expect a sequence of asm statements to remain perfectly consecutive after compilation. If certain instructions need to remain consecutive in the output, put them in a single multi-instruction asm statement.
2) Directly modifying registers (like you are doing with eax) without letting gcc know you are doing so is also a bad idea. You should either use register constraints (so gcc can pick its own registers) or clobbers to let gcc know you are stomping on them.
3) When a function (like puts) is called, while some registers must have their values restored before returning, some registers can be treated as scratch registers by the called function (ie modified and not restored before returning). As I mentioned in #2, having your asm modify registers without informing gcc is a very bad idea. If you know the ABI for the function you are calling, you can add its scratch registers to the asm's clobber list.
4) While in this specific example you are using a constant string, as a general rule, when passing asm pointers to strings, structs, arrays, etc, you are likely to need the "memory" clobber to ensure that any pending writes to memory are performed before starting to execute your asm.
5) Actually, the lea is doing something very important. The value of esp is not known at compile time, so it's not like you can perform push $12345. Someone needs to compute (esp + the offset of localmsg) before it can be pushed on the stack. Also, see second example below.
6) If you prefer intel format (and what right-thinking person wouldn't?), you can use -masm=intel.
Given all this, my first cut at this code looks like this. Note that this does NOT clobber puts' scratch registers. That's left as an exercise...
#include <stdio.h>
int main()
{
const char localmsg[] = "my local message";
int result;
/* Use 'volatile' since 'result' is usually not going to get used,
which might tempt gcc to discard this asm statement as unneeded. */
asm volatile ("push %[msg] \n\t" /* Push the address of the string. */
"call %[puts] \n \t" /* Call the print function. */
"add $4,%%esp" /* Clean up the stack. */
: "=a" (result) /* The result code from puts. */
: [puts] "m" (puts), [msg] "r" (localmsg)
: "memory", "esp");
printf("%d\n", result);
}
True this doesn't avoid the lea due to #5. However, if that's really important, try this:
#include <stdio.h>
const char localmsg[] = "my local message";
int main()
{
int result;
/* Use 'volatile' since 'result' is usually not going to get used. */
asm volatile ("push %[msg] \n\t" /* Push the address of the string. */
"call %[puts] \n \t" /* Call the print function. */
"add $4,%%esp" /* Clean up the stack. */
: "=a" (result) /* The result code. */
: [puts] "m" (puts), [msg] "i" (localmsg)
: "memory", "esp");
printf("%d\n", result);
}
As a global, the address of localmsg is now knowable at compile time (ok, I'm simplifying a bit), the asm produced looks like this:
push $__ZL8localmsg
call _puts
add $4,%esp
Tada.