RISC-V inline assembly - c

I'm quite new to inline assembly, so I need your help to be sure that I use it correctly.
I need to add assembly code inside my C code that is compiled with the Risc-v toolchain. Please consider the following code:
int bar = 0xFF00;
int main(){
volatile int result;
int k;
k = funct();
int* ptr;
ptr = &bar;
asm volatile (".insn r 0x33, 0, 0, a4, a5, a3":
"=m"(*ptr), "=r"(result):
[a5] "m"(*ptr), [a3] "r"(k) :
);
}
...
What I want to do is bar = bar+k. Actually, I want to change the content of the memory location that bar resides in. But the code that I wrote gets the address of bar and adds it to k. Does anybody know what the problem is?

Unfortunately, you have misunderstood the syntax.
In the assembler string, you can either refer to an argument using %0, %1, where the number is the n:th argument passed to the asm directive. Alternatively, you can use the symbolic name, %[myname] which refers to the argument in the form [myname]"r"(k).
Note that the symbolic name is the same as using the number, the name itself doesn't imply anything. In you example, one could get the impression that you are forcing the code to use a specific processor register. (There is another syntax for that, if you really need to use it.)
For example, if you write something like:
int bar = 0xFF00;
int main(){
volatile int result;
int k;
k = funct();
int* ptr;
ptr = &bar;
asm volatile (".insn r 0x33, 0, 0, %[res], %[res], %[ptr]":
[res]"+r"(result) : [ptr]"r"(ptr));
}
The IAR compiler will emit the following. As you can see a0 has been assigned the result variable (using the symbolic name res) and a1 assigned the variable ptr (here, the symbolic name is the same as the variable name).
\ 000014 0001'2503 lw a0, 0x0(sp)
\ 000018 0000'05B7 lui a1, %hi(bar)
\ 00001C 0005'8593 addi a1, a1, %lo(bar)
\ 000020 00B5'0533 .insn r 0x33, 0, 0, a0, a0, a1
\ 000024 00A1'2023 sw a0, 0x0(sp)
You can read more about the IAR inline assembly syntax in the book "IAR C/C++ Development Guide Compiling and linking for RISC-V", in chapter "Assembler Language Interface". The book is provided as a PDF, which you can access from within IAR Embedded Workbench.

Based on the snippet provided in your question, I tried the following code with the IAR C/C++ Compiler for RISC-V:
int funct();
int funct() { return 0xA5; } // stub
int bar = 0xFF00;
int main() {
int k = funct();
int* ptr = &bar;
asm volatile (".insn r 0x33, 0, 0, %[res], %[ptr], %[k]"
: [res]"=r"(*ptr)
: [ptr]"r"(*ptr), [k]"r"(k));
}
In this case, the .insn directive will generate add r,r,r which is effectively *ptr = *ptr + k.
In an earlier version of this answer it was assumed that there would be a requirement to be explicit about which registers to use. For that, explicit register selectors were used as the IAR compiler simply allows it (e.g., "a3", ="a3", "a4", "a5", etc.). At that point, as noted by #PeterCordes in the comments, GCC offered a different set of constraints and would require a different solution. However, if there is no need to be explicit about the registers, it is better to let the compiler decide which ones can be used directly. It will generally impose less overhead.

Related

Early-clobbers and named registers

I'm trying to understand the usage of "early-clobber outputs" but I stumbled upon a snipped which confuses me. Consider the following multiply-modulo function:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t d;
uint64_t unused;
asm ("mulq %3\n\t"
"divq %4"
:"=a"(unused), "=&d"(d)
:"a"(a), "rm"(b), "rm"(n)
:"cc");
return d;
}
Why has RDX the early-clobber flag (&)? Is it because mulq implicitly modified RDX? Would the example work without the flag? (I tried and it seems it does. But would it be correct as well?) On the other had, isn't it enough that the function outputs RDX to tell the compiler RDX was modified?
Also, why there is that unused variable? I assume it's there to denote that RAX was modified, correct? Can I remove it? (I tried and it seems to work.) I would have expected the correct way of marking the modified RAX is by including "rax" to "clobbers", along with "cc". But that does not work.
While this doesn't answer the question - I think the comments have it covered - I would simplify this, by letting the compiler choose registers vs memory, and allowing it to schedule mulq and divq as required... The problem is that div has register restrictions:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t ret, q, rh, rl;
__asm__ ("mulq %3" : "=a,a" (rl), "=d,d" (rh)
: "%0,0" (a), "r,m" (b) : "cc");
/* assert(rh < n), otherwise `div` raises a 'divide error' - the quotient is
* too large to store in in `%rax`. */
/* the "%0,0" notation implies that `(a)` and `(b)` are commutative.
* the "cc" clobber is implicit in gcc / clang asm (and, I expect, Intel icc)
* for the x86-64 asm statements. */
__asm__ ("divq %4" : "=a,a" (q), "=d,d" (ret)
: "0,0" (rl), "1,1" (rh), "r,m" (n), "cc");
return ret;
}

How to understand this GNU C inline assembly macro for PowerPC stwbrx

This is basically to perform swap for the buffers while transferring a message buffer. This statement left me puzzled (because of my unfamiliarity with the embedded assembly code in c). This is a power pc instruction
#define ASMSWAP32(dest_addr,data) __asm__ volatile ("stwbrx %0, 0, %1" : : "r" (data), "r" (dest_addr))
Besides being unsafe because of a bug, this macro is also less efficient than what the compiler will generate for you.
stwbrx = store word byte-reversed. The x stands for indexed.
You don't need inline asm for this in GNU C, where you can use __builtin_bswap32 and let the compiler emit this instruction for you.
void swapstore_asm(int a, int *p) {
ASMSWAP32(p, a);
}
void swapstore_c(int a, int *p) {
*p = __builtin_bswap32(a);
}
Compiled with gcc4.8.5 -O3 -mregnames, we get identical code from both functions (Godbolt compiler explorer):
swapstore:
stwbrx %r3, 0, %r4
blr
swapstore_c:
stwbrx %r3,0,%r4
blr
But with a more complicated address (storing to p[off], where off is an integer function arg), the compiler knows how to use both register inputs, while your macro forces the compiler to have the address in a single register:
void swapstore_offset(int a, int *p, int off) {
= __builtin_bswap32(a);
}
swapstore_offset:
slwi %r5,%r5,2 # *4 = sizeof(int)
stwbrx %r3,%r4,%r5 # use an indexed addressing mode, with both registers non-zero
blr
swapstore_offset_asm:
slwi %r5,%r5,2
add %r4,%r4,%r5 # extra instruction forced by using the macro
stwbrx %r3, 0, %r4
blr
BTW, if you're having trouble understanding GNU C inline asm templates, looking at the compiler's asm output can be a useful way to see what gets substituted in. See How to remove "noise" from GCC/clang assembly output? for more about reading compiler asm output.
Also note that this macro is buggy: it's missing a "memory" clobber for the store. And yes, you still need that with asm volatile. The compiler doesn't assume that *dest_addr is modified unless you tell it, so it could hoist a non-volatile load of *dest_addr ahead of this insn, or more likely to be a real problem, sink a store after it. (e.g. if you zeroed a buffer before storing to it with this, the compiler might actually zero after this instruction.)
Instead of a "memory" clobber (and also leaving out volatile), you could tell the compiler which memory location you modify with a =m" (*dest_addr) operand, either as a dummy operand or with a constraint on the addressing mode so you could use it as reg+reg. (IDK PPC well enough to know what "=m" usually expands to.)
In most cases this bug won't bite you, but it's still a bug. Upgrading your compiler version or using link-time optimization could maybe make your program buggy with no source-level changes.
This kind of thing is why https://gcc.gnu.org/wiki/DontUseInlineAsm
See also https://stackoverflow.com/tags/inline-assembly/info.
#define ASMSWAP32(dest_addr,data) ...
This part should be clear
__asm__ volatile ( ... : : "r" (data), "r" (dest_addr))
This is the actual inline assembly:
Two values are passed to the assmbly code; no value is returned from the assembly code (this is the colons after the actual assembly code).
Both parameters are passed in registers ("r"). The expression %0 will be replaced by the register that contains the value of data while the expression %1 will be replaced by the register that contains the value of dest_addr (which will be a pointer in this case).
The volatile here means that the assembly code has to be executed at this point and cannot be moved to somewhere else.
So if you use the following code in the C source:
ASMSWAP(&a, b);
... the following assembler code will be generated:
# write the address of a to register 5 (for example)
...
# write the value of b to register 6
...
stwbrx 6, 0, 5
So the first argument of the stwbrx instruction is the value of b and the last argument is the address of a.
stwbrx x, 0, y
This instruction writes the value in register x to the address stored in register y; however it writes the value in "reverse endian" (on a big-endian CPU it writes the value "little endian".
The following code:
uint32 a;
ASMSWAP32(&a, 0x12345678);
... should therefore result in a = 0x78563412.

Vector Sum using AVX Inline Assembly on XeonPhi

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.

gcc optimize variable away before systemcall

Using Codesourcery arm-linux-eabi crosscompiler and have problems with the compiler not executing certain code because it thinks it's not used, especially for a systemcall. Is there any way to get around this?
For example this code does not initialize the variable.
unsigned int temp = 42;
asm volatile("mov R1, %0 :: "r" (temp));
asm volatile("swi 1");
In this case temp never get initialized to the value 42. However if I add a printk after the initialization, it gets initialized to the correct value 42. I tried with
unsigned int temp __attribute__ ((used)) = 42;
Still doesn't work but I get a warning message:
'used' attribute ignored [-Wattributes]
this is in the linux kernel code.
Any tips?
This is not the correct way to use inline assembly. As written, the two statements are separate, and there is no reason the compiler has to preserve any register values between the two. You need to either put both assembly instructions in the same inline assembly block, with proper input and output constraints, or you could do something like the following which allows the compiler to be more efficient:
register unsigned int temp __asm__("r1") = 42;
__asm__ volatile("swi 1" : : "r"(temp) : "memory");
(Note that I added memory to the clobber list; I'm not sure which syscall you're making, but if the syscall writes to any object in userspace, "memory" needs to be listed in the clobberlist.)

GCC: Prohibit use of some registers

This is a strange request but I have a feeling that it could be possible. What I would like is to insert some pragmas or directives into areas of my code (written in C) so that GCC's register allocator will not use them.
I understand that I can do something like this, which might set aside this register for this variable
register int var1 asm ("EBX") = 1984;
register int var2 asm ("r9") = 101;
The problem is that I'm inserting new instructions (for a hardware simulator) directly and GCC and GAS don't recognise these yet. My new instructions can use the existing general purpose registers and I want to make sure that I have some of them (i.e. r12->r15) reserved.
Right now, I'm working in a mockup environment and I want to do my experiments quickly. In the future I will append GAS and add intrinsics into GCC, but right now I'm looking for a quick fix.
Thanks!
When writing GCC inline assembler, you can specify a "clobber list" - a list of registers that may be overwritten by your inline assembler code. GCC will then do whatever is needed to save and restore data in those registers (or avoid their use in the first place) over the course of the inline asm segment. You can also bind input or output registers to C variables.
For example:
inline unsigned long addone(unsigned long v)
{
unsigned long rv;
asm("mov $1, %%eax;"
"mov %0, %%ebx;"
"add %%eax, %%ebx"
: /* outputs */ "b" (rv)
: /* inputs */ "g" (v) /* select unused general purpose reg into %0 */
: /* clobbers */ "eax"
);
}
For more information, see the GCC-Inline-Asm-HOWTO.
If you use global explicit register variables, these will be reserved throughout the compilation unit, and will not be used by the compiler for anything else (it may still be used by the system's libraries, so choose something that will be restored by those). local register variables do not guarantee that your value will be in the register at all times, but only when referenced by code or as an asm operand.
If you write an inline asm block for your new instructions, there are commands that inform GCC what registers are used by that block and how they are used. GCC will then avoid using those registers or will at least save and reload their contents.
Non-hardcoded scratch register in inline assembly
This is not a direct answer to the original question, but since and since I keep Googling this in that context and since https://stackoverflow.com/a/6683183/895245 was accepted, I'm going to try and provide a possible improvement to that answer.
The improvement is the following: you should avoid hard-coding your scratch registers when possible, to give the register allocator more freedom.
Therefore, as an educational example that is useless in practice (could be done in a single lea (%[in1], %[in2]), %[out];), the following hardcoded scratch register code:
bad.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
__asm__ (
"mov %[in2], %%rax;" /* scratch = in2 */
"add %[in1], %%rax;" /* scratch += in1 */
"mov %%rax, %[out];" /* out = scratch */
: [out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
: "rax"
);
assert(out == 0x100000000);
}
could compile to something more efficient if you instead use this non-hardcoded version:
good.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
uint64_t scratch;
__asm__ (
"mov %[in2], %[scratch];" /* scratch = in2 */
"add %[in1], %[scratch];" /* scratch += in1 */
"mov %[scratch], %[out];" /* out = scratch */
: [scratch] "=&r" (scratch),
[out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
:
);
assert(out == 0x100000000);
}
since the compiler is free to choose any register it wants instead of just rax,
Note that in this example we had to mark the scratch as an early clobber register with & to prevent it from being put into the same register as an input, I have explained that in more detail at: When to use earlyclobber constraint in extended GCC inline assembly? This example also happens to fail in the implementation I tested on without &.
Tested in Ubuntu 18.10 amd64, GCC 8.2.0, compile and run with:
gcc -O3 -std=c99 -ggdb3 -Wall -Werror -pedantic -o good.out good.c
./good.out
Non-hardcoded scratch registers are also mentioned in the GCC manual 6.45.2.6 "Clobbers and Scratch Registers", although their example is too much for mere mortals to take in at once:
Rather than allocating fixed registers via clobbers to provide scratch registers for an asm statement, an alternative is to define a variable and make it an early-clobber output as with a2 and a3 in the example below. This gives the compiler register allocator more freedom. You can also define a variable and make it an output tied to an input as with a0 and a1, tied respectively to ap and lda. Of course, with tied outputs your asm can’t use the input value after modifying the output register since they are one and the same register. What’s more, if you omit the early-clobber on the output, it is possible that GCC might allocate the same register to another of the inputs if GCC could prove they had the same value on entry to the asm. This is why a1 has an early-clobber. Its tied input, lda might conceivably be known to have the value 16 and without an early-clobber share the same register as %11. On the other hand, ap can’t be the same as any of the other inputs, so an early-clobber on a0 is not needed. It is also not desirable in this case. An early-clobber on a0 would cause GCC to allocate a separate register for the "m" ((const double ()[]) ap) input. Note that tying an input to an output is the way to set up an initialized temporary register modified by an asm statement. An input not tied to an output is assumed by GCC to be unchanged, for example "b" (16) below sets up %11 to 16, and GCC might use that register in following code if the value 16 happened to be needed. You can even use a normal asm output for a scratch if all inputs that might share the same register are consumed before the scratch is used. The VSX registers clobbered by the asm statement could have used this technique except for GCC’s limit on the number of asm parameters.
static void
dgemv_kernel_4x4 (long n, const double *ap, long lda,
const double *x, double *y, double alpha)
{
double *a0;
double *a1;
double *a2;
double *a3;
__asm__
(
/* lots of asm here */
"#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
"#a0=%3 a1=%4 a2=%5 a3=%6"
:
"+m" (*(double (*)[n]) y),
"+&r" (n), // 1
"+b" (y), // 2
"=b" (a0), // 3
"=&b" (a1), // 4
"=&b" (a2), // 5
"=&b" (a3) // 6
:
"m" (*(const double (*)[n]) x),
"m" (*(const double (*)[]) ap),
"d" (alpha), // 9
"r" (x), // 10
"b" (16), // 11
"3" (ap), // 12
"4" (lda) // 13
:
"cr0",
"vs32","vs33","vs34","vs35","vs36","vs37",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
);
}

Resources