Use C variables in ARM Neon assembly - c

I've a problem using C/C++ variables inside ARM NEON assembly code written in:
__asm__ __volatile()
I've read about the following possibilities, which should move values from ARM to NEON registers. Each of the following possibilities cause a Fatal Signal in my Android application:
VDUP.32 d0, %[variable]
VMOV.32 d0[0], %[variable]
the input argument list includes:
[variable] "r" (variable)
The only way I have success is using a load:
int variable = 0;
int *address = &variable;
....
VLD1.32 d0[0], [%[address]]
: [address] "+r" (address)
But I think a load is not the best for performance if I don't need to modify the variable, and I also need to understand how to move data from ARM to NEON registers for other purposes.
EDIT: added example as requested, both possibility 1 and 2 result in a "fatal signal". I know in this example NEON assembly simply should modify first 2 elements of "array[4]".
int c = 10;
int *array4;
array4 = new int[64];
for(int i = 0; i < 64; i++){
array4[i] = 100*i;
}
__asm__ __volatile ("VLD1.32 d0, [%[array4]] \n\t"
"VMOV.32 d1[0], %[c] \n\t" //this is possibility 1
"VDUP.32 d2, %[c] \n\t" //this is possibility 2
"VMUL.S32 d0, d0, d2 \n\t"
"VST1.32 d0, [%[output_array1]] \n\t"
: [output_array1] "=r" (output_array1)
: [c] "r" (c), [array4] "r" (array4)
: "d0", "d1", "d2");

The problem is caused by the output list. Moving the output array address in an input register solves the crashes.
int c = 10;
int *array4;
array4 = new int[64];
for(int i = 0; i < 64; i++){
array4[i] = 100*i;
}
__asm__ __volatile ("VLD1.32 d0, [%[array4]] \n\t"
"VMOV.32 d1[0], %[c] \n\t" //this is possibility 1
"VDUP.32 d2, %[c] \n\t" //this is possibility 2
"VMUL.S32 d0, d0, d2 \n\t"
"VST1.32 d0, [%[output_array1]] \n\t"
:
: [c] "r" (c), [array4] "r" (array4), [output_array1] "r" (output_array1)
: "d0", "d1", "d2");

Related

How can I call a function in inline assembly from C [duplicate]

This question already has answers here:
Referencing memory operands in .intel_syntax GNU C inline assembly
(1 answer)
Calling printf in extended inline ASM
(1 answer)
Is this assembly function call safe/complete?
(2 answers)
Calling a function in gcc inline assembly
(1 answer)
Closed 1 year ago.
I am currently playing around with in-line simply and I've gotten a bit stuck. I have managed to call a function with no parameters but when it comes to calling one with two parameters I get stuck.
My code below should call a function (add) that adds to predefined numbers together and it should call a second one (add parameter) with two parameters which should be added together.
#include <stdio.h>
int c = 4;
int d = 5;
void add() {
int result = 1 + 2;
printf("Result: %d\n", result);
}
void add_parameter(int a, int b) {
int result = a + b;
printf("Result: %d\n", result);
}
int main()
{
__asm__ __volatile__ ( "call add" );
// __asm__ __volatile__(
// "mov eax, offset c"
// "push eax"
// "mov eax, offset d"
// "push eax"
// "call add_parameter"
// "pop ebx"
// "pop ebx"
// );
__asm__ __volatile__ ( "mov eax, offset c" );
__asm__ __volatile__ ( "push eax" );
__asm__ __volatile__ ( "mov eax, offset d" );
__asm__ __volatile__ ( "push eax" );
__asm__ __volatile__ ( "call add_parameter" );
__asm__ __volatile__ ( "pop ebx" );
__asm__ __volatile__ ( "pop ebx" );
return 0;
}
My problem at the moment is that when I try to compile the program I get an error that says
p_function.c:31: Error: too many memory references for `mov'
p_function.c:33: Error: too many memory references for `mov'
In my program I've tried two approaches one being one single ASM call with the whole ASM code in it and one where I had split each line into its own asm call.
Unfortunately I am not sure which one of these approaches is correct let alone the most effective but I get the same error regardless of which approach I use.
How can I fix this problem and call the function add_parameter
Thanks

INC malfunction in inline assembly

In this code:
int a[2]={5,2},i=0;
asm volatile
(
"incl %1\n"
"incl %0"
:"+r"(a[i]),"+r"(i)
:
:
);
printf("%d\n",a[i]);
I'm trying to increase a[1] by 1 (for a result of 2+1=3) but the output shows 2, which means it hasn't changed. What's the problem and how can I fix it?

How can I determine (preferably at compile-time) whether gcc is using rbp-based offsets or rsp-based offsets?

I want to write something like this:
#include <stdint.h>
inline uint64_t with_rsp(uint64_t x, uint64_t y) {
uint64_t z, w;
uint64_t rsp;
asm ("mov %%rsp, %[rsp]\t\n"
"mov $0x13, %%rsp\t\n"
"mov %[x], %%rdx\t\n"
"mulx %[y], %[z], %[w]\t\n"
"mov %[rsp], %%rsp\t\n"
: [z] "=&r" (z), [w] "=&r" (w)
: [x] "r" (x), [y] "r" (y), [rsp] "m" (rsp)
: "rdx"
);
return z + w;
}
inline uint64_t with_rbp(uint64_t x, uint64_t y) {
uint64_t z, w;
uint64_t rbp;
asm ("mov %%rbp, %[rbp]\t\n"
"mov $0x13, %%rbp\t\n"
"mov %[x], %%rdx\t\n"
"mulx %[y], %[z], %[w]\t\n"
"mov %[rbp], %%rbp\t\n"
: [z] "=&r" (z), [w] "=&r" (w)
: [x] "r" (x), [y] "r" (y), [rbp] "m" (rbp)
: "rdx"
);
return z + w;
}
int main() {
uint64_t x = 15, y = 3, zw;
if (inline_asm_uses_rbp()) {
zw = with_rsp(x, y);
} else {
zw = with_rbp(x, y);
}
return zw;
}
Ideally, the if statement should compile away at compile-time (but I don't think I can do this with preprocessor macros, because those get evaluated before the code is assembled). So I'm fine with needing some sort of jump to get it to work, though I'd prefer to not need that.
The reason I need this is that I have some inline assembly that needs to be able to use 15 registers, plus some memory locations on the stack, and gcc is choosing rsp-based offsets in some locations where the function is inlined, and it's choosing rbp-based offsets in other locations. (A separate assembly module isn't a good match for this because I'd like to avoid the overhead of a function call.)

How to update an array in vectorized assembly(AVX)?

inline void addition(double * x, const double * vx,uint32_t size){
/*for (uint32_t i=0;i<size;++i){
x[i] = x[i] + vx[i];
}*/
__asm__ __volatile__ (
"1: \n\t"
"vmovupd -32(%0), %%ymm1\n\t"
"vmovupd (%0), %%ymm0\n\t"
"vaddpd -32(%1), %%ymm0, %%ymm0\n\t"
"vaddpd (%1), %%ymm1, %%ymm1\n\t"
"vmovupd %%ymm0, -32(%0)\n\t"
"vmovupd %%ymm1, (%0)\n\t"
"addq $128, %0\n\t"
"addq $128, %1\n\t"
"addl $-8, %2\n\t"
"jne 1b"
:
: "r" (x),"r"(vx),"r"(size)
: "ymm0", "ymm1"
);
}
I am practicing assembly(AVX instructions) right now so I write the above piece of code in inline assembly to replace the c code in the original function(which is commented out). The compiling process is successful but when I try to run the program, An error happens: Bus error: 10
Any thoughts to this bug? I didn't know what's wrong here. The compiler version is clang 602.0.53. Thank you!
Inline assembly is a complicated beast, if you just want to practice AVX assembly use a separate asm file where you don't have to put up with the compiler. In exchange, you will need to observe calling convention though.
You have some issues with the constraints. For example, you change all your input registers without telling the compiler and that can cause all sorts of weird problems elsewhere in compiler generated code. You also need to specify a memory clobber for obvious reasons.
Also, learn to use a debugger so you can find the exact cause of problems and fix your own code.
Failing that, at least comment your code so we can figure out your intentions. In this case, I am particularly puzzled why you use -32 offset to address before the array. I think you wanted +32 there. Using two avx registers at 32 bytes each, you of course need to advance the pointers by 64 not 128. Also you have ymm0 and ymm1 swapped in the initial load.
This code seems to work fine for me:
#include <stdio.h>
#include <stdint.h>
inline void addition(double * x, const double * vx,uint32_t size){
/*for (uint32_t i=0;i<size;++i){
x[i] = x[i] + vx[i];
}*/
__asm__ __volatile__ (
"1: \n\t"
"vmovupd 32(%0), %%ymm0\n\t"
"vmovupd (%0), %%ymm1\n\t"
"vaddpd 32(%1), %%ymm0, %%ymm0\n\t"
"vaddpd (%1), %%ymm1, %%ymm1\n\t"
"vmovupd %%ymm0, 32(%0)\n\t"
"vmovupd %%ymm1, (%0)\n\t"
"addq $64, %0\n\t"
"addq $64, %1\n\t"
"addl $-8, %2\n\t"
"jne 1b"
: "+r" (x),"+r"(vx),"+r"(size)
:
: "ymm0", "ymm1", "memory"
);
}
int main()
{
double x[] = { 1, 2, 3, 4, 5, 6, 7, 8 };
double vx[] = { 9, 10, 11, 12, 13, 14, 15, 16 };
int i;
addition(x, vx, 8);
for(i = 0; i < 8; i++) printf("%g ", x[i]);
putchar('\n');
return 0;
}

Double assignment using inline assembly [duplicate]

This question already has answers here:
When to use earlyclobber constraint in extended GCC inline assembly?
(2 answers)
Closed 4 months ago.
Following this manual I wanted to create simplest inline AVR assembly snippet possible: copy values of two variables to two other variables.
uint8_t a, b, c, d;
a = 42;
b = 11;
asm(
"mov %0, %2\n\t"
"mov %1, %3\n\t"
: "=r" (c), "=r" (d)
: "r" (a), "r" (b)
);
I would expect it to be equivalent to:
uint8_t a, b, c, d;
a = 42;
b = 11;
c = a;
d = b;
However, after running both values of c and d are equal to 42. If I change the asm snipptet to:
asm(
"mov %0, %3\n\t"
"mov %1, %2\n\t"
: "=r" (c), "=r" (d)
: "r" (a), "r" (b)
);
c is equal to 11 and d is equal to 42 as expected. Similarly, changing both source operands to %2 yields two 42 and setting both of them to %3 yields two 11.
Why the first version does not work as intended?
I would expect it to be equivalent to:
uint8_t a, b, c, d;
a = 42;
b = 11;
c = a;
d = b;
No, it's not1. The reason is that in the C code, one assignment follows after the other, whereas in inline asm, the compiler treats the "code" as if it happens at once. The compiler does not analyze the code in the asm string template in any way, it's just a string on which it performs replacements of %-operands. In
asm ("mov %0, %3" "\n\t"
"mov %1, %2"
: "=r" (c), "=r" (d)
: "r" (a), "r" (b));
the lifetime of a and b ends at the asm, and the lifetime of c and d begins. Therefore, it's totally fine for the compiler to use the same register for, say c and a. This means the output of the 1st move overrides the input of the 2nd move. This is the classic early-clobber situation, and you'll have to tell this fact to the compiler by means of early-clobber modifier &:
asm ("mov %0, %3" "\n\t"
"mov %1, %2"
: "=&r" (c), "=r" (d)
: "r" (a), "r" (b));
However, the code that's generated is sub-optimal because it's actually fine if the compiler uses the same register for c and b, and the same register for d and a. This means you don't need any explicit asm code at all, and everything can be described by means of the constraints:
asm (""
: "=r" (c), "=r" (d)
: "1" (a), "0" (b));
1Apart from that, your asm code tries to implement c = b and d = a, not c = a and d = b.

Resources