ARM NEON not giving accurate resuls - arm

I am trying to optimize a raytracer code on beagleboard and for that I am using the NEON coprocessor. There is a matrix multiplication function that is called multiple times which I have written in inline assembly. However, for some reason the results are not accurate. Here is my code:
void VecMatMult(float Vt[4], float M[4][4], float V[4])
{
__asm__ volatile(
"vldmia %1, {q1-q4} \n\t" // Load the Matrix in the quad registers
"vldmia %2, {q5} \n\t" //Load the Vector
"vmul.f32 q0, q1, d10[0] \n\t" //Calculate the matrix product
"vmla.f32 q0, q2, d10[1] \n\t"
"vmla.f32 q0, q3, d11[0] \n\t"
"vmla.f32 q0, q4, d11[1] \n\t"
"vstmia %0, {q0} \n\t" //Store the output
:
:"r" (Vt), "r" (M), "r" (V)
:"q0", "q1", "q2", "q3", "q4", "q5"
);
}
The funny thing is when I call this code in a separate program to test if it works, the results are perfect. However, when it's called in my main program several times, the results are not correct. Any help will be appreciated as I am literally clueless at the moment.

I don't know exactly how the inline assembly handles register preserving, but according to ATPCS, d8~d15 have to be preserved prior to usage, so it's not very wise to use them (if not absolutely necessary) to start with.
It will cause performance loss (if the inline assembly does a proper job), or it will do something 'unreasonable' (if the inline assembly fails)
Try to use q8~q13 instead. It would be a safe bet.

Related

ARM inline assembly, registers are read in an incorrect order

I am trying to read ARM registers via C using inline assembly but it doesn't follow the order of execution as it should.
volatile uint32_t var1 = 0;
volatile uint32_t var2 = 0;
volatile uint32_t var3 = 0;
volatile uint32_t var4 = 0;
__asm volatile(
"mov r0, #5\n\t"
"mov r1, #6\n\t"
"mov r2, #7\n\t"
"mov r3, #8\n\t"
"mov %0, r0\n\t"
"mov %0, r1\n\t"
"mov %0, r2\n\t"
"mov %0, r3\n\t"
: "=r" (var1),
"=r" (var2),
"=r" (var3),
"=r" (var4));
What happens is that the output is;
var1 = 8
var2 = 5
var3 = 6
var4 = 7
What I expect is;
var1 = 5
var2 = 6
var3 = 7
var4 = 8
It seems r0 is read last and it starts with r1. Why is that happening?
Note: Ignore the purpose of the code or how variables are defined, it's a copy/paste from a bigger application. It's just this specific register behavior in question.
GCC-style assembly inserts are designed to insert a single instruction per asm("..."), and they require you to give accurate information about all of the registers involved. In this case, you have not notified the compiler that registers r0, r1, r2, and r3 are used internally, so it probably thinks it's OK to reuse some of those for var1 through var4. You have also reused %0 as the destination of all four of the final mov instructions, so all four of them are actually writing to var1.
Also, it may or may not be your immediate problem, but volatile doesn't do what you think it does and probably isn't accomplishing anything useful here.
How to fix it? Well, first off, there needs to be a really strong reason why you can't just write
uint32_t var1 = 5;
uint32_t var2 = 6;
uint32_t var3 = 7;
uint32_t var4 = 8;
Assuming there is such a reason, then you should instead try writing one instruction per asm, and not using any scratch registers at all...
asm ("mov %0, #5" : "=r" (var1));
asm ("mov %0, #6" : "=r" (var2));
asm ("mov %0, #7" : "=r" (var3));
asm ("mov %0, #8" : "=r" (var4));
If you really absolutely have to do a whole bunch of work in a single asm then the first thing you should consider is putting it in a separate .S file and making it conform to the ABI so that you can call it like a normal function:
.text
.globl do_the_thing
.type do_the_thing, #function
_do_the_thing:
; all your actual code here
bx r14
.size do_the_thing, .-do_the_thing
and then in your C
rv = do_the_thing(arg1, arg2, ...);
If there's no way to make that work, then, and only then, should you sit down and read the entire "Extended Asm" chapter of the GCC manual and work out how to wedge the complete register-usage behavior of your insert into the constraints. If you need help with that, post a new question in which you show the real assembly language construct you need to insert, rather than a vague example, because every little detail matters.
The arguments passed into the inline assembly need to be incremented;
"mov %0, r0\n\t"
"mov %1, r1\n\t"
"mov %2, r2\n\t"
"mov %3, r3\n\t"

How to write a short block of inline gnu extended assembly to swap the values of two integer variables?

For entertainment, I am learning gnu extended assembly using AT&T syntax for x86 with a 32bit Linux target. I have just spent the last three hours coding two possible solutions to my challenge of swapping the values of two integer variables a and b, and neither of my solutions completely solved my problem. First, let's look at my TODO obstacle in some more detail:
int main()
{
int a = 2, b = 1;
printf("a is %d, b is %d\n", a, b);
// TODO: swap a and b using extended assembly, and do not modify the program in any other way
printf("a is %d, b is %d\n", a, b);
}
After reading this HOWTO, I wrote the following inline extended assembler code. Here is my first attempt at swapping the integers:
asm volatile("movl %0, %%eax;"
"movl %1, %%ecx;"
"movl %%ecx, %0;"
: "=r" (a)
: "r" (b)
: "%eax", "%ecx");
asm volatile("movl %%eax, %0;"
: "=r" (b)
: "r" (a)
: "%eax", "%ecx");
My reasoning was that to set a = b, I needed an extended assembly call that was separated from the assembly to set b = a. So I wrote the two extended assembly calls, compiled my code, i.e., gcc -m32 asmPractice.c, and ran a.out. The results were as follows:
a is 2, b is 1
a is 1, b is 1
Seeing how that did not work properly, I then decided to combine the two extended assembler calls, and wrote this:
asm volatile("movl %0, %%eax;"
"movl %1, %%ecx;"
"movl %%ecx, %0;"
"movl %%eax, %1;"
: "=r" (a)
: "r" (b));
After recompiling and linking, my code still does not correctly swap both values. See for yourself. Here are my results:
a is 2, b is 1
a is 1, b is 1
Here are some solutions from the comments:
Solution #0 (best option): https://gcc.gnu.org/wiki/DontUseInlineAsm
Even the zero-instruction solution defeats constant-propagation, and any other optimization that involves gcc knowing anything about the value. It also forces the compiler to have both variables in registers at the same time at that point. Always keep these downsides in mind when considering using inline-asm instead of builtins / intrinsics.
Solution #1: x86 xchg, no scratch regs, and works in both AT&T and Intel-syntax modes. Costs about the same as 3 mov instructions on most Intel CPUs, or only 2 uops on some AMD.
asm("xchg %0, %1;" : "+r" (a), "+r" (b));
Solution #2: purely using GNU C inline asm constraints. (Bonus: portable to all architectures)
asm("" : "=r" (a), "=r" (b) : "1" (a), "0" (b));
See all three solutions in action on the Godbolt compiler explorer, including examples of them defeating optimization:
int swap_constraints(int a, int b) {
asm("" : "=r" (a), "=r" (b) : "1" (a), "0" (b));
return a;
}
// Demonstrate the optimization-defeating behaviour:
int swap_constraints_constants(void) {
int a = 10, b = 20;
return swap_constraints(a, b) + 15;
}
swap_constraints_constants:
movl $10, %edx
movl $20, %eax
addl $15, %eax
ret
vs. with a pure C swap:
swap_noasm_constants:
movl $35, %eax # the add is done at compile-time, and `a` is optimized away as unused.
ret

Is it possible to access hardware register through inline assembly

I am trying to access hardware register on a Broadcom ARM processsor through inline assemble. I have accessed hardware regiters through bare metal programming, but now I am trying to incorporate those bare metal programming codes in the C file using asm. Here is my code which toggles GPIO 17 on a Raspberry Pi 2:
void main() {
__asm__(
".section .init\n\t"
".globl _start\n\t"
"_start:"
"ldr r0,=0x3F200000\n\t"
"mov r1, #1\n\t"
"lsl r1, #21\n\t"
"str r1, [r0, #4]\n\t"
"loop$:\n\t"
"mov r1, #1\n\t"
"lsl r1, #17\n\t"
"str r1, [r0, #28]\n\t"
"mov r1, #1\n\t"
"lsl r1, #17\n\t"
"str r1, [r0, #40]\n\t"
"b loop$\n\t"
);
}
but when I compile it by gcc file.c
it throws following error
/tmp/ccrfp9mv.s: Assembler messages:
/tmp/ccrfp9mv.s: Error: .size expression for main does not evaluate to a constant
You get Error: .size expression for main does not evaluate to a constant because you change sections inside a function. As you can see on the godbolt compiler explorer, the compiler will emits asm directives to calculate ELF metadata, with lines like:
.size main, .-main # size_of_main = current_pos - start_of_main
Since you switch sections inside the body of main, the distance between main and the end of main isn't known until link time, and it's not possible to get the linker to fill in this piece of metadata that late. (.size has to be an assemble-time constant, not just a link-time constant).
Like people commented, you should do the whole thing in C, e.g. with a global like
#include <stdint.h>
volatile uint32_t *const GPIO17 = (uint32_t*)0x3F200000; // address is const, contents aren't.
Presumably you need to ask the OS for access to that MMIO register. Part of the OS's job is to stop programs from talking directly to the hardware and messing up other programs that are doing the same thing at the same time.
Even if your code assembled, it won't link because your definition of _start will conflict with the one provided by the libc runtime code.
Don't try to define a function in inline asm inside another function. Write a stand-alone function if you want to do that.

ARM inline assembly multi-precision multiplication [duplicate]

I am trying to implement a function which multiplies 32-bit operand with 256-bit operand in ARM assembly on ARM Cortex-a8. The problem is I am running out of registers and I have no idea how I can reduce the number of used registers here. Here is my function:
typedef struct UN_256fe{
uint32_t uint32[8];
}UN_256fe;
typedef struct UN_288bite{
uint32_t uint32[9];
}UN_288bite;
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){
asm (
"umull r3, r4, %9, %10;\n\t"
"mov %0, r3; \n\t"/*res->uint32[0] = r3*/
"umull r3, r5, %9, %11;\n\t"
"adds r6, r3, r4; \n\t"/*res->uint32[1] = r3 + r4*/
"mov %1, r6; \n\t"
"umull r3, r4, %9, %12;\n\t"
"adcs r6, r5, r3; \n\t"
"mov %2, r6; \n\t"/*res->uint32[2] = r6*/
"umull r3, r5, %9, %13;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %3, r6; \n\t"/*res->uint32[3] = r6*/
"umull r3, r4, %9, %14;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %4, r6; \n\t"/*res->uint32[4] = r6*/
"umull r3, r5, %9, %15;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %5, r6; \n\t"/*res->uint32[5] = r6*/
"umull r3, r4, %9, %16;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %6, r6; \n\t"/*res->uint32[6] = r6*/
"umull r3, r5, %9, %17;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %7, r6; \n\t"/*res->uint32[7] = r6*/
"adc r6, r5, #0 ; \n\t"
"mov %8, r6; \n\t"/*res->uint32[8] = r6*/
: "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
"=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
: "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
"r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp)
: "r3", "r4", "r5", "r6", "cc", "memory");
}
EDIT-1: I updated my clobber list based on the first comment, but I still get the same error
A simple solution is to break this up and don't use 'clobber'. Declare the variables as 'tmp1', etc. Try not to use any mov statements; let the compiler do this if it has to. The compiler will use an algorithm to figure out the best 'flow' of information. If you use 'clobber', it can not reuse registers. They way it is now, you make it load all the memory first before the assembler executes. This is bad as you want memory/CPU ALU to pipeline.
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res)
{
uint32_t mulhi1, mullo1;
uint32_t mulhi2, mullo2;
uint32_t tmp;
asm("umull %0, %1, %2, %3;\n\t"
: "=r" (mullo1), "=r" (mulhi1)
: "r"(A), "r"(B->uint32[7])
);
res->uint32[8] = mullo1; /* was 'mov %0, r3; */
volatile asm("umull %0, %1, %3, %4;\n\t"
"adds %2, %5, %6; \n\t"/*res->uint32[1] = r3 + r4*/
: "=r" (mullo2), "=r" (mulhi2), "=r" (tmp)
: "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1)
: "cc"
);
res->uint32[7] = tmp; /* was 'mov %1, r6; */
/* ... etc */
}
The whole purpose of the 'gcc inline assembler' is not to code assembler directly in a 'C' file. It is to use the register allocation logic of the compiler AND do something that can not be easily done in 'C'. The use of carry logic in your case.
By not making it one huge 'asm' clause, the compiler can schedule the loads from memory as it needs new registers. It will also pipeline your 'UMULL' ALU activity with the load/store unit.
You should only use clobber if an instruction implicitly clobbers a specific register. You may also use something like,
register int *p1 asm ("r0");
and use that as an output. However, I don't know of any ARM instructions like this besides those that might alter the stack and your code doesn't use these and the carry of course.
GCC knows that memory changes if it is listed as an input/output, so you don't need a memory clobber. In fact it is detrimental as the memory clobber is a compiler memory barrier and this will cause memory to be written when the compiler might be able to schedule that for latter.
The moral is use gcc inline assembler to work with the compiler. If you code in assembler and you have huge routines, the register use can become complex and confusing. Typical assembler coders will keep only one thing in a register per routine, but that is not always the best use of registers. The compiler will shuffle the data around in a fairly smart way that is difficult to beat (and not very satisfying to hand code IMO) when the code size gets larger.
You might want to look at the GMP library which has lots of ways to efficiently tackle some of the same issues it looks like your code has.

SSE2 instruction in C code

I am trying to reverse engineer a c code, but this part of assembly I cant really understand. I know it is part of the SSE extension. However, somethings are really different than what I am used to in x86 instructions.
static int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
{
int ret;
__asm__ volatile(
"pxor %%xmm6, %%xmm6 \n\t"
ASMALIGN(4)
"1: \n\t"
"movdqu (%1), %%xmm0 \n\t"
"movdqu (%1, %3), %%xmm1 \n\t"
"psadbw (%2), %%xmm0 \n\t"
"psadbw (%2, %3), %%xmm1 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"paddw %%xmm1, %%xmm6 \n\t"
"lea (%1,%3,2), %1 \n\t"
"lea (%2,%3,2), %2 \n\t"
"sub $2, %0 \n\t"
" jg 1b \n\t"
: "+r" (h), "+r" (blk1), "+r" (blk2)
: "r" ((x86_reg)stride)
);
__asm__ volatile(
"movhlps %%xmm6, %%xmm0 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"movd %%xmm6, %0 \n\t"
: "=r"(ret)
);
return ret;
}
What are the %1, %2, and %3? what does (%1,%2,%3) mean? Also what does "+r", "-r", "=r" mean?
You'll want to have a look at this GCC Inline Asssembly HOWTO.
The percent sign numbers are the instruction operands.
The inline assembler works similar to a macro preprocessor. Operands with exactly one leading percent are replaced by the the input parameters in the order as they appear in the parameter list, in this case:
%0 h output, register, r/w
%1 blk1 output, register, r/w
%2 blk2 output, register, r/w
%3 (x86_reg)stride input, register, read only
The parameters are normal C expressions. They can be further specified by "constraints", in this case "r" means the value should be in a register, opposed to "m" which is a memory operand. The constraint modifier "=r" makes this a write-only operand, "+r" is a read-write operand and "r" and normal read operand.
After the first colon the output operands appear, after the second the input operands and after the optional third the clobbered registers.
So the instruction sequence calculates the sum of the absolute differences in each byte of blk1 and blk2. This happens in 16 byte blocks, so if stride is 16, the blocks are contiguous, otherwise there are holes. Each instruction appears twice because some minimal loop unrolling is done, the h parameter is the number of 32 byte blocks to process. The second asm block seems to be useless, as the psadbw instruction sums up only in the low 16 bit of the destination register. (Did you omit some code?)

Resources