ARM VLD neon structured load - arm

I am trying to load from an float array into the d registers of the neon unit, in order to later use the q registers for simd. I want to later work on 4 float values at the same time.
Here is what I am doing:
r7 holds the reference to the array
VLD2.32 {d0,d2}, [r7]
add r7, r7, #4
VLD2.32 {d1,d3}, [r7]
add r7, r7, #4
but when I debug the whole thing nothing gets loaded and the registers are still 0.
The array is structured like this:
[A0,
B0,
A1,
B1,
A2,
B2,
.
.
.]
I want them to be in the registers as followed:
q0 = A0, A1, A2, A3
q1 = B0, B1, B2, B3
Can't seem to get it right, some new input might help.
Thank you for your help

Related

STM32 Cortex-M4F FPU hardfaults on basic VLDR

Yes there is a FPU present with my specific MCU.
The code is compiled with the -mfloat-abi=soft flag otherwise the float variable never gets passed to R0
The FPU gets enabled via SCB->CPACR |= ((3UL << (10 * 2)) | (3UL << (11 * 2)));
The assembly function;
sqrt_func:
VLDR.32 S0, [R0] <-- hardfault
VSQRT.F32 S0, S0
VSTR.32 S0, [R0]
BX LR
C code calling said function;
extern float sqrt_func(float s);
float x = sqrt_func(1000000.0f);
But after stepping through, the MCU hard faults at VLDR.32 S0, [R0] with the CFSR showing
CFSR
->BFARVALID
->PRECISERR
I see that the float is being passed correctly because that's the hex value for it the moment it hard faults;
R0
->0x49742400
S0 never gets loaded with anything.
I can't figure out why this is hard faulting, anyone have any ideas? I am trying to manually calculate the square root using the FPU.
Also what's weird is d13-d15 and s0-s31 registers are showing "0xx-2" but that's probably a quirk of the debugger not being able to pull the registers once it hardfaults.
Ok I'm just a dumbo and thought VLDR and VSTR operated differently for some reason but they're identical to LDR and STR. The value of the float was being passed to R0 but VLDR was trying to load the value at that address (0x49742400 which was my float value in hex) and that's either an invalid address or some sort of memory violation.
Instead you have to use VMOV.32 to copy register contents over;
sqrt_func:
VMOV.32 S0, R0
VSQRT.F32 S0, S0
VMOV.32 R0, S0
BX LR
And now it works.

Conversion from uint64_t to double

For an STM32F7, which includes instructions for double floating points, I want to convert an uint64_t to double.
In order to test that, I used the following code:
volatile static uint64_t m_testU64 = 45uLL * 0xFFFFFFFFuLL;
volatile static double m_testD;
#ifndef DO_NOT_USE_UL2D
m_testD = (double)m_testU64;
#else
double t = (double)(uint32_t)(m_testU64 >> 32u);
t *= 4294967296.0;
t += (double)(uint32_t)(m_testU64 & 0xFFFFFFFFu);
m_testD = t;
#endif
By default (if DO_NOT_USE_UL2D is not defined) the compiler (gcc or clang) is calling the function: __aeabi_ul2d() which is kind of complex in number of executed instruction. See the assembly code here : https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/arm/ieee754-df.S#L537
For my particular example, it takes 20 instructions without entering in most of the branches
And if DO_NOT_USE_UL2D is defined, the compiler generate the following assembly code:
movw r0, #1728 ; 0x6c0
vldr d2, [pc, #112] ; 0x303fa0
movt r0, #8192 ; 0x2000
vldr s0, [r0, #4]
ldr r1, [r0, #0]
vcvt.f64.u32 d0, s0
vldr s2, [r0]
vcvt.f64.u32 d1, s2
ldr r1, [r0, #4]
vfma.f64 d1, d0, d2
vstr d1, [r0, #8]
The code is simpler, and it is only 10 instructions.
So here the the questions (if DO_NOT_USE_UL2D is defined):
Is my code (in C) correct?
Is my code slower than the __aeabi_ul2d() function (not really important, but a bit curious)?
I have to do that, since I am not allowed to use function from libgcc (There are very good reasons for that...)
Be aware that the main purpure of this question is not about performance, I am really curious about the implementation in libgcc, and I really want to know if there is something wrong in my code.

Accessing certain elements of an array in arm assembler

I have a problem which is bugging me for multiple days now...
I call a function from c which is implemented in arm assembler on a raspberry pi using the neon module. The signature looks like the following:
void doStuff(const uint32_t key[4])
I can load all the values into d-registers using VLD4.32 {d6-d9}, [r0].
The problem is that I have to use a value at a certain index of the array which is calculated at runtime. So I have to access the array at an index which I only know at runtime.
In c, the code I want to achieve would look like this:
// calculations
int i = ... // 'i' is the index of value in the array
int result = key[i];
In assembler I tried this:
VMOV r8, s22 ;# copy the calculated index into an arm register
MOV r8, r8, LSL #0x2;# multiply with 4
ADD r8, r5, r8 ;# add offset to base adress
VLDR.32 d14, [r8] ;# load from adress into d-register
I also tried multiplying with 2 and 32 instead of 4. But I always get the value 3.
I got it working with this stupid and very slow solution:
;# <--- very slow and ugly --->
VLD4.32 {d6-d9}, [r1] ;# load 4x32bit from adress *r1
VMOV r6, s22 ;# r6 now contains the offset which is either 0,1,2 or 3
CMP r6, #0x0 ;# 3 - 0 == 0 -> Z set
BEQ equal0
CMP r6, #0x1
BEQ equal1
CMP r6, #0x2
BEQ equal2
VMOV d12, d9 ;# has to be 3
B continue
equal0:
VMOV d12, d6
B continue
equal1:
VMOV d12, d7
B continue
equal2:
VMOV d12, d8
B continue
continue:
;# <--- --->
I basically have an if for every possible number and then select the corresponding register.
Thanks!
Edit:
Okay it works with VLD1.32 d14, [r8]. Do not quite unterstand why it won't work with VLDR.32, though.

STRB works, unless target address gets shifted

I'm trying to use ARM assembly to insert one string into another, however my code would always return an empty string to the C program calling the assembly program. I believe I have narrowed down my issue to the STRB instruction. Below is my code with most irrelevant code removed. The important part to look at is in the "test" block.
.global ins
ins:
stmfd sp!, {v1-v6, lr}
mov v1, a1 # save pointer to 1st string
mov v2, a2 # save pointer to 2nd string
bl strlen # find out length
mov v3, a1 # save string1 length
mov a1, v2 # recover pointer to string 2
bl strlen # length of string 2
add a1, a1, v3 # total length
add a1, a1, #1 # add one for null byte
bl malloc
add a3, a3, #1
test:
ldrb v3, [v1], #1
strb v3, [a1], #1
ldmfd sp!, {v1-v6, pc}
exit:
.end
v1 and v2 hold strings 1 and 2. When I have the test block written as:
test:
ldrb v3, [v1], #1
strb v3, [a1], #1
ldmfd sp!, {v1-v6, pc}
then the program returns an empty string. However, if I have it written as:
test:
ldrb v3, [v1], #1
strb v3, [a1]
ldmfd sp!, {v1-v6, pc}
it successfully returns the first character in string 1. Obviously, this is not sufficient to build a new string, as I'm not performing an offset on a1.
Does anyone know what is causing the string to be returned as empty? I honestly have no idea what the issue may be after several hours of experimenting and researching.
Any help is greatly appreciated!
The value in a1 is returned to the C function calling your assembler routine. You need to return the address of the start of the string, but if you increment a1 while writing the string you will return the address of the end of the string instead.
If you use another register for storing the current address that you are writing to then the start address will still be in a1 when you return. e.g:
test:
mov v4, a1 # copy address of new string to v4
ldrb v3, [v1], #1
strb v3, [v4], #1 # increment v4, the start of string
# will still be in a1
ldmfd sp!, {v1-v6, pc}

Largest Integer in ARM assembly

This is a homework question, but I'm stuck.
The assignment is to find the largest integer in an array. Here's the C code we're given:
#include <stdio.h>
#include <stdlib.h>
extern int mybig( int array[] ) ;
void main( char * argv[], int argc )
{
int array[] = { 5, 15, 100, 25, 50, -1 } ;
int biggest ;
biggest = mybig( array ) ;
printf( "Biggest integer in array: %d\n", biggest ) ;
}
I've made about a dozen versions of the assembly so far, but this is the closest I've gotten
.global mybig
mybig: stmfd sp!, {v1-v6, lr}
mvn v1, #0
loop: ldrb a4, [a1], #4
MOVLT a4, a1
cmp a1, v1
bne loop
ldmfd sp!, {v1-v6, pc}
.end
Every time I link it together, I hit an infinite loop, and I'm not sure why. Any help would be majorly appreciated, the professor didn't teach us anything in an introductory course, just told us to do it, and gave us a link to a toolchain to compile and assemble.
EDIT: This is where I've gotten to. Program doesn't run, just hits an infinite loop.
.global mybig
mybig: stmfd sp!, {v1-v6, lr}
mvn v1, #0
mov a3, a1
loop: ldr a4, [a1], #4
cmp a4, a1
MOVMI a3, a1
cmp a1, v1
bne loop
mov a1, a4
ldmfd sp!, {v1-v6, pc}
.end
C code hasn't changed
That would be my solution:
.global mybig
mybig:
// a1 = Highest word, defaults to 0x80000000 = −2,147,483,648
// a2 = Pointer to array
// a3 = current word
mov a2, a1
mov a1, #0x80000000
.Lloop:
ldr a3, [a2], #4 // Load word and advance pointer by 4 bytes
cmn a3, #1 // Compare with -1
bxeq lr // Return if endmarker was found
cmp a1, a3 // Compare current highest word and new word
movlt a1, a3 // Replace highest word if it was smaller
b .Lloop // Loop again
.end
While this is not the best possible code in regards of performance it should be self explaining.

Resources