Conversion short to int and sum with NEON - c

I want to convert the next function to NEON:
int dot4_c(unsigned char v0[4], unsigned char v1[4]){
int r=0;
r = v0[0]*v1[0];
r += v0[1]*v1[1];
r += v0[2]*v1[2];
r += v0[3]*v1[3];
return r;
}
I think I almost do it, but there is an error because it is not working well
int dot4_neon_hfp(unsigned char v0[4], unsigned char v1[4])
{
asm volatile (
"vld1.16 {d2, d3}, [%0] \n\t" //d2={x0,y0}, d3={z0, w0}
"vld1.16 {d4, d5}, [%1] \n\t" //d4={x1,y1}, d5={z1, w1}
"vcvt.32.u16 d2, d2 \n\t" //conversion
"vcvt.32.u16 d3, d3 \n\t"
"vcvt.32.u16 d4, d4 \n\t"
"vcvt.32.u16 d5, d5 \n\t"
"vmul.32 d0, d2, d4 \n\t" //d0= d2*d4
"vmla.32 d0, d3, d5 \n\t" //d0 = d0 + d3*d5
"vpadd.32 d0, d0 \n\t" //d0 = d[0] + d[1]
:: "r"(v0), "r"(v1) :
);
}
How can I get this working?

As mentioned, you must load at least 8 bytes at a time with NEON. As long as the load doesn't go past the end of your buffer, you can ignore the extra bytes. Here is how to do it with intrinsics:
uint8x8_t v0_vec, v1_vec;
uint16x8_t vproduct;
uint32x2_t vsum32;
v0_vec = vld1_u8(v0); // extra bytes will be ignored as long as you can safely read them
v1_vec = vld1_u8(v1);
// you didn't specify if the product of your vector fits in 8-bits, so I assume it needs to be widened to 16-bits
vproduct = vmull_u8(v0_vec, v1_vec);
vsum32 = vpaddl_u16(vget_low_u16(vproduct)); // pairwise add lower half (first 4 u16's)
return vsum32.val[0] + vsum32.val[1];
If you absolutely can't load 8 bytes from your source pointers, you can manually load a 32-bit value into a NEON register (the 4 bytes) and then cast it to the proper intrinsic type.

Related

Mov value on floating point register to general register

So I am using the instruction "fmov x0,d3" to move the value from d3 to x0, but for some reason the value on x0 remains unchanged.
Can someone, please, tell me how to properly move a value from a floating point register to a general register?
This is my code:
mov x0,#6 //Moves the decimal 0 value to the x0 register, which is the return register
Fadd_test_64:
fmov d1, #7.0 //Moves floating point value to the d1 register
fmov d2, #2.0
fmov d4, #14.0 //Moves the expected result to d4
fadd d3,d1,d2 //d3=d1+d2
fcmp d3, d4 //Compares w3 to w4
b.ne Fadd_error_64 //if they are different, go to Fadd_error_64
b Fadd_end_64 //if not, go to Fadd_end_64
Fadd_error_64:
fmov x0,d3 //Moves the value on the floating point register to the return register
Fadd_end_64:
ret
//Functions declaration
void mathTest_64bits(void);
UINT16 FAddition_64();
//Main function
int main(void)
{
mathTest_64bits();
return 0;
}
//Functino in C to run the assembly code
void mathTest_64bits(void)
{
int ret;
ret = FAddition_64();
if(ret!=0)
{
Print(L"64-bits Floating Point Addition test: Failed: %d",ret);
}
else
{
Print(L"64-bits Floating Point Addition test: Success ");
}
Print(L"\n");
}

writing a function in ARM assembly language which inserts a string into another string at a specific location

I was going through one of my class's textbooks and I stumbled upon this problem:
Write a function in ARM assembly language which will insert a string into another string at a specific location. The function is:
char * csinsert( char * s1, char * s2, int loc ) ;
The function has a pointer to s1 in a1, a pointer to s2 in a2, and an integer in a3 as to where the insertion takes place. The function returns a pointer in a1 to the new string.
You can use the library functions strlen and malloc.
strlen has as input the pointer to the string in a1 and returns the length in a1.
malloc will allocate space for the new string where a1 on input is the size in bytes of the space requested and on output a1 is a pointer to the requested space.
Remember the registers a1-a4 do not retain their values across function calls.
This is the C language driver for the string insert I created:
#include <stdio.h>
extern char * csinsert( char * s1, char * s2, int loc ) ;
int main( int argc, char * argv[] )
{
char * s1 = "String 1 are combined" ;
char * s2 = " and string 2 " ;
int loc = 8 ;
char * result ;
result = csinsert( s1, s2, loc ) ;
printf( "Result: %s\n", result ) ;
}
My assembly language code so far is:
.global csinsert
csinsert:
stmfd sp!, {v1-v6, lr}
mov v1, a1
bl strlen
add a1, a1, #1
mov v2, a1
add a2, a2
mov v3, a2
add a3, a3
bl malloc
mov v3, #0
loop:
ldrb v4, [v1], #1
subs v2, v2, #1
add v4, v4, a2
strb v4, [a1], #1
bne loop
ldmfd sp!, {v1-v6, pc} #std
.end
I don't think my code works properly. When I link the two finals, there is no result given back. Why does my code not insert the string properly? I believe the issue is in the assembly program, is it not returning anything?
Can anyone explain what my mistake is? I'm not sure how to use the library functions the question hints to.
Thanks!
Caveat: I've been doing asm for 40+, I've looked at arm a bit, but not used it. However, I pulled the arm ABI document.
As the problem stated, a1-a4 are not preserved across a call, which matches the ABI. You saved your a1, but you did not save your a2 or a3.
strlen [or any other function] is permitted to use a1-a4 as scratch regs. So, for efficiency, my guess is that strlen [or malloc] is using a2-a4 as scratch and [from your perspective] corrupting some of the register values.
By the time you get to loop:, a2 is probably a bogus journey :-)
UPDATE
I started to clean up your asm. Style is 10x more important in asm than C. Every asm line should have a sidebar comment. And add a blank line here or there. Because you didn't post your updated code, I had to guess at the changes and after a bit, I realized you only had about 25% or so. Plus, I started to mess things up.
I split the problem into three parts:
- Code in C
- Take C code and generate arm pseudo code in C
- Code in asm
If you take a look at the C code and pseudo code, you'll notice that any misuse of instructions aside, your logic was wrong (e.g. you needed two strlen calls before the malloc)
So, here is your assembler cleaned for style [not much new code]. Notice that I may have broken some of your existing logic, but my version may be easier on the eyes. I used tabs to separate things and got everything to line up. That can help. Also, the comments show intent or note limitations of instructions or architecture.
.global csinsert
csinsert:
stmfd sp!,{v1-v6,lr} // preserve caller registers
// preserve our arguments across calls
mov v1,a1
mov v2,a2
mov v3,a3
// get length of destination string
mov a1,v1 // set dest addr as strlen arg
bl strlen // call strlen
add a1,a1,#1 // increment length
mov v4,a1 // save it
add v3,v3 // src = src + src (what???)
mov v5,v2 // save it
add v3,v3 // double the offset (what???)
bl malloc // get heap memory
mov v4,#0 // set index for loop
loop:
ldrb v7,[v1],#1
subs v2,v2,#1
add v7,v7,a2
strb v7,[a1],#1
bne loop
ldmfd sp!,{v1-v6,pc} #std // restore caller registers
.end
At first, you should prototype in real C:
// csinsert_real -- real C code
char *
csinsert_real(char *s1,char *s2,int loc)
{
int s1len;
int s2len;
char *bp;
int chr;
char *bf;
s1len = strlen(s1);
s2len = strlen(s2);
bf = malloc(s1len + s2len + 1);
bp = bf;
// copy over s1 up to but not including the "insertion" point
for (; loc > 0; --loc, ++s1, ++bp) {
chr = *s1;
if (chr == 0)
break;
*bp = chr;
}
// "insert" the s2 string
for (chr = *s2++; chr != 0; chr = *s2++, ++bp)
*bp = chr;
// copy the remainder of s1 [if any]
for (chr = *s1++; chr != 0; chr = *s1++, ++bp)
*bp = chr;
*bp = 0;
return bf;
}
Then, you can [until you're comfortable with arm], prototype in C "pseudocode":
// csinsert_pseudo -- pseudo arm code
char *
csinsert_pseudo()
{
// save caller registers
v1 = a1;
v2 = a2;
v3 = a3;
a1 = v1;
strlen();
v4 = a1;
a1 = v2;
strlen();
a1 = a1 + v4 + 1;
malloc();
v5 = a1;
// NOTE: load/store may only use r0-r7
// and a1 is r0
#if 0
r0 = a1;
#endif
r1 = v1;
r2 = v2;
// copy over s1 up to but not including the "insertion" point
loop1:
if (v3 == 0) goto eloop1;
r3 = *r1;
if (r3 == 0) goto eloop1;
*r0 = r3;
++r0;
++r1;
--v3;
goto loop1;
eloop1:
// "insert" the s2 string
loop2:
r3 = *r2;
if (r3 == 0) goto eloop2;
*r0 = r3;
++r0;
++r2;
goto loop2;
eloop2:
// copy the remainder of s1 [if any]
loop3:
r3 = *r1;
if (r3 == 0) goto eloop3;
*r0 = r3;
++r0;
++r1;
goto loop3;
eloop3:
*r0 = 0;
a1 = v5;
// restore caller registers
}

Last stretch of rounding function in ASM

What I essentially have to do is make what is in Main work.
I'm on my last stretch of this assignment (which will likely take just as long as it did for me to get here) I'm having trouble figuring out how to pass the roundingMode that is passed to roundD and using it in ASM.
Also, there is a block of just comments, as far as I can tell, that's all I have left to do. does that sound right?
#include <stdio.h>
#include <stdlib.h>
#define PRECISION 3
#define RND_CTL_BIT_SHIFT 10
// floating point rounding modes: IA-32 Manual, Vol. 1, p. 4-20
typedef enum {
ROUND_NEAREST_EVEN = 0 << RND_CTL_BIT_SHIFT,
ROUND_MINUS_INF = 1 << RND_CTL_BIT_SHIFT,
ROUND_PLUS_INF = 2 << RND_CTL_BIT_SHIFT,
ROUND_TOWARD_ZERO = 3 << RND_CTL_BIT_SHIFT
} RoundingMode;
double roundD(double n, RoundingMode roundingMode)
{
// do not change anything above this comment
int oldCW = 0x0000;
int newCW = 0xF3FF;
int mask = 0x0300;
int tempVar = 0x0000;
asm(" push %eax \n"
" push %ebx \n"
" fstcw %[oldCWOut] \n" //store FPU CW into OldCW
" mov %%eax, %[oldCWOut] \n" //store old FPU CW into tempVar
" mov %[tempVarIn], %%eax \n"
" add %%eax, %[maskIn] \n" //isolate rounding bits
" add %%eax, %[roundModeOut] \n" //adding rounding modifier
//shift in old bits to tempFPU
//do rounding calculation
//store result into n
" fldcw %[oldCWIn] \n" //restoring the FPU CW to normal
" pop %ebx \n"
" pop %eax \n"
: [oldCWOut] "=m" (oldCW),
[newCWOut] "=m" (newCW),
[maskOut] "=m" (mask),
[tempVarOut] "=m" (tempVar),
[roundModeOut] "=m" (roundMode)
: [oldCWIn] "m" (oldCW),
[newCWIn] "m" (newCW),
[maskIn] "m" (mask),
[tempVarIn] "m" (tempVar),
[roundModeIn] "m" (roundMode)
:"eax", "ebx"
);
return n;
// do not change anything below this comment, except for printing out your name
}
int main(int argc, char **argv)
{
double n = 0.0;
if (argc > 1)
n = atof(argv[1]);
printf("roundD even %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_NEAREST_EVEN));
printf("roundD down %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_MINUS_INF));
printf("roundD up %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_PLUS_INF));
printf("roundD zero %.*f = %.*f\n",
PRECISION, n, PRECISION, roundD(n, ROUND_TOWARD_ZERO));
return 0;
}
While C might like to pretend that enum is not just an integer, it is just an integer. If you can't use roundingMode directly in the assembly, create an integer local variable and set it equal to the roundingMode parameter.
I'm just offering this as a suggestion to you. I've never used inline assembly before and I've never used x86 assembly before, but if all you need to do is reference the parameter, what I said above should work.

ARM NEON count compare result

I need to make some parallel compare under uint16x8_t vectors, and increment some local variable (counter) according to it, for example +8 increment, if all elements of vector compared as true. I implement this algorithm:
...
register int objects = 0;
uint16x8_t vcmp0,vobj;
uint32x2_t dobj;
register uint32_t temp0;
...
vobj = vreinterpretq_u16_u8(vcntq_u8(vreinterpretq_u8_u16(vcmp0)));
vobj = vpaddlq_u8(vreinterpretq_u8_u16(vobj));
vobj = vreinterpretq_u16_u32(vpaddlq_u16(vobj));
vobj = vreinterpretq_u16_u64(vpaddlq_u32(vreinterpretq_u32_u16(vobj)));
dobj = vmovn_u64(vreinterpretq_u64_u16(vobj));
dobj = vreinterpret_u32_u64(vpaddl_u32(dobj));
__asm__ __volatile__
(
"vmov.u32 %[temp0] , %[dobj][0] \n\t"
"add %[objects] ,%[objects], %[temp0], asr #4 \n\t"
: [dobj]"+w"(dobj), [temp0]"=r"(temp0), [objects]"+r"(objects)
:
: "memory"
);
...
Vector vcmp0 contains results of compare, vobj, dobj used for computation, objects is counter. I am using count of set bits and pairwise add for computation. Is there any faster way to do this work?

Efficient Neon Implementation Of Clipping

Within a loop i have to implement a sort of clipping
if ( isLast )
{
val = ( val < 0 ) ? 0 : val;
val = ( val > 255 ) ? 255 : val;
}
However this "clipping" takes up almost half the time of execution of the loop in Neon .
This is what the whole loop looks like-
for (row = 0; row < height; row++)
{
for (col = 0; col < width; col++)
{
Int sum;
//...Calculate the sum
Short val = ( sum + offset ) >> shift;
if ( isLast )
{
val = ( val < 0 ) ? 0 : val;
val = ( val > 255 ) ? 255 : val;
}
dst[col] = val;
}
}
This is how the clipping has been implemented in Neon
cmp %10,#1 //if(isLast)
bne 3f
vmov.i32 %4, d4[0] //put val in %4
cmp %4,#0 //if( val < 0 )
blt 4f
b 5f
4:
mov %4,#0
vmov.i32 d4[0],%4
5:
cmp %4,%11 //if( val > maxVal )
bgt 6f
b 3f
6:
mov %4,%11
vmov.i32 d4[0],%4
3:
This is the mapping of variables to registers-
isLast- %10
maxVal- %11
Any suggestions to make it faster ?
Thanks
EDIT-
The clipping now looks like-
"cmp %10,#1 \n\t"//if(isLast)
"bne 3f \n\t"
"vmin.s32 d4,d4,d13 \n\t"
"vmax.s32 d4,d4,d12 \n\t"
"3: \n\t"
//d13 contains maxVal(255)
//d12 contains 0
Time consumed by this portion of the code has dropped from 223ms to 18ms
Using normal compares with NEON is almost always a bad idea because it forces the contents of a NEON register into a general purpose ARM register, and this costs lots of cycles.
You can use the vmin and vmax NEON instructions. Here is a little example that clamps an array of integers to any min/max values.
void clampArray (int minimum,
int maximum,
int * input,
int * output,
int numElements)
{
// get two NEON values with your minimum and maximum in each lane:
int32x2_t lower = vdup_n_s32 (minimum);
int32x2_t higher = vdup_n_s32 (maximum);
int i;
for (i=0; i<numElements; i+=2)
{
// load two integers
int32x2_t x = vld1_s32 (&input[i]);
// clamp against maximum:
x = vmin_s32 (x, higher);
// clamp against minimum
x = vmax_s32 (x, lower);
// store two integers
vst1_s32 (&output[i], x);
}
}
Warning: This code assumes the numElements is always a multiple of two, and I haven't tested it.
You may even make it faster if you process four elements at a time using the vminq / vmaxq instructions and load/store four integers per iteration.
If maxVal is UCHAR_MAX, CHAR_MAX, SHORT_MAX or USHORT_MAX, you can simply convert with neon from int to your desired datatype, by casting with saturation.
By example
// Will convert four int32 values to signed short values, with saturation.
int16x4_t vqmovn_s32 (int32x4_t)
// Converts signed short to unsgigned char, with saturation
uint8x8_t vqmovun_s16 (int16x8_t)
If you do not want to use multiple-data capabilities, you can still use those instructions, by simply loading and reading one of the lanes.

Resources