I have the following 32-bit neon code that simply extracts an image:
extractY8ImageARM(unsigned char *from, unsigned char *to, int left, int top, int width, int height, int stride)
from: pointer to the original image
to: pointer to the destination extracted image
left, top: position where to extract in the original image
width, height: size of the extracted image
stride: width of the original image
and here is the assembly code:
.text
.arch armv7-a
.fpu neon
.type extractY8ImageARM, STT_FUNC
.global extractY8ImageARM
extractY8ImageARM:
from .req r0
to .req r1
left .req r2
top .req r3
width .req r4
height .req r5
stride .req r6
tmp .req r7
push {r0-r7, lr}
//Let's get back the arguments
ldr width, [sp, #(9 * 4)]
ldr height, [sp, #(10 * 4)]
ldr stride, [sp, #(11 * 4)]
//Update the from pointer. Advance left + stride * top
add from, from, left
mul tmp, top, stride
add from, from, tmp
.loopV:
//We will copy width
mov tmp, width
.loopH:
//Read and store data
pld [from]
vld1.u8 { d0, d1, d2, d3 }, [from]!
pld [to]
vst1.u8 { d0, d1, d2, d3 }, [to]!
subs tmp, tmp, #32
bgt .loopH
//We advance the from pointer for the next line
add from, from, stride
sub from, from, width
subs height, height, #1
bgt .loopV
pop {r0-r7, pc}
.unreq from
.unreq to
.unreq left
.unreq top
.unreq width
.unreq height
.unreq stride
.unreq tmp
I need to port it to 64-bit neon. can anyone help me to do the translation? I have read this white paper http://malideveloper.arm.com/downloads/Porting%20to%20ARM%2064-bit.pdf so I understand more or less the differences.
My code is simple and it would be a good example how to pass arguments and load/store data in a 64-bit neon assembly file. I prefer to avoid intrinsic.
The whole code looks like this:
.text
.arch armv8-a
.type extractY8ImageARM, STT_FUNC
.global extractY8ImageARM
extractY8ImageARM:
from .req x0
to .req x1
left .req x2
top .req x3
width .req x4
height .req x5
stride .req x6
tmp .req x9
add from, from, left
mul tmp, top, stride
add from, from, tmp
.loopV:
mov tmp, width
.loopH:
ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [from], #64
st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [to], #64
subs tmp, tmp, #64
bgt .loopH
add from, from, stride
sub from, from, width
subs height, height, #1
bgt .loopV
ret
.unreq from
.unreq to
.unreq left
.unreq top
.unreq width
.unreq height
.unreq stride
.unreq tmp
Related
I'm struggling to see where I'm going wrong in a function that reverses a list in place. The function takes a pointer of an array of long ints as x0 and a long int for the size of the array as x1.
When the code runs the first half of the list is correct but after the halfway point it mirrors the beginning. For example for the list [48, 307, 12, -18, -805] mine returns [-805, -18, 12, -18, -805].
So that makes me believe it stops reversing after it hits halfway. I'm not sure why it would do this and any help as to why it's doing this would be really appreciated. Thanks in advance!
.global reverse
.global loop
.global done
reverse:
cmp x1, #0 // if list is empty
beq empty // return 0
mov x3, x0 // copy of list
mov x2, x1 // size
sub x2, x2, #1 // size-1
lsl x2, x2, #3 // (size-1) * 8
add x3, x3, x2 // &data[size-1]
loop:
cmp x1, #0 // if we have reached the end of the list
beq done // return the reversed list
ldr x2, [x3] // get the reverse
str x2, [x0] //set x0 with the reverse
add x0, x0, #8 // x0++
sub x3, x3, #8 // x3--
sub x1, x1, #1 // length -1
b loop //loop
empty:
mov x0, #0
ret
done:
ret
I am facing a weird issue. I am passing a uint64_t offset to a function on a 32-bit architecture(Cortex R52). The registers are all 32-bit.
This is the function. :
Sorry, I messed up the function declaration.
void read_func(uint8_t* read_buffer_ptr, uint64_t offset, size_t size);
main(){
// read_buffer : memory where something is read.
// read_buffer_ptr : points to read_buffer structure where value should be stored after reading value.
read_func(read_buffer_ptr, 0, sizeof(read_buffer));
}
In this function, the value stored in offset is not zero but some random values which I also see in the registers(r5, r6). Also, when I use offset as a 32-bit value, it works perfectly fine. The value is copied from r2,r3 into r5,r6.
Can you please let me know why this could be happening? Are registers not enough?
The prototype posted is invalid, it should be:
void read_func(uint8_t *read_buffer_ptr, uint64_t offset, size_t size);
Similarly, the definition main() is obsolete: the implicit int return type is not supported as of c99, the function call has another syntax error with a missing )...
What happens when you pass a 64-bit argument on a 32-bit architecture is implementation defined:
either 8 bytes of stack space are used to pass the value
or 2 32-bit registers are used to pass the least significant part and the most significant part
or a combination of both depending on the number of arguments
or some other scheme appropriate for the target CPU
In your code you pass 0 which has type int and presumably has only 32 bits. This is not a problem if the prototype for read_func was correct and parsed before the function call, otherwise the behavior is undefined and a C99 compiler should not even compile the code, but may compilers will just issue a warning and generate bogus code.
In your case (Cortex R52), the 64-bit argument is passed to read_func in registers r2 and r3.
Cortex-R52 has 32 bits address bus and offset cannot be 64 bits. In calculations only lower 32bits will be used as higher ones will not have any effect.
example:
uint64_t foo(void *buff, uint64_t offset, uint64_t size)
{
unsigned char *cbuff = buff;
while(size--)
{
*(cbuff++ + offset) = size & 0xff;
}
return offset + (uint32_t)cbuff;
}
void *z1(void);
uint64_t z2(void);
uint64_t z3(void);
uint64_t bar(void)
{
return foo(z1(), z2(), z3());
}
foo:
push {r4, lr}
ldr lr, [sp, #8] //size
ldr r1, [sp, #12] //size
mov ip, lr
add r4, r0, r2 // cbuff + offset calculation r3 is ignored as not needed - processor has only 32bits address space.
.L2:
subs ip, ip, #1 //size--
sbc r1, r1, #0 //size--
cmn r1, #1
cmneq ip, #1
bne .L3
add r0, r0, lr
adds r0, r0, r2
adc r1, r3, #0
pop {r4, pc}
.L3:
strb ip, [r4], #1
b .L2
bar:
push {r0, r1, r4, r5, r6, lr}
bl z1
mov r4, r0 // buff
bl z2
mov r6, r0 // offset
mov r5, r1 // offset
bl z3
mov r2, r6 // offset
strd r0, [sp] // size passed on the stack
mov r3, r5 // offset
mov r0, r4 // buff
bl foo
add sp, sp, #8
pop {r4, r5, r6, pc}
As you see resister r2 & r3 contain the offset, r0 - buff and size is on the stack.
I want to divide 64 bit number by 32 bit number in ARM cortex M3 device using ARM inline assembler.
I tried dividing 32 bit number by 32 bit number, its working fine. I shared the code also. Please let me know what changes or what new things has to be added so that i can do 64 bit division.
long res = 0;
long Divide(long i,long j)
{
asm ("sdiv %0,%[input_i], %[input_j];"
: "=r" (res)
: [input_i] "r" (i), [input_j] "r" (j)
);
return res;
}
Cortex-M ISA currently doesn't support 64bit integer division.
You'll have to program it.
The following is an example I just writed down. Probably it wastly inefficient and buggy.
.syntax unified
.cpu cortex-m3
.fpu softvfp
.thumb
.global div64
.section .text.div64
.type div64, %function
div64:
cbz r1, normal_divu
stm sp!, {r4-r7}
mov r6, #0
mov r7, #32
rot_init:
cbz r7, exit
#evaluate free space on left of higher word
clz r3, r1
#limit to free digits
cmp r7, r3
it pl
bpl no_limit
mov r3, r7
no_limit:
#update free digits
sub r7, r3
#shift upper word r3 times
lsl r1, r3
#evaluate right shift for masking upper bits
rsb r4, r3, #32
#mask higher bits of lower word
mov r4, r0, LSR r4
#add them to higher word
add r1, r4
#shift lower word r3 times
lsl r0, r3
#divide higher word
udiv r5, r1, r2
#put the remainder in higher word
mul r4, r5, r2
sub r1, r4
#add result bits
lsl r6, r3
add r6, r5
b rot_init
exit:
mov r0, r6
ldm sp!, {r4-r7}
bx lr
normal_divu:
udiv r0, r2
bx lr
Need to convert the following C code into ARM assembly subroutine:
int power(int x, unsigned int n)
{
int y;
if (n == 0)
return 1;
if (n & 1)
return x * power(x, n - 1);
else
{ y = power(x, n >> 1);
return y * y;
}
}
Here's what I have so far but cant figure out how to get the link register to increment after each return (keeps looping back to the same point)
pow CMP r0, #0
MOVEQ r0, #1
BXEQ lr
TST r0, #1
BEQ skip
SUB r0, r0, #1
BL pow
MUL r0, r1, r0
BX lr
skip LSR r0, #1
BL pow
MUL r3, r0, r3
BX lr
The BL instruction does not automatically push or pop anything from the stack. This saves a memory access. It's the way it works with RISC processors (in part because they offer 30 general purpose registers.)
STR lr, [sp, #-4]! ; "PUSH lr"
BL pow
LDR lr, [sp], #4 ; "POP lr"
If you repeat a BL call, then you want to STR/LDR on the stack outside of the loop.
I have the following C code that converts an interlaced webcam YUYV to gray:
void convert_yuyv_to_y(const void *src, char *dest) {
int x, y;
char *Y, *gray;
//get only Y component for grayscale from (Y1)(U1,2)(Y2)(V1,2)
for (y = 0; y < CAM_HEIGHT; y++) {
Y = src + (CAM_WIDTH * 2 * y);
gray = dest + (CAM_WIDTH * y);
for (x=0; x < CAM_WIDTH; x += 2) {
gray[x] = *Y;
Y += 2;
gray[x + 1] = *Y;
Y += 2;
}
}
}
Is there a way to optimize such function with some neon instructions?
Here is a starting point. From here you can do cache preloads, loop unrolling, etc. The best performance will happen when more NEON registers are involved to prevent data stalls.
.equ CAM_HEIGHT, 480 # fill in the correct values
.equ CAM_WIDTH, 640
#
# Call from C as convert_yuyv_to_y(const void *src, char *dest);
#
convert_yuyv_to_y:
mov r2,#CAM_HEIGHT
cvtyuyv_top_y:
mov r3,#CAM_WIDTH
cvtyuyv_top_x:
vld2.8 {d0,d1},[r0]! # assumes source width is a multiple of 8
vst1.8 {d0},[r1]! # work with 8 pixels at a time
subs r3,r3,#8 # x+=8
bgt cvtyuyv_top_x
subs r2,r2,#1 # y++
bgt cvtyuyv_top_y
bx lr
(Promoting my comment to answer)
The least amount of instructions to de-interleave data in NEON architecture is achievable with the sequence:
vld2.8 { d0, d1 }, [r0]!
vst1.8 { d0 }, [r1]!
Here r0 is the source pointer, which advances by 16 each time and r1 is the destination pointer, which advances by 8.
Loop unrolling, ability to retrieve up to 4 registers and offset the registers by 2 can give slightly larger maximum throughput. Coupled with alignment by 16 bytes:
start:
vld4.8 { d0, d1, d2, d3 }, [r0:256]
subs r3, r3, #1
vld4.8 { d4, d5, d6, d7 }, [r1:256]
add r0, r0, #64
add r1, r0, #64
vst2.8 { d0, d2 }, [r2:256]!
vst2.8 { d4, d6 }, [r2:128]!
bgt start
(I can't remember if the format vstx.y {regs}, [rx, ro] exists -- here ro is offset register, that post-increments rx)
While memory transfer optimizations can be useful, it's still better to think, if it can be skipped all together, or merged with some calculation. Also this could be the place to consider planar pixel format, which could completely avoid the copying task.