I need to optimize my mixing code in c for faster response time, so i decided to use inline assembly to do a mixing of two buffers into a new bigger buffer. Basically i have left and right channels separated and i want to put them together into a buffer. So i need to put 2 bytes from the left channel and then two bytes from the right channel and so on.
for this i decided to send my 3 pointers to my assembly code where i intend to copy memory pointed by the left channel pointer into R0 register and memory pointed by right channel pointer into R1 afterwards i intend to mix R0 and R1 into R3 and R4 to later save those registers to memory.(I intend to use other free registers to do same procedure and reduce processing time with pipelining)
So I have two registers R0 and R1 with data, and need to mix them into R3 and R4, and i need to end up is R3 = R0HI(high-part) + R1HI(high-part) and R4 = R0LO(low-part) + R1LO(low-part)
I can think of using a bitwise shifts, but my questions is if there is an easier way to do it like in intel x86 architecture where you could transfer the data into ax register and then us ah as high part and al as low part?
is my thinking right? is there a faster way to do this?
my actual (not working) code in the ndk
void mux(short *pLeftBuf, short *pRightBuf, short *pOutBuf, int vecsamps_stereo) {
int iterations = vecsamps_stereo / 4;
asm volatile(
"ldr r0, %[outbuf];"
"ldr r1, %[leftbuf];"
"ldr r2, %[rightbuf];"
"ldr r3, %[iter];"
"ldr r4, [r3];"
"mov r8, r4;"
"mov r9, r0;"
"mov r4, #0;"
"mov r10, r4;"
"loop:; "
"ldr r2, [r1];"
"ldr r3, [r2];"
"ldr r7, =0xffff;"
"mov r4, r2;"
"and r4, r4, r7;"
"mov r5, r3;"
"and r5, r5, r7;"
"lsl r5, r5, #16;"
"orr r4, r4, r5;"
"lsl r7, r7, #16;"
"mov r5, r2;"
"and r5, r5, r7;"
"mov r6, r3;"
"and r6, r6, r7;"
"lsr r6, r6, #16;"
"orr r5, r5, r6;"
"mov r6, r9;"
"str r4, [r6];"
"add r6, r6, #1;"
"str r5, [r6];"
"add r6, r6, #1;"
"mov r9, r6;"
"mov r4, r10;"
"add r4, r4, #1;"
"mov r10, r4;"
"cmp r4, r8;"
"blt loop"
:[outbuf] "=m" (pOutBuf)
:[leftbuf] "m" (pLeftBuf) ,[rightbuf] "m" (pRightBuf),[iter] "m" (pIter)
:"r0","r1","r2","r3","memory"
);
}
I may not be 100% clear on what you are trying to do, but it looks like you want:
R3[31:16] = R0[31:16], R3[15:0] = R1[31:16];
R4[31:16] = R0[15:0], R4[15:0] = R1[15:0];
and not the actual sum.
In this case, you should be able to accomplish this relatively efficiently with a spare register for a 16-bit mask. ARM assembly offers shifting of a second operand as a part of most arithmetic or logical instructions.
MOV R2, 0xffff ; load 16-bit mask into lower half of R2
AND R3, R2, R1, LSR #16 ; R3 = R2 & (R1 >> 16), or R3[15:0] = R1[31:16]
ORR R3, R3, R0, LSR #16 ; R3 = R3 | (R0 >> 16), or R3[31:16] = R0[31:16]
AND R4, R2, R1 ; R4 = R2 & R1, or R4[15:0] = R1[15:0]
ORR R4, R4, R0, LSL #16 ; R4 = R4 | (R1 << 16), or R4[31:16] = R0[15:0]
; repeat to taste
Another option is to load just the 16 bits at a time, but this may be lower performance if your buffers are in slow memory, and it may not work at all if it doesn't support access less than 32 bits. I'm not certain if the core will request the 32 bits and mask out what isn't needed or if it relies on the memory to handle the byte lanes.
; assume R2 contains CH1 pointer, R3 contains CH2 pointer,
; and R1 contains output ptr
LDRH R0, [R2] ; load first 16 bits pointed to by CH1 into R0
STRH R0, [R1] ; store those 16 bites back into *output
LDRH R0, [R3] ; load first 16 bits pointed to by CH2 into R0
STRH R0, [R1, #2]! ; store those 16 bites back into *(output+2),
; write output+2 to R1
; after "priming" we can now run the following for
; auto increment of pointers.
LDRH R0, [R2, #2]! ; R0 = *(CH1+2), CH1 += 2
STRH R0, [R1, #2]! ; *(Out+2) = R0, Out += 2
LDRH R0, [R3, #2]! ; R0 = *(CH2+2), CH1 += 2
STRH R0, [R1, #2]! ; *(Out+2) = R0, Out += 2
; Lather, rinse, repeat.
These two examples make use of some of the handy features of ARM assembly. The first example makes use of the built in shift available on most instructions, while the second makes use of sized load/store instructions as well as the write-back on these instructions. These should both be compatible with Cortex-M cores. If you do have a more advanced ARM, #Notlikethat's answer is more suitable.
In terms of code size, when you add the load and store to the first example, you end up executing two load instructions, the four logic instructions, and two stores for a total of eight instructions for mixing two samples. The second examples uses two loads and two stores for a total of four instructions when mixing one sample, or, well, eight instructions for mixing two.
You will probably find the first example works faster, as it has fewer memory accesses, and the number of stores can be reduced by using a STM store multiple instruction (ie STMIA OutputRegister!, {R3, R4}). In fact, the first example can be pipelined a bit by using eight registers. LDMIA can be used to load four 16 bit samples from a channel in one instruction, perform two sets of the four mixing instructions, and then store the four output registers in one STMIA instruction. This likely wouldn't offer much benefit in performance since it will likely interact with the memory in the same manner (STM and LDM just execute multiple LDRs and STRs), but if you are optimizing for minimal instructions, this would result in 11 instructions to mix four samples (compared to 16).
ARM registers are strictly 32-bit, however provided you're on a recent enough core (v6+, but not Thumb-only v7-M) there are a number of suitable instructions for dealing with halfwords (PKHBT, PKHTB), or arbitrary slices of registers (BFI, UBFX), not to mention crazy parallel add/subtract instructions that frighten me (available with saturating arithmetic which can be useful for audio, too)..
However, if your machine implements NEON instructions they would be the route to the optimal implementation since this is exactly the sort of thing they're designed for. Plus, they should be accessible through compiler intrinsics so you can use them directly in C code.
Related
The question is to store 12 to R1 and 27 to R2 then subtract R2 with R1 and store the result into the memory address 0x4000. Lastly, store R1 into 0x4004 and R2 into 0x4008 , but I got Invalid immediate Operand Value on MOV R5, #0x4004 and R6, #0x4008.
MOV R2, #27
SUB R3, R2, R1
MOV R4, #0x4000
STR R3, [R4]
MOV R5, #0x4004
MOV R6, #0x4008
STR R1, [R5]
STR R2, [R6]
MOV R2, #27
SUB R3, R2, R1
MOV R4, #0x4000
STR R3, [R4, #0]
STR R1, [R4, #4]
STR R2, [R4, #8]
According to this tutorial, it looks like intermediate values in 32-bit ARM are restricted to "neat" numbers which can be represented as a byte value shifted by some even integer. A quick check of 0x4000 yields 0b100000000000000 which is could be represented by 0x1 shifted left 0d14 times. Your values 0x4004 and 0x4008 don't seem to fall under this category: 0x4004 is 0b100000000000100 and 0x4008 is 0b100000000001000.
Since those specific values seem important, you could try adding to the value in R4 and saving those values to R5 and R6.
MOV R2, #27
SUB R3, R2, R1
MOV R4, #0x4000
STR R3, [R4]
ADD R5, R4, #0x4
ADD R6, R4, #0x8
STR R1, [R5]
STR R2, [R6]
If you'd like more information, you can check out ARM's own documentation on immediate values here. Just make sure to check that the version of the ISA you're using is the same as the one in the documentation. In the future, try to give as much detail as possible about your environment, since there are many versions of the ARM ISA (for example, the 16-bit thumb version is very different compared to the 32 or 64 bit versions).
I need a thread save idx++ and idx-- operation.
Disabling interrupts, i.e. use critical sections, is one thing, but I want
to understand why my operations are not atomic, as I expect ?
Here is the C-code with inline assembler code shown, using segger ozone:
(Also please notice, the address of the variables show that the 32 bit variable is 32-bit-aligned in memory, and the 8- and 16-bit variables are both 16 bit aligned)
volatile static U8 dbgIdx8 = 1000U;
volatile static U16 dbgIdx16 = 1000U;
volatile static U32 dbgIdx32 = 1000U;
dbgIdx8 ++;
080058BE LDR R3, [PC, #48]
080058C0 LDRB R3, [R3]
080058C2 UXTB R3, R3
080058C4 ADDS R3, #1
080058C6 UXTB R2, R3
080058C8 LDR R3, [PC, #36]
080058CA STRB R2, [R3]
dbgIdx16 ++;
080058CC LDR R3, [PC, #36]
080058CE LDRH R3, [R3]
080058D0 UXTH R3, R3
080058D2 ADDS R3, #1
080058D4 UXTH R2, R3
080058D6 LDR R3, [PC, #28]
080058D8 STRH R2, [R3]
dbgIdx32 ++;
080058DA LDR R3, [PC, #28]
080058DC LDR R3, [R3]
080058DE ADDS R3, #1
080058E0 LDR R2, [PC, #20]
080058E2 STR R3, [R2]
There is no guarantee that ++ and -- are atomic. If you need guaranteed atomicity, you will have to find some other way.
As #StaceyGirl points out in a comment, you might be able to use the facilities of <stdatomic.h>. For example, I see there's an atomic atomic_fetch_add function defined, which acts like the postfix ++ you're striving for. There's an atomic_fetch_sub, too.
Alternatively, you might have some compiler intrinsics available to you for performing an atomic increment in some processor-specific way.
ARM cortex cores do not modify memory. All memory modifications are performed as RMW (read-modify-write) operations which are not atomic by default.
But Cortex M3 has special instructions to lock access to the memory location. LDREX & STREX. https://developer.arm.com/documentation/100235/0004/the-cortex-m33-instruction-set/memory-access-instructions/ldaex-and-stlex
You can use them directly in the C code without touching the assembly by using intrinsic.
Do not use not 32 bits data types in any performance (you want to lock for as short as possible time) sensitive programs. Most shorter data types operations add some additional code.
I want to divide 64 bit number by 32 bit number in ARM cortex M3 device using ARM inline assembler.
I tried dividing 32 bit number by 32 bit number, its working fine. I shared the code also. Please let me know what changes or what new things has to be added so that i can do 64 bit division.
long res = 0;
long Divide(long i,long j)
{
asm ("sdiv %0,%[input_i], %[input_j];"
: "=r" (res)
: [input_i] "r" (i), [input_j] "r" (j)
);
return res;
}
Cortex-M ISA currently doesn't support 64bit integer division.
You'll have to program it.
The following is an example I just writed down. Probably it wastly inefficient and buggy.
.syntax unified
.cpu cortex-m3
.fpu softvfp
.thumb
.global div64
.section .text.div64
.type div64, %function
div64:
cbz r1, normal_divu
stm sp!, {r4-r7}
mov r6, #0
mov r7, #32
rot_init:
cbz r7, exit
#evaluate free space on left of higher word
clz r3, r1
#limit to free digits
cmp r7, r3
it pl
bpl no_limit
mov r3, r7
no_limit:
#update free digits
sub r7, r3
#shift upper word r3 times
lsl r1, r3
#evaluate right shift for masking upper bits
rsb r4, r3, #32
#mask higher bits of lower word
mov r4, r0, LSR r4
#add them to higher word
add r1, r4
#shift lower word r3 times
lsl r0, r3
#divide higher word
udiv r5, r1, r2
#put the remainder in higher word
mul r4, r5, r2
sub r1, r4
#add result bits
lsl r6, r3
add r6, r5
b rot_init
exit:
mov r0, r6
ldm sp!, {r4-r7}
bx lr
normal_divu:
udiv r0, r2
bx lr
this works, but I have to do it using auto-indexing and I can not figure out that part.
writeloop:
cmp r0, #10
beq writedone
ldr r1, =array1
lsl r2, r0, #2
add r2, r1, r2
str r2, [r2]
add r0, r0, #1
b writeloop
and for data I have
.balign 4
array1: skip 40
What I had tried was this, and yes I know it is probably a poor attempt but I am new to this and do not understand
ldr r1, =array1
writeloop:
cmp r0, #10
beq writedone
ldr r2, [r1], #4
str r2, [r2]
add r0, r0, #1
b writeloop
It says segmentation fault when I try this. What is wrong? What I am thinking should happen is every time it loops through, it sets the element r2 it at = to the address of itself, and then increments to the next element and does the same thing
The ARM architechures gives several different address modes.
From ARM946E-S product overview and many other sources:
Load and store instructions have three primary addressing modes
- offset
- pre-indexed
- post-indexed.
They are formed by adding or subtracting an immediate or register-based offset to or from a base register. Register-based offsets can also be scaled with shift operations. Pre-indexed and post-indexed addressing modes update the base register with the base plus offset calculation. As the PC is a general purpose register, a 32‑bit value can be loaded directly into the PC to perform a jump to any address in the 4GB memory space.
As well, they support write back or updating of the register, hence the reason for pre-indexed and post-indexed. Post-index doesn't make much sense without write back.
Now to your issue, I believe that you want to write the values 0-9 to an array of ten words (length four bytes). Assuming this, you can use indexing and update the value via add. This leads to,
mov r0, #0 ; start value
ldr r1, =array1 ; array pointer
writeloop:
cmp r0, #10
beq writedone
str r0, [r1, r0, lsl #2] ; index with r1 base by r0 scaled by *4
add r0, r0, #1
b writeloop
writedone:
; code to jump somewhere else and not execute data.
.balign 4
array1: skip 40
For interest a more efficient loop can be done by counting and writing down,
mov r0, #9 ; start value
ldr r1, =array1 ; array pointer
writeloop:
str r0, [r1, r0, lsl #2] ; index with r1 base by r0 scaled by *4
subs r0, r0, #1
bne writeloop
Your original example was writing the pointer to the array; often referred to as 'value equals address'. If this is what you want,
ldr r0, =array_end ; finished?
ldr r1, =array1 ; array pointer
write_loop:
str r1, [r1], #4 ; add four and update after storing
cmp r0, r1
bne write_loop
; code to jump somewhere else and not execute data.
.balign 4
array1: skip 40
array_end:
I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?
The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1
May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).