how to do a simple loop in ARM inline Assembly with NDK?

how to do a simple loop in ARM inline Assembly with NDK? - c

I'm trying to do a simple loop in ARM Assembly, but every time i run it crashes
this is the log:
01-13 15:34:21.277: A/libc(27296): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 27312 (Thread-2932)
and here is my code what am i doing wrong?
void foo(int *pIn, int *pOut) {
//pIn contains the number of iterations the loop will have
asm volatile(
"ldr r3, %[in];"
"ldr r4, %[out];"
"ldr r5, [r3];"
"loop:; "
//here would go the code inside the loop perhaps put something in output, in this case just do nothing
"subs r5, r5, #1;"
"bne loop"
:[out] "=m" (pOut)
:[in] "m" (pIn)
:"r3","r4","r5","memory"
);
}
and in Android.mk file i put the 32bit directive
LOCAL_ARM_MODE := arm
any ideas why it is crashing?
the crash only occurs when i put the loop, before this i tried moving things around and it worked perfectly fine giving output values as i expected.

the problem is solved, adding "r5" and "cc" to my clobber list made it work.
here is the working code:
void foo(int *pIn, int *pOut) {
//pIn contains the number of iterations the loop will have
asm volatile(
"ldr r3, %[in];"
"ldr r4, %[out];"
"ldr r5, [r3];"
"loop:; "
//here would go the code inside the loop perhaps put something in output, in this case just do nothing
"subs r5, r5, #1;"
"bne loop"
:[out] "=m" (pOut)
:[in] "m" (pIn)
:"r3","r4","r5","cc","memory"
);
}

Related

Quick sort using ARM assembly - segmentation error

I'm trying to make a Quick Sort function using ARM assembly (Raspberry pi),
but it shows me segmentation error.
I think recursion process makes that error, while storing or loading with stacks.
Can you tell me how can I fix it?
I used ARM assembly code in https://en.wikibooks.org/wiki/Algorithm_Implementation/Sorting/Quicksort#ARM_Assembly
here,
I just typed it same. Just changing registers like 'r3'->'r2', 'r2'->'r1', 'r1'->'r0' ...
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SIZE 32
int main()
{
int arr[SIZE];
int max, min;
int i;
for (i = 0; i < SIZE; i++) {
arr[i] = rand() % 100;
}
asm(
"mov r0, #0\n\t"
"mov r1, #128\n\t"
"Loop3:\n\t"
"stmfd sp!, {r3, r5, lr}\n\t"
"mov r5, r1\n\t"
"Loop4:\n\t"
"sub r6, r5, r0\n\t"
"cmp r6, #4\n\t"
"ldmlefd sp!, {r3, r5, pc}\n\t"
"ldr r6, [%[arr],r0]\n\t"
"add r1, r0, #4\n\t"
"mov r3, r5\n\t"
"Loop5:\n\t"
"ldr r2, [%[arr],r1]\n\t"
"cmp r2, r6\n\t"
"addle r1, r1, #4\n\t"
"ble Loop6\n\t"
"sub r3, r3, #4\n\t"
"ldr r4, [%[arr],r3]\n\t"
"str r4, [%[arr],r1]\n\t"
"str r2, [%[arr],r3]\n\t"
"Loop6:\n\t"
"cmp r1, r3\n\t"
"blt Loop5\n\t"
"Loop7:\n\t"
"sub r1, r1, #4\n\t"
"ldr r2, [%[arr],r1]\n\t"
"str r2, [%[arr],r0]\n\t"
"str r6, [%[arr],r1]\n\t"
"bl Loop3\n\t"
"mov r0, r3\n\t"
"b Loop4\n\t"
:
:
[arr] "r"(arr)
:
"r0", "r1", "r2", "r3", "r4", "r5", "r6"
);
return 0;
}

You inline asm can never reach the end of the asm template. Presumably you're trying to return out of the C function, not just the internal recursive calls. That's obviously unsafe because there's zero guarantee about stack layout or the contents of LR, and that will change with/without optimization.
Don't write a whole recursive in the middle of a C function.
Use a debugger to single-step the resulting program and see where your code breaks the compiler-generated asm that surrounds it.
Also your inline asm is broken: you dereference arr without specifying it as a memory read/write input or a "memory" clobber. A pointer input does not imply that the pointed-to memory is also an operand.

ARM C Inline assembly - LDR instruction

I am working with inline assembly on RPI2(ARM arch) and I am using GCC for my compiler.
I want to compile and run the following part of code but I take an error, If anyone can help me or give me any help to fix the problem please.
Here is the part of code that I need help:
int main(void)
{
int a;
asm("PUSH {r0}");
asm("PUSH {r1}");
asm("LDR r0, =a");
asm("MOV r1, sp");
asm("STR r1, [r0]");
asm("POP {r1}");
asm("POP {r0}");
}
The error that I take is about the LDR instruction. I tried to remove the '=' or instead of the LDR command to use MOV but still does not work.

To access specific registers, you can use asm register variables, such as:
register int sp asm("sp");
__asm__ __volatile__("" : "=r" (sp));

gcc incorrectly reusing registers in inline asm

I've implemented a simple delay loop macro in a C program for the Cortex-M4:
#define DELAY_CYCLES (F_CPU / 3000000) //F_CPU is 72000000
#define delayUS(n) __asm__ volatile( \
"1: subs %0, #1 \n" \
"bne 1b \n" \
: /* no outputs */ \
: "r" (n * DELAY_CYCLES) /* input */ \
: "0" /* clobbers */ \
)
This delays for n microseconds (assuming interrupts are disabled). Mostly, it works fine. However, I've found that it doesn't work correctly in a function that uses it twice:
static void test(uint8_t num) {
digitalWrite(12, 1);
delayUS(10);
digitalWrite(13, 1);
delayUS(10);
digitalWrite(12, 0);
digitalWrite(13, 0);
}
(This was a function that actually uses num, but got stripped down to this while debugging this issue. It also gets inlined into main, hence the labels in the disassembly.)
What happens here is the second call to delayUS() never completes. Examining the generated assembly shows the problem:
528: 2701 movs r7, #1
52a: 6037 str r7, [r6, #0] ;digitalWrite(12, 1)
52c: 23f0 movs r3, #240 ;delayUS(10); 10 * DELAY_CYCLES = 240
52e: 3b01 subs r3, #1
530: d1fd bne.n 52e <main+0x4a>
532: 4c0d ldr r4, [pc, #52]
534: 6027 str r7, [r4, #0] ;digitalWrite(13, 1)
536: 3b01 subs r3, #1 ;delayUS(10), but r3 is still 0
538: d1fd bne.n 536 <main+0x52>
53a: 2300 movs r3, #0
53c: 6033 str r3, [r6, #0] ;digitalWrite(12, 0)
For some reason, gcc doesn't re-initialize r3 before using it in the second delay loop, so instead of delaying for 240 iterations (10µs), it delays for 2^32 (about 3 minutes).
With this variation, the issue disappears:
__attribute__((used)) static int dummy;
#define delayUS(n) __asm__ volatile( \
"1: subs %0, #1 \n" \
"bne 1b \n" \
: "=r" (dummy) /* no outputs */ \
: "0" (n * DELAY_CYCLES) /* input */ \
: "0" /* clobbers */ \
)
That generates more correct code:
528: 2701 movs r7, #1
52a: 23f0 movs r3, #240 ;r3 = 10 * DELAY_CYCLES
52c: 6037 str r7, [r6, #0] ;digitalWrite(12, 1)
52e: 461a mov r2, r3 ;r2 = r3
530: 3a01 subs r2, #1 ;delayUS(r2)
532: d1fd bne.n 530 <main+0x4c>
534: 4c0d ldr r4, [pc, #52]
536: 6027 str r7, [r4, #0] ;digitalWrite(13, 1)
538: 3b01 subs r3, #1 ;delayUS(r3)
53a: d1fd bne.n 538 <main+0x54>
53c: 4a0c ldr r2, [pc, #48]
53e: 6013 str r3, [r2, #0] ;digitalWrite(12, 0)
Here, it's correctly realized that the delay loop clobbers its input register, and so doesn't re-use r3 without initializing it (it uses r2 for one of the loops instead.)
So, why does gcc not recognize that the former version also clobbers its input, when it's listed in the clobber list?

The problem is that the 'clobbers' list is a list of register names, or the special strings "cc" and "memory". Since there is no register called "0", having this in the clobbers list is meaningless. Unfortuately gcc does not give you a warning about this. Instead, as the gcc docs note:
Warning: Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement. It is not possible to use clobbers to inform the compiler that the values in these inputs are changing. One common work-around is to tie the changing input variable to an output variable that never gets used.
This workaround is what your second example does, and is why it works. For correctness, you should probably also add "cc" to the clobbers list (as you modify the flags), and you might as well remove the "0", because it is meaningless.

ARM inline assembly code not working correctly at runtime

This code is meant to open opensl_es audio record capture a stream in mono, copy the stream and process left channel and right channel separately, then mix both channels into an output stream which is later played using opensl_es as well. the reason of the assembly code is because i found a bottle neck in the mixing function i had previously written in c which was a simple for loop to join left and right buffer into output buffer
well the problem is quite weird, when i put the logs i get in the output stream just what i want, the mixing of left and right buffer working and i see it in the logs, when i try to play the stream the application crashes, the same happens whenever i comment the logs, for some reason the app just crashes, so i'm starting to think it has something to do with the registers i am using or something in assembly code, i am new to assembly so is there something i am missing about arm assembly?
any idea why this is happening or how should i fix this problem?
here is the code: the first function is the main function which i use to capture sound call process functions. Te second "mux" is the function with the inline assembly in it.
void start_playing()
{
OPENSL_STREAM *pStream;
int samps, i, j;
short inbuffer[VECSAMPS_MONO], outbuffer[VECSAMPS_STEREO];
pStream = android_OpenAudioDevice(SR,1,2,BUFFERFRAMES);
if(pStream == NULL)
{
return;
}
on = 1;
iLog = 0;
while (on)
{
samps = android_AudioInRaw(pStream,inbuffer,VECSAMPS_MONO); //audio recording
//signal processing process called here for left channel then for right channel (equalizing, etc)
mux(inbuffer, inbuffer, outbuffer,VECSAMPS_MONO); //Assembly mixing of left and right channel into output channel
//android_AudioOutRaw(pStream,outbuffer,samps*2);//audio playing
}
android_CloseAudioDevice(pStream);
}
//assembly function here
void mux(short *pLeftBuf, short *pRightBuf, short *pOutBuf, int vecsamps_mono)
{
int *pIter;
*pIter = vecsamps_mono / 4;
__android_log_print(ANDROID_LOG_INFO, "$$$$$$$$$$$$", "value : %d , %d , %d , %d",pLeftBuf[0],pLeftBuf[1], pRightBuf[0],pRightBuf[1]);
asm volatile(
"ldr r9, %[outbuf];"
"ldr r0, %[leftbuf];"
"ldr r1, %[rightbuf];"
"ldr r2, %[iter];"
"ldr r8, [r2];"
"loop: "
"ldr r2, [r0];"
"ldr r3, [r1];"
"ldr r7, =0xffff;"
"and r4, r2, r7;"
"and r5, r3, r7;"
"lsl r5, r5, #16;"
"orr r4, r4, r5;"
"lsl r7, r7, #16;"
"and r5, r2, r7;"
"and r6, r3, r7;"
"lsr r6, r6, #16;"
"orr r5, r5, r6;"
"str r4, [r9];"
"str r5, [r9, #4];"
"add r0, r0, #4;"
"add r1, r1, #4;"
"add r9, r9, #8;"
"subs r8, r8, #1;"
"bne loop"
:[outbuf] "=m" (pOutBuf)
:[leftbuf] "m" (pLeftBuf) ,[rightbuf] "m" (pRightBuf),[iter] "m" (pIter)
:"r0","r1","r2","r3","r4","r5","r8","r9","memory","cc"
);
__android_log_print(ANDROID_LOG_INFO, "##################", "value : %d , %d , %d , %d" ,*pOutBuf,*(pOutBuf+1),*(pOutBuf+2) ,*(pOutBuf+3));
}
any suggestions?
this is the error i get in logcat:
01-14 11:41:40.992: A/libc(16161): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 16178 (Thread-4783)

*pIter = vecsamps_mono / 4;
...
"ldr r2, %[iter];"
"ldr r8, [r2];"
...
...[iter] "m" (pIter)
Maybe, just maybe, vecsamps_mono isn't 4 times a valid memory address.
Not that you even get that far, dereferencing an uninitialised pointer in that first line.

Working of asm volatile ("" : : : "memory")

What basically __asm__ __volatile__ () does and what is significance of "memory" for ARM architecture?

asm volatile("" ::: "memory");
creates a compiler level memory barrier forcing optimizer to not re-order memory accesses across the barrier.
For example, if you need to access some address in a specific order (probably because that memory area is actually backed by a different device rather than a memory) you need to be able tell this to the compiler otherwise it may just optimize your steps for the sake of efficiency.
Assume in this scenario you must increment a value in address, read something and increment another value in an adjacent address.
int c(int *d, int *e) {
int r;
d[0] += 1;
r = e[0];
d[1] += 1;
return r;
}
Problem is compiler (gcc in this case) can rearrange your memory access to get better performance if you ask for it (-O). Probably leading to a sequence of instructions like below:
00000000 <c>:
0: 4603 mov r3, r0
2: c805 ldmia r0, {r0, r2}
4: 3001 adds r0, #1
6: 3201 adds r2, #1
8: 6018 str r0, [r3, #0]
a: 6808 ldr r0, [r1, #0]
c: 605a str r2, [r3, #4]
e: 4770 bx lr
Above values for d[0] and d[1] are loaded at the same time. Lets assume this is something you want to avoid then you need to tell compiler not to reorder memory accesses and that is to use asm volatile("" ::: "memory").
int c(int *d, int *e) {
int r;
d[0] += 1;
r = e[0];
asm volatile("" ::: "memory");
d[1] += 1;
return r;
}
So you'll get your instruction sequence as you want it to be:
00000000 <c>:
0: 6802 ldr r2, [r0, #0]
2: 4603 mov r3, r0
4: 3201 adds r2, #1
6: 6002 str r2, [r0, #0]
8: 6808 ldr r0, [r1, #0]
a: 685a ldr r2, [r3, #4]
c: 3201 adds r2, #1
e: 605a str r2, [r3, #4]
10: 4770 bx lr
12: bf00 nop
It should be noted that this is only compile time memory barrier to avoid compiler to reorder memory accesses, as it puts no extra hardware level instructions to flush memories or wait for load or stores to be completed. CPUs can still reorder memory accesses if they have the architectural capabilities and memory addresses are on normal type instead of strongly ordered or device (ref).

This sequence is a compiler memory access scheduling barrier, as noted in the article referenced by Udo. This one is GCC specific - other compilers have other ways of describing them, some of them with more explicit (and less esoteric) statements.
__asm__ is a gcc extension of permitting assembly language statements to be entered nested within your C code - used here for its property of being able to specify side effects that prevent the compiler from performing certain types of optimisations (which in this case might end up generating incorrect code).
__volatile__ is required to ensure that the asm statement itself is not reordered with any other volatile accesses any (a guarantee in the C language).
memory is an instruction to GCC that (sort of) says that the inline asm sequence has side effects on global memory, and hence not just effects on local variables need to be taken into account.

The meaning is explained here:
http://en.wikipedia.org/wiki/Memory_ordering
Basically it implies that the assembly code will be executed where you expect it. It tells the compiler to not reorder instructions around it. That is what is coded before this piece of code will be executed before and what is coded after will be executed after.

static inline unsigned long arch_local_irq_save(void)
{
unsigned long flags;
asm volatile(
" mrs %0, cpsr # arch_local_irq_save\n"
" cpsid i" //disabled irq
: "=r" (flags) : : "memory", "cc");
return flags;
}