Address of stack pointer - arm

I have an example here. I checked on the debug and realized that in the beginning, the SP's value is 0x20001000. Why does the compiler use 0x20001000 for the intitial value of stack pointer? I created some data segments but it's still 0x20001000.
soA EQU 25
soB EQU 30
AREA RESET, DATA, READONLY
DCD 0x20001000
DCD Reset_Handler
AREA MYCODE, CODE, READONLY
EXPORT Reset_Handler
ENTRY
Reset_Handler
MOV R0, #soA
MOV R1, #soB
KiemTra
CMP R0, R1
BLT NhoHon
BEQ KetThuc
LonHon
SUB R0, R1
B KiemTra
NhoHon
SUB R1, R0
B KiemTra
KetThuc
END

Related

ARM assembly program would not store the result in register 0

I need to write a ARM assembly program will “iteratively” sum up (the integer multiplication of each integer element in the array_D by 4) and also the next element by looping until the end of the array signaled by 0. In each iteration, the current element x of the array_D will be replaced with the new result of summation as (x * 4 + [next element]). For instance, in case array_D = [2020, -97, 2441, -11, 0], the final result stored in r0 should be : (2020 * 4 - 97) + (-97 * 4 + 2441) + (2441 * 4 -11) + (-11 * 4 + 0)= 19745, with the array_D updated as [7983, 2053, 9753, -44, 0]. When it reaches the end of the array, the ARM assembly program will exit the loop and then terminate the program execution with the result of summation stored in register r0. Here is the program I've written:
# File : simple2.s------------------------
.data
array_D: .word 2020, -97, 2441, -11, 0
.text
.global main
main:
LDR r1,=array_D # load base addr. of array_D into r1
MOV r2, #0 # r2 as the array pointer
loop:
LDR r3, [r1,r2] # r3 as the array element
LSL r3, r3, #2 # multiply r3 by 4
MOV r4, r2 # copy r2 to r4
ADD r2, r2, #4 # r2 points to next element
LDR r5, [r1,r2] # r5 as the next element
ADD r3, r3, r5 # add r5 to r3
ADD r0, r0, r3 # sum the new element to r0 (change this to ADD r6, r6,
r3 and it worked)
STR r3, [r1,r4] # store the new value to the array_D
CMP r5, #0 # test if the next element is 0
BNE loop # loop if not 0
SWI 0x11
.end
However this program will not store the final result in r0 but zero instead (register view 1). I have tried r6 to store the result and it worked (register view 2). But the assignment requires me to store it in r0. What is wrong with my program?
register view 1
register view 2

How the value get stored in registers in microprocessors

I am learning the assembly language of ARM. I think the register r3 should hold value 55 (10 + 20 + 25), but it does not. Could someone explain to me why?. I am emulating PXA255 connex board.
.text
entry: b start
arr: .byte 10,20,25
eoa:
.align
start:
ldr r0, =eoa
ldr r1, =arr
mov r3, #0
loop: ldrb r2, [r1], #1
add r3, r2, r3
cmp r1, r0
bne loop
stop: b stop
(qemu) info registers
R00=00000007 R01=00000007 R02=00000019 R03=00000037
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00000000 R11=00000000
R12=00000000 R13=00000000 R14=00000000 R15=00000024

How to understand why an ARM exception happens?

I'm trying understand what is the reason of ARM exception that I encounter.
It happens randomly during system startup, and may looks in few different ways.
One of simplest is following:
0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
#0 0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
#1 0x80002f04 in ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm(int0_t) ()
at /home/rnd_share/sysbios/bios_6_51_00_15/packages/ti/sysbios/family/arm/exc/Exception_asm_gnu.asm:103
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
r0 0x20000197 536871319
r1 0x20000197 536871319
r2 0x20000197 536871319
r3 0x20000197 536871319
r4 0x20000197 536871319
r5 0x6 6
r6 0x80000024 2147483684
r7 0x80007a0c 2147514892
r8 0x8004f0a8 2147807400
r9 0x80041340 2147750720
r10 0x80040a3c 2147748412
r11 0xffffffff 4294967295
r12 0x20000197 536871319
sp 0x7fffff88 0x7fffff88
lr 0x80002f04 2147495684
pc 0x8004e810 0x8004e810 <ti_sysbios_family_arm_a8_intcps_Hwi_vectors+16>
cpsr 0x20000197 536871319
PC = 8004E810, CPSR = 20000197 (ABORT mode, ARM IRQ dis.)
R0 = 20000197, R1 = 20000197, R2 = 20000197, R3 = 20000197
R4 = 20000197, R5 = 00000006, R6 = 80000024, R7 = 80007A0C
USR: R8 =8004F0A8, R9 =80041340, R10=80040A3C, R11 =FFFFFFFF, R12 =20000197
R13=80212590, R14=80040A3C
FIQ: R8 =AEE1D6FA, R9 =C07BA930, R10=1B0B137A, R11 =7EC3F1DF, R12 =2000019F
R13=80065CF8, R14=00000000, SPSR=00000000
SVC: R13=4030CB20, R14=00022071, SPSR=00000000
ABT: R13=7FFFFF88, R14=80002F04, SPSR=20000197
IRQ: R13=F4ADFD8A, R14=80041020, SPSR=8000011F
UND: R13=80085CF8, R14=ED0F7EF1, SPSR=00000000
(gdb) frame
#0 0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
(gdb) frame 1
#1 0x80002f04 in ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm(int0_t) ()
at /home/rnd_share/sysbios/bios_6_51_00_15/packages/ti/sysbios/family/arm/exc/Exception_asm_gnu.asm:103
103 mrc p15, #0, r12, c5, c0, #0 # read DFSR into r12
(gdb) list
98 .func ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm__I
99
100 ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm__I:
101 stmfd sp!, {r0-r12} # save r4-r12 while we're at it
102
103 mrc p15, #0, r12, c5, c0, #0 # read DFSR into r12
104 stmfd sp!, {r12} # save DFSR
105 mrc p15, #0, r12, c5, c0, #1 # read IFSR into r12
106 stmfd sp!, {r12} # save DFSR
107 mrc p15, #0, r12, c6, c0, #0 # read DFAR into r12
(gdb) monitor cp15 6 0 0 0
Reading CP15 register (6,0,0,0 = 0x7FFFFF54)
My understanding is that, there was some ongoing exception, which can be seen in frame 1.
It tries to save registers onto stack:
101 stmfd sp!, {r0-r12} # save r4-r12 while we're at it
But, stack pointer was incorrect at:
ABT: R13=7FFFFF88
I don't understand both:
What can be a cause of such value of SP in ABT and IRQ contexts ?
what is actually in frame 0 ? in other words, how Cortex reacted to data abort while being already in exception handler ?
This device usually starts normally, such situation happens like 3 times per 10 boots. It never happens when starting from debugger, only release and only when started from bootloader.
Two weeks later...
Boot procedure is following:
2nd stage bootloader loads application to memory
2nd stage bootloader jumps to application start.
main function of application is entered.
It turns out that sometimes statically initialized values of application have correct values after 1 step of booting, but then in 3 step they are corrupted. I mean application image is corrupted.
Caches haven't been flushed correctly between step 1 and 2.
Disabling caches at 2nd stage bootloader fixed problem at all.
Now need to fix that correctly.

PMU counters in ARM11

I am programming raspbery pi model b ARM1176 bare metal (in assembly and c). I need to calculate the clock cycles used to execute an assembly code.
I am using the following code for PMU counter:
mov r0,#1
MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register
/* Reset Cycle Counter */
mov r0,#5
MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register
/* Meaure */
MRC p15, 0, r0, c15, c12, 1 # Read Cycle Counter Register
<MY CODES>
MRC p15, 0, r1, c15, c12, 1 # Read Cycle Counter Register
From this if I have
add r3,#3
in place of my code i get r1=8 and r0=0, which seems correct since arm11 has 8 pipeline stages and it takes 8 clock cycles to execute it.
But when I add more instructions I am getting ridiculous results like
add r3,#3
add r4,#1
r0=0,r1=97/96/94 (the result of r1 should also be constant!!!)
I am using uart to see results of registers on minicom.
Okay, seeing the same thing, that is very interesting.
# nop
.globl test
test:
mov r0,#1
MCR p15, 0, r0, c15, c12, 0
mov r0,#5
MCR p15, 0, r0, c15, c12, 0
MRC p15, 0, r0, c15, c12, 1
add r3,#3
add r2,#1
MRC p15, 0, r1, c15, c12, 1
sub r0,r1,r0
bx lr
I am calling this from C so if I muck with r4 in the code under test would have to save it on the stack so messed with r2. Without the add r2 line the return value was 8 with the add r2 line the return value was 0x68 then 0x65. Note this is on a pi zero. So some clocks are a little faster than yours.
Remember this is running from dram and dram is painfully slow. So you may be seeing some of that.
Initial alignment of the code:
00008024 <test>:
8024: e3a00001 mov r0, #1
8028: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
802c: e3a00005 mov r0, #5
8030: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8034: ee1f0f3c mrc 15, 0, r0, cr15, cr12, {1}
8038: e2833003 add r3, r3, #3
803c: e2822001 add r2, r2, #1
8040: ee1f1f3c mrc 15, 0, r1, cr15, cr12, {1}
8044: e0410000 sub r0, r1, r0
8048: e12fff1e bx lr
Yep if I uncomment the nop in front of .globl test, and I comment out the add r2 I only have the add r3 as the code under test, but the nop pushes the alignment of the whole block of code. with the add r3 and no nop I get 8 counts with the add r3 and the nop I get 0x67 counts.
So I think this is just a case of measuring the fetch. I have not enabled the arm cache, but there may be a deeper cache or an mmu or other since this ram is shared between the arm and the gpu.
If I go one step further and uncomment the nop have both the add r3 and the add r2 it is 0x69 counts. or basically on par or barely longer than one instruction, so we forced a fetch in there.
so I my case if I add more nops so the initial read of the count is aligned on an 8 word boundary, and I have the two instructions being measured
00008030 <test>:
8030: e3a00001 mov r0, #1
8034: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8038: e3a00005 mov r0, #5
803c: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8040: ee1f0f3c mrc 15, 0, r0, cr15, cr12, {1}
8044: e2833003 add r3, r3, #3
8048: e2822001 add r2, r2, #1
804c: ee1f1f3c mrc 15, 0, r1, cr15, cr12, {1}
8050: e0410000 sub r0, r1, r0
8054: e12fff1e bx lr
I get a count of 8. I put a third instruction in there an add r3 and two add r2s. still a count of 8.
If I go back to this where at least part of it is in a different fetch line.
00008024 <test>:
8024: e3a00001 mov r0, #1
8028: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
802c: e3a00005 mov r0, #5
8030: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8034: ee1f0f3c mrc 15, 0, r0, cr15, cr12, {1}
8038: e2833003 add r3, r3, #3
803c: e2822001 add r2, r2, #1
8040: ee1f1f3c mrc 15, 0, r1, cr15, cr12, {1}
8044: e0410000 sub r0, r1, r0
8048: e12fff1e bx lr
And I do three runs without changing anything, and then enable the l1 cache (instruction) and do three more runs I get
00000068
0000001D
0000001D
0000001F
00000008
00000008
So I think you are dealing with dram which is slow, fetch lines, cache misses and and hits and resulting cache line fetches.
If you were expecting to see the number of clocks it took to execute an instruction you wont, you dont have zero wait state memory unless you can keep the entire code under test in the l1 cache.
I dont think there is on chip sram that you can use for this kind of thing for this chip/board, you are going to end up hitting dram and that dram is shared with the gpu. So basically program execution time is not expected to be deterministic and as with your computer or phone or other the cpu is not the bottleneck has not been for a long time it is sitting around waiting to be fed data or instructions.

GCC asm inline constraints, conflicting register allocation

I've made some ARM-inline assembler code.
Looking in Semaphore.s, I see that gcc is using register r3 for both two variables: "success" and "change". I wonder if there is a problem with my constraints?
First most relevant code lines:
asm inline:
"1: MVN %[success], #0 # success=TRUE=~FALSE\n\t"
"LDREX %[value], %[signal] # try to get exclusive access\n\t"
"ADDS %[newValue], %[value], %[change] # new value = value + change\n\t"
constraints:
: [signal] "+m" (signal), [success] "=r" (success), [locked] "=r" (locked), [newValue] "=r" (newValue), [value] "=r" (value)
: [borderValue] "r" (borderValue), [change] "r" (change)
: "cc"
symbol file:
1: MVN r3, #0 # success=TRUE=~FALSE
LDREX r0, [r7, #12] # try to get exclusive access
ADDS r1, r0, r3 # new value = value + change
More source and generated symbol is below.
BOOLEAN Semaphore_exclusiveChange (INT32U * signal, INT32S change, INT32U borderValue)
{
BOOLEAN success;
INT32U locked;// exclusive status
INT32U newValue;
INT32U value;
asm (
"1: MVN %[success], #0 # success=TRUE=~FALSE\n\t"
"LDREX %[value], %[signal] # new to get exclusive access\n\t"
"ADDS %[newValue], %[value], %[change] # new value = value + change\n\t"
"ITE MI # if (new value<0) \n\t"
" SUBSMI %[newValue], %[newValue] # (new value<0): new value=0, set zero flag \n\t"
"# else\n\t"
" CMPPL %[newValue], %[borderValue] # (new value>=0): if new value > border value \n\t"
"\n\t# zero flag is either: new value=0 or =bordervalue\n\t"
"ITE HI # if new signal level > border value \n\t" //
" MOVHI %[success], #0 # fail to raise signal, success=FALSE \n\t"
"\t# else\n\t"
" MOVLS %[value], %[newValue] # use new value \n\t" // ok
"STREX %[locked], %[value], %[signal] # new exclusive store of value\n\t"
"TST %[locked],%[locked] # is locked? \n\t"
"IT NE # if locked \n\t"
"BNE 1b # try again\n\t"
"DMB # memory barrier\n\t" //
: [signal] "+m" (signal), [success] "=r" (success), [locked] "=r" (locked), [newValue] "=r" (newValue), [value] "=r" (value)
: [borderValue] "r" (borderValue), [change] "r" (change)
: "cc" );
return success;
}
Relevant text from symbol file:
Semaphore_exclusiveChange:
.LFB2:
.loc 1 10 0
# args = 0, pretend = 0, frame = 32
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
push {r7}
.LCFI0:
sub sp, sp, #36
.LCFI1:
add r7, sp, #0
.LCFI2:
str r0, [r7, #12]
str r1, [r7, #8]
str r2, [r7, #4]
.loc 1 16 0
ldr r2, [r7, #4]
ldr r3, [r7, #8]
# 16 "../drivers/Semaphore.c" 1
1: MVN r3, #0 # success=TRUE=~FALSE
LDREX r0, [r7, #12] # new to get exclusive access
ADDS r1, r0, r3 # new value = value + change
ITE MI # if (new value<0)
SUBSMI r1, r1 # (new value<0): new value=0, set zero flag
# else
CMPPL r1, r2 # (new value>=0): if new value > border value
# zero flag is either: new value=0 or =bordervalue
ITE HI # if new signal level > border value
MOVHI r3, #0 # fail to raise signal, success=FALSE
# else
MOVLS r0, r1 # use new value
STREX r2, r0, [r7, #12] # new exclusive store of value
TST r2,r2 # is locked?
IT NE # if locked
BNE 1b # try again
DMB # memory barrier
# 0 "" 2
.thumb
strb r3, [r7, #19]
str r2, [r7, #20]
str r1, [r7, #24]
str r0, [r7, #28]
.loc 1 38 0
ldrb r3, [r7, #19] # zero_extendqisi2
.loc 1 39 0
mov r0, r3
add r7, r7, #36
mov sp, r7
pop {r7}
bx lr
You need to constrain "success" further with '&':
: [signal] "+m" (signal), [success] "=&r" (success), [locked] "=r" (locked), [newValue] "=r" (newValue), [value] "=r" (value)
which marks it as an 'early clobber'. Otherwise the compiler will assume that all outputs are produced after all inputs are consumed and is free to use the same register for a different output and input.
If you have a "input/output" value, you need to use the "repeating value" constraint.

Resources