Compiling and running ARM assembly binary on Cortex-M4 (simulated in QEMU) - arm

I successfully compiled and executed ARM binary file on a virtual QEMU embedded system connex using this procedure:
arm-none-eabi-as -o program.o program.s
arm-none-eabi-ld -Ttext=0x0 -o program.elf program.o
arm-none-eabi-objcopy -O binary program.elf program.bin
dd if=/dev/zero of=flash.bin bs=4096 count=4096
dd if=program.bin of=flash.bin bs=4096 conv=notrunc
qemu-system-arm -M connex -pflash flash.bin -nographic -serial /dev/null
In line four I created a zeroed out empty disk which represents flash and in line five I copied my binary into flash.
So this works like a charm, but it simulates an entire embedded system while I only want to simulate ARM core, for example Cortex-M4. This is why I am trying to just use qemu-arm instead of qemu-system-arm.
So I 1st tried to compile and run my program like this (lines 1-3 are same as above):
arm-none-eabi-as -o program.o program.s
arm-none-eabi-ld -Ttext=0x0 -o program.elf program.o
arm-none-eabi-objcopy -O binary program.elf program.bin
qemu-arm -cpu cortex-m4 program.bin
And this doesn't work - it says:
Error while loading program.bin: Exec format error
So I tried to create flash image like before (because it worked):
arm-none-eabi-as -o program.o program.s
arm-none-eabi-ld -Ttext=0x0 -o program.elf program.o
arm-none-eabi-objcopy -O binary program.elf program.bin
dd if=/dev/zero of=flash.bin bs=4096 count=4096
dd if=program.bin of=flash.bin bs=4096 conv=notrunc
qemu-arm -cpu cortex-m4 flash.bin
And I get this:
Error while loading flash.bin: Permission denied
Can anyone help me a bit? Using sudo doesn't help.

qemu-arm's purpose is not "simulate just an ARM core". It is "run a single Linux binary", and it expects that the binary file you provide it is a Linux format ELF executable. Trying to feed it something else is not going to work.
Since Linux assumes A-profile cores, not M-profile cores, anything you do with -cpu cortex-m4 on qemu-arm will only be working by luck, not deliberately. (We don't disable those CPU types since there are some GCC test case scenarios that use semihosting which sort-of-work and which we don't want to deliberately break. But those are working as much by luck as anything else.)

Compared to a microcontroller build you need an entry point (and it ram).
start.s
.thumb
.thumb_func
.global _start
_start:
#mov r0,=0x10000
#mov sp,r0
bl notmain
mov r7,#0x1
mov r0,#0
swi #0
.word 0xFFFFFFFF
b .
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
.thumb_func
.globl dummy
dummy:
bx lr
.thumb_func
.globl write
write:
push {r7,lr}
mov r7,#0x04
swi 0
pop {r7,pc}
b .
.end
notmain.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
void write ( unsigned int, char *, unsigned int );
int notmain ( void )
{
//unsigned int ra;
//for(ra=0;ra<1000;ra++) dummy(ra);
write(1,"Hello\n",6);
return(0);
}
hello.ld
ENTRY(_start)
MEMORY
{
ram : ORIGIN = 0x00010000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : { *(.bss*) } > ram
}
build
arm-none-eabi-as --warn --fatal-warnings start.s -o start.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -c notmain.c -o notmain.o
arm-none-eabi-ld -o notmain.elf -T hello.ld start.o notmain.o
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy notmain.elf notmain.bin -O binary
run
qemu-arm -d in_asm,cpu,cpu_reset -D hello -cpu cortex-m4 notmain.elf
Hello
dump log
cat hello
CPU Reset (CPU 0)
R00=00000000 R01=00000000 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00000000 R11=00000000
R12=00000000 R13=00000000 R14=00000000 R15=00000000
PSR=40000000 -Z-- A usr26
CPU Reset (CPU 0)
R00=00000000 R01=00000000 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00000000 R11=00000000
R12=00000000 R13=00000000 R14=00000000 R15=00000000
PSR=40000010 -Z-- A usr32
Reserved 0xf7000000 bytes of guest address space
host mmap_min_addr=0x10000
guest_base 0x7f4347fb4000
start end size prot
00010000-00011000 00001000 r-x
f67ff000-f6800000 00001000 ---
f6800000-f7000000 00800000 rw-
start_brk 0x00000000
end_code 0x00010044
start_code 0x00010000
start_data 0x00010044
end_data 0x00010044
start_stack 0xf6fff350
brk 0x00010044
entry 0x00010001
----------------
IN:
0x00010000: f000 f810 bl 0x10024
R00=00000000 R01=f6fff4c2 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010044 R11=00000000
R12=00000000 R13=f6fff350 R14=00000000 R15=00010000
PSR=00000030 ---- T usr32
----------------
IN: notmain
0x00010024: b508 push {r3, lr}
0x00010026: 2001 movs r0, #1
0x00010028: 4903 ldr r1, [pc, #12] (0x10038)
0x0001002a: 2206 movs r2, #6
0x0001002c: f7ff fff5 bl 0x1001a
R00=00000000 R01=f6fff4c2 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010044 R11=00000000
R12=00000000 R13=f6fff350 R14=00010005 R15=00010024
PSR=00000030 ---- T usr32
----------------
IN:
0x0001001a: b580 push {r7, lr}
0x0001001c: 2704 movs r7, #4
0x0001001e: df00 svc 0
R00=00000001 R01=0001003c R02=00000006 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010044 R11=00000000
R12=00000000 R13=f6fff348 R14=00010031 R15=0001001a
PSR=00000030 ---- T usr32
----------------
IN:
0x00010020: bd80 pop {r7, pc}
R00=00000006 R01=0001003c R02=00000006 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000004
R08=00000000 R09=00000000 R10=00010044 R11=00000000
R12=00000000 R13=f6fff340 R14=00010031 R15=00010020
PSR=00000030 ---- T usr32
----------------
IN: notmain
0x00010030: 2000 movs r0, #0
0x00010032: bc08 pop {r3}
0x00010034: bc02 pop {r1}
0x00010036: 4708 bx r1
R00=00000006 R01=0001003c R02=00000006 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010044 R11=00000000
R12=00000000 R13=f6fff348 R14=00010031 R15=00010030
PSR=00000030 ---- T usr32
----------------
IN:
0x00010004: 2701 movs r7, #1
0x00010006: 2000 movs r0, #0
0x00010008: df00 svc 0
R00=00000000 R01=00010005 R02=00000006 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010044 R11=00000000
R12=00000000 R13=f6fff350 R14=00010031 R15=00010004
PSR=40000030 -Z-- T usr32
It gets unhappy if you touch the stack pointer, so dont...
Thanks for pointing out this program, wasnt aware of it, going to have some fun with it...
EDIT
sorry you just wanted assembly.
start.s
.thumb
.thumb_func
.global _start
_start:
mov r4,#10
top:
nop
sub r4,#1
bne top
mov r7,#0x1
mov r0,#0
swi #0
.word 0xFFFFFFFF
b .
.end
linker script above
build
arm-none-eabi-as --warn --fatal-warnings start.s -o start.o
arm-none-eabi-ld -o notmain.elf -T hello.ld start.o
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy notmain.elf notmain.bin -O binary
run
qemu-arm -d in_asm,cpu,cpu_reset -D hello -cpu cortex-m4 notmain.elf
dump log
cat hello
CPU Reset (CPU 0)
R00=00000000 R01=00000000 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00000000 R11=00000000
R12=00000000 R13=00000000 R14=00000000 R15=00000000
PSR=40000000 -Z-- A usr26
CPU Reset (CPU 0)
R00=00000000 R01=00000000 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00000000 R11=00000000
R12=00000000 R13=00000000 R14=00000000 R15=00000000
PSR=40000010 -Z-- A usr32
Reserved 0xf7000000 bytes of guest address space
host mmap_min_addr=0x10000
guest_base 0x7f36110fc000
start end size prot
00010000-00011000 00001000 r-x
f67ff000-f6800000 00001000 ---
f6800000-f7000000 00800000 rw-
start_brk 0x00000000
end_code 0x00010014
start_code 0x00010000
start_data 0x00010014
end_data 0x00010014
start_stack 0xf6fff350
brk 0x00010014
entry 0x00010001
----------------
IN:
0x00010000: 240a movs r4, #10
0x00010002: 46c0 nop (mov r8, r8)
0x00010004: 3c01 subs r4, #1
0x00010006: d1fc bne.n 0x10002
R00=00000000 R01=f6fff4c2 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010014 R11=00000000
R12=00000000 R13=f6fff350 R14=00000000 R15=00010000
PSR=00000030 ---- T usr32
----------------
IN:
0x00010002: 46c0 nop (mov r8, r8)
0x00010004: 3c01 subs r4, #1
0x00010006: d1fc bne.n 0x10002
R00=00000000 R01=f6fff4c2 R02=00000000 R03=00000000
R04=00000009 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010014 R11=00000000
R12=00000000 R13=f6fff350 R14=00000000 R15=00010002
PSR=20000030 --C- T usr32
R00=00000000 R01=f6fff4c2 R02=00000000 R03=00000000
R04=00000008 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010014 R11=00000000
R12=00000000 R13=f6fff350 R14=00000000 R15=00010002
PSR=20000030 --C- T usr32
----------------
IN:
0x00010008: 2701 movs r7, #1
0x0001000a: 2000 movs r0, #0
0x0001000c: df00 svc 0
R00=00000000 R01=f6fff4c2 R02=00000000 R03=00000000
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00010014 R11=00000000
R12=00000000 R13=f6fff350 R14=00000000 R15=00010008
PSR=60000030 -ZC- T usr32

Related

How the value get stored in registers in microprocessors

I am learning the assembly language of ARM. I think the register r3 should hold value 55 (10 + 20 + 25), but it does not. Could someone explain to me why?. I am emulating PXA255 connex board.
.text
entry: b start
arr: .byte 10,20,25
eoa:
.align
start:
ldr r0, =eoa
ldr r1, =arr
mov r3, #0
loop: ldrb r2, [r1], #1
add r3, r2, r3
cmp r1, r0
bne loop
stop: b stop
(qemu) info registers
R00=00000007 R01=00000007 R02=00000019 R03=00000037
R04=00000000 R05=00000000 R06=00000000 R07=00000000
R08=00000000 R09=00000000 R10=00000000 R11=00000000
R12=00000000 R13=00000000 R14=00000000 R15=00000024

ARM-v8 NEON: is there an instruction to split a single normal register across multiple lanes of a NEON register?

I'm new to ARM-v8 (AArch64) and only did a little bit of NEON coding in ARM-v7 (but I'm very comfortable with A32 and ok(*) with normal A64).
Ultimately what I'm trying to do is count the frequency of each set bit [31:0] in a bunch (up to 15) of 32-bit values. I.e in these 15 values, how many times is bit 0 set, how many times is bit 1 set, etc.
So, what I'd like to do is split the 32 bits over 32 nibbles in a 128 bit NEON register and then accumulate the NEON register, like this:
// args(x0: ptr to array of 16 32-bit words) ret(v0: sum of set bits as 32 nibbles)
mov w2, 16 // w2: loop counter
mov v0, 0 // v0: accumulate count
1:
ldr w1, [x0], 4
split v1, w1 // here some magic occurs
add v0.16b, v0.16b, v1.16b
subs w2, w2, 1
bne 1b
I'm not having much luck with the ARM documentation. The ARMv8-ARM just has an alphabetical listing of the 354 NEON instructions, (800 pages of pseudocode). The ARMv8-A Programmer's guide only has 14 pages of introduction and the enticing statement "New lane insert and extract instructions have been added to support the new register packing scheme." And the NEON Programmer's Guide is about ARM-v7.
Assuming there isn't a single instruction to do that, what would be the most efficient way of doing it? -- Not looking for a complete solution, but can NEON help at all? There wouldn't be much point if I have to load each lane separately...
(*) Can't say I like A64 though. :-(
You should think out of the box. That the source data is 32bit wide doesn't mean you should access them by 32bit.
By reading them in 4x8bit manner, the problem is much more simplified. Below is splitting and counting each of the 32bits in the array:
/*
* alqCountBits.S
*
* Created on: 2020. 5. 26.
* Author: Jake 'Alquimista' LEE
*/
.arch armv8-a
.global alqCountBits
.text
// extern void alqCountBits(uint32_t *pDst, uint32_t *pSrc, uint32_t nLength);
// assert(nLength % 2 == 0);
pDst .req x0
pSrc .req x1
length .req w2
.balign 64
.func
alqCountBits:
adr x3, .LShiftTable
movi v30.16b, #1
ld1r {v31.2d}, [x3]
movi v0.16b, #0
movi v1.16b, #0
movi v2.16b, #0
movi v3.16b, #0
movi v4.16b, #0
movi v5.16b, #0
movi v6.16b, #0
movi v7.16b, #0
.balign 64
1:
ld4r {v16.8b, v17.8b, v18.8b, v19.8b}, [pSrc], #4
ld4r {v20.8b, v21.8b, v22.8b, v23.8b}, [pSrc], #4
subs length, length, #2
trn1 v24.2d, v16.2d, v17.2d
trn1 v25.2d, v18.2d, v19.2d
trn1 v26.2d, v20.2d, v21.2d
trn1 v27.2d, v22.2d, v23.2d
ushl v16.16b, v24.16b, v31.16b
ushl v17.16b, v25.16b, v31.16b
ushl v18.16b, v26.16b, v31.16b
ushl v19.16b, v27.16b, v31.16b
and v16.16b, v16.16b, v30.16b
and v17.16b, v17.16b, v30.16b
and v18.16b, v18.16b, v30.16b
and v19.16b, v19.16b, v30.16b
uaddl v24.8h, v18.8b, v16.8b
uaddl2 v25.8h, v18.16b, v16.16b
uaddl v26.8h, v19.8b, v17.8b
uaddl2 v27.8h, v19.16b, v17.16b
uaddw v0.4s, v0.4s, v24.4h
uaddw2 v1.4s, v1.4s, v24.8h
uaddw v2.4s, v2.4s, v25.4h
uaddw2 v3.4s, v3.4s, v25.8h
uaddw v4.4s, v4.4s, v26.4h
uaddw2 v5.4s, v5.4s, v26.8h
uaddw v6.4s, v6.4s, v27.4h
uaddw2 v7.4s, v7.4s, v27.8h
b.gt 1b
.balign 8
stp q0, q1, [pDst, #0]
stp q2, q3, [pDst, #32]
stp q4, q5, [pDst, #64]
stp q6, q7, [pDst, #96]
ret
.endfunc
.balign 8
.LShiftTable:
.dc.b 0, -1, -2, -3, -4, -5, -6, -7
.end
I don't like the aarch64 mnemonics either. For comparison I put the aarch32 version below:
/*
* alqCountBits.S
*
* Created on: 2020. 5. 26.
* Author: Jake 'Alquimista' LEE
*/
.syntax unified
.arm
.arch armv7-a
.fpu neon
.global alqCountBits
.text
// extern void alqCountBits(uint32_t *pDst, uint32_t *pSrc, uint32_t nLength);
// assert(nLength % 2 == 0);
pDst .req r0
pSrc .req r1
length .req r2
.balign 32
.func
alqCountBits:
adr r12, .LShiftTable
vpush {q4-q7}
vld1.64 {d30}, [r12]
vmov.i8 q14, #1
vmov.i8 q0, #0
vmov.i8 q1, #0
vmov.i8 q2, #0
vmov.i8 q3, #0
vmov.i8 q4, #0
vmov.i8 q5, #0
vmov.i8 q6, #0
vmov.i8 q7, #0
vmov d31, d30
.balign 32
1:
vld4.8 {d16[], d17[], d18[], d19[]}, [pSrc]!
vld4.8 {d20[], d21[], d22[], d23[]}, [pSrc]!
subs length, length, #2
vshl.u8 q8, q8, q15
vshl.u8 q9, q9, q15
vshl.u8 q10, q10, q15
vshl.u8 q11, q11, q15
vand q8, q8, q14
vand q9, q9, q14
vand q10, q10, q14
vand q11, q11, q14
vaddl.u8 q12, d20, d16
vaddl.u8 q13, d21, d17
vaddl.u8 q8, d22, d18
vaddl.u8 q10, d23, d19
vaddw.u16 q0, q0, d24
vaddw.u16 q1, q1, d25
vaddw.u16 q2, q2, d26
vaddw.u16 q3, q3, d27
vaddw.u16 q4, q4, d16
vaddw.u16 q5, q5, d17
vaddw.u16 q6, q6, d20
vaddw.u16 q7, q7, d21
bgt 1b
.balign 8
vst1.32 {q0, q1}, [pDst]!
vst1.32 {q2, q3}, [pDst]!
vst1.32 {q4, q5}, [pDst]!
vst1.32 {q6, q7}, [pDst]
vpop {q4-q7}
bx lr
.endfunc
.balign 8
.LShiftTable:
.dc.b 0, -1, -2, -3, -4, -5, -6, -7
.end
As you can see, trn1 equivalence is not needed at all in aarch32
Still, I overall prefer aarch64 so much due to the sheer number of registers.
I don't think it can be done per nibble, but per byte should work.
Load a vector with the relevant source bit set in each byte (you'll need two of these as we probably only can do this per byte and not per nibble). Duplicate each byte of the word into 8 byte sized elements each, in two vectors. Do a cmtst with both masks (which will set all bits, i.e. set it to -1, in an element if the corresponding bit was set), and accumulate.
Something like this, untested:
.section .rodata
mask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
.text
mov w2, 16 // w2: loop counter
mov v0.16b, 0 // v0: accumulate count 1
mov v1.16b, 0 // v1: accumulate count 2
adrp w3, mask
add w3, :lo12:mask
ld1 {v2.16b}, [w3] // v2: mask with one bit set in each byte
1:
ld1r {v3.4s}, [x0], #4 // One vector with the full 32 bit word
subs w2, w2, 1
dup v4.8b, v3.b[0] // v4: vector containing the lowest byte of the word
dup v5.8b, v3.b[1] // v5: vector containing the second lowest byte of the word
dup v6.8b, v3.b[2]
dup v7.8b, v3.b[3]
ins v4.d[1], v5.d[0] // v4: elements 0-7: lowest byte, elements 8-15: second byte
ins v6.d[1], v7.d[0] // v6: elements 0-7: third byte, elements 8-15: fourth byte
cmtst v4.16b, v4.16b, v2.16b // v4: each byte -1 if the corresponding bit was set
cmtst v6.16b, v6.16b, v2.16b // v5: each byte -1 if the corresponding bit was set
sub v0.16b, v0.16b, v4.16b // accumulate: if bit was set, subtract -1 i.e. add +1
sub v1.16b, v1.16b, v6.16b
b.ne 1b
// Done, count of individual bits in byte sized elements in v0-v1
EDIT: The ld4r approach as suggested by Jake 'Alquimista' LEE is actually better than the loading here; the ld1r followed by four dup could be replaced by ld4r {v4.8b, v5.8b, v6.8b, v7.8h}, [x0], #4 here, keeping the logic the same. For the rest, whether cmtst or ushl + and ends up faster, one would have to test and measure to see. And handling two 32 bit words at the same time, as in his solution, probably gives better throughput than my solution here.
Combining the above answers, and modifying my requirements ;-) I came up with:
tst:
ldr x0, =test_data
ldr x1, =mask
ld1 {v2.2d}, [x1] // ld1.2d v2, [x1] // load 2 * 64 = 128 bits
movi v0.16b, 0
mov w2, 8
1:
ld1r {v1.8h}, [x0], 2 // ld1r.8h v1, [x0], 2 // repeat one 16-bit word across eight 16-bit lanes
cmtst v1.16b, v1.16b, v2.16b // cmtst.16b v1, v1, v2 // sets -1 in each 8bit word of 16 8-bit lanes if input matches mask
sub v0.16b, v0.16b, v1.16b // sub.16b v0, v0, v1 // sub -1 = add +1
subs w2, w2, 1
bne 1b
// v0 contains 16 bytes, mildly shuffled.
If one wants them unshuffled:
mov v1.d[0], v0.d[1]
uzp1 v2.8b, v0.8b, v1.8b
uzp2 v3.8b, v0.8b, v1.8b
mov v2.d[1], v3.d[0]
// v2 contains 16 bytes, in order.
The following counts up to fifteen samples with 32 bits (accumulating in 32 nibbles):
tst2:
ldr x0, =test_data2
ldr x1, =mask2
ld1 {v2.4s, v3.4s, v4.4s, v5.4s}, [x1] // ld1.4s {v2, v3, v4, v5}, [x1]
movi v0.16b, 0
mov w2, 8
1:
ld1r {v1.4s}, [x0], 4 // ld1r.4s v1, [x0], 4 // repeat one 32-bit word across four 32-bit lanes
cmtst v6.16b, v1.16b, v2.16b // cmtst.16b v6, v1, v2 // upper nibbles
cmtst v1.16b, v1.16b, v3.16b // cmtst.16b v1, v1, v3 // lower nibbles
and v6.16b, v6.16b, v4.16b // and.16b v6, v6, v4 // upper inc 0001.0000 x 16
and v1.16b, v1.16b, v5.16b // and.16b v1, v1, v5 // lower inc 0000.0001 x 16
orr v1.16b, v1.16b, v6.16b // orr.16b v1, v1, v6
add v0.16b, v0.16b, v1.16b // add.16b v0, v0, v1 // accumulate
subs w2, w2, 1
bne 1b
// v0 contains 32 nibbles -- somewhat shuffled, but that's ok.
// fedcba98.76543210.fedcba98.76543210.fedcba98.76543210.fedcba98.76543210 fedcba98.76543210.fedcba98.76543210.fedcba98.76543210.fedcba98.76543210
// 10000000.10000000.01000000.01000000.00100000.00100000.00010000.00010000 00001000.00001000.00000100.00000100.00000010.00000010.00000001.00000001
// f 7 e 6 d 5 c 4 b 3 a 2 9 1 8 0
mask:
.quad 0x0808040402020101
.quad 0x8080404020201010
test_data:
.hword 0x0103
.hword 0x0302
.hword 0x0506
.hword 0x080A
.hword 0x1010
.hword 0x2020
.hword 0xc040
.hword 0x8080
// FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰.FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰.FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰.FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰
// 10001000 10001000 10001000 10001000 01000100 01000100 01000100 01000100 00100010 00100010 00100010 00100010 00010001 00010001 00010001 00010001
// F B 7 3 f b ⁷ ³ E A 6 2 e a ⁶ ² D 9 5 1 d ⁹ ⁵ ¹ C 8 4 0 c ⁸ ⁴ ⁰
mask2:
.quad 0x8080808040404040 // v2
.quad 0x2020202010101010
.quad 0x0808080804040404 // v3
.quad 0x0202020201010101
.quad 0x1010101010101010 // v4
.quad 0x1010101010101010
.quad 0x0101010101010101 // v5
.quad 0x0101010101010101
test_data2:
.word 0xff000103
.word 0xff000302
.word 0xff000506
.word 0xff00080A
.word 0xff001010
.word 0xff002020
.word 0xff00c040
.word 0xff008080

How to understand why an ARM exception happens?

I'm trying understand what is the reason of ARM exception that I encounter.
It happens randomly during system startup, and may looks in few different ways.
One of simplest is following:
0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
#0 0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
#1 0x80002f04 in ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm(int0_t) ()
at /home/rnd_share/sysbios/bios_6_51_00_15/packages/ti/sysbios/family/arm/exc/Exception_asm_gnu.asm:103
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
r0 0x20000197 536871319
r1 0x20000197 536871319
r2 0x20000197 536871319
r3 0x20000197 536871319
r4 0x20000197 536871319
r5 0x6 6
r6 0x80000024 2147483684
r7 0x80007a0c 2147514892
r8 0x8004f0a8 2147807400
r9 0x80041340 2147750720
r10 0x80040a3c 2147748412
r11 0xffffffff 4294967295
r12 0x20000197 536871319
sp 0x7fffff88 0x7fffff88
lr 0x80002f04 2147495684
pc 0x8004e810 0x8004e810 <ti_sysbios_family_arm_a8_intcps_Hwi_vectors+16>
cpsr 0x20000197 536871319
PC = 8004E810, CPSR = 20000197 (ABORT mode, ARM IRQ dis.)
R0 = 20000197, R1 = 20000197, R2 = 20000197, R3 = 20000197
R4 = 20000197, R5 = 00000006, R6 = 80000024, R7 = 80007A0C
USR: R8 =8004F0A8, R9 =80041340, R10=80040A3C, R11 =FFFFFFFF, R12 =20000197
R13=80212590, R14=80040A3C
FIQ: R8 =AEE1D6FA, R9 =C07BA930, R10=1B0B137A, R11 =7EC3F1DF, R12 =2000019F
R13=80065CF8, R14=00000000, SPSR=00000000
SVC: R13=4030CB20, R14=00022071, SPSR=00000000
ABT: R13=7FFFFF88, R14=80002F04, SPSR=20000197
IRQ: R13=F4ADFD8A, R14=80041020, SPSR=8000011F
UND: R13=80085CF8, R14=ED0F7EF1, SPSR=00000000
(gdb) frame
#0 0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
(gdb) frame 1
#1 0x80002f04 in ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm(int0_t) ()
at /home/rnd_share/sysbios/bios_6_51_00_15/packages/ti/sysbios/family/arm/exc/Exception_asm_gnu.asm:103
103 mrc p15, #0, r12, c5, c0, #0 # read DFSR into r12
(gdb) list
98 .func ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm__I
99
100 ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm__I:
101 stmfd sp!, {r0-r12} # save r4-r12 while we're at it
102
103 mrc p15, #0, r12, c5, c0, #0 # read DFSR into r12
104 stmfd sp!, {r12} # save DFSR
105 mrc p15, #0, r12, c5, c0, #1 # read IFSR into r12
106 stmfd sp!, {r12} # save DFSR
107 mrc p15, #0, r12, c6, c0, #0 # read DFAR into r12
(gdb) monitor cp15 6 0 0 0
Reading CP15 register (6,0,0,0 = 0x7FFFFF54)
My understanding is that, there was some ongoing exception, which can be seen in frame 1.
It tries to save registers onto stack:
101 stmfd sp!, {r0-r12} # save r4-r12 while we're at it
But, stack pointer was incorrect at:
ABT: R13=7FFFFF88
I don't understand both:
What can be a cause of such value of SP in ABT and IRQ contexts ?
what is actually in frame 0 ? in other words, how Cortex reacted to data abort while being already in exception handler ?
This device usually starts normally, such situation happens like 3 times per 10 boots. It never happens when starting from debugger, only release and only when started from bootloader.
Two weeks later...
Boot procedure is following:
2nd stage bootloader loads application to memory
2nd stage bootloader jumps to application start.
main function of application is entered.
It turns out that sometimes statically initialized values of application have correct values after 1 step of booting, but then in 3 step they are corrupted. I mean application image is corrupted.
Caches haven't been flushed correctly between step 1 and 2.
Disabling caches at 2nd stage bootloader fixed problem at all.
Now need to fix that correctly.

Allwinner a64 - switch from aarch32 to aarch64 by warm reset

I want to deploy a simple bare metal software on the Pine64 board, hosting Allwinner A64 SoC. The configuration is following: when powered on, boot0 starts u-boot, which loads my hello.bin to RAM (0x40000000) and starts executing it. The thing is that it is in aarch32 execution state and I want aarch64.
I have found out a way how to do it as in this patch. Some background also on the wiki.
I have copied the code and the objdump -d hello.o returns identical results as in the link:
Disassembly of section .text:
00000000 <_reset>:
0: e59f0024 ldr r0, [pc, #36] ; 2c <_reset+0x2c>
4: e59f1024 ldr r1, [pc, #36] ; 30 <_reset+0x30>
8: e5801000 str r1, [r0]
c: f57ff04f dsb sy
10: f57ff06f isb sy
14: ee1c0f50 mrc 15, 0, r0, cr12, cr0, {2}
18: e3800003 orr r0, r0, #3
1c: ee0c0f50 mcr 15, 0, r0, cr12, cr0, {2}
20: f57ff06f isb sy
24: e320f003 wfi
28: eafffffe b 28 <_reset+0x28>
2c: 017000a0 .word 0x017000a0
30: 40008000 .word 0x40008000
It is supposed to perform a warm-reset and start executin at 0x40008000 in aarch64 execution state. But when running I am getting Undefined instruction error and it restarts in the same state and starts from 0x0.
## Starting application at 0x40000000 ...
undefined instruction
pc : [<40000018>] lr : [<7ff1d054>]
sp : 76eb8a90 ip : 00000030 fp : 7ff1d00c
r10: 00000002 r9 : 76ed0ea0 r8 : 7ffb5340
r7 : 77f1bd78 r6 : 40000000 r5 : 00000002 r4 : 77f1bd7c
r3 : 40000000 r2 : 77f1bd7c r1 : 40008000 r0 : 017000a0
Flags: nZCv IRQs on FIQs off Mode SVC_32
Resetting CPU ...
Why is that?
EDIT:
The first problem was noticed by #Frant below, the binary that should be linked with different .text section address, that is start from 0x40000000 instead of 0x0.
It also couldn't work loaded by u-boot, that is in EL2. In order to write to RMR one needs to be in EL3. This is possible with FEL method.
NOTE:
After facing this problem I was asking around for some help and apparently I was using an old way of flashing the board. Since some time Pine64 got much better support and now it is possible to boot it in two more convenient ways:
* mainline u-boot with atf, that will directly generate a binary one can flash to SD card, and drops you in EL2,
* using the sunxi-fel tool, as described below, which is very convenient if one does not want to re-flash SD card all the time, drops you in EL3 (WARNING: sunxi wiki is a bit misleading on the sunxi-fel command arguments, these one below worked for me).
My answer is an attempt to answer the following question: Does the aarch32 state-switching code you are using work ? The good new is that the code you are using works fine. The bad new is that something else may not work properly in your environment.This would not surprise me much given the terrible state of all Allwinner out-of-the box BSPs.
Since I did not know which exact versions of boot0 and u-boot you were using, I tested your code using Andre Przywara's FEL-capable SPL binaries for A64/H5 - see the FEL Booting section of the A64 entry for more details - and sunxi-fel:This does remove the boot0 and u-boot you are using as potential culprits.
The Minimal, Complete, and Verifiable example I built for testing your code requires:
Removing the SD card from the Pine64, so that it will enter the FEL mode at power-up,
A male-A to male-A USB 2.0 cable for connecting your PC to the upper USB host receptacle of the Pine64.
A bash script, build.sh, for building sunxi-tools, retrieving the FEL-capable SPL binaries,
rmr_switch.S, a version of rmr_switch.S minus comments plus a symbol to be pre-processed for setting the start address without having to modify the file all the time,
rmr_switch2.S, a version of the rmr_switch.S mentionned above, but using r0 and r1 the way they are being used in the patch you were referencing.
uart-aarch32.s, an aarch32 program displaying *** Hello from aarch32! *** on UART0,
uart-aarch64.s, an aarch64 program displaying *** Hello from aarch64! *** on UART0.
Here is the content for each of the required files:
build.sh:
#!/bin/bash
# usage:
# CROSS_COMPILE_AARCH64=/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_aarch64-elf/bin/aarch64-elf- CROSS_COMPILE_AARCH32=/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_arm-eabi/bin/arm-eabi- ./build.sh
clear
CROSS_COMPILE_AARCH64=${CROSS_COMPILE_AARCH64:-/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_aarch64-elf/bin/aarch64-elf-}
CROSS_COMPILE_AARCH32=${CROSS_COMPILE_AARCH32:-/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_arm-eabi/bin/arm-eabi-}
SOC=${SOC:-a64}
#AARCH32_START_ADDRESS=0x42000000
#AARCH64_START_ADDRESS=0x42010000
AARCH32_START_ADDRESS=0x40000000
AARCH64_START_ADDRESS=0x40008000
SUNXI_FEL=sunxi-tools/sunxi-fel
install_sunxi_tools()
{
if [ ! -f ${SUNXI_FEL} ]
then
git clone --branch v1.4.2 https://github.com/linux-sunxi/sunxi-tools
pushd sunxi-tools
make
popd
fi
}
retrieve_spl_aarch32()
{
if [ ! -f sunxi-a64-spl32-ddr3.bin ]
then
wget https://github.com/apritzel/pine64/raw/master/binaries/sunxi-a64-spl32-ddr3.bin
fi
if [ ! -f sunxi-h5-spl32-ddr3.bin ]
then
wget https://github.com/apritzel/pine64/raw/master/binaries/sunxi-h5-spl32-ddr3.bin
fi
}
test_aarch32()
{
# testing aarch32 program
PROGRAM=uart-aarch32.s
BASE=${PROGRAM%%.*}
${CROSS_COMPILE_AARCH32}gcc -O0 -nostdlib -nostartfiles -e ${AARCH64_START_ADDRESS} -Wl,-Ttext=${AARCH32_START_ADDRESS} -o ${BASE}.elf ${BASE}.s
${CROSS_COMPILE_AARCH32}objcopy --remove-section .note.gnu.build-id ${BASE}.elf
${CROSS_COMPILE_AARCH32}objcopy --remove-section .ARM.attributes ${BASE}.elf
${CROSS_COMPILE_AARCH32}objdump -D ${BASE}.elf > ${BASE}.lst
${CROSS_COMPILE_AARCH32}objcopy -O binary ${BASE}.elf ${BASE}.bin
${CROSS_COMPILE_AARCH32}objcopy ${BASE}.elf -O srec ${BASE}.srec
echo "------------------ test uart-aarch32 -----------------------------"
echo sudo ${SUNXI_FEL} spl sunxi-${SOC}-spl32-ddr3.bin
echo sudo ${SUNXI_FEL} write ${AARCH32_START_ADDRESS} uart-aarch32.bin
echo sudo ${SUNXI_FEL} exe ${AARCH32_START_ADDRESS}
echo "------------------------------------------------------------------"
}
test_aarch64()
{
# testing aarch64 program
PROGRAM=uart-aarch64.s
BASE=${PROGRAM%%.*}
${CROSS_COMPILE_AARCH64}gcc -O0 -nostdlib -nostartfiles -e ${AARCH64_START_ADDRESS} -Wl,-Ttext=${AARCH64_START_ADDRESS} -o ${BASE}.elf ${BASE}.s
${CROSS_COMPILE_AARCH64}objcopy --remove-section .note.gnu.build-id ${BASE}.elf
${CROSS_COMPILE_AARCH64}objcopy --remove-section .ARM.attributes ${BASE}.elf
${CROSS_COMPILE_AARCH64}objdump -D ${BASE}.elf > ${BASE}.lst
${CROSS_COMPILE_AARCH64}objcopy -O binary ${BASE}.elf ${BASE}.bin
${CROSS_COMPILE_AARCH64}objcopy ${BASE}.elf -O srec ${BASE}.srec
echo "------------------ test uart-aarch64 -----------------------------"
echo sudo ${SUNXI_FEL} spl sunxi-${SOC}-spl32-ddr3.bin
echo sudo ${SUNXI_FEL} write ${AARCH64_START_ADDRESS} uart-aarch64.bin
echo sudo ${SUNXI_FEL} reset64 ${AARCH64_START_ADDRESS}
echo "------------------------------------------------------------------"
}
test_rmr_switch()
{
# compiling rmr_switch.s
PROGRAM=rmr_switch.s
BASE=${PROGRAM%%.*}
rm -f ${BASE}.s
${CROSS_COMPILE_AARCH64}cpp -DAARCH64_START_ADDRESS=${AARCH64_START_ADDRESS} ${BASE}.S > ${BASE}.s
${CROSS_COMPILE_AARCH32}gcc -O0 -nostdlib -nostartfiles -e ${AARCH32_START_ADDRESS} -Wl,-Ttext=${AARCH32_START_ADDRESS} -o ${BASE}.elf ${BASE}.s
${CROSS_COMPILE_AARCH32}objcopy --remove-section .note.gnu.build-id ${BASE}.elf
${CROSS_COMPILE_AARCH32}objcopy --remove-section .ARM.attributes ${BASE}.elf
${CROSS_COMPILE_AARCH32}objdump -D ${BASE}.elf > ${BASE}.lst
${CROSS_COMPILE_AARCH32}objcopy -O binary ${BASE}.elf ${BASE}.bin
${CROSS_COMPILE_AARCH32}objcopy ${BASE}.elf -O srec ${BASE}.srec
echo "------------------ test rmr_switch uart-aarch64 ------------------"
echo sudo ${SUNXI_FEL} spl sunxi-${SOC}-spl32-ddr3.bin
echo sudo ${SUNXI_FEL} write ${AARCH32_START_ADDRESS} rmr_switch.bin
echo sudo ${SUNXI_FEL} write ${AARCH64_START_ADDRESS} uart-aarch64.bin
echo sudo ${SUNXI_FEL} exe ${AARCH32_START_ADDRESS}
echo "------------------------------------------------------------------"
}
test_rmr_switch2()
{
# compiling rmr_switch2.s
PROGRAM=rmr_switch2.s
BASE=${PROGRAM%%.*}
rm -f ${BASE}.s
${CROSS_COMPILE_AARCH64}cpp -DAARCH64_START_ADDRESS=${AARCH64_START_ADDRESS} ${BASE}.S > ${BASE}.s
${CROSS_COMPILE_AARCH32}gcc -O0 -nostdlib -nostartfiles -e ${AARCH32_START_ADDRESS} -Wl,-Ttext=${AARCH32_START_ADDRESS} -o ${BASE}.elf ${BASE}.s
${CROSS_COMPILE_AARCH32}objcopy --remove-section .note.gnu.build-id ${BASE}.elf
${CROSS_COMPILE_AARCH32}objcopy --remove-section .ARM.attributes ${BASE}.elf
${CROSS_COMPILE_AARCH32}objdump -D ${BASE}.elf > ${BASE}.lst
${CROSS_COMPILE_AARCH32}objcopy -O binary ${BASE}.elf ${BASE}.bin
${CROSS_COMPILE_AARCH32}objcopy ${BASE}.elf -O srec ${BASE}.srec
echo "------------------ test rmr_switch2 uart-aarch64 -----------------"
echo sudo ${SUNXI_FEL} spl sunxi-${SOC}-spl32-ddr3.bin
echo sudo ${SUNXI_FEL} write ${AARCH32_START_ADDRESS} rmr_switch2.bin
echo sudo ${SUNXI_FEL} write ${AARCH64_START_ADDRESS} uart-aarch64.bin
echo sudo ${SUNXI_FEL} exe ${AARCH32_START_ADDRESS}
echo "------------------------------------------------------------------"
}
# prerequisites
install_sunxi_tools
retrieve_spl_aarch32
# test
test_aarch32
test_aarch64
test_rmr_switch
test_rmr_switch2
rmr_switch.S:
.text
ldr r1, =0x017000a0 # MMIO mapped RVBAR[0] register
ldr r0, =AARCH64_START_ADDRESS # start address, to be replaced
str r0, [r1]
dsb sy
isb sy
mrc 15, 0, r0, cr12, cr0, 2 # read RMR register
orr r0, r0, #3 # request reset in AArch64
mcr 15, 0, r0, cr12, cr0, 2 # write RMR register
isb sy
1: wfi
b 1b
rmr_switch2.S:
.text
ldr r0, =0x017000a0 # MMIO mapped RVBAR[0] register
ldr r1, =AARCH64_START_ADDRESS # start address, to be replaced
str r1, [r0]
dsb sy
isb sy
mrc 15, 0, r0, cr12, cr0, 2 # read RMR register
orr r0, r0, #3 # request reset in AArch64
mcr 15, 0, r0, cr12, cr0, 2 # write RMR register
isb sy
1: wfi
b 1b
uart-aarch32.s:
.code 32
.text
ldr r1,=0x01C28000
ldr r2,=message
loop: ldrb r0, [r2]
add r2, r2, #1
cmp r0, #0
beq completed
strb r0, [r1]
b loop
completed: b .
.data
message:
.asciz "*** Hello from aarch32! ***"
.end
uart-aarch64.s:
.text
ldr x1,=0x01C28000
ldr x2,=message
loop: ldrb w0, [x2]
add x2, x2, #1
cmp w0, #0
beq completed
strb w0, [x1]
b loop
completed: b .
.data
message:
.asciz "*** Hello from aarch64! ***"
.end
Once all the files are in the same directory, the test procedure would be:
Execute build.sh: You can specify the SOC you are using A64 (default) or H5, and the aarch32/aarch64 toolchains in the command-line:
CROSS_COMPILE_AARCH64=/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_aarch64-elf/bin/aarch64-elf- CROSS_COMPILE_AARCH32=/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_arm-eabi/bin/arm-eabi- ./build.sh
The output should look like this, (I removed harmless warnings):
------------------ test uart-aarch32 -----------------------------
sudo sunxi-tools/sunxi-fel spl sunxi-a64-spl32-ddr3.bin
sudo sunxi-tools/sunxi-fel write 0x40000000 uart-aarch32.bin
sudo sunxi-tools/sunxi-fel exe 0x40000000
------------------ test uart-aarch64 -----------------------------
sudo sunxi-tools/sunxi-fel spl sunxi-a64-spl32-ddr3.bin
sudo sunxi-tools/sunxi-fel write 0x40008000 uart-aarch64.bin
sudo sunxi-tools/sunxi-fel reset64 0x40008000
------------------ test rmr_switch uart-aarch64 ------------------
sudo sunxi-tools/sunxi-fel spl sunxi-a64-spl32-ddr3.bin
sudo sunxi-tools/sunxi-fel write 0x40000000 rmr_switch.bin
sudo sunxi-tools/sunxi-fel write 0x40008000 uart-aarch64.bin
sudo sunxi-tools/sunxi-fel exe 0x40000000
------------------ test rmr_switch2 uart-aarch64 -----------------
sudo sunxi-tools/sunxi-fel spl sunxi-a64-spl32-ddr3.bin
sudo sunxi-tools/sunxi-fel write 0x40000000 rmr_switch2.bin
sudo sunxi-tools/sunxi-fel write 0x40008000 uart-aarch64.bin
sudo sunxi-tools/sunxi-fel exe 0x40000000
------------------------------------------------------------------
Now, before entering the sunxi-fel commands required for each of the four tests, you need to unplug the Pine64 from its power source and from any USB host receptacle it may be plugged into (USB TTL uart, male-A to male-A USB cable). Reconnect the Pine64 to its power source, then re-plug USB cables.
lsusb should now display:
Bus 001 Device 016: ID 1f3a:efe8 Onda (unverified) V972 tablet in flashing mode
Output on the serial console for the four tests should be:
test uart-aarch32 (verifying an aarch32 program runs from 0x40000000):
U-Boot SPL 2018.01-00007-gdb0ecc9b42 (Feb 23 2018 - 00:50:52)
DRAM: 512 MiB
Trying to boot from FEL
*** Hello from aarch32! ***
test uart-aarch64 (verifying an aarch64 program runs from 0x40008000):
U-Boot SPL 2018.01-00007-gdb0ecc9b42 (Feb 23 2018 - 00:50:52)
DRAM: 512 MiB
Trying to boot from FEL
*** Hello from aarch64! ***
test test rmr_switch uart-aarch64 (running rmr_switch from 0x40000000, which will switch into aarch64 state and execute uart-aarch64 from 0x40008000):
U-Boot SPL 2018.01-00007-gdb0ecc9b42 (Feb 23 2018 - 00:50:52)
DRAM: 512 MiB
Trying to boot from FEL
*** Hello from aarch64! ***
test test rmr_switch2 uart-aarch64 (running rmr_switch2 from 0x40000000, which will switch into aarch64 state and execute uart-aarch64 from 0x40008000):
U-Boot SPL 2018.01-00007-gdb0ecc9b42 (Feb 23 2018 - 00:50:52)
DRAM: 512 MiB
Trying to boot from FEL
*** Hello from aarch64! ***
It is worth mentioning that those tests can be performed on Windows using Linaro mingw32 toolchains, a Windows version of sunxi-fel, and Zadig.
Bottom line, the code you were using seems to be working well, and the rmr_switch2.s code I assembled is the same (I guess) that the one you are using:
rmr_switch2.elf: file format elf32-littlearm
Disassembly of section .text:
40000000 <.text>:
40000000: e59f0024 ldr r0, [pc, #36] ; 4000002c <.text+0x2c>
40000004: e59f1024 ldr r1, [pc, #36] ; 40000030 <.text+0x30>
40000008: e5801000 str r1, [r0]
4000000c: f57ff04f dsb sy
40000010: f57ff06f isb sy
40000014: ee1c0f50 mrc 15, 0, r0, cr12, cr0, {2}
40000018: e3800003 orr r0, r0, #3
4000001c: ee0c0f50 mcr 15, 0, r0, cr12, cr0, {2}
40000020: f57ff06f isb sy
40000024: e320f003 wfi
40000028: eafffffd b 40000024 <.text+0x24>
4000002c: 017000a0 cmneq r0, r0, lsr #1
40000030: 40008000 andmi r8, r0, r0
The examples were was successfully tested on an H5-based OrangePI PC2. Command-line for running build.sh should be:
SOC=h5 CROSS_COMPILE_AARCH64=/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_aarch64-elf/bin/aarch64-elf- CROSS_COMPILE_AARCH32=/opt/linaro/gcc-linaro-7.2.1-2017.11-x86_64_arm-eabi/bin/arm-eabi- ./build.sh
Output for build.sh, and therefore sunxi-fel commands to be executed, will be slightly different, since a different, H5-specific, FEL-capable SPL will have to be used.
I noticed there is a small difference between the code you are using and rmr_switch2 code, but since it comes after the state switch/after wfi, it should not matter I guess - I am assuming the code you assembled was slightly different itself:
Yours (.o):
28: eafffffe b 28 <_reset+0x28>
Mine (.elf):
40000028: eafffffd b 40000024 <.text+0x24>
I hope this help.

PMU counters in ARM11

I am programming raspbery pi model b ARM1176 bare metal (in assembly and c). I need to calculate the clock cycles used to execute an assembly code.
I am using the following code for PMU counter:
mov r0,#1
MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register
/* Reset Cycle Counter */
mov r0,#5
MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register
/* Meaure */
MRC p15, 0, r0, c15, c12, 1 # Read Cycle Counter Register
<MY CODES>
MRC p15, 0, r1, c15, c12, 1 # Read Cycle Counter Register
From this if I have
add r3,#3
in place of my code i get r1=8 and r0=0, which seems correct since arm11 has 8 pipeline stages and it takes 8 clock cycles to execute it.
But when I add more instructions I am getting ridiculous results like
add r3,#3
add r4,#1
r0=0,r1=97/96/94 (the result of r1 should also be constant!!!)
I am using uart to see results of registers on minicom.
Okay, seeing the same thing, that is very interesting.
# nop
.globl test
test:
mov r0,#1
MCR p15, 0, r0, c15, c12, 0
mov r0,#5
MCR p15, 0, r0, c15, c12, 0
MRC p15, 0, r0, c15, c12, 1
add r3,#3
add r2,#1
MRC p15, 0, r1, c15, c12, 1
sub r0,r1,r0
bx lr
I am calling this from C so if I muck with r4 in the code under test would have to save it on the stack so messed with r2. Without the add r2 line the return value was 8 with the add r2 line the return value was 0x68 then 0x65. Note this is on a pi zero. So some clocks are a little faster than yours.
Remember this is running from dram and dram is painfully slow. So you may be seeing some of that.
Initial alignment of the code:
00008024 <test>:
8024: e3a00001 mov r0, #1
8028: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
802c: e3a00005 mov r0, #5
8030: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8034: ee1f0f3c mrc 15, 0, r0, cr15, cr12, {1}
8038: e2833003 add r3, r3, #3
803c: e2822001 add r2, r2, #1
8040: ee1f1f3c mrc 15, 0, r1, cr15, cr12, {1}
8044: e0410000 sub r0, r1, r0
8048: e12fff1e bx lr
Yep if I uncomment the nop in front of .globl test, and I comment out the add r2 I only have the add r3 as the code under test, but the nop pushes the alignment of the whole block of code. with the add r3 and no nop I get 8 counts with the add r3 and the nop I get 0x67 counts.
So I think this is just a case of measuring the fetch. I have not enabled the arm cache, but there may be a deeper cache or an mmu or other since this ram is shared between the arm and the gpu.
If I go one step further and uncomment the nop have both the add r3 and the add r2 it is 0x69 counts. or basically on par or barely longer than one instruction, so we forced a fetch in there.
so I my case if I add more nops so the initial read of the count is aligned on an 8 word boundary, and I have the two instructions being measured
00008030 <test>:
8030: e3a00001 mov r0, #1
8034: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8038: e3a00005 mov r0, #5
803c: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8040: ee1f0f3c mrc 15, 0, r0, cr15, cr12, {1}
8044: e2833003 add r3, r3, #3
8048: e2822001 add r2, r2, #1
804c: ee1f1f3c mrc 15, 0, r1, cr15, cr12, {1}
8050: e0410000 sub r0, r1, r0
8054: e12fff1e bx lr
I get a count of 8. I put a third instruction in there an add r3 and two add r2s. still a count of 8.
If I go back to this where at least part of it is in a different fetch line.
00008024 <test>:
8024: e3a00001 mov r0, #1
8028: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
802c: e3a00005 mov r0, #5
8030: ee0f0f1c mcr 15, 0, r0, cr15, cr12, {0}
8034: ee1f0f3c mrc 15, 0, r0, cr15, cr12, {1}
8038: e2833003 add r3, r3, #3
803c: e2822001 add r2, r2, #1
8040: ee1f1f3c mrc 15, 0, r1, cr15, cr12, {1}
8044: e0410000 sub r0, r1, r0
8048: e12fff1e bx lr
And I do three runs without changing anything, and then enable the l1 cache (instruction) and do three more runs I get
00000068
0000001D
0000001D
0000001F
00000008
00000008
So I think you are dealing with dram which is slow, fetch lines, cache misses and and hits and resulting cache line fetches.
If you were expecting to see the number of clocks it took to execute an instruction you wont, you dont have zero wait state memory unless you can keep the entire code under test in the l1 cache.
I dont think there is on chip sram that you can use for this kind of thing for this chip/board, you are going to end up hitting dram and that dram is shared with the gpu. So basically program execution time is not expected to be deterministic and as with your computer or phone or other the cpu is not the bottleneck has not been for a long time it is sitting around waiting to be fed data or instructions.

Resources