objdump produces wrong branch opcode interpretation

objdump produces wrong branch opcode interpretation - c

See the following objdump line of a specific object file of a specific function (func):
3c: e03a b.n 78 <func+0x78>
Now, the opcode e03a in the target system (ARMv6-M) says jump to the location of PC + 0x78. A correct interpretation will be:
3c: e03a b.n B4 <func+0xB4>
Every other function and file contains proper b.n interpretations with proper values calculations in their objdump dump. For some reason, only this function causes objdump to be "confused".
Note: funcstarts at 0x0.
I could not think of any reason for this situation. And since I have tools that parse and uses the objdump dump, this causes great problem for me.
Is there any reasonable reason for that?
toolchain: gcc-arm-none-eabi-4_9-2015q3
platform running this toolchain: Ubuntu 16.04.2 LTS
EDIT: I'm attaching partial dump:
Disassembly of section i.func:
00000000 <func>:
0: b531 push {r0, r4, r5, lr}
2: b088 sub sp, #32
4: 2100 movs r1, #0
6: 9106 str r1, [sp, #24]
8: 482c ldr r0, [pc, #176] ; (bc <func+0xbc>)
a: 6800 ldr r0, [r0, #0]
c: 6840 ldr r0, [r0, #4]
e: 9103 str r1, [sp, #12]
10: 1c40 adds r0, r0, #1
12: 9002 str r0, [sp, #8]
14: 492a ldr r1, [pc, #168] ; (c0 <func+0xc0>)
16: 2000 movs r0, #0
18: 9104 str r1, [sp, #16]
1a: 9005 str r0, [sp, #20]
1c: a802 add r0, sp, #8
1e: f7ff fffe bl 0 <random_func>
22: f7ff fffe bl 0 <random_func2>
26: 4604 mov r4, r0
28: 4d26 ldr r5, [pc, #152] ; (c4 <func+0xc4>)
2a: 42ac cmp r4, r5
2c: d007 beq.n 3e <func+0x3e>
2e: a326 add r3, pc, #152 ; (adr r3, c8 <func+0xc8>)
30: 22ee movs r2, #238 ; 0xee
32: 492c ldr r1, [pc, #176] ; (e4 <func+0xe4>)
34: 2000 movs r0, #0
36: 9400 str r4, [sp, #0]
38: f7ff fffe bl 0 <log_func>
3c: e03a b.n 78 <func+0x78> <---- PROBLEM IS HERE
3e: f7ff fffe bl 0 <func>
42: 9006 str r0, [sp, #24]
44: f3bf 8f5f dmb sy
48: a808 add r0, sp, #32
4a: 7800 ldrb r0, [r0, #0]
4c: 2800 cmp r0, #0
4e: d00f beq.n 70 <func+0x70>
50: 9806 ldr r0, [sp, #24]
52: 2803 cmp r0, #3
54: d016 beq.n 84 <func+0x84>
56: f7ff fffe bl 0 <some_hw_func>
5a: 4604 mov r4, r0
5c: 42ac cmp r4, r5
5e: d01a beq.n 96 <func+0x96>
60: a321 add r3, pc, #132 ; (adr r3, e8 <func+0xe8>)
62: 22fa movs r2, #250 ; 0xfa
64: 491f ldr r1, [pc, #124] ; (e4 <func+0xe4>)
66: 2000 movs r0, #0
68: 9400 str r4, [sp, #0]
6a: f7ff fffe bl 0 <log_func>
6e: e021 b.n 46 <random_delay+0x46> <--- ALSO HERE SAME PROBLEM
70: f7ff fffe bl 0 <random_delay>
74: 2800 cmp r0, #0
76: d003 beq.n 80 <func+0x80>
78: a808 add r0, sp, #32
7a: 7800 ldrb r0, [r0, #0]
7c: 2800 cmp r0, #0
7e: d018 beq.n b2 <func+0xb2>
80: f7ff fffe bl 0 <some_hw_func2>
84: f7ff fffe bl 0 <random_delay>
88: 2800 cmp r0, #0
8a: d002 beq.n 92 <func+0x92>
8c: 9806 ldr r0, [sp, #24]
8e: 2803 cmp r0, #3
90: d00f beq.n b2 <func+0xb2>
92: f7ff fffe bl 0 <some_hw_func2>
96: f7ff fffe bl 0 <func>
9a: 4604 mov r4, r0
9c: 42ac cmp r4, r5
9e: d008 beq.n b2 <func+0xb2>
a0: 22ff movs r2, #255 ; 0xff
a2: a318 add r3, pc, #96 ; (adr r3, 104 <func+0x104>)
a4: 3201 adds r2, #1
a6: 490f ldr r1, [pc, #60] ; (e4 <func+0xe4>)
a8: 2000 movs r0, #0
aa: 9400 str r4, [sp, #0]
ac: f7ff fffe bl 0 <log_func>
b0: e000 b.n b4 <func+0xb4>
b2: 462c mov r4, r5
b4: 4620 mov r0, r4

Looks like a bug; each time the jump is between jumps, that are subject to relocation like here
38: f7ff fffe bl 0 <log_func>
3c: e03a b.n 78 <func+0x78> <---- PROBLEM IS HERE
3e: f7ff fffe bl 0 <func>
or here
6a: f7ff fffe bl 0 <log_func>
6e: e021 b.n 46 <random_delay+0x46>
70: f7ff fffe bl 0 <random_delay>
the calculation is wrong.
There is no legitimate reason for this; a report to the bugtracking system http://www.sourceware.org/bugzilla/ is probably appropriate (after verifying, that the latest versions also suffer from this bug)
EDIT: I had some time to look deeper into this bug.
The problem is, that if the instruction before the b.n is any 32-bit instruction and the instruction after the b.n is subject to relocation, objdump falsely assumes that the b.n instruction has a relocation associated with it and sets the relative pc to 0 for the offset calculation.
This code part from binutils/objdump.c is the culprit:
bfd_signed_vma distance_to_rel;
distance_to_rel = (**relppp)->address
- (rel_offset + addr_offset);
/* Check to see if the current reloc is associated with
the instruction that we are about to disassemble. */
if (distance_to_rel == 0
/* FIXME: This is wrong. We are trying to catch
relocs that are addressed part way through the
current instruction, as might happen with a packed
VLIW instruction. Unfortunately we do not know the
length of the current instruction since we have not
disassembled it yet. Instead we take a guess based
upon the length of the previous instruction. The
proper solution is to have a new target-specific
disassembler function which just returns the length
of an instruction at a given address without trying
to display its disassembly. */
|| (distance_to_rel > 0
&& distance_to_rel < (bfd_signed_vma) (previous_octets/ opb)))
{
inf->flags |= INSN_HAS_RELOC;
aux->reloc = **relppp;
}
The comment says it all: this parser guesses from the previous 32-bit instruction, that the next instruction is also 32-bit (which it isn't!). The relocation is targeted for 3e and the disassembler assumes, that the next instruction is from 3c to 3f, so the b.n is flagged with INSN_HAS_RELOC, which in turn leads to the incorrect offset calculation. Looks, like this will not be easy to fix up.
However, you could try and patch your objdump like this:
if (distance_to_rel == 0) {
inf->flags |= INSN_HAS_RELOC;
aux->reloc = **relppp;
}
This might produce inaccuracies the other way round, but that should be rare cases and maybe that is better acceptable for you.

Related

how to force arm gcc compiler to not to use 32bit access for an unaligned memory

I work on a memory which cannot handle 32bit access on an unaligned address. For unaligned addresses the memory supports 8bit level access.
In my code there is a memcpy, when I pass a unaligned address to memcpy the chip was getting stuck.
Upon looking deeper I figured out the generated assembly code of memcpy is doing a 32bit access to the address regardless of whether the given address is aligned to 32bit or not. When I reduced the optimization level to O(2) then the compiler generates code which always do a 8bit access.
[Edit] : Below is the memcpy code which I am using
void* memcpy(void * restrict s1, const void * restrict s2, size_t n)
{
char* ll = (char*)s1;
char* rr = (char*)s2;
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
return s1;
}
Below is the disassembly of the code
void* memcpy3(void *s1, void *s2, size_t n)
{
char* ll = (char*)s1;
char* rr = (char*)s2;
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
0: b38a cbz r2, 66 <memcpy3+0x66>
{
2: b4f0 push {r4, r5, r6, r7}
4: 1d03 adds r3, r0, #4
6: 1d0c adds r4, r1, #4
8: 42a0 cmp r0, r4
a: bf38 it cc
c: 4299 cmpcc r1, r3
e: d31e bcc.n 4e <memcpy3+0x4e>
10: 2a08 cmp r2, #8
12: d91c bls.n 4e <memcpy3+0x4e>
14: 460d mov r5, r1
16: 4604 mov r4, r0
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
18: 2300 movs r3, #0
1a: 0897 lsrs r7, r2, #2
1c: f855 6b04 ldr.w r6, [r5], #4
20: 3301 adds r3, #1
22: 429f cmp r7, r3
24: f844 6b04 str.w r6, [r4], #4
28: d8f8 bhi.n 1c <memcpy3+0x1c>
2a: f022 0303 bic.w r3, r2, #3
2e: 429a cmp r2, r3
30: d00b beq.n 4a <memcpy3+0x4a>
32: 56cd ldrsb r5, [r1, r3]
34: 1c5c adds r4, r3, #1
36: 42a2 cmp r2, r4
38: 54c5 strb r5, [r0, r3]
3a: d906 bls.n 4a <memcpy3+0x4a>
3c: 570d ldrsb r5, [r1, r4]
3e: 3302 adds r3, #2
40: 429a cmp r2, r3
42: 5505 strb r5, [r0, r4]
44: d901 bls.n 4a <memcpy3+0x4a>
46: 56ca ldrsb r2, [r1, r3]
48: 54c2 strb r2, [r0, r3]
return s1;
}
4a: bcf0 pop {r4, r5, r6, r7}
4c: 4770 bx lr
4e: 3a01 subs r2, #1
50: 440a add r2, r1
52: 1e43 subs r3, r0, #1
54: 3901 subs r1, #1
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
56: f911 4f01 ldrsb.w r4, [r1, #1]!
5a: 4291 cmp r1, r2
5c: f803 4f01 strb.w r4, [r3, #1]!
60: d1f9 bne.n 56 <memcpy3+0x56>
}
62: bcf0 pop {r4, r5, r6, r7}
64: 4770 bx lr
66: 4770 bx lr
Is it possible to configure the arm-gcc compiler to not to use a 32bit access on an unaligned address.

Use -mno-unaligned-access flag to tell the compiler to not to use unaligned access. By default the compiler uses -munaligned-access.

Use the "-mcpu=" flag to set the processor type, not "-march=", as it covers more of the options.
The processor determines whether unaligned accesses are allowed to the bus interface. But an unaligned access will be translated into smaller parts before the access to the memory device. It really doesn't make sense to say that a memory does not support unaligned accesses, as it never sees them regardless of what the core does.

Self written simple memset not working with -03 eabi gcc on ARMv7

I wrote a very simple memset in c that works fine up to -O2 but not with -O3...
memset:
void * memset(void * blk, int c, size_t n)
{
unsigned char * dst = blk;
while (n-- > 0)
*dst++ = (unsigned char)c;
return blk;
}
...which compiles to this assembly when using -O2:
20000430 <memset>:
20000430: e3520000 cmp r2, #0 # compare param 'n' with zero
20000434: 012fff1e bxeq lr # if equal return to caller
20000438: e6ef1071 uxtb r1, r1 # else zero extend (extract byte from) param 'c'
2000043c: e0802002 add r2, r0, r2 # add pointer 'blk' to 'n'
20000440: e1a03000 mov r3, r0 # move pointer 'blk' to r3
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
20000450: e12fff1e bx lr # else back to caller
This makes sense to me. I annotated what happens here.
When I compile it with -O3 the program crashes. My memset calls itself repeatedly until it ate the whole stack:
200005e4 <memset>:
200005e4: e3520000 cmp r2, #0 # compare param 'n' with zero
200005e8: e92d4010 push {r4, lr} # ? (1)
200005ec: e1a04000 mov r4, r0 # move pointer 'blk' to r4 (temp to hold return value)
200005f0: 0a000001 beq 200005fc <memset+0x18> # if equal (first line compare) jump to epilogue
200005f4: e6ef1071 uxtb r1, r1 # zero extend (extract byte from) param 'c'
200005f8: ebfffff9 bl 200005e4 <memset> # call myself ? (2)
200005fc: e1a00004 mov r0, r4 # epilogue start. move return value to r0
20000600: e8bd8010 pop {r4, pc} # restore r4 and back to caller
I can't figure out how this optimised version is supposed to work without any strb or similar. It doesn't matter if I try to set the memory to '0' or something else so the function is not only called on .bss (zero initialised) variables.
(1) This is a problem. This push gets endlessly repeated without a matching pop as it's called by (2) when the function doesn't early-exit because of 'n' being zero. I verified this with uart prints. Also r2 is never touched so why should the compare to zero ever become true?
Please help me understand what's happening here. Is the compiler assuming prerequisites that I may not fulfill?
Background: I'm using external code that requires memset in my baremetal project so I rolled my own. It's only used once on startup and not performance critical.
/edit: The compiler is called with these options:
arm-none-eabi-gcc -O3 -Wall -Wextra -fPIC -nostdlib -nostartfiles -marm -fstrict-volatile-bitfields -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv3

Your first question (1). That is per the calling convention if you are going to make a nested function call you need to preserve the link register, and you need to be 64 bit aligned. The code uses r4 so that is the extra register saved. No magic there.
Your second question (2) it is not calling your memset it is optimizing your code because it sees it as an inefficient memset. Fuz has provided the answers to your question.
Rename the function
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: e92d4010 push {r4, lr}
8: e1a04000 mov r4, r0
c: 0a000001 beq 18 <xmemset+0x18>
10: e6ef1071 uxtb r1, r1
14: ebfffffe bl 0 <memset>
18: e1a00004 mov r0, r4
1c: e8bd8010 pop {r4, pc}
and you can see this.
If you were to use -ffreestanding as Fuz recommended then you see this or something like it
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: 012fff1e bxeq lr
8: e92d41f0 push {r4, r5, r6, r7, r8, lr}
c: e2426001 sub r6, r2, #1
10: e3560002 cmp r6, #2
14: e6efe071 uxtb lr, r1
18: 9a00002a bls c8 <xmemset+0xc8>
1c: e3a0c000 mov r12, #0
20: e3520023 cmp r2, #35 ; 0x23
24: e7c7c01e bfi r12, lr, #0, #8
28: e1a04122 lsr r4, r2, #2
2c: e7cfc41e bfi r12, lr, #8, #8
30: e7d7c81e bfi r12, lr, #16, #8
34: e7dfcc1e bfi r12, lr, #24, #8
38: 9a000024 bls d0 <xmemset+0xd0>
3c: e2445009 sub r5, r4, #9
40: e1a03000 mov r3, r0
44: e3c55007 bic r5, r5, #7
48: e3a07000 mov r7, #0
4c: e2851008 add r1, r5, #8
50: e1570005 cmp r7, r5
54: f5d3f0a0 pld [r3, #160] ; 0xa0
58: e1a08007 mov r8, r7
5c: e583c000 str r12, [r3]
60: e583c004 str r12, [r3, #4]
64: e2877008 add r7, r7, #8
68: e583c008 str r12, [r3, #8]
6c: e2833020 add r3, r3, #32
70: e503c014 str r12, [r3, #-20] ; 0xffffffec
74: e503c010 str r12, [r3, #-16]
78: e503c00c str r12, [r3, #-12]
7c: e503c008 str r12, [r3, #-8]
80: e503c004 str r12, [r3, #-4]
84: 1afffff1 bne 50 <xmemset+0x50>
88: e2811001 add r1, r1, #1
8c: e483c004 str r12, [r3], #4
90: e1540001 cmp r4, r1
94: 8afffffb bhi 88 <xmemset+0x88>
98: e3c23003 bic r3, r2, #3
9c: e1520003 cmp r2, r3
a0: e0466003 sub r6, r6, r3
a4: e0803003 add r3, r0, r3
a8: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
ac: e3560000 cmp r6, #0
b0: e5c3e000 strb lr, [r3]
b4: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
b8: e3560001 cmp r6, #1
bc: e5c3e001 strb lr, [r3, #1]
c0: 15c3e002 strbne lr, [r3, #2]
c4: e8bd81f0 pop {r4, r5, r6, r7, r8, pc}
c8: e1a03000 mov r3, r0
cc: eafffff6 b ac <xmemset+0xac>
d0: e1a03000 mov r3, r0
d4: e3a01000 mov r1, #0
d8: eaffffea b 88 <xmemset+0x88>
which appears like it simply inlined memset, the one it knows not your code (the faster one).
So if you want it to use your code then stick with -O2. Yours is pretty inefficient so not sure why you need to push it any further than it was.
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
It isn't going to get any better than that without replacing your code with something else.
Fuz already answered the question:
Compile with -fno-builtin-memset. The compiler recognises that the function implements memset and thus replaces it with a call to memset. You should in general compile with -ffreestanding when writing bare-metal code. I believe this fixes this sort of problem, too
It is replacing your code with memset, if you want it not to do that use -ffreestanding.
If you wish to go beyond that and wonder why -fno-builtin-memset didn't work that is a question for the gcc folks, file a ticket, let us know what they say (or just look at the compiler source code).

Bare metal C Function not working

I have been writing a kernel for the Raspberry Pi 2 using C. To do so I have been following the Valvers and Baking Pi (written in Assembly) tutorials to do so. But each time I try to port the function to set a pin to output from the Baking Pi OK03 tutorial to C the led stops blinking (but the code compiles just fine). I have rewritten the function several times but I cannot get it to work.
Here is my code:
main.c:
#include "gpio.h"
int main(void) __attribute__((naked));
int main(void)
{
gpio = (unsigned int*)GPIO_BASE;
/* Write 1 to the GPIO16 init nibble in the Function Select 1 GPIO
peripheral register to enable GPIO16 as an output */
// gpio[LED_GPFSEL] |= (1 << LED_GPFBIT);
pinMode(47, 1);
// Never return from here
while(1)
{
for(tim = 0; tim < 500000; tim++)
;
/* Set the LED GPIO pin low ( Turn OK LED on for original Pi, and off
for plus models )*/
gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);
for(tim = 0; tim < 500000; tim++)
;
/* Set the LED GPIO pin high ( Turn OK LED off for original Pi, and on
for plus models )*/
gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
}
}
gpio.h:
#ifndef GPIO_H
#define GPIO_H
#ifdef RPI2
#define GPIO_BASE 0x3F200000UL
#else
#define GPIO_BASE 0x20200000UL
#endif
#if defined( RPIBPLUS ) || defined( RPI2 )
#define LED_GPFSEL GPIO_GPFSEL4
#define LED_GPFBIT 21
#define LED_GPSET GPIO_GPSET1
#define LED_GPCLR GPIO_GPCLR1
#define LED_GPIO_BIT 15
#else
#define LED_GPFSEL GPIO_GPFSEL1
#define LED_GPFBIT 18
#define LED_GPSET GPIO_GPSET0
#define LED_GPCLR GPIO_GPCLR0
#define LED_GPIO_BIT 16
#endif
#define GPIO_GPFSEL0 0
#define GPIO_GPFSEL1 1
#define GPIO_GPFSEL2 2
#define GPIO_GPFSEL3 3
#define GPIO_GPFSEL4 4
#define GPIO_GPFSEL5 5
#define GPIO_GPSET0 7
#define GPIO_GPSET1 8
#define GPIO_GPCLR0 10
#define GPIO_GPCLR1 11
#define GPIO_GPLEV0 13
#define GPIO_GPLEV1 14
#define GPIO_GPEDS0 16
#define GPIO_GPEDS1 17
#define GPIO_GPREN0 19
#define GPIO_GPREN1 20
#define GPIO_GPFEN0 22
#define GPIO_GPFEN1 23
#define GPIO_GPHEN0 25
#define GPIO_GPHEN1 26
#define GPIO_GPLEN0 28
#define GPIO_GPLEN1 29
#define GPIO_GPAREN0 31
#define GPIO_GPAREN1 32
#define GPIO_GPAFEN0 34
#define GPIO_GPAFEN1 35
#define GPIO_GPPUD 37
#define GPIO_GPPUDCLK0 38
#define GPIO_GPPUDCLK1 39
/** GPIO Register set */
volatile unsigned int* gpio;
/** Simple loop variable */
volatile unsigned int tim;
// Function to change a pin's mode
int pinMode(int pinnum, int mode);
#endif
gpio.c:
#include "gpio.h"
/* Docs: The Pi has 54 GPIO pins and 6 Function Selec Registers (FSR). Each FSR
controls 10 GPIO pins and each FSR is made of 33 bits (each pin is controlled by
3 bits of the FSR). To know which pins of the FSR control each GPIO pin the formula
3n is used (where n is the pin number). In order for this to work, the pin number must be minor or equal to 9. Therefore if the pin is higher than 9, 10 units are subtracted from the pin number and one unit is added to the fsr variable. This loops until the pin number is lower or equal to 9. */
int pinMode(int pinnum, int mode) {
// Variable declaration and initialization
int fsr = 0;
int fsrbit;
// Let's check the pin does exist
if (pinnum < 0 || pinnum > 53) {
// Abort, there is no such pin.
return 1;
}
else if (mode < 0 || mode > 1) {
// Abort, invalid mode (Actually there are 7 modes but we will only use 2)
return 1;
}
// Create a pointer to the GPIO perhiperal register so we can speak to it
gpio = (unsigned int*)GPIO_BASE;
// And calculate wich FSR we should use
/* do {
if (pinnum > 9) {
pinnum -= 10;
fsr++;
}
} while (pinnum > 9); */
if (pinnum > 9) {
while (pinnum > 9) {
pinnum -= 10;
fsr++;
}
}
// Then we calculate the bytes of the fsreg to use
fsrbit = pinnum * 3;
// Finally let's set the pin to the desired mode
gpio[fsr] |= (mode << fsrbit);
return 0;
}
Please help, I have been stuck with this problem for weeks.
PS: In case you want the kernel disassembly:
kernel_disassembly.asm:
./kernel.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <main>:
8000: e3082200 movw r2, #33280 ; 0x8200
8004: e3402001 movt r2, #1
8008: e3a03000 mov r3, #0
800c: e3433f20 movt r3, #16160 ; 0x3f20
8010: e5823000 str r3, [r2]
8014: e3a01001 mov r1, #1
8018: e3a0002f mov r0, #47 ; 0x2f
801c: eb000032 bl 80ec <pinMode>
8020: e30831fc movw r3, #33276 ; 0x81fc
8024: e3403001 movt r3, #1
8028: e3a02000 mov r2, #0
802c: e5832000 str r2, [r3]
8030: ea000006 b 8050 <main+0x50>
8034: e30831fc movw r3, #33276 ; 0x81fc
8038: e3403001 movt r3, #1
803c: e5933000 ldr r3, [r3]
8040: e2832001 add r2, r3, #1
8044: e30831fc movw r3, #33276 ; 0x81fc
8048: e3403001 movt r3, #1
804c: e5832000 str r2, [r3]
8050: e30831fc movw r3, #33276 ; 0x81fc
8054: e3403001 movt r3, #1
8058: e5932000 ldr r2, [r3]
805c: e30a311f movw r3, #41247 ; 0xa11f
8060: e3403007 movt r3, #7
8064: e1520003 cmp r2, r3
8068: 9afffff1 bls 8034 <main+0x34>
806c: e3083200 movw r3, #33280 ; 0x8200
8070: e3403001 movt r3, #1
8074: e5933000 ldr r3, [r3]
8078: e283302c add r3, r3, #44 ; 0x2c
807c: e3a02902 mov r2, #32768 ; 0x8000
8080: e5832000 str r2, [r3]
8084: e30831fc movw r3, #33276 ; 0x81fc
8088: e3403001 movt r3, #1
808c: e3a02000 mov r2, #0
8090: e5832000 str r2, [r3]
8094: ea000006 b 80b4 <main+0xb4>
8098: e30831fc movw r3, #33276 ; 0x81fc
809c: e3403001 movt r3, #1
80a0: e5933000 ldr r3, [r3]
80a4: e2832001 add r2, r3, #1
80a8: e30831fc movw r3, #33276 ; 0x81fc
80ac: e3403001 movt r3, #1
80b0: e5832000 str r2, [r3]
80b4: e30831fc movw r3, #33276 ; 0x81fc
80b8: e3403001 movt r3, #1
80bc: e5932000 ldr r2, [r3]
80c0: e30a311f movw r3, #41247 ; 0xa11f
80c4: e3403007 movt r3, #7
80c8: e1520003 cmp r2, r3
80cc: 9afffff1 bls 8098 <main+0x98>
80d0: e3083200 movw r3, #33280 ; 0x8200
80d4: e3403001 movt r3, #1
80d8: e5933000 ldr r3, [r3]
80dc: e2833020 add r3, r3, #32
80e0: e3a02902 mov r2, #32768 ; 0x8000
80e4: e5832000 str r2, [r3]
80e8: eaffffcc b 8020 <main+0x20>
000080ec <pinMode>:
80ec: e52db004 push {fp} ; (str fp, [sp, #-4]!)
80f0: e28db000 add fp, sp, #0
80f4: e24dd014 sub sp, sp, #20
80f8: e50b0010 str r0, [fp, #-16]
80fc: e50b1014 str r1, [fp, #-20] ; 0xffffffec
8100: e3a03000 mov r3, #0
8104: e50b3008 str r3, [fp, #-8]
8108: e51b3010 ldr r3, [fp, #-16]
810c: e3530000 cmp r3, #0
8110: ba000002 blt 8120 <pinMode+0x34>
8114: e51b3010 ldr r3, [fp, #-16]
8118: e3530035 cmp r3, #53 ; 0x35
811c: da000001 ble 8128 <pinMode+0x3c>
8120: e3a03001 mov r3, #1
8124: ea000030 b 81ec <pinMode+0x100>
8128: e51b3014 ldr r3, [fp, #-20] ; 0xffffffec
812c: e3530000 cmp r3, #0
8130: ba000002 blt 8140 <pinMode+0x54>
8134: e51b3014 ldr r3, [fp, #-20] ; 0xffffffec
8138: e3530001 cmp r3, #1
813c: da000001 ble 8148 <pinMode+0x5c>
8140: e3a03001 mov r3, #1
8144: ea000028 b 81ec <pinMode+0x100>
8148: e3082200 movw r2, #33280 ; 0x8200
814c: e3402001 movt r2, #1
8150: e3a03000 mov r3, #0
8154: e3433f20 movt r3, #16160 ; 0x3f20
8158: e5823000 str r3, [r2]
815c: e51b3010 ldr r3, [fp, #-16]
8160: e3530009 cmp r3, #9
8164: da000009 ble 8190 <pinMode+0xa4>
8168: ea000005 b 8184 <pinMode+0x98>
816c: e51b3010 ldr r3, [fp, #-16]
8170: e243300a sub r3, r3, #10
8174: e50b3010 str r3, [fp, #-16]
8178: e51b3008 ldr r3, [fp, #-8]
817c: e2833001 add r3, r3, #1
8180: e50b3008 str r3, [fp, #-8]
8184: e51b3010 ldr r3, [fp, #-16]
8188: e3530009 cmp r3, #9
818c: cafffff6 bgt 816c <pinMode+0x80>
8190: e51b3010 ldr r3, [fp, #-16]
8194: e3a02003 mov r2, #3
8198: e0030392 mul r3, r2, r3
819c: e50b300c str r3, [fp, #-12]
81a0: e3083200 movw r3, #33280 ; 0x8200
81a4: e3403001 movt r3, #1
81a8: e5932000 ldr r2, [r3]
81ac: e51b3008 ldr r3, [fp, #-8]
81b0: e1a03103 lsl r3, r3, #2
81b4: e0822003 add r2, r2, r3
81b8: e3083200 movw r3, #33280 ; 0x8200
81bc: e3403001 movt r3, #1
81c0: e5931000 ldr r1, [r3]
81c4: e51b3008 ldr r3, [fp, #-8]
81c8: e1a03103 lsl r3, r3, #2
81cc: e0813003 add r3, r1, r3
81d0: e5933000 ldr r3, [r3]
81d4: e51b0014 ldr r0, [fp, #-20] ; 0xffffffec
81d8: e51b100c ldr r1, [fp, #-12]
81dc: e1a01110 lsl r1, r0, r1
81e0: e1833001 orr r3, r3, r1
81e4: e5823000 str r3, [r2]
81e8: e3a03000 mov r3, #0
81ec: e1a00003 mov r0, r3
81f0: e24bd000 sub sp, fp, #0
81f4: e49db004 pop {fp} ; (ldr fp, [sp], #4)
81f8: e12fff1e bx lr
Disassembly of section .bss:
000181fc <__bss_start>:
181fc: 00000000 andeq r0, r0, r0
00018200 <gpio>:
18200: 00000000 andeq r0, r0, r0
Disassembly of section .comment:
00000000 <.comment>:
0: 3a434347 bcc 10d0d24 <_stack+0x1050d24>
4: 4e472820 cdpmi 8, 4, cr2, cr7, cr0, {1}
8: 6f542055 svcvs 0x00542055
c: 20736c6f rsbscs r6, r3, pc, ror #24
10: 20726f66 rsbscs r6, r2, r6, ror #30
14: 204d5241 subcs r5, sp, r1, asr #4
18: 65626d45 strbvs r6, [r2, #-3397]! ; 0xfffff2bb
1c: 64656464 strbtvs r6, [r5], #-1124 ; 0xfffffb9c
20: 6f725020 svcvs 0x00725020
24: 73736563 cmnvc r3, #415236096 ; 0x18c00000
28: 2973726f ldmdbcs r3!, {r0, r1, r2, r3, r5, r6, r9, ip, sp, lr}^
2c: 342e3520 strtcc r3, [lr], #-1312 ; 0xfffffae0
30: 3220312e eorcc r3, r0, #-2147483637 ; 0x8000000b
34: 30363130 eorscc r3, r6, r0, lsr r1
38: 20393139 eorscs r3, r9, r9, lsr r1
3c: 6c657228 sfmvs f7, 2, [r5], #-160 ; 0xffffff60
40: 65736165 ldrbvs r6, [r3, #-357]! ; 0xfffffe9b
44: 415b2029 cmpmi fp, r9, lsr #32
48: 652f4d52 strvs r4, [pc, #-3410]! ; fffff2fe <_stack+0xfff7f2fe>
4c: 6465626d strbtvs r6, [r5], #-621 ; 0xfffffd93
50: 2d646564 cfstr64cs mvdx6, [r4, #-400]! ; 0xfffffe70
54: 72622d35 rsbvc r2, r2, #3392 ; 0xd40
58: 68636e61 stmdavs r3!, {r0, r5, r6, r9, sl, fp, sp, lr}^
5c: 76657220 strbtvc r7, [r5], -r0, lsr #4
60: 6f697369 svcvs 0x00697369
64: 3432206e ldrtcc r2, [r2], #-110 ; 0xffffff92
68: 36393430 ; <UNDEFINED> instruction: 0x36393430
6c: Address 0x000000000000006c is out of bounds.
Disassembly of section .debug_aranges:
00000000 <.debug_aranges>:
0: 0000001c andeq r0, r0, ip, lsl r0
4: 00000002 andeq r0, r0, r2
8: 00040000 andeq r0, r4, r0
c: 00000000 andeq r0, r0, r0
10: 00008000 andeq r8, r0, r0
14: 000000ec andeq r0, r0, ip, ror #1
...
20: 0000001c andeq r0, r0, ip, lsl r0
24: 00760002 rsbseq r0, r6, r2
28: 00040000 andeq r0, r4, r0
2c: 00000000 andeq r0, r0, r0
30: 000080ec andeq r8, r0, ip, ror #1
34: 00000110 andeq r0, r0, r0, lsl r1
...
Disassembly of section .debug_info:
00000000 <.debug_info>:
0: 00000072 andeq r0, r0, r2, ror r0
4: 00000004 andeq r0, r0, r4
8: 01040000 mrseq r0, (UNDEF: 4)
c: 0000003f andeq r0, r0, pc, lsr r0
10: 0000340c andeq r3, r0, ip, lsl #8
14: 00000d00 andeq r0, r0, r0, lsl #26
18: 00800000 addeq r0, r0, r0
1c: 0000ec00 andeq lr, r0, r0, lsl #24
20: 00000000 andeq r0, r0, r0
24: 00d10200 sbcseq r0, r1, r0, lsl #4
28: 04010000 streq r0, [r1], #-0
2c: 0000003a andeq r0, r0, sl, lsr r0
30: 00008000 andeq r8, r0, r0
34: 000000ec andeq r0, r0, ip, ror #1
38: 04039c01 streq r9, [r3], #-3073 ; 0xfffff3ff
3c: 746e6905 strbtvc r6, [lr], #-2309 ; 0xfffff6fb
40: 002f0400 eoreq r0, pc, r0, lsl #8
44: 42020000 andmi r0, r2, #0
48: 00000052 andeq r0, r0, r2, asr r0
4c: 82000305 andhi r0, r0, #335544320 ; 0x14000000
50: 04050001 streq r0, [r5], #-1
54: 0000005f andeq r0, r0, pc, asr r0
58: 00070406 andeq r0, r7, r6, lsl #8
5c: 07000000 streq r0, [r0, -r0]
60: 00000058 andeq r0, r0, r8, asr r0
64: 6d697408 cfstrdvs mvd7, [r9, #-32]! ; 0xffffffe0
68: 5f450200 svcpl 0x00450200
6c: 05000000 streq r0, [r0, #-0]
70: 0181fc03 orreq pc, r1, r3, lsl #24
74: 00af0000 adceq r0, pc, r0
78: 00040000 andeq r0, r4, r0
7c: 00000076 andeq r0, r0, r6, ror r0
80: 003f0104 eorseq r0, pc, r4, lsl #2
84: dd0c0000 stcle 0, cr0, [ip, #-0]
88: 0d000000 stceq 0, cr0, [r0, #-0]
8c: ec000000 stc 0, cr0, [r0], {-0}
90: 10000080 andne r0, r0, r0, lsl #1
94: 61000001 tstvs r0, r1
98: 02000000 andeq r0, r0, #0
9c: 000000ef andeq r0, r0, pc, ror #1
a0: 00770b01 rsbseq r0, r7, r1, lsl #22
a4: 80ec0000 rschi r0, ip, r0
a8: 01100000 tsteq r0, r0
ac: 9c010000 stcls 0, cr0, [r1], {-0}
b0: 00000077 andeq r0, r0, r7, ror r0
b4: 0000d603 andeq sp, r0, r3, lsl #12
b8: 770b0100 strvc r0, [fp, -r0, lsl #2]
bc: 02000000 andeq r0, r0, #0
c0: f7036c91 ; <UNDEFINED> instruction: 0xf7036c91
c4: 01000000 mrseq r0, (UNDEF: 0)
c8: 0000770b andeq r7, r0, fp, lsl #14
cc: 68910200 ldmvs r1, {r9}
d0: 72736604 rsbsvc r6, r3, #4, 12 ; 0x400000
d4: 770d0100 strvc r0, [sp, -r0, lsl #2]
d8: 02000000 andeq r0, r0, #0
dc: e8057491 stmda r5, {r0, r4, r7, sl, ip, sp, lr}
e0: 01000000 mrseq r0, (UNDEF: 0)
e4: 0000770e andeq r7, r0, lr, lsl #14
e8: 70910200 addsvc r0, r1, r0, lsl #4
ec: 05040600 streq r0, [r4, #-1536] ; 0xfffffa00
f0: 00746e69 rsbseq r6, r4, r9, ror #28
f4: 00002f07 andeq r2, r0, r7, lsl #30
f8: 8f420200 svchi 0x00420200
fc: 05000000 streq r0, [r0, #-0]
100: 01820003 orreq r0, r2, r3
104: 9c040800 stcls 8, cr0, [r4], {-0}
108: 09000000 stmdbeq r0, {} ; <UNPREDICTABLE>
10c: 00000704 andeq r0, r0, r4, lsl #14
110: 950a0000 strls r0, [sl, #-0]
114: 0b000000 bleq 11c <main-0x7ee4>
118: 006d6974 rsbeq r6, sp, r4, ror r9
11c: 009c4502 addseq r4, ip, r2, lsl #10
120: 03050000 movweq r0, #20480 ; 0x5000
124: 000181fc strdeq r8, [r1], -ip
...
Disassembly of section .debug_abbrev:
00000000 <.debug_abbrev>:
0: 25011101 strcs r1, [r1, #-257] ; 0xfffffeff
4: 030b130e movweq r1, #45838 ; 0xb30e
8: 110e1b0e tstne lr, lr, lsl #22
c: 10061201 andne r1, r6, r1, lsl #4
10: 02000017 andeq r0, r0, #23
14: 193f002e ldmdbne pc!, {r1, r2, r3, r5} ; <UNPREDICTABLE>
18: 0b3a0e03 bleq e8382c <_stack+0xe0382c>
1c: 19270b3b stmdbne r7!, {r0, r1, r3, r4, r5, r8, r9, fp}
20: 01111349 tsteq r1, r9, asr #6
24: 18400612 stmdane r0, {r1, r4, r9, sl}^
28: 00194296 mulseq r9, r6, r2
2c: 00240300 eoreq r0, r4, r0, lsl #6
30: 0b3e0b0b bleq f82c64 <_stack+0xf02c64>
34: 00000803 andeq r0, r0, r3, lsl #16
38: 03003404 movweq r3, #1028 ; 0x404
3c: 3b0b3a0e blcc 2ce87c <_stack+0x24e87c>
40: 3f13490b svccc 0x0013490b
44: 00180219 andseq r0, r8, r9, lsl r2
48: 000f0500 andeq r0, pc, r0, lsl #10
4c: 13490b0b movtne r0, #39691 ; 0x9b0b
50: 24060000 strcs r0, [r6], #-0
54: 3e0b0b00 vmlacc.f64 d0, d11, d0
58: 000e030b andeq r0, lr, fp, lsl #6
5c: 00350700 eorseq r0, r5, r0, lsl #14
60: 00001349 andeq r1, r0, r9, asr #6
64: 03003408 movweq r3, #1032 ; 0x408
68: 3b0b3a08 blcc 2ce890 <_stack+0x24e890>
6c: 3f13490b svccc 0x0013490b
70: 00180219 andseq r0, r8, r9, lsl r2
74: 11010000 mrsne r0, (UNDEF: 1)
78: 130e2501 movwne r2, #58625 ; 0xe501
7c: 1b0e030b blne 380cb0 <_stack+0x300cb0>
80: 1201110e andne r1, r1, #-2147483645 ; 0x80000003
84: 00171006 andseq r1, r7, r6
88: 012e0200 ; <UNDEFINED> instruction: 0x012e0200
8c: 0e03193f mcreq 9, 0, r1, cr3, cr15, {1}
90: 0b3b0b3a bleq ec2d80 <_stack+0xe42d80>
94: 13491927 movtne r1, #39207 ; 0x9927
98: 06120111 ; <UNDEFINED> instruction: 0x06120111
9c: 42971840 addsmi r1, r7, #64, 16 ; 0x400000
a0: 00130119 andseq r0, r3, r9, lsl r1
a4: 00050300 andeq r0, r5, r0, lsl #6
a8: 0b3a0e03 bleq e838bc <_stack+0xe038bc>
ac: 13490b3b movtne r0, #39739 ; 0x9b3b
b0: 00001802 andeq r1, r0, r2, lsl #16
b4: 03003404 movweq r3, #1028 ; 0x404
b8: 3b0b3a08 blcc 2ce8e0 <_stack+0x24e8e0>
bc: 0213490b andseq r4, r3, #180224 ; 0x2c000
c0: 05000018 streq r0, [r0, #-24] ; 0xffffffe8
c4: 0e030034 mcreq 0, 0, r0, cr3, cr4, {1}
c8: 0b3b0b3a bleq ec2db8 <_stack+0xe42db8>
cc: 18021349 stmdane r2, {r0, r3, r6, r8, r9, ip}
d0: 24060000 strcs r0, [r6], #-0
d4: 3e0b0b00 vmlacc.f64 d0, d11, d0
d8: 0008030b andeq r0, r8, fp, lsl #6
dc: 00340700 eorseq r0, r4, r0, lsl #14
e0: 0b3a0e03 bleq e838f4 <_stack+0xe038f4>
e4: 13490b3b movtne r0, #39739 ; 0x9b3b
e8: 1802193f stmdane r2, {r0, r1, r2, r3, r4, r5, r8, fp, ip}
ec: 0f080000 svceq 0x00080000
f0: 490b0b00 stmdbmi fp, {r8, r9, fp}
f4: 09000013 stmdbeq r0, {r0, r1, r4}
f8: 0b0b0024 bleq 2c0190 <_stack+0x240190>
fc: 0e030b3e vmoveq.16 d3[0], r0
100: 350a0000 strcc r0, [sl, #-0]
104: 00134900 andseq r4, r3, r0, lsl #18
108: 00340b00 eorseq r0, r4, r0, lsl #22
10c: 0b3a0803 bleq e82120 <_stack+0xe02120>
110: 13490b3b movtne r0, #39739 ; 0x9b3b
114: 1802193f stmdane r2, {r0, r1, r2, r3, r4, r5, r8, fp, ip}
118: Address 0x0000000000000118 is out of bounds.
Disassembly of section .debug_line:
00000000 <.debug_line>:
0: 0000005d andeq r0, r0, sp, asr r0
4: 002b0002 eoreq r0, fp, r2
8: 01020000 mrseq r0, (UNDEF: 2)
c: 000d0efb strdeq r0, [sp], -fp
10: 01010101 tsteq r1, r1, lsl #2
14: 01000000 mrseq r0, (UNDEF: 0)
18: 73010000 movwvc r0, #4096 ; 0x1000
1c: 00006372 andeq r6, r0, r2, ror r3
20: 6e69616d powvsez f6, f1, #5.0
24: 0100632e tsteq r0, lr, lsr #6
28: 70670000 rsbvc r0, r7, r0
2c: 682e6f69 stmdavs lr!, {r0, r3, r5, r6, r8, r9, sl, fp, sp, lr}
30: 00000100 andeq r0, r0, r0, lsl #2
34: 02050000 andeq r0, r5, #0
38: 00008000 andeq r8, r0, r0
3c: 6ba31316 blvs fe8c4c9c <_stack+0xfe844c9c>
40: 03040200 movweq r0, #16896 ; 0x4200
44: 02009e06 andeq r9, r0, #6, 28 ; 0x60
48: 06d60104 ldrbeq r0, [r6], r4, lsl #2
4c: 0200bcdb andeq fp, r0, #56064 ; 0xdb00
50: 9e060304 cdpls 3, 0, cr0, cr6, cr4, {0}
54: 01040200 mrseq r0, R12_usr
58: bbdb06d6 bllt ff6c1bb8 <_stack+0xff641bb8>
5c: 01000202 tsteq r0, r2, lsl #4
60: 00005f01 andeq r5, r0, r1, lsl #30
64: 2b000200 blcs 86c <main-0x7794>
68: 02000000 andeq r0, r0, #0
6c: 0d0efb01 vstreq d15, [lr, #-4]
70: 01010100 mrseq r0, (UNDEF: 17)
74: 00000001 andeq r0, r0, r1
78: 01000001 tsteq r0, r1
7c: 00637273 rsbeq r7, r3, r3, ror r2
80: 69706700 ldmdbvs r0!, {r8, r9, sl, sp, lr}^
84: 00632e6f rsbeq r2, r3, pc, ror #28
88: 67000001 strvs r0, [r0, -r1]
8c: 2e6f6970 mcrcs 9, 3, r6, cr15, cr0, {3}
90: 00010068 andeq r0, r1, r8, rrx
94: 05000000 streq r0, [r0, #-0]
98: 0080ec02 addeq lr, r0, r2, lsl #24
9c: 010a0300 mrseq r0, (UNDEF: 58)
a0: 02004da0 andeq r4, r0, #160, 26 ; 0x2800
a4: 66060104 strvs r0, [r6], -r4, lsl #2
a8: 004c6806 subeq r6, ip, r6, lsl #16
ac: 06010402 streq r0, [r1], -r2, lsl #8
b0: 4d680666 stclmi 6, cr0, [r8, #-408]! ; 0xfffffe68
b4: 672f67a6 strvs r6, [pc, -r6, lsr #15]!
b8: 02846c64 addeq r6, r4, #100, 24 ; 0x6400
bc: 022f1424 eoreq r1, pc, #36, 8 ; 0x24000000
c0: 01010008 tsteq r1, r8
Disassembly of section .debug_frame:
00000000 <.debug_frame>:
0: 0000000c andeq r0, r0, ip
4: ffffffff ; <UNDEFINED> instruction: 0xffffffff
8: 7c020001 stcvc 0, cr0, [r2], {1}
c: 000d0c0e andeq r0, sp, lr, lsl #24
10: 0000000c andeq r0, r0, ip
14: 00000000 andeq r0, r0, r0
18: 00008000 andeq r8, r0, r0
1c: 000000ec andeq r0, r0, ip, ror #1
20: 0000000c andeq r0, r0, ip
24: ffffffff ; <UNDEFINED> instruction: 0xffffffff
28: 7c020001 stcvc 0, cr0, [r2], {1}
2c: 000d0c0e andeq r0, sp, lr, lsl #24
30: 0000001c andeq r0, r0, ip, lsl r0
34: 00000020 andeq r0, r0, r0, lsr #32
38: 000080ec andeq r8, r0, ip, ror #1
3c: 00000110 andeq r0, r0, r0, lsl r1
40: 8b040e42 blhi 103950 <_stack+0x83950>
44: 0b0d4201 bleq 350850 <_stack+0x2d0850>
48: 0d0d8002 stceq 0, cr8, [sp, #-8]
4c: 000ecb42 andeq ip, lr, r2, asr #22
Disassembly of section .debug_str:
00000000 <.debug_str>:
0: 69736e75 ldmdbvs r3!, {r0, r2, r4, r5, r6, r9, sl, fp, sp, lr}^
4: 64656e67 strbtvs r6, [r5], #-3687 ; 0xfffff199
8: 746e6920 strbtvc r6, [lr], #-2336 ; 0xfffff6e0
c: 73552f00 cmpvc r5, #0, 30
10: 2f737265 svccs 0x00737265
14: 6f63614a svcvs 0x0063614a
18: 66666f53 uqsaxvs r6, r6, r3
1c: 7365442f cmnvc r5, #788529152 ; 0x2f000000
20: 706f746b rsbvc r7, pc, fp, ror #8
24: 6964452f stmdbvs r4!, {r0, r1, r2, r3, r5, r8, sl, lr}^
28: 2d6e6f73 stclcs 15, cr6, [lr, #-460]! ; 0xfffffe34
2c: 67005452 smlsdvs r0, r2, r4, r5
30: 006f6970 rsbeq r6, pc, r0, ror r9 ; <UNPREDICTABLE>
34: 2f637273 svccs 0x00637273
38: 6e69616d powvsez f6, f1, #5.0
3c: 4700632e strmi r6, [r0, -lr, lsr #6]
40: 4320554e ; <UNDEFINED> instruction: 0x4320554e
44: 35203131 strcc r3, [r0, #-305]! ; 0xfffffecf
48: 312e342e ; <UNDEFINED> instruction: 0x312e342e
4c: 31303220 teqcc r0, r0, lsr #4
50: 31393036 teqcc r9, r6, lsr r0
54: 72282039 eorvc r2, r8, #57 ; 0x39
58: 61656c65 cmnvs r5, r5, ror #24
5c: 20296573 eorcs r6, r9, r3, ror r5
60: 4d52415b ldfmie f4, [r2, #-364] ; 0xfffffe94
64: 626d652f rsbvs r6, sp, #197132288 ; 0xbc00000
68: 65646465 strbvs r6, [r4, #-1125]! ; 0xfffffb9b
6c: 2d352d64 ldccs 13, cr2, [r5, #-400]! ; 0xfffffe70
70: 6e617262 cdpvs 2, 6, cr7, cr1, cr2, {3}
74: 72206863 eorvc r6, r0, #6488064 ; 0x630000
78: 73697665 cmnvc r9, #105906176 ; 0x6500000
7c: 206e6f69 rsbcs r6, lr, r9, ror #30
80: 34303432 ldrtcc r3, [r0], #-1074 ; 0xfffffbce
84: 205d3639 subscs r3, sp, r9, lsr r6
88: 70666d2d rsbvc r6, r6, sp, lsr #26
8c: 656e3d75 strbvs r3, [lr, #-3445]! ; 0xfffff28b
90: 762d6e6f strtvc r6, [sp], -pc, ror #28
94: 34767066 ldrbtcc r7, [r6], #-102 ; 0xffffff9a
98: 666d2d20 strbtvs r2, [sp], -r0, lsr #26
9c: 74616f6c strbtvc r6, [r1], #-3948 ; 0xfffff094
a0: 6962612d stmdbvs r2!, {r0, r2, r3, r5, r8, sp, lr}^
a4: 7261683d rsbvc r6, r1, #3997696 ; 0x3d0000
a8: 6d2d2064 stcvs 0, cr2, [sp, #-400]! ; 0xfffffe70
ac: 68637261 stmdavs r3!, {r0, r5, r6, r9, ip, sp, lr}^
b0: 6d72613d ldfvse f6, [r2, #-244]! ; 0xffffff0c
b4: 612d3776 ; <UNDEFINED> instruction: 0x612d3776
b8: 746d2d20 strbtvc r2, [sp], #-3360 ; 0xfffff2e0
bc: 3d656e75 stclcc 14, cr6, [r5, #-468]! ; 0xfffffe2c
c0: 74726f63 ldrbtvc r6, [r2], #-3939 ; 0xfffff09d
c4: 612d7865 ; <UNDEFINED> instruction: 0x612d7865
c8: 672d2037 ; <UNDEFINED> instruction: 0x672d2037
cc: 304f2d20 subcc r2, pc, r0, lsr #26
d0: 69616d00 stmdbvs r1!, {r8, sl, fp, sp, lr}^
d4: 6970006e ldmdbvs r0!, {r1, r2, r3, r5, r6}^
d8: 6d756e6e ldclvs 14, cr6, [r5, #-440]! ; 0xfffffe48
dc: 63727300 cmnvs r2, #0, 6
e0: 6970672f ldmdbvs r0!, {r0, r1, r2, r3, r5, r8, r9, sl, sp, lr}^
e4: 00632e6f rsbeq r2, r3, pc, ror #28
e8: 62727366 rsbsvs r7, r2, #-1744830463 ; 0x98000001
ec: 70007469 andvc r7, r0, r9, ror #8
f0: 6f4d6e69 svcvs 0x004d6e69
f4: 6d006564 cfstr32vs mvfx6, [r0, #-400] ; 0xfffffe70
f8: 0065646f rsbeq r6, r5, pc, ror #8
Disassembly of section .ARM.attributes:
00000000 <_stack-0x80000>:
0: 00003441 andeq r3, r0, r1, asr #8
4: 61656100 cmnvs r5, r0, lsl #2
8: 01006962 tsteq r0, r2, ror #18
c: 0000002a andeq r0, r0, sl, lsr #32
10: 412d3705 ; <UNDEFINED> instruction: 0x412d3705
14: 070a0600 streq r0, [sl, -r0, lsl #12]
18: 09010841 stmdbeq r1, {r0, r6, fp}
1c: 0c050a02 stceq 10, cr0, [r5], {2}
20: 14041202 strne r1, [r4], #-514 ; 0xfffffdfe
24: 17011501 strne r1, [r1, -r1, lsl #10]
28: 19011803 stmdbne r1, {r0, r1, fp, ip}
2c: 1c011a01 stcne 10, cr1, [r1], {1}
30: 22061e01 andcs r1, r6, #1, 28
34: Address 0x0000000000000034 is out of bounds.
Thanks in advance

i think the tim=500000 can cause problems because the LED is switched on too short to be recognizable by human eye. RPi's processor has a speed around 1 GHz ( https://en.wikipedia.org/wiki/Raspberry_Pi ) what is 1.000.000.000 Hz
so the switched-on time of the LED is approx 0.5 ms. maybe try tim=5000000000 to try approx. 5s visibility or try other timings like delay(), sleep(),... ( implement time delay in c )

very slight modifications, the naked attributes are not required (what tools are you using?)
asm(".globl _start; _start: nop\n");
#ifndef GPIO_H
#define GPIO_H
#ifdef RPI2
#define GPIO_BASE 0x3F200000UL
#else
#define GPIO_BASE 0x20200000UL
#endif
#if defined( RPIBPLUS ) || defined( RPI2 )
#define LED_GPFSEL GPIO_GPFSEL4
#define LED_GPFBIT 21
#define LED_GPSET GPIO_GPSET1
#define LED_GPCLR GPIO_GPCLR1
#define LED_GPIO_BIT 15
#else
#define LED_GPFSEL GPIO_GPFSEL1
#define LED_GPFBIT 18
#define LED_GPSET GPIO_GPSET0
#define LED_GPCLR GPIO_GPCLR0
#define LED_GPIO_BIT 16
#endif
#define GPIO_GPFSEL0 0
#define GPIO_GPFSEL1 1
#define GPIO_GPFSEL2 2
#define GPIO_GPFSEL3 3
#define GPIO_GPFSEL4 4
#define GPIO_GPFSEL5 5
#define GPIO_GPSET0 7
#define GPIO_GPSET1 8
#define GPIO_GPCLR0 10
#define GPIO_GPCLR1 11
#define GPIO_GPLEV0 13
#define GPIO_GPLEV1 14
#define GPIO_GPEDS0 16
#define GPIO_GPEDS1 17
#define GPIO_GPREN0 19
#define GPIO_GPREN1 20
#define GPIO_GPFEN0 22
#define GPIO_GPFEN1 23
#define GPIO_GPHEN0 25
#define GPIO_GPHEN1 26
#define GPIO_GPLEN0 28
#define GPIO_GPLEN1 29
#define GPIO_GPAREN0 31
#define GPIO_GPAREN1 32
#define GPIO_GPAFEN0 34
#define GPIO_GPAFEN1 35
#define GPIO_GPPUD 37
#define GPIO_GPPUDCLK0 38
#define GPIO_GPPUDCLK1 39
/** GPIO Register set */
volatile unsigned int* gpio;
/** Simple loop variable */
volatile unsigned int tim;
// Function to change a pin's mode
static int pinMode(int pinnum, int mode);
#endif
static int pinMode(int pinnum, int mode) {
// Variable declaration and initialization
int fsr = 0;
int fsrbit;
// Let's check the pin does exist
if (pinnum < 0 || pinnum > 53) {
// Abort, there is no such pin.
return 1;
}
else if (mode < 0 || mode > 1) {
// Abort, invalid mode (Actually there are 7 modes but we will only use 2)
return 1;
}
// Create a pointer to the GPIO perhiperal register so we can speak to it
gpio = (unsigned int*)GPIO_BASE;
// And calculate wich FSR we should use
/* do {
if (pinnum > 9) {
pinnum -= 10;
fsr++;
}
} while (pinnum > 9); */
if (pinnum > 9) {
while (pinnum > 9) {
pinnum -= 10;
fsr++;
}
}
// Then we calculate the bytes of the fsreg to use
fsrbit = pinnum * 3;
// Finally let's set the pin to the desired mode
gpio[fsr] |= (mode << fsrbit);
return 0;
}
int main(void)
{
gpio = (unsigned int*)GPIO_BASE;
/* Write 1 to the GPIO16 init nibble in the Function Select 1 GPIO
peripheral register to enable GPIO16 as an output */
// gpio[LED_GPFSEL] |= (1 << LED_GPFBIT);
pinMode(47, 1);
// Never return from here
while(1)
{
for(tim = 0; tim < 500000; tim++)
;
/* Set the LED GPIO pin low ( Turn OK LED on for original Pi, and off
for plus models )*/
gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);
for(tim = 0; tim < 500000; tim++)
;
/* Set the LED GPIO pin high ( Turn OK LED off for original Pi, and on
for plus models )*/
gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
}
}
along with some hackery to get it to link, not something you can use directly but main wont need to change, just bootstrap and linking.
00001000 <main>:
1000: e52de004 push {lr} ; (str lr, [sp, #-4]!)
1004: e3a00801 mov r0, #65536 ; 0x10000
1008: e3a0e000 mov lr, #0
100c: e59f3078 ldr r3, [pc, #120] ; 108c <main+0x8c>
1010: e59f2078 ldr r2, [pc, #120] ; 1090 <main+0x90>
1014: e5823000 str r3, [r2]
1018: e5932010 ldr r2, [r3, #16]
101c: e3822602 orr r2, r2, #2097152 ; 0x200000
1020: e1a0c003 mov r12, r3
1024: e5832010 str r2, [r3, #16]
1028: e59f1064 ldr r1, [pc, #100] ; 1094 <main+0x94>
102c: e59f3064 ldr r3, [pc, #100] ; 1098 <main+0x98>
1030: e583e000 str lr, [r3]
1034: e5932000 ldr r2, [r3]
1038: e1520001 cmp r2, r1
103c: 8a000005 bhi 1058 <main+0x58>
1040: e5932000 ldr r2, [r3]
1044: e2822001 add r2, r2, #1
1048: e5832000 str r2, [r3]
104c: e5932000 ldr r2, [r3]
1050: e1520001 cmp r2, r1
1054: 9afffff9 bls 1040 <main+0x40>
1058: e58c0028 str r0, [r12, #40] ; 0x28
105c: e583e000 str lr, [r3]
1060: e5932000 ldr r2, [r3]
1064: e1520001 cmp r2, r1
1068: 8a000005 bhi 1084 <main+0x84>
106c: e5932000 ldr r2, [r3]
1070: e2822001 add r2, r2, #1
1074: e5832000 str r2, [r3]
1078: e5932000 ldr r2, [r3]
107c: e1520001 cmp r2, r1
1080: 9afffff9 bls 106c <main+0x6c>
1084: e58c001c str r0, [r12, #28]
1088: eaffffe8 b 1030 <main+0x30>
108c: 20200000 eorcs r0, r0, r0
1090: 000110a4 andeq r1, r1, r4, lsr #1
1094: 0007a11f andeq r10, r7, pc, lsl r1
1098: 000110a0 andeq r1, r1, r0, lsr #1
0000109c <_start>:
109c: e1a00000 nop ; (mov r0, r0)
Disassembly of section .bss:
000110a0 <tim>:
110a0: 00000000 andeq r0, r0, r0
000110a4 <gpio>:
110a4: 00000000 andeq r0, r0, r0
so the first problem is here which comes straight from your code
fsrbit = pinnum * 3;
gpio[fsr] |= (mode << fsrbit);
100c: e59f3078 ldr r3, [pc, #120] ; 108c <main+0x8c>
1018: e5932010 ldr r2, [r3, #16]
101c: e3822602 orr r2, r2, #2097152 ; 0x200000
1024: e5832010 str r2, [r3, #16]
doing a read-modify-write is correct, but unless you knew the bits were zeros to start with it is best to zero them first
gpio[fsr] &= (~(7<<fsrbit));
gpio[fsr] |= (mode<<fsrbit);
by declaring tim as volatile the code as you can see actually counts takes a few instructions, yes it might be a 1GHZ processor but you are not running that fast, you are fetching from dram (slow), even with the cache, and you have pipe hazards, etc.
As pointed out though maybe make your loop count number larger. Another thing to try is to swap things around on/off state.
first this:
while(1)
{
gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);
for(tim = 0; tim < 500000; tim++) continue;
gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
for(tim = 0; tim < 500000; tim++) continue;
}
is it on, does it glow? then try this
while(1)
{
gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
for(tim = 0; tim < 500000; tim++) continue;
gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);
for(tim = 0; tim < 500000; tim++) continue;
}
if the led looks the same then your count may be too small, make it larger, control the bit then do the delay not the other way around. If it still glows and doesnt blink with set first then clear or clear first then set? then maybe too fast still, just do this
gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
with no loop
or
gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);
with no loop.
can you make it go on and stay on? can you make it go off and stay off? If not then there is something wrong in the code you use to make it go on and off, if so then there maybe something wrong with your delay. but examination of optimized code shown above the volatile is taking care of that and burning cycles.
your I solved with a naked answer, was deleted, so does that mean you didnt solve it? The baking pi tutorials are nice, glad they are there, but the baremetal forum at raspberry pi has over time been littered with it (baking pi) doesnt work questions, mostly makefile/directory issues, but perhaps others. Looks like though you get the gist of it, your gpio pointer/array style is interesting, I wouldnt go that way but so far it is working for you.
Ahh, wait and another bug...you didnt define RPI2 but clearly you had on the command line perhaps?
800c: e3433f20 movt r3, #16160 ; 0x3f20
so lets try my smaller build again
1034: e5804020 str r4, [r0, #32] ; 0x20
1048: e580e02c str lr, [r0, #44] ; 0x2c
much better GPSET1 and GPCLR1 where 47 is found...so no bug there.
actually instead of hardcoding 47 you should have finished out your header file and used some LED_GPIO_BIT in the pinMode() call insuring you matched the LED_otherstuff to the gpio pin.

Cortex-M4 SIMD slower than Scalar

I have a few place in my code that could really use some speed up, when I try to use CM4 SIMD instructions the result is always slower than scalar version, for example, this is an alpha blending function I'm using a lot, it's not really slow but it serves as an example:
for (int y=0; y<h; y++) {
i=y*w;
for (int x=0; x<w; x++) {
uint spix = *srcp++;
uint dpix = dstp[i+x];
uint r=(alpha*R565(spix)+(256-alpha)*R565(dpix))>>8;
uint g=(alpha*G565(spix)+(256-alpha)*G565(dpix))>>8;
uint b=(alpha*B565(spix)+(256-alpha)*B565(dpix))>>8;
dstp[i+x]= RGB565(r, g, b);
}
}
R565, G565, B565 and RGB565 are macros that extract and pack RGB565 respectively, please ignore
Now I tried using __SMUAD and see if anything changes, the result was slower (or same speed as the original code) even tried loop unrolling, with no luck:
uint v0, vr, vg, vb;
v0 = (alpha<<16)|(256-alpha);
for (int y=0; y<h; y++) {
i=y*w;
for (int x=0; x<w; x++) {
spix = *srcp++;
dpix = dstp[i+x];
uint vr = R565(spix)<<16 | R565(dpix);
uint vg = G565(spix)<<16 | G565(dpix);
uint vb = B565(spix)<<16 | B565(dpix);
uint r = __SMUAD(v0, vr)>>8;
uint g = __SMUAD(v0, vg)>>8;
uint b = __SMUAD(v0, vb)>>8;
dstp[i+x]= RGB565(r, g, b);
}
}
I know this has been asked before, but given the architectural differences, and the fact that none of the answers really solve my problem, I'm asking again. Thanks!
Update
Scalar disassembly:
Disassembly of section .text.blend:
00000000 <blend>:
0: e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
4: 6846 ldr r6, [r0, #4]
6: 68c4 ldr r4, [r0, #12]
8: b086 sub sp, #24
a: 199e adds r6, r3, r6
c: 9601 str r6, [sp, #4]
e: 9200 str r2, [sp, #0]
10: 68ca ldr r2, [r1, #12]
12: f89d 5038 ldrb.w r5, [sp, #56] ; 0x38
16: 9204 str r2, [sp, #16]
18: 9a01 ldr r2, [sp, #4]
1a: 426e negs r6, r5
1c: 4293 cmp r3, r2
1e: b2f6 uxtb r6, r6
20: da5b bge.n da <blend+0xda>
22: 8809 ldrh r1, [r1, #0]
24: 6802 ldr r2, [r0, #0]
26: 9102 str r1, [sp, #8]
28: fb03 fb01 mul.w fp, r3, r1
2c: 9900 ldr r1, [sp, #0]
2e: 4411 add r1, r2
30: 9103 str r1, [sp, #12]
32: 0052 lsls r2, r2, #1
34: 9205 str r2, [sp, #20]
36: 9903 ldr r1, [sp, #12]
38: 9a00 ldr r2, [sp, #0]
3a: 428a cmp r2, r1
3c: fa1f fb8b uxth.w fp, fp
40: da49 bge.n d6 <blend+0xd6>
42: 4610 mov r0, r2
44: 4458 add r0, fp
46: f100 4000 add.w r0, r0, #2147483648 ; 0x80000000
4a: 9a04 ldr r2, [sp, #16]
4c: f8dd a014 ldr.w sl, [sp, #20]
50: 3801 subs r0, #1
52: eb02 0040 add.w r0, r2, r0, lsl #1
56: 44a2 add sl, r4
58: f834 1b02 ldrh.w r1, [r4], #2
5c: 8842 ldrh r2, [r0, #2]
5e: f3c1 07c4 ubfx r7, r1, #3, #5
62: f3c2 09c4 ubfx r9, r2, #3, #5
66: f001 0c07 and.w ip, r1, #7
6a: f3c1 2804 ubfx r8, r1, #8, #5
6e: fb07 f705 mul.w r7, r7, r5
72: 0b49 lsrs r1, r1, #13
74: fb06 7709 mla r7, r6, r9, r7
78: ea41 01cc orr.w r1, r1, ip, lsl #3
7c: f3c2 2904 ubfx r9, r2, #8, #5
80: f002 0c07 and.w ip, r2, #7
84: fb08 f805 mul.w r8, r8, r5
88: 0b52 lsrs r2, r2, #13
8a: fb01 f105 mul.w r1, r1, r5
8e: 097f lsrs r7, r7, #5
90: fb06 8809 mla r8, r6, r9, r8
94: ea42 02cc orr.w r2, r2, ip, lsl #3
98: fb06 1202 mla r2, r6, r2, r1
9c: f007 07f8 and.w r7, r7, #248 ; 0xf8
a0: f408 58f8 and.w r8, r8, #7936 ; 0x1f00
a4: 0a12 lsrs r2, r2, #8
a6: ea48 0107 orr.w r1, r8, r7
aa: ea41 3142 orr.w r1, r1, r2, lsl #13
ae: f3c2 02c2 ubfx r2, r2, #3, #3
b2: 430a orrs r2, r1
b4: 4554 cmp r4, sl
b6: f820 2f02 strh.w r2, [r0, #2]!
ba: d1cd bne.n 58 <blend+0x58>
bc: 9902 ldr r1, [sp, #8]
be: 448b add fp, r1
c0: 9901 ldr r1, [sp, #4]
c2: 3301 adds r3, #1
c4: 428b cmp r3, r1
c6: fa1f fb8b uxth.w fp, fp
ca: d006 beq.n da <blend+0xda>
cc: 9a00 ldr r2, [sp, #0]
ce: 9903 ldr r1, [sp, #12]
d0: 428a cmp r2, r1
d2: 4654 mov r4, sl
d4: dbb5 blt.n 42 <blend+0x42>
d6: 46a2 mov sl, r4
d8: e7f0 b.n bc <blend+0xbc>
da: b006 add sp, #24
dc: e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
e0: 4770 bx lr
e2: bf00 nop
SIMD disassembly:
sassembly of section .text.blend:
00000000 <blend>:
0: e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
4: 6846 ldr r6, [r0, #4]
6: 68c4 ldr r4, [r0, #12]
8: b086 sub sp, #24
a: 199e adds r6, r3, r6
c: 9601 str r6, [sp, #4]
e: 9200 str r2, [sp, #0]
10: 68ca ldr r2, [r1, #12]
12: f89d 5038 ldrb.w r5, [sp, #56] ; 0x38
16: 9204 str r2, [sp, #16]
18: 9a01 ldr r2, [sp, #4]
1a: f5c5 7680 rsb r6, r5, #256 ; 0x100
1e: 4293 cmp r3, r2
20: ea46 4505 orr.w r5, r6, r5, lsl #16
24: da5d bge.n e2 <blend+0xe2>
26: 8809 ldrh r1, [r1, #0]
28: 6802 ldr r2, [r0, #0]
2a: 9102 str r1, [sp, #8]
2c: fb03 fb01 mul.w fp, r3, r1
30: 9900 ldr r1, [sp, #0]
32: 4411 add r1, r2
34: 9103 str r1, [sp, #12]
36: 0052 lsls r2, r2, #1
38: 9205 str r2, [sp, #20]
3a: 9903 ldr r1, [sp, #12]
3c: 9a00 ldr r2, [sp, #0]
3e: 428a cmp r2, r1
40: fa1f fb8b uxth.w fp, fp
44: da4b bge.n de <blend+0xde>
46: 4610 mov r0, r2
48: 4458 add r0, fp
4a: f100 4000 add.w r0, r0, #2147483648 ; 0x80000000
4e: 9a04 ldr r2, [sp, #16]
50: f8dd a014 ldr.w sl, [sp, #20]
54: 3801 subs r0, #1
56: eb02 0040 add.w r0, r2, r0, lsl #1
5a: 44a2 add sl, r4
5c: f834 2b02 ldrh.w r2, [r4], #2
60: 8841 ldrh r1, [r0, #2]
62: f3c2 07c4 ubfx r7, r2, #3, #5
66: f3c1 06c4 ubfx r6, r1, #3, #5
6a: ea46 4707 orr.w r7, r6, r7, lsl #16
6e: fb25 f707 smuad r7, r5, r7
72: f001 0907 and.w r9, r1, #7
76: ea4f 3c51 mov.w ip, r1, lsr #13
7a: f002 0607 and.w r6, r2, #7
7e: ea4f 3852 mov.w r8, r2, lsr #13
82: ea4c 0cc9 orr.w ip, ip, r9, lsl #3
86: ea48 06c6 orr.w r6, r8, r6, lsl #3
8a: ea4c 4606 orr.w r6, ip, r6, lsl #16
8e: fb25 f606 smuad r6, r5, r6
92: f3c1 2104 ubfx r1, r1, #8, #5
96: f3c2 2204 ubfx r2, r2, #8, #5
9a: ea41 4202 orr.w r2, r1, r2, lsl #16
9e: fb25 f202 smuad r2, r5, r2
a2: f3c6 260f ubfx r6, r6, #8, #16
a6: 097f lsrs r7, r7, #5
a8: f3c6 01c2 ubfx r1, r6, #3, #3
ac: f007 07f8 and.w r7, r7, #248 ; 0xf8
b0: 430f orrs r7, r1
b2: f402 52f8 and.w r2, r2, #7936 ; 0x1f00
b6: ea47 3646 orr.w r6, r7, r6, lsl #13
ba: 4316 orrs r6, r2
bc: 4554 cmp r4, sl
be: f820 6f02 strh.w r6, [r0, #2]!
c2: d1cb bne.n 5c <blend+0x5c>
c4: 9902 ldr r1, [sp, #8]
c6: 448b add fp, r1
c8: 9901 ldr r1, [sp, #4]
ca: 3301 adds r3, #1
cc: 428b cmp r3, r1
ce: fa1f fb8b uxth.w fp, fp
d2: d006 beq.n e2 <blend+0xe2>
d4: 9a00 ldr r2, [sp, #0]
d6: 9903 ldr r1, [sp, #12]
d8: 428a cmp r2, r1
da: 4654 mov r4, sl
dc: dbb3 blt.n 46 <blend+0x46>
de: 46a2 mov sl, r4
e0: e7f0 b.n c4 <blend+0xc4>
e2: b006 add sp, #24
e4: e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
e8: 4770 bx lr
ea: bf00 nop

If you want to truly optimize a function, you should check the assembler output of the compiler. You'll then learn how it is transforming your code, and then you can learn how to write code to help the compiler produce better output, or write the necessary assembler.
One of the easy wins you'd hit on quickly on your alpha blending loop is that division is slow.
Instead of x / 100, use
x * 65536 / 65536 / 100
-> x * (65536 / 100) / 65536
-> x * 655.36 >> 16
-> x * 656 >> 16
An even better alternative would be to use alpha values between 0 -> 256 so that you can just bitshift the result without even needing to do this trick.
One reason why smuad might not giving any benefit is that you're having to move data into a format specifically for this command.
I'm not sure whether you'll be able to do better in general, but I thought I'd point out a way to avoid the division in your sample routine. Also, if you inspect the assembly, you may discover that there is code generation that you don't expect that can be eliminated.

Community Wiki Answer
The major change to accommodate SIMD type instruction is to transform the loading. The SMUAD instruction can be looked at as a 'C' instruction like,
/* a,b are register vectors/arrays of 16 bits */
SMUAD = a[0] * b[0] + a[1] * b[1];
It is very easy to transform these. Instead of,
u16 *srcp, dstp;
uint spix = *srcp++;
uint dpix = dstp[i+x];
Use the full bus and get 32bits at a time,
uint *srcp, *dstp; /* These are two 16 bit values. */
uint spix = *srcp++;
uint dpix = dstp[i+x];
/* scale `dpix` and `spix` by alpha */
spix /= (alpha << 16 | alpha); /* Precompute, reduce strength, etc. */
dpix /= (1-alpha << 16 | 1-alpha); /* Precompute, reduce strength, etc. */
/* Hint, you can use SMUAD here? or maybe not.
You could if you scale at the same time*/
It looks like SMUL is a good fit for the alpha scaling; you don't want to add the two halves.
Now, spix and dpix contain two pixels. The vr synthetic is not needed. You may do two operations at one time.
uint rb = (dpix + spix) & ~GMASK; /* GMASK is x6xx6x bits. */
uint g = (dpix + spix) & GMASK;
/* Maybe you don't care about overflow?
A dual 16bit add helps, if the M4 has it? */
dstp[i+x]= rb | g; /* write 32bits or two pixels at a time. */
Mainly just making better use of the BUS by loading 32bits at a time will definitely speed up your routine. Standard 32 bit integer math may work most of the time if you are careful of the ranges and don't overflow the lower 16bit value in to the upper one.
For blitter code, Bit blog and Bit hacks are useful for extraction and manipulation of RGB565 values; whether SIMD or straight Thumb2 code.
Mainly, it is never a simple re-compile to use SIMD. It can be weeks of work to transform an algorithm. If done properly, SIMD speed ups are significant when the algorithm is not memory bandwidth bound and don't involve many conditionals.

Now with the disassembly posted: You'll see that both the scalar and the simd version have 29 instructions, and the SIMD version actually takes more code space. (Scalar is 0x58 -> 0xba for the inner loop, vs SIMD is (0x5c -> 0xc2))
You can see a lot of instructions are used getting the data into the right format for both loops.. maybe you can improve the performance more by working on the RGB bit unpacking/repacking rather than the alpha blend calculation!
Edit: You may also want to consider processing pairs of pixels at a time.

Using ARM NEON intrinsics to add alpha and permute

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?
void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
numPix /= 8; //process 8 pixels at a time
uint8x8_t alpha = vdup_n_u8 (0xff);
for (int i=0; i<numPix; i++)
{
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8x4_t bgra;
bgra.val[0] = rgb.val[2]; //these lines are slow
bgra.val[1] = rgb.val[1]; //these lines are slow
bgra.val[2] = rgb.val[0]; //these lines are slow
bgra.val[3] = alpha;
vst4_u8(dst, bgra);
src += 8*3;
dst += 8*4;
}
}

The ARMCC disassembly isn't that fast either :
It isn't using the most appropriate instructions
It mixes VFP instructions with NEON ones which causes huge hiccups every time
Try this :
mov r2, r2, lsr #3
vmov.u8, d3, #0xff
loop:
vld3.8 {d0-d2}, [r0]!
subs r2, r2, #1
vswp d0, d2
vst4.8 {d0-d3}, [r1]!
bgt loop
bx lr
My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.

This depends on the compiler. For example when I compile the code above with armcc (5.01) and disassemble it, what I get looks like (I'm just putting the loop and I moved alpha assignment outside of the loop)
18: f420440d vld3.8 {d4-d6}, [r0]!
1c: e2822001 add r2, r2, #1 ; 0x1
20: eeb01b45 fcpyd d1, d5
24: eeb00b46 fcpyd d0, d6
28: eeb02b44 fcpyd d2, d4
2c: f401000d vst4.8 {d0-d3}, [r1]!
30: e1520003 cmp r2, r3
34: bafffff7 blt 18 <neonPermuteRGBtoBGRA_armcc+0x18>
If I compile the code with gcc (4.4.3) and disassemble again I get,
40: f967 040f vld3.8 {d16-d18}, [r7]
44: 46d6 mov lr, sl
46: ecca 0b06 vstmia sl, {d16-d18}
4a: 9d02 ldr r5, [sp, #8]
4c: ed8d 8b1a vstr d8, [sp, #104]
50: 3718 adds r7, #24
52: e8be 000f ldmia.w lr!, {r0, r1, r2, r3}
56: f108 0801 add.w r8, r8, #1 ; 0x1
5a: c50f stmia r5!, {r0, r1, r2, r3}
5c: eddd 0b24 vldr d16, [sp, #144]
60: e89e 0003 ldmia.w lr, {r0, r1}
64: edcd 0b16 vstr d16, [sp, #88]
68: eddd 0b22 vldr d16, [sp, #136]
6c: edcd 0b18 vstr d16, [sp, #96]
70: e885 0003 stmia.w r5, {r0, r1}
74: ed9d 0b26 vldr d0, [sp, #152]
78: 9d03 ldr r5, [sp, #12]
7a: ed8d 0b14 vstr d0, [sp, #80]
7e: cd0f ldmia r5!, {r0, r1, r2, r3}
80: 46ae mov lr, r5
82: 465d mov r5, fp
84: c50f stmia r5!, {r0, r1, r2, r3}
86: e89e 000f ldmia.w lr, {r0, r1, r2, r3}
8a: e885 000f stmia.w r5, {r0, r1, r2, r3}
8e: 9501 str r5, [sp, #4]
90: 465d mov r5, fp
92: 2100 movs r1, #0
94: 2220 movs r2, #32
96: 4620 mov r0, r4
98: f7ff fffe bl 0 <memset>
9c: cd0f ldmia r5!, {r0, r1, r2, r3}
9e: 4625 mov r5, r4
a0: c50f stmia r5!, {r0, r1, r2, r3}
a2: f8dd c004 ldr.w ip, [sp, #4]
a6: e89c 000f ldmia.w ip, {r0, r1, r2, r3}
aa: e885 000f stmia.w r5, {r0, r1, r2, r3}
ae: ecd4 0b08 vldmia r4, {d16-d19}
b2: f946 000f vst4.8 {d16-d19}, [r6]
b6: 3620 adds r6, #32
b8: 45c8 cmp r8, r9
ba: dbc1 blt.n 40 <memset+0x40>
And the execution time was 10 times faster with armcc.
If I compile armcc produced assembly code for the function (it looks like now alpha is back in loop :)) with gcc (inline assembly)
void neonPermuteRGBtoBGRA_gas(unsigned char* src, unsigned char* dst,
int numPix) {
asm(
" ASR r3,r2,#31\n"
" VMOV.I8 d1,#0xff\n"
" ADD r2,r2,r3,LSR #29\n"
" ASR r3,r2,#3\n"
" MOV r2,#0\n"
" CMP r3,#0\n"
" BLE end\n"
"loop:\n"
" VLD3.8 {d4,d5,d6},[r0]!\n"
" ADD r2,r2,#1\n"
" CMP r3,r2\n"
" VMOV.F64 d3,d5\n"
" VMOV.F64 d2,d6\n"
" VMOV.F64 d5,d1\n"
" VMOV.F64 d0,d4\n"
" VST4.8 {d2,d3,d4,d5},[r1]!\n"
" BGT loop\n"
"end:\n"
);
}
I get the same execution time with gcc as well.
At the end what I suggest you is either disassemble your binary and check if the compiler produces what you want or use assembly.
Btw if you want to improve the execution time of this function even further, I suggest you to look into
arm's PLD (preload data) instruction
utilize all the possible neon instructions in the loop, like loop unrolling (you'll notice that actually bandwidth will be the data load time from memory)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

objdump produces wrong branch opcode interpretation - c

Related

how to force arm gcc compiler to not to use 32bit access for an unaligned memory

Self written simple memset not working with -03 eabi gcc on ARMv7

Bare metal C Function not working

Cortex-M4 SIMD slower than Scalar

Using ARM NEON intrinsics to add alpha and permute

Categories

Resources