How do I convert a binary firmware dump to an .elf for assembly language debugging? - arm

I have a binary firmware image for ARM Cortex M that I know should be loaded at 0x20000000. I would like to convert it to a format that I can use for assembly level debugging with gdb, which I assume means converting to an .elf. But I have not been able to figure out how to add enough metadata to the .elf for this to happen. Here is what I've tried so far.
arm-none-eabi-objcopy -I binary -O elf32-littlearm --set-section-flags \
.data=alloc,contents,load,readonly \
--change-section-address .data=0x20000000 efr32.bin efr32.elf
efr32.elf: file format elf32-little
efr32.elf
architecture: UNKNOWN!, flags 0x00000010:
HAS_SYMS
start address 0x00000000
Sections:
Idx Name Size VMA LMA File off Algn
0 .data 00000168 20000000 20000000 00000034 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
SYMBOL TABLE:
20000000 l d .data 00000000 .data
20000000 g .data 00000000 _binary_efr32_bin_start
20000168 g .data 00000000 _binary_efr32_bin_end
00000168 g *ABS* 00000000 _binary_efr32_bin_size
Do I need to start by converting the binary to .o and write a simple linker script? Should I add an architecture option to the objcopy command?

A little experiment...
58: 480a ldr r0, [pc, #40] ; (84 <spi_write_byte+0x38>)
5a: bf08 it eq
5c: 4809 ldreq r0, [pc, #36] ; (84 <spi_write_byte+0x38>)
5e: f04f 01ff mov.w r1, #255 ; 0xff
you dont have that of course, but you can read the binary and do this with it:
.thumb
.globl _start
_start:
.inst.n 0x480a
.inst.n 0xbf08
.inst.n 0x4809
.inst.n 0xf04f
.inst.n 0x01ff
then see what happens.
arm-none-eabi-as test.s -o test.o
arm-none-eabi-ld -Ttext=0x58 test.o -o test.elf
arm-none-eabi-objdump -D test.elf
test.elf: file format elf32-littlearm
Disassembly of section .text:
00000058 <_start>:
58: 480a ldr r0, [pc, #40] ; (84 <_start+0x2c>)
5a: bf08 it eq
5c: 4809 ldreq r0, [pc, #36] ; (84 <_start+0x2c>)
5e: f04f 01ff mov.w r1, #255 ; 0xff
but the reality is it wont work...if this binary has any thumb2 extensions it isnt going to work, you cant disassemble variable length instructions linearly. You have to deal with them in execution order. So to do this correctly you have to write a dissassembler that walks through the code in execution order, determining the instructions you can figure out, mark them as instructions...
80: d1e8 bne.n 54 <spi_write_byte+0x8>
82: bd70 pop {r4, r5, r6, pc}
84: 40005200
88: F7FF4000
8c: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}
90: 4887 ldr r0, [pc, #540] ; (2b0 <notmain+0x224>)
.thumb
.globl _start
_start:
.inst.n 0xd1e8
.inst.n 0xbd70
.inst.n 0x5200
.inst.n 0x4000
.inst.n 0x4000
.inst.n 0xF7FF
.inst.n 0xe92d
.inst.n 0x41f0
.inst.n 0x4887
80: d1e8 bne.n 54 <_start-0x2c>
82: bd70 pop {r4, r5, r6, pc}
84: 5200 strh r0, [r0, r0]
86: 4000 ands r0, r0
88: 4000 ands r0, r0
8a: f7ff e92d ; <UNDEFINED> instruction: 0xf7ffe92d
8e: 41f0 rors r0, r6
90: 4887 ldr r0, [pc, #540] ; (2b0 <_start+0x230>)
it will recover, and break and recover, etc...
instead you have to write a disassembler that walks through the code (doesnt necessarily have to disassemble to assembly language but enough to walk the code and recurse down all possible branch paths). all data not determined to be instructions mark as instructions
.thumb
.globl _start
_start:
.inst.n 0xd1e8
.inst.n 0xbd70
.word 0x40005200
.word 0xF7FF4000
.inst.n 0xe92d
.inst.n 0x41f0
.inst.n 0x4887
00000080 <_start>:
80: d1e8 bne.n 54 <_start-0x2c>
82: bd70 pop {r4, r5, r6, pc}
84: 40005200 andmi r5, r0, r0, lsl #4
88: f7ff4000 ; <UNDEFINED> instruction: 0xf7ff4000
8c: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}
90: 4887 ldr r0, [pc, #540] ; (2b0 <_start+0x230>)
and our stmdb instruction is now correct.
good luck.

Related

How to to build a binary into a image at a fixed address 0x80000?

In our c project, we need to build a binary firmware into a image at a fixed file offset 0x80000.
Then when the image is loaded to memory. We can load firmware from offset 0x80000 to a specified address.
Meanwhile, as the firmware is placed at file offset 0x80000, we can upgrade the firmare independently.
So I'm trying to use GNU linker script to implement that.
What I do now is use incbin to include my binary file in a asm file.
And in linker script, my code is:
.fw_image_start : {
*(.__fw_image_start)
}
.fw_image : {
KEEP(*(.fw_image))
}
.fw_image_end : {
*(.__fw_image_end)
}
Then I can use fw_image_start to load firmware in image code.
But I still can't find a way to put the firmware binary to file offset 0x80000 in the final image.
Could you help me on this?
Thank you in advance!
What did you find when you looked at the documentation and examples from GNU? Some of it is admittedly confusing or misleading, but some is pretty easy. This should give a hit of at least one way to do it (there are multiple ways to solve your problem).
novectors.s
.global _start
_start:
bl notmain
b .
.globl bounce
bounce:
bx lr
.section .hello_world
.word 1,2,3,4
notmain.c
void bounce ( unsigned int );
unsigned int mybss[8];
int notmain ( void )
{
unsigned int ra;
for(ra=0;ra<1000;ra++) bounce(ra);
return(0);
}
memmap.ld
MEMORY
{
ram : ORIGIN = 0x80000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.hello_world : { *(.hello_world) } > ram
.bss : { *(.bss*) } > ram
}
build
arm-none-eabi-as --warn --fatal-warnings novectors.s -o novectors.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -c notmain.c -o notmain.o
arm-none-eabi-ld -o notmain.elf -T memmap.ld novectors.o notmain.o
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy notmain.elf notmain.bin -O binary
examine results
Disassembly of section .text:
00080000 <_start>:
80000: eb000001 bl 8000c <notmain>
80004: eafffffe b 80004 <_start+0x4>
00080008 <bounce>:
80008: e12fff1e bx lr
0008000c <notmain>:
8000c: e92d4010 push {r4, lr}
80010: e3a04000 mov r4, #0
80014: e1a00004 mov r0, r4
80018: e2844001 add r4, r4, #1
8001c: ebfffff9 bl 80008 <bounce>
80020: e3540ffa cmp r4, #1000 ; 0x3e8
80024: 1afffffa bne 80014 <notmain+0x8>
80028: e3a00000 mov r0, #0
8002c: e8bd4010 pop {r4, lr}
80030: e12fff1e bx lr
Disassembly of section .hello_world:
00080034 <.hello_world>:
80034: 00000001 andeq r0, r0, r1
80038: 00000002 andeq r0, r0, r2
8003c: 00000003 andeq r0, r0, r3
80040: 00000004 andeq r0, r0, r4
Disassembly of section .bss:
00080034 <mybss>:
...
Naturally:
MEMORY
{
bob : ORIGIN = 0x80000, LENGTH = 0x1000
ted : ORIGIN = 0xB0000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > bob
.rodata : { *(.rodata*) } > bob
.hello_world : { *(.hello_world) } > ted
.bss : { *(.bss*) } > ted
}
gives
Disassembly of section .text:
00080000 <_start>:
80000: eb000001 bl 8000c <notmain>
80004: eafffffe b 80004 <_start+0x4>
00080008 <bounce>:
80008: e12fff1e bx lr
0008000c <notmain>:
8000c: e92d4010 push {r4, lr}
80010: e3a04000 mov r4, #0
80014: e1a00004 mov r0, r4
80018: e2844001 add r4, r4, #1
8001c: ebfffff9 bl 80008 <bounce>
80020: e3540ffa cmp r4, #1000 ; 0x3e8
80024: 1afffffa bne 80014 <notmain+0x8>
80028: e3a00000 mov r0, #0
8002c: e8bd4010 pop {r4, lr}
80030: e12fff1e bx lr
Disassembly of section .hello_world:
000b0000 <.hello_world>:
b0000: 00000001 andeq r0, r0, r1
b0004: 00000002 andeq r0, r0, r2
b0008: 00000003 andeq r0, r0, r3
b000c: 00000004 andeq r0, r0, r4
Disassembly of section .bss:
000b0000 <mybss>:
...
and as demonstrated the name ram, rom, etc are not special can call it other things like bob, ted, alice...I assume there are some reserved words you cant use.
Again there are numerous solutions, see the GNU documentation, I like this method as it reads better for me, but you will see solutions that skip the MEMORY part.
(no this wasnt intended to be completely correct code, but demonstrates the assembly language bootstrap, the C code and the linker script).

What am I doing wrong when compiling C code to bare metal (raspberry pi)?

I have spent multiple days trying to figure this out and I just can't. I have some C code. I have made the assembly code for this C program, copy pasted the assembly to someone else's project (that only contains a single assembly file) and assembled that. In these case things work. But if I try to compile from C directly to generate the binaries, it doesn't work. Even though everything else should be identical. This is my C code:
#include <stdint.h>
#define REGISTERS_BASE 0x3F000000
#define MAIL_BASE 0xB880 // Base address for the mailbox registers
// This bit is set in the status register if there is no space to write into the mailbox
#define MAIL_FULL 0x80000000
// This bit is set in the status register if there is nothing to read from the mailbox
#define MAIL_EMPTY 0x40000000
struct Message
{
uint32_t messageSize;
uint32_t requestCode;
uint32_t tagID;
uint32_t bufferSize;
uint32_t requestSize;
uint32_t pinNum;
uint32_t on_off_switch;
uint32_t end;
};
struct Message m =
{
.messageSize = sizeof(struct Message),
.requestCode =0,
.tagID = 0x00038041,
.bufferSize = 8,
.requestSize =0,
.pinNum = 130,
.on_off_switch = 1,
.end = 0,
};
/** Main function - we'll never return from here */
int _start(void)
{
uint32_t mailbox = MAIL_BASE + REGISTERS_BASE + 0x18;
volatile uint32_t status;
do
{
status = *(volatile uint32_t *)(mailbox);
}
while((status & 0x80000000));
*(volatile uint32_t *)(MAIL_BASE + REGISTERS_BASE + 0x20) = ((uint32_t)(&m) & 0xfffffff0) | (uint32_t)(8);
while(1);
}
This is a linker file I copied from the successful method:
/*
* Very simple linker script, combing the text and data sections
* and putting them starting at address 0x800.
*/
SECTIONS {
/* Put the code at 0x80000, leaving room for ARM and
* the stack. It also conforms to the standard expecations.
*/
.init 0x8000 : {
*(.init)
}
.text : {
*(.text)
}
/* Put the data after the code */
.data : {
*(.data)
}
}
And these is how I am compiling and linking everything:
arm-none-eabi-gcc -O0 -march=armv8-a PiTest.c -nostartfiles -o kernel.o
arm-none-eabi-ld kernel.o -o kernel.elf -T kernel.ld
arm-none-eabi-objcopy kernel.elf -O binary kernel.img
My target architecture is armv8 since that's what the pi model 3 uses.
I have no idea how the generated assembly works, but the C code directly does not. Please help I am on the verge of madness.
EDIT: The expected behaviour is for the pi's light to turn on. which it does with the first method I described. With the second method the light remains off.
EDIT4: Made some changes to files, deleted previous edits with outdated info to reduce post size
kernel.elf: file format elf32-littlearm
Disassembly of section .init:
00008000 <_start>:
8000: e3a0dd7d mov sp, #8000 ; 0x1f40
8004: eaffffff b 8008 <kernel_main>
Disassembly of section .text:
00008008 <kernel_main>:
8008: e52db004 push {fp} ; (str fp, [sp, #-4]!)
800c: e28db000 add fp, sp, #0
8010: e24dd00c sub sp, sp, #12
8014: e30b3898 movw r3, #47256 ; 0xb898
8018: e3433f00 movt r3, #16128 ; 0x3f00
801c: e50b3008 str r3, [fp, #-8]
8020: e51b3008 ldr r3, [fp, #-8]
8024: e5933000 ldr r3, [r3]
8028: e50b300c str r3, [fp, #-12]
802c: e51b300c ldr r3, [fp, #-12]
8030: e3530000 cmp r3, #0
8034: bafffff9 blt 8020 <kernel_main+0x18>
8038: e30b38a0 movw r3, #47264 ; 0xb8a0
803c: e3433f00 movt r3, #16128 ; 0x3f00
8040: e3082050 movw r2, #32848 ; 0x8050
8044: e3402001 movt r2, #1
8048: e3c2200f bic r2, r2, #15
804c: e3822008 orr r2, r2, #8
8050: e5832000 str r2, [r3]
8054: eafffffe b 8054 <kernel_main+0x4c>
Disassembly of section .data:
00008058 <__data_start>:
8058: 00000020 andeq r0, r0, r0, lsr #32
805c: 00000000 andeq r0, r0, r0
8060: 00038041 andeq r8, r3, r1, asr #32
8064: 00000008 andeq r0, r0, r8
8068: 00000000 andeq r0, r0, r0
806c: 00000082 andeq r0, r0, r2, lsl #1
8070: 00000001 andeq r0, r0, r1
8074: 00000000 andeq r0, r0, r0
Disassembly of section .ARM.attributes:
00000000 <_stack-0x80021>:
0: 00002e41 andeq r2, r0, r1, asr #28
4: 61656100 cmnvs r5, r0, lsl #2
8: 01006962 tsteq r0, r2, ror #18
c: 00000024 andeq r0, r0, r4, lsr #32
10: 412d3805 ; <UNDEFINED> instruction: 0x412d3805
14: 070e0600 streq r0, [lr, -r0, lsl #12]
18: 09010841 stmdbeq r1, {r0, r6, fp}
1c: 14041202 strne r1, [r4], #-514 ; 0xfffffdfe
20: 17011501 strne r1, [r1, -r1, lsl #10]
24: 1a011803 bne 46038 <__bss_end__+0x3dfc0>
28: 2a012201 bcs 48834 <__bss_end__+0x407bc>
2c: Address 0x000000000000002c is out of bounds.
Disassembly of section .comment:
00000000 <.comment>:
0: 3a434347 bcc 10d0d24 <_stack+0x1050d03>
4: 35312820 ldrcc r2, [r1, #-2080]! ; 0xfffff7e0
8: 392e343a stmdbcc lr!, {r1, r3, r4, r5, sl, ip, sp}
c: 732b332e ; <UNDEFINED> instruction: 0x732b332e
10: 33326e76 teqcc r2, #1888 ; 0x760
14: 37373131 ; <UNDEFINED> instruction: 0x37373131
18: 2029312d eorcs r3, r9, sp, lsr #2
1c: 2e392e34 mrccs 14, 1, r2, cr9, cr4, {1}
20: 30322033 eorscc r2, r2, r3, lsr r0
24: 35303531 ldrcc r3, [r0, #-1329]! ; 0xfffffacf
28: 28203932 stmdacs r0!, {r1, r4, r5, r8, fp, ip, sp}
2c: 72657270 rsbvc r7, r5, #112, 4
30: 61656c65 cmnvs r5, r5, ror #24
34: 00296573 eoreq r6, r9, r3, ror r5
kernel8.img
12345678
00000800
00080264
00000000
12345678
kernel8-32.img
12345678
00008320
00008224
200001DA
12345678
kernel7.img
12345678
00000700
00008224
200001DA
12345678
kernel.img
12345678
00000000
00008224
200001DA
12345678
when I wrote and posted this code this is what I got so if you name your file kernel.img then 0x8000 is your entry point the answer I gave in your other SO question is a complete raspberry pi starting point. You can simply add your mailbox stuff, although if you are struggling with this I thing the mailbox and video are not where you should start IMO.
if you name the file kernel8.img then the entry point is 0x80000 change the linker script to match.
I have a serial port based bootloader you can use to save on the sd card dance, can get a long way with that then simply use the binary version of what you are creating to write to the flash once your application is working.
EDIT
Okay this is incredibly disgusting and by posting it here maybe that means you cant use it in your classwork...you should really do this right and not use inline assembly for your bootstrap...
so.c
asm(
".globl _start\n"
"_start:\n"
"mov sp,#0x8000\n"
"bl centry\n"
"b .\n"
);
unsigned int centry ( void )
{
return(5);
}
build
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-ld -Ttext=0x8000 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary kernel.img
examine
Disassembly of section .text:
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000000 bl 800c <centry>
8008: eafffffe b 8008 <_start+0x8>
0000800c <centry>:
800c: e3a00005 mov r0, #5
8010: e12fff1e bx lr
A complete raspberry pi C with bootstrap example that will work on any of the flavors of pi (so far as I know they might have changed the GPU bootloader in the last few months but assume the didnt).
There are a couple of things I see wrong here. The most obvious ones are:
You aren't leaving anything at address 0, so the CPU is left executing blank memory at startup. You need to put something (like a branch instruction!) at 0x0.
On ARM Cortex-A, the stack pointer is not initialized at startup. You have to initialize it yourself in _start -- which means you will need to write that function in assembly.
First, cudos to old timer for his patience helping me.
The mistakes were:
Wrong entry point for the program, fixed by creating an assembly file with the label _start to set the stack pointer and using the linker to put the init section at address 0x8000
The compilation line itself was also wrong, it was missing a -c argument

How to make bare metal ARM programs and run them on QEMU?

I am trying to get this tutorial to work as intended without success (Something fails after the bl main instruction).
According to the tutorial the command
(qemu) xp /1dw 0xa0000018
should result in the print 33 (But i get 0x00 instead)
a0000018: 33
This is the content of the registers after the main call (see startup.s)
(qemu) info registers
R00=a000001c R01=a000001c R02=00000006 R03=00000000
R04=00000000 R05=00000005 R06=00000006 R07=00000007
R08=00000008 R09=00000009 R10=00000000 R11=a3fffffc
R12=00000000 R13=00000000 R14=0000003c R15=00000004
PSR=800001db N--- A und32
FPSCR: 00000000
I have the following files
main.c
startup.s
lscript.ld
Makefile
And I am using the following toolchain
arm-2013.11-24-arm-none-eabi-i686-pc-linux-gnu
Makefile:
SRCS := main.c startup.s
LINKER_NAME := lscript.ld
ELF_NAME := program.elf
BIN_NAME := program.bin
FLASH_NAME := flash.bin
CC := arm-none-eabi
CFLAGS := -nostdlib
OBJFLAGS ?= -DS
QEMUFLAGS := -M connex -pflash $(FLASH_NAME) -nographic -serial /dev/null
# Allocate 16MB to use as a virtual flash for th qemu
# bs = blocksize -> 4KB
# count = number of block -> 4096
# totalsize = 16MB
setup:
dd if=/dev/zero of=$(FLASH_NAME) bs=4096 count=4096
# Compile srcs and write to virtual flash
all: clean setup
$(CC)-gcc $(CFLAGS) -o $(ELF_NAME) -T $(LINKER_NAME) $(SRCS)
$(CC)-objcopy -O binary $(ELF_NAME) $(BIN_NAME)
dd if=$(BIN_NAME) of=$(FLASH_NAME) bs=4096 conv=notrunc
objdump:
$(CC)-objdump $(OBJFLAGS) $(ELF_NAME)
mem-placement:
$(CC)-nm -n $(ELF_NAME)
qemu:
qemu-system-arm $(QEMUFLAGS)
clean:
rm -rf *.bin
rm -rf *.elf
main.c:
static int arr[] = { 1, 10, 4, 5, 6, 7 };
static int sum;
static const int n = sizeof(arr) / sizeof(arr[0]);
int main()
{
int i;
for (i = 0; i < n; i++){
sum += arr[i];
}
return 0;
}
startup.s:
.section "vectors"
reset: b _start
undef: b undef
swi: b swi
pabt: b pabt
dabt: b dabt
nop
irq: b irq
fiq: b fiq
.text
_start:
init:
## Copy data to RAM.
ldr r0, =flash_sdata
ldr r1, =ram_sdata
ldr r2, =data_size
## Handle data_size == 0
cmp r2, #0
beq init_bss
copy:
ldrb r4, [r0], #1
strb r4, [r1], #1
subs r2, r2, #1
bne copy
init_bss:
## Initialize .bss
ldr r0, =sbss
ldr r1, =ebss
ldr r2, =bss_size
## Handle bss_size == 0
cmp r2, #0
beq init_stack
mov r4, #0
zero:
strb r4, [r0], #1
subs r2, r2, #1
bne zero
init_stack:
## Initialize the stack pointer
ldr sp, =0xA4000000
## **this call dosent work as expected.. (r13/sp contains 0xA4000000)**
bl main
## Dosent return from main
## r0 should now contain 33
stop:
b stop
lscript.ld:
/*
* Linker for testing purposes
* (using 16 MB virtual flash = 0x0100_0000)
*/
MEMORY {
rom (rx) : ORIGIN = 0x00000000, LENGTH = 0x01000000
ram (rwx) : ORIGIN = 0xA0000000, LENGTH = 0x04000000
}
SECTIONS {
.text : {
* (vectors);
* (.text);
} > rom
.rodata : {
* (.rodata);
} > rom
flash_sdata = .;
ram_sdata = ORIGIN(ram);
.data : AT (flash_sdata) {
* (.data);
} > ram
ram_edata = .;
data_size = ram_edata - ram_sdata;
sbss = .;
.bss : {
* (.bss);
} > ram
ebss = .;
bss_size = ebss - sbss;
/DISCARD/ : {
*(.note*)
*(.comment)
*(.ARM*)
/*
*(.debug*)
*/
}
}
Disassembly of the executable (objdump):
program.elf: file format elf32-littlearm
Disassembly of section .text:
00000000 <reset>:
0: ea000023 b 94 <_start>
00000004 <undef>:
4: eafffffe b 4 <undef>
00000008 <swi>:
8: eafffffe b 8 <swi>
0000000c <pabt>:
c: eafffffe b c <pabt>
00000010 <dabt>:
10: eafffffe b 10 <dabt>
14: e320f000 nop {0}
00000018 <irq>:
18: eafffffe b 18 <irq>
0000001c <fiq>:
1c: eafffffe b 1c <fiq>
00000020 <main>:
20: e52db004 push {fp} ; (str fp, [sp, #-4]!)
24: e28db000 add fp, sp, #0
28: e24dd00c sub sp, sp, #12
2c: e3a03000 mov r3, #0
30: e50b3008 str r3, [fp, #-8]
34: ea00000d b 70 <main+0x50>
38: e3003000 movw r3, #0
3c: e34a3000 movt r3, #40960 ; 0xa000
40: e51b2008 ldr r2, [fp, #-8]
44: e7932102 ldr r2, [r3, r2, lsl #2]
48: e3003018 movw r3, #24
4c: e34a3000 movt r3, #40960 ; 0xa000
50: e5933000 ldr r3, [r3]
54: e0822003 add r2, r2, r3
58: e3003018 movw r3, #24
5c: e34a3000 movt r3, #40960 ; 0xa000
60: e5832000 str r2, [r3]
64: e51b3008 ldr r3, [fp, #-8]
68: e2833001 add r3, r3, #1
6c: e50b3008 str r3, [fp, #-8]
70: e3a02006 mov r2, #6
74: e51b3008 ldr r3, [fp, #-8]
78: e1530002 cmp r3, r2
7c: baffffed blt 38 <main+0x18>
80: e3a03000 mov r3, #0
84: e1a00003 mov r0, r3
88: e24bd000 sub sp, fp, #0
8c: e49db004 pop {fp} ; (ldr fp, [sp], #4)
90: e12fff1e bx lr
00000094 <_start>:
94: e59f004c ldr r0, [pc, #76] ; e8 <stop+0x4>
98: e59f104c ldr r1, [pc, #76] ; ec <stop+0x8>
9c: e59f204c ldr r2, [pc, #76] ; f0 <stop+0xc>
a0: e3520000 cmp r2, #0
a4: 0a000003 beq b8 <init_bss>
000000a8 <copy>:
a8: e4d04001 ldrb r4, [r0], #1
ac: e4c14001 strb r4, [r1], #1
b0: e2522001 subs r2, r2, #1
b4: 1afffffb bne a8 <copy>
000000b8 <init_bss>:
b8: e59f0034 ldr r0, [pc, #52] ; f4 <stop+0x10>
bc: e59f1034 ldr r1, [pc, #52] ; f8 <stop+0x14>
c0: e59f2034 ldr r2, [pc, #52] ; fc <stop+0x18>
c4: e3520000 cmp r2, #0
c8: 0a000003 beq dc <init_stack>
cc: e3a04000 mov r4, #0
000000d0 <zero>:
d0: e4c04001 strb r4, [r0], #1
d4: e2522001 subs r2, r2, #1
d8: 1afffffc bne d0 <zero>
000000dc <init_stack>:
dc: e3a0d329 mov sp, #-1543503872 ; 0xa4000000
e0: ebffffce bl 20 <main>
000000e4 <stop>:
e4: eafffffe b e4 <stop>
e8: 00000104 andeq r0, r0, r4, lsl #2
ec: a0000000 andge r0, r0, r0
f0: 00000018 andeq r0, r0, r8, lsl r0
f4: a0000018 andge r0, r0, r8, lsl r0
f8: a000001c andge r0, r0, ip, lsl r0
fc: 00000004 andeq r0, r0, r4
Disassembly of section .rodata:
00000100 <n>:
100: 00000006 andeq r0, r0, r6
Disassembly of section .data:
a0000000 <arr>:
a0000000: 00000001 andeq r0, r0, r1
a0000004: 0000000a andeq r0, r0, sl
a0000008: 00000004 andeq r0, r0, r4
a000000c: 00000005 andeq r0, r0, r5
a0000010: 00000006 andeq r0, r0, r6
a0000014: 00000007 andeq r0, r0, r7
Disassembly of section .bss:
a0000018 <sum>:
a0000018: 00000000 andeq r0, r0, r0
Can someone point me in the right direction to why this isn't working according to my expectations?
Thanks Henrik
Minimal examples that just work
https://github.com/cirosantilli/linux-kernel-module-cheat/tree/54e15e04338c0fecc0be139a0da2d0d972c21419#baremetal-setup-getting-started
The prompt.c example takes input from your host terminal and gives back output all through the simulated UART:
enter a character
got: a
new alloc of 1 bytes at address 0x0x4000a1c0
enter a character
got: b
new alloc of 2 bytes at address 0x0x4000a1c0
enter a character
It uses Newlib to expose a subset of the C standard library. This allows you to run existing programs written in C if the only use that restricted subset of the C standard library.
More details about Newlib at: https://electronics.stackexchange.com/questions/223929/c-standard-libraries-on-bare-metal/400077#400077
https://github.com/freedomtan/aarch64-bare-metal-qemu/tree/2ae937a2b106b43bfca49eec49359b3e30eac1b1 for -M virt, just the hello world on the repo. Compile with:
sudo apt-get install gcc-aarch64-linux-gnu
make CROSS_PREFIX=aarch64-linux-gnu-
Here is the example minimized to printing a single character from assembly: How to run a bare metal ELF file on QEMU?
https://github.com/bztsrc/raspi3-tutorial for -M raspi3. Quick getting started at: https://raspberrypi.stackexchange.com/questions/34733/how-to-do-qemu-emulation-for-bare-metal-raspberry-pi-images/85135#85135 Several other examples on the repo going to more advanced subjects.
Also does display output on 09_framebuffer.
Both write a hello world to the UART.
Tested in Ubuntu 18.04, gcc-aarch64-linux-gnu version 4:7.3.0-3ubuntu2.
Debugging!
First, look at the PC and PSR: You're in Undef mode, in the undefined instruction handler.
OK, in an exception mode, the LR tells you where you took the exception. There are some slightly complicated rules between the PC offset and the preferred return address determining exactly what it points at, but just eyeballing it it's clearly in the vicinity of the movw/movt pair.
The movw instruction effectively only exists in the ARMv7 ISA onwards. A brief investigation tells me the machine you're emulating is some old PXA255 thing, whose CPU only implements the ARMv5 ISA. Thus it's not surprising it faults on an instruction that it predates by many years.
Your compiler is apparently configured to target ARMv7 by default (which is not uncommon), so you need to add at least -march=armv5te to your CFLAGS to target the appropriate architecture version. The 'advanced' challenge would be to switch to a different, newer, machine, but that's going to involve adapting the linker script to a new memory map and rewriting any hardware-touching code for new peripherals, so I'd save that idea for the longer term, once you're comfortable with the basics of bare-metal code and slogging through hardware reference manuals.
for the same code on my ubuntu i got
arm-none-eabi-gcc -nostdlib -o sum.elf sum.lds startup.s -w
/usr/lib/gcc/arm-none-eabi/4.9.3/../../../arm-none-eabi/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000
/tmp/ccBthV7t.o: In function init_stack':
(.text+0x4c): undefined reference tomain'
collect2: error: ld returned 1 exit status

Would Thumb-2 ARM-Core Micros From Different Manufacturers Have Same Codesize?

Comparing two Thumb-2 micros from two different manufacturers. One's a Cortex M3, one's an A5. Are they guaranteed to compile a particular piece of code to the same codesize?
so here goes
fun.c
unsigned int fun ( unsigned int x )
{
return(x);
}
addimm.c
extern unsigned int fun ( unsigned int );
unsigned int addimm ( unsigned int x )
{
return(fun(x)+0x123);
}
for demonstration purposes building for bare metal, not really a functional program but it compiles clean and demonstrates what I intend to demonstrate.
arm instructions
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=cortex-a5 -march=armv7-a -c addimm.c -o addimma.o
disassembly of the object, not linked
00000000 <addimm>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <fun>
8: e2800e12 add r0, r0, #288 ; 0x120
c: e2800003 add r0, r0, #3
10: e8bd8008 pop {r3, pc}
thumb generic (armv4 or v5 whatever the default was for this compiler build)
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -c addimm.c -o addimmt.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: 3024 adds r0, #36 ; 0x24
8: 30ff adds r0, #255 ; 0xff
a: bc08 pop {r3}
c: bc02 pop {r1}
e: 4708 bx r1
cortex-a5 specific
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-a5 -march=armv7-a -c addimm.c -o addimma5.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: f200 1023 addw r0, r0, #291 ; 0x123
a: bd08 pop {r3, pc}
cortex-a5 is armv7-a which supports thumb-2 as far as the add immediate itself goes and related to binary size there is no optimization here, 32 bits for thumb and 32 bits for thumb2. But this is but one example there perhaps will be times that thumb2 produces smaller binaries than thumb.
cortex-m3
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m3 -march=armv7-m -c addimm.c -o addimmm3.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: f200 1023 addw r0, r0, #291 ; 0x123
a: bd08 pop {r3, pc}
produced the same result as cortex-a5. for this simple example the machine code for this object is the same, same size, when built for cortex-a5 and cortex-m3
Now if I add a bootstrap, a main, and call this function and fill in the function it calls to create a complete, linked, program
00000000 <_start>:
0: f000 f802 bl 8 <notmain>
4: e7fe b.n 4 <_start+0x4>
...
00000008 <notmain>:
8: 2005 movs r0, #5
a: f000 b801 b.w 10 <addimm>
e: bf00 nop
00000010 <addimm>:
10: b508 push {r3, lr}
12: f000 f803 bl 1c <fun>
16: f200 1023 addw r0, r0, #291 ; 0x123
1a: bd08 pop {r3, pc}
0000001c <fun>:
1c: 4770 bx lr
1e: 46c0 nop ; (mov r8, r8)
We get a result. The addimm function itself did not change in size. with a cortex-a5 you have to have some arm code that then switches to thumb, and likely when linking with libraries, etc you may get a mixture of arm and thumb, so
00000000 <_start>:
0: eb000000 bl 8 <notmain>
4: eafffffe b 4 <_start+0x4>
00000008 <notmain>:
8: e92d4008 push {r3, lr}
c: e3a00005 mov r0, #5
10: fa000001 blx 1c <addimm>
14: e8bd4008 pop {r3, lr}
18: e12fff1e bx lr
0000001c <addimm>:
1c: b508 push {r3, lr}
1e: f000 e804 blx 28 <fun>
22: f200 1023 addw r0, r0, #291 ; 0x123
26: bd08 pop {r3, pc}
00000028 <fun>:
28: e12fff1e bx lr
overall larger binary, the addimm part itself did not change in size though.
as far as linking changing the size of the object, look at this example
bootstrap.s
.thumb
.thumb_func
.globl _start
_start:
bl notmain
hang: b hang
.thumb_func
.globl dummy
dummy:
bx lr
.code 32
.globl bounce
bounce:
bx lr
hello.c
void dummy ( void );
void bounce ( void );
void notmain ( void )
{
dummy();
bounce();
}
looking at an arm build of notmain by itself, the object:
00000000 <notmain>:
0: e92d4800 push {fp, lr}
4: e28db004 add fp, sp, #4
8: ebfffffe bl 0 <dummy>
c: ebfffffe bl 0 <bounce>
10: e24bd004 sub sp, fp, #4
14: e8bd4800 pop {fp, lr}
18: e12fff1e bx lr
depending on what is calling it and what it calls, the linker may have to add more code to deal with items that are defined outside the object, from global variables to external functions
00008000 <_start>:
8000: f000 f818 bl 8034 <__notmain_from_thumb>
00008004 <hang>:
8004: e7fe b.n 8004 <hang>
00008006 <dummy>:
8006: 4770 bx lr
00008008 <bounce>:
8008: e12fff1e bx lr
0000800c <notmain>:
800c: e92d4800 push {fp, lr}
8010: e28db004 add fp, sp, #4
8014: eb000003 bl 8028 <__dummy_from_arm>
8018: ebfffffa bl 8008 <bounce>
801c: e24bd004 sub sp, fp, #4
8020: e8bd4800 pop {fp, lr}
8024: e12fff1e bx lr
00008028 <__dummy_from_arm>:
8028: e59fc000 ldr ip, [pc] ; 8030 <__dummy_from_arm+0x8>
802c: e12fff1c bx ip
8030: 00008007 andeq r8, r0, r7
00008034 <__notmain_from_thumb>:
8034: 4778 bx pc
8036: 46c0 nop ; (mov r8, r8)
8038: eafffff3 b 800c <notmain>
803c: 00000000 andeq r0, r0, r0
dummy_from_arm and notmain_from_thumb were both added, an increase in the size of the binary. each object did not change in size but the overall binary did. bounce() was an arm to arm function, no patching, dummy() arm to thumb and notmain() thumb to main.
so you might have a cortex-m3 object, and a cortex-a5 object that as far as the code in that object goes they are both identical. But dopending on what you link them with, which eventually something is dfferent between a cortex-m3 system and a cortex-a5 system, you may see more or less code added by the linker to account for the system differences, libraries, operating system specific, etc even so much as where in the binary you put the object, if it has to have a further reach than it can with a single instruction, then the linker will add even more code.
This is all gcc specific stuff, each toolchain is going to deal with each of these problems in its own way. It is the nature of the beast when you use an object and linker model, a very good model but the compiler, assembler, and linker have to work together to make sure that global resources can be properly accessed when linked. has nothing to do with ARM, this problem exists with many/most processor architectures and the toolchains deal with those problems per toolchain, per version, per target architecture. When I said change the size of the object what I really meant was the linker may add more code to the final binary in order to deal with that object and how it interacts with others.

ARM-C Inter-working

I am trying out a simple program for ARM-C inter-working. Here is the code:
#include<stdio.h>
#include<stdlib.h>
int Double(int a);
extern int Start(void);
int main(){
int result=0;
printf("in C main\n");
result=Start();
printf("result=%d\n",result);
return 0;
}
int Double(int a)
{
printf("inside double func_argument_value=%d\n",a);
return (a*2);
}
The assembly file goes as-
.syntax unified
.cpu cortex-m3
.thumb
.align
.global Start
.global Double
.thumb_func
Start:
mov r10,lr
mov r0,#42
bl Double
mov lr,r10
mov r2,r0
mov pc,lr
During debugging on LPC1769(embedded artists board), I get an hardfault error on the instruction " result=Start(). " I am trying to do an arm-C internetworking here. the lr value during the execution of the above the statement(result=Start()) is 0x0000029F, where the faulting instruction is,and the pc value is 0x0000029E.
This is how I got the faulting instruction in r1
__asm("mrs r0,MSP\n"
"isb\n"
"ldr r1,[r0,#24]\n");
Can anybody please explain where I am going wrong? Any solution is appreciated.
Thank you in advance.
I am a beginner in cortex-m3 & am using the NXP LPCXpresso IDE powered by Code_Red.
Here is the disassembly of my code.
IntDefaultHandler:
00000269: push {r7}
0000026b: add r7, sp, #0
0000026d: b.n 0x26c <IntDefaultHandler+4>
0000026f: nop
00000271: mov r3, lr
00000273: mov.w r0, #42 ; 0x2a
00000277: bl 0x2c0 <Double>
0000027b: mov lr, r3
0000027d: mov r2, r0
0000027f: mov pc, lr
main:
00000281: push {r7, lr}
00000283: sub sp, #8
00000285: add r7, sp, #0
00000287: mov.w r3, #0
0000028b: str r3, [r7, #4]
0000028d: movw r3, #11212 ; 0x2bcc
00000291: movt r3, #0
00000295: mov r0, r3
00000297: bl 0xd64 <printf>
0000029b: bl 0x270 <Start>
0000029f: mov r3, r0
000002a1: str r3, [r7, #4]
000002a3: movw r3, #11224 ; 0x2bd8
000002a7: movt r3, #0
000002ab: mov r0, r3
000002ad: ldr r1, [r7, #4]
000002af: bl 0xd64 <printf>
000002b3: mov.w r3, #0
000002b7: mov r0, r3
000002b9: add.w r7, r7, #8
000002bd: mov sp, r7
000002bf: pop {r7, pc}
Double:
000002c0: push {r7, lr}
000002c2: sub sp, #8
000002c4: add r7, sp, #0
000002c6: str r0, [r7, #4]
000002c8: movw r3, #11236 ; 0x2be4
000002cc: movt r3, #0
000002d0: mov r0, r3
000002d2: ldr r1, [r7, #4]
000002d4: bl 0xd64 <printf>
000002d8: ldr r3, [r7, #4]
000002da: mov.w r3, r3, lsl #1
000002de: mov r0, r3
000002e0: add.w r7, r7, #8
000002e4: mov sp, r7
000002e6: pop {r7, pc}
As per your advice Dwelch, I have changed the r10 to r3.
I assume you mean interworking not internetworking? The LPC1769 is a cortex-m3 which is thumb/thumb2 only so it doesnt support arm instructions so there is no interworking available for that platform. Nevertheless, playing with the compiler to see what goes on:
Get the compiler to do it for you first, then try it yourself in asm...
start.s
.thumb
.globl _start
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
two.c
unsigned int two ( unsigned int t )
{
return(t+5);
}
Makefile
hello.list : start.s hello.c two.c
arm-none-eabi-as -mthumb start.s -o start.o
arm-none-eabi-gcc -c -O2 hello.c -o hello.o
arm-none-eabi-gcc -c -O2 -mthumb two.c -o two.o
arm-none-eabi-ld -Ttext=0x1000 start.o hello.o two.o -o hello.elf
arm-none-eabi-objdump -D hello.elf > hello.list
clean :
rm -f *.o
rm -f *.elf
rm -f *.list
produces hello.list
Disassembly of section .text:
00001000 <_start>:
1000: 4801 ldr r0, [pc, #4] ; (1008 <hang+0x2>)
1002: 46fe mov lr, pc
1004: 4700 bx r0
00001006 <hang>:
1006: e7fe b.n 1006 <hang>
1008: 0000100c andeq r1, r0, ip
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
...
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
hello.o disassembled by itself:
00000000 <hello>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <two>
8: e8bd4008 pop {r3, lr}
c: e2800007 add r0, r0, #7
10: e12fff1e bx lr
the compiler uses bl assuming/hoping it will be calling arm from arm. but it didnt, so what they did was put a trampoline in there.
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
the bl to __two_from_arm is an arm mode to arm mode branch link. the address of the destination function (two) with the lsbit set, which tells bx to switch to thumb mode, is loaded into the disposable register ip (r12?) then the bx ip happens switching modes. the branch link had setup the return address in lr, which was an arm mode address no doubt (lsbit zero).
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
the two() function does its thing and returns, note you have to use bx lr not mov pc,lr when interworking. Basically if you are not running an ARMv4 without the T, or an ARMv5 without the T, mov pc,lr is an okay habit. But anything ARMv4T or newer (ARMv5T or newer) use bx lr to return from a function unless you have a special reason not to. (avoid using pop {pc} as well for the same reason unless you really need to save that instruction and are not interworking). Now being on a cortex-m3 which is thumb+thumb2 only, well you cant interwork so you can use mov pc,lr and pop {pc}, but the code is not portable, and it is not a good habit as that habit will bite you when you switch back to arm programming.
So since hello was in arm mode when it used bl which is what set the link register, the bx in two_from_arm does not touch the link register, so when two() returns with a bx lr it is returning to arm mode after the bl __two_from_arm line in the hello() function.
Also note the extra 0x0000 after the thumb function, this was to align the program on a word boundary so that the following arm code was aligned...
to see how the compiler does thumb to arm change two as follows
unsigned int three ( unsigned int );
unsigned int two ( unsigned int t )
{
return(three(t)+5);
}
and put that function in hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
unsigned int three ( unsigned int t )
{
return(t+3);
}
and now we get another trampoline
00001028 <two>:
1028: b508 push {r3, lr}
102a: f000 f80b bl 1044 <__three_from_thumb>
102e: 3005 adds r0, #5
1030: bc08 pop {r3}
1032: bc02 pop {r1}
1034: 4708 bx r1
1036: 46c0 nop ; (mov r8, r8)
...
00001044 <__three_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eafffff4 b 1020 <three>
104c: 00000000 andeq r0, r0, r0
Now this is a very cool trampoline. the bl to three_from_thumb is in thumb mode and the link register is set to return to the two() function with the lsbit set no doubt to indicate to return to thumb mode.
The trampoline starts with a bx pc, pc is set to two instructions ahead and the pc internally always has the lsbit clear so a bx pc will always take you to arm mode if not already in arm mode, and in either mode two instructions ahead. Two instructions ahead of the bx pc is an arm instruction that branches (not branch link!) to the three function, completing the trampoline.
Notice how I wrote the call to hello() in the first place
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
that actually wont work will it? It will get you from arm to thumb but not from thumb to arm. I will leave that as an exercise for the reader.
If you change start.s to this
.thumb
.globl _start
_start:
bl hello
hang : b hang
the linker takes care of us:
00001000 <_start>:
1000: f000 f820 bl 1044 <__hello_from_thumb>
00001004 <hang>:
1004: e7fe b.n 1004 <hang>
...
00001044 <__hello_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eaffffee b 1008 <hello>
I would and do always disassemble programs like these to make sure the compiler and linker resolved these issues. Also note that for example __hello_from_thumb can be used from any thumb function, if I call hello from several places, some arm, some thumb, and hello was compiled for arm, then the arm calls would call hello directly (if they can reach it) and all the thumb calls would share the same hello_from_thumb (if they can reach it).
The compiler in these examples was assuming code that stays in the same mode (simple branch link) and the linker added the interworking code...
If you really meant inter-networking and not interworking, then please describe what that is and I will delete this answer.
EDIT:
You were using a register to preserve lr during the call to Double, that will not work, no register will work for that you need to use memory, and the easiest is the stack. See how the compiler does it:
00001008 <hello>:
1008: e92d4008 push {r3, lr}
100c: eb000009 bl 1038 <__two_from_arm>
1010: e8bd4008 pop {r3, lr}
1014: e2800007 add r0, r0, #7
1018: e12fff1e bx lr
r3 is pushed likely to align the stack on a 64 bit boundary (makes it faster). the thing to notice is the link register is preserved on the stack, but the pop does not pop to pc because this is not an ARMv4 build, so a bx is needed to return from the function. Because this is arm mode we can pop to lr and simply bx lr.
For thumb you can only push r0-r7 and lr directly and pop r0-r7 and pc directly you dont want to pop to pc because that only works if you are staying in the same mode (thumb or arm). this is fine for a cortex-m, or fine if you know what all of your callers are, but in general bad. So
00001024 <two>:
1024: b508 push {r3, lr}
1026: f000 f811 bl 104c <__three_from_thumb>
102a: 3005 adds r0, #5
102c: bc08 pop {r3}
102e: bc02 pop {r1}
1030: 4708 bx r1
same deal r3 is used as a dummy register to keep the stack aligned for performance (I used the default build for gcc 4.8.0 which is likely a platform with a 64 bit axi bus, specifying the architecture might remove that extra register). Because we cannot pop pc, I assume because r1 and r3 would be out of order and r3 was chosen (they could have chosen r2 and saved an instruction) there are two pops, one to get rid of the dummy value on the stack and the other to put the return value in a register so that they can bx to it to return.
Your Start function does not conform to the ABI and as a result when you mix it in with such large libraries as a printf call, no doubt you will crash. If you didnt it was dumb luck. Your assembly listing of main shows that neither r4 nor r10 were used and assuming main() is not called other than the bootstrap, then that is why you got away with either r4 or r10.
If this really is an LPC1769 this this whole discussion is irrelevant as it does not support ARM and does not support interworking (interworking = mixing of ARM mode code and thumb mode code). Your problem was unrelated to interworking, you are not interworking (note the pop {pc} at the end of the functions). Your problem was likely related to your assembly code.
EDIT2:
Changing the makefile to specify the cortex-m
00001008 <hello>:
1008: b508 push {r3, lr}
100a: f000 f805 bl 1018 <two>
100e: 3007 adds r0, #7
1010: bd08 pop {r3, pc}
1012: 46c0 nop ; (mov r8, r8)
00001014 <three>:
1014: 3003 adds r0, #3
1016: 4770 bx lr
00001018 <two>:
1018: b508 push {r3, lr}
101a: f7ff fffb bl 1014 <three>
101e: 3005 adds r0, #5
1020: bd08 pop {r3, pc}
1022: 46c0 nop ; (mov r8, r8)
first and foremost it is all thumb since there is no arm mode on a cortex-m, second the bx is not needed for function returns (Because there are no arm/thumb mode changes). So pop {pc} will work.
it is curious that the dummy register is still used on a push, I tried an arm7tdmi/armv4t build and it still did that, so there is some other flag to use to get rid of that behavior.
If your desire was to learn how to make an assembly function that you can call from C, you should have just done that. Make a C function that somewhat resembles the framework of the function you want to create in asm:
extern unsigned int Double ( unsigned int );
unsigned int Start ( void )
{
return(Double(42));
}
assemble then disassemble
00000000 <Start>:
0: b508 push {r3, lr}
2: 202a movs r0, #42 ; 0x2a
4: f7ff fffe bl 0 <Double>
8: bd08 pop {r3, pc}
a: 46c0 nop ; (mov r8, r8)
and start with that as you assembly function.
.globl Start
.thumb_func
Start:
push {lr}
mov r0, #42
bl Double
pop {pc}
That, or read the arm abi for gcc and understand what registers you can and cant use without saving them on the stack, what registers are used for passing and returning parameters.
_start function is the entry point of a C program which makes a call to main(). In order to debug _start function of any C program is in the assembly file. Actually the real entry point of a program on linux is not main(), but rather a function called _start(). The standard libraries normally provide a version of this that runs some initialization code, then calls main().
Try compiling this with gcc -nostdlib:

Resources