I am trying to generate exceptions like Bus Fault, Usage Fault on ARM Cortex-M3. My code for enable exceptions:
void EnableExceptions(void)
{
UINT32 uReg = SCB->SHCSR;
uReg |= 0x00070000;
SCB->SHCSR = uReg;
//Set Configurable Fault Status Register
SCB->CFSR = 0x0367E7C3;
//Set to 1 DIV_0_TRP register
SCB->CCR |= 0x00000010;
//Set priorities of fault handlers
NVIC_SetPriority(MemoryManagement_IRQn, 0x01);
NVIC_SetPriority(BusFault_IRQn, 0x01);
NVIC_SetPriority(UsageFault_IRQn, 0x01);
}
void UsageFault_Handler(void){
//handle
//I've set a breakpoint but system does not hit
}
void BusFault_Handler(void){
//handle
//I've set a breakpoint but system does not hit
}
I tried to generate division by zero exception and saw the variables value as "Infinity". However system does not generate any exception on keeps running. Also tried to generate Bus Fault exception and same thing happens.
Also when I comment out to EnableExceptions function system works correctly. What is wrong with my code? Does ARM handle these kind of errors inside of the microprocessor?
Cortex-M devices use the Thumb-2 instruction set exclusively, ARM uses the least significant bit of the the branch/jump/call address to determine whether the target is Thumb or ARM code, since the Cortex-M cannot run ARM code, you can generate an BusFault exception by creating a jump to an even address.
int dummy(){ volatile x = 0 ; return x ; }
int main()
{
typedef void (*fn_t)();
fn_t foo = (fn_t)(((char*)dummy) - 1) ;
foo() ;
}
The following will also work, since the call will fail before any instructions are executed, so it does not need to point to any valid code.
int main()
{
typedef void (*fn_t)();
fn_t foo = (fn_t)(0x8004000) ;
foo() ;
}
You can generate a usage fault by forcing an integer divide by zero:
int main()
{
volatile int x = 0 ;
volatile int y = 1 / x ;
}
From your comment question: How to generate any exception.
Here is one from the documentation:
Encoding T1 All versions of the Thumb instruction set.
SVC<c> #<imm8>
...
Exceptions
SVCall.
Which I can find by searching for SVCall.
Exceptions are well documented by ARM, there are ways to cause the exceptions you listed without having to break the bus (requiring a sim an fpga or creating your own silicon), you already know the search terms for the document to find busfault and usagefault.
How ARM handles these (internally or not) is documented. Internally in this case means a lockup or not if you look, otherwise they execute the fault handler (unless of course there is a fault fetching the fault handler).
Most you can create in C without resorting to assembly language instructions, but you have to be careful that it is generating what you think it is generating:
void fun ( void )
{
int x = 3;
int y = 0;
int z = x / y;
}
Disassembly of section .text:
00000000 <fun>:
0: 4770 bx lr
Instead you want something that actually generates the instruction that can cause the fault:
int fun0 ( int x, int y )
{
return(x/y);
}
void fun1 ( void )
{
fun0(3,0);
}
00000000 <fun0>:
0: fb90 f0f1 sdiv r0, r0, r1
4: 4770 bx lr
6: bf00 nop
00000008 <fun1>:
8: 4770 bx lr
but as shown you have to be careful about where and how you call it. In this case the call was done in the same file so the optimizer had visibility to see that this is now dead code and optimized it out, so a test like this would fail to generate a fault for multiple reasons.
This is why the OP needs to provide a complete minimal example the reason why faults are not being seen is not the processor. But the software and/or test code.
Edit
A complete minimal example, everything you need but a gnu toolchain (no . This is on a stm32 blue pill an STM32F103...
flash.s
.cpu cortex-m3
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset /* 1 Reset */
.word hang /* 2 NMI */
.word hang /* 3 HardFault */
.word hang /* 4 MemManage */
.word hang /* 5 BusFault */
.word usagefault /* 6 UsageFault */
.word hang /* 7 Reserved */
.word hang /* 8 Reserved */
.word hang /* 9 Reserved */
.word hang /*10 Reserved */
.word hang /*11 SVCall */
.word hang /*12 DebugMonitor */
.word hang /*13 Reserved */
.word hang /*14 PendSV */
.word hang /*15 SysTick */
.word hang /* External interrupt 1 */
.word hang /* External interrupt 2 */
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
.thumb_func
.globl dummy
dummy:
bx lr
.thumb_func
.globl dosvc
dosvc:
svc 1
.thumb_func
.globl hop
hop:
bx r0
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
fun.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
void hop ( unsigned int );
#define GPIOCBASE 0x40011000
#define RCCBASE 0x40021000
#define SHCSR 0xE000ED24
void usagefault ( void )
{
unsigned int ra;
while(1)
{
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<100000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<100000;ra++) dummy(ra);
}
}
int notmain ( void )
{
unsigned int ra;
ra=GET32(SHCSR);
ra|=1<<18; //usagefault
PUT32(SHCSR,ra);
ra=GET32(RCCBASE+0x18);
ra|=1<<4; //enable port c
PUT32(RCCBASE+0x18,ra);
ra=GET32(GPIOCBASE+0x04);
ra&=(~(3<<20)); //PC13
ra|= (1<<20) ; //PC13
ra&=(~(3<<22)); //PC13
ra|= (0<<22) ; //PC13
PUT32(GPIOCBASE+0x04,ra);
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<200000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<200000;ra++) dummy(ra);
ra=GET32(0x08000004);
ra&=(~1);
hop(ra);
return(0);
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c so.c -o so.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -O binary so.elf so.bin
All of those command line options are not required arm-linux-gnueabi- and other flavors of gnu toolchains work just fine from several versions back to the present as I use them as a compiler, assembler and linker and don't mess with library or other stuff that varies from one flavor to another.
UsageFault The UsageFault fault handles non-memory related faults
caused by instruction execution.
A number of different situations cause usage faults, including:
• Undefined Instruction.
• Invalid state on instruction execution.
• Error on exception return.
• Attempting to access a disabled or unavailable coprocessor.
The following can cause usage faults when the processor is configured to
report them:
• A word or halfword memory accesses to an unaligned address.
• Division by zero.
Software can disable this fault. If it does, a UsageFault escalates to HardFault. UsageFault has a configurable priority.
...
Instruction execution with EPSR.T set to 0 causes the invalid state UsageFault
So the test here branches to an arm address vs a thumb address and this causes a usagefault. (Can read up about the BX instruction the psr.t bit how and when it gets changed, etc in the documentation as well)
Backing up this is the stm32 blue pill. There is an led on PC13, the code enables usagefault, configures PC13 as an output, blinks it once so we see the program started then if it hits the usagefault handler then it blinks forever.
ra&=(~1);
If you comment this out then it keeps branching to reset which does everything again one slow blink and you see that repeat forever.
Before running naturally you check the build to see that it has a chance of working:
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000
8000004: 08000049
8000008: 0800004f
800000c: 0800004f
8000010: 0800004f
8000014: 0800004f
8000018: 08000061
800001c: 0800004f
8000020: 0800004f
8000024: 0800004f
8000028: 0800004f
800002c: 0800004f
8000030: 0800004f
8000034: 0800004f
8000038: 0800004f
800003c: 0800004f
8000040: 0800004f
8000044: 0800004f
08000048 <reset>:
8000048: f000 f82a bl 80000a0 <notmain>
800004c: e7ff b.n 800004e <hang>
0800004e <hang>:
800004e: e7fe b.n 800004e <hang>
...
08000060 <usagefault>:
8000060: b570 push {r4, r5, r6, lr}
The vector table is correct the right vectors point to the right places.
0xE000ED28 CFSR RW 0x00000000
The HFSR is upper bits of the CFSR
> halt
target halted due to debug-request, current mode: Handler UsageFault
xPSR: 0x81000006 pc: 0x0800008a msp: 0x20000fc0
> mdw 0xE000ED28
0xe000ed28: 00020000
And that bit is
INVSTATE, bit[1]
0 EPSR.T bit and EPSR.IT bits are valid for instruction execution.
1 Instruction executed with invalid EPSR.T or EPSR.IT field.
Now
Using the CCR, see Configuration and Control Register, CCR on page B3-604, software can enable or disable:
• Divide by zero faults, alignment faults and some features of processor operation.
• BusFaults at priority -1 and higher.
The reset value of the CCR is IMPLEMENTATION DEFINED so it might just be enabled for you or not, likely have to look at the Cortex-m3 TRM or just read it:
> mdw 0xE000ED14
0xe000ed14: 00000000
so its zeros on mine.
So add fun.c:
unsigned int fun ( unsigned int x, unsigned int y)
{
return(x/y);
}
Change so.c:
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
unsigned int fun ( unsigned int, unsigned int);
#define GPIOCBASE 0x40011000
#define RCCBASE 0x40021000
#define SHCSR 0xE000ED24
#define CCR 0xE000ED14
void usagefault ( void )
{
unsigned int ra;
while(1)
{
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<100000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<100000;ra++) dummy(ra);
}
}
int notmain ( void )
{
unsigned int ra;
ra=GET32(SHCSR);
ra|=1<<18; //usagefault
PUT32(SHCSR,ra);
ra=GET32(CCR);
ra|=1<<4; //div by zero
PUT32(CCR,ra);
ra=GET32(RCCBASE+0x18);
ra|=1<<4; //enable port c
PUT32(RCCBASE+0x18,ra);
ra=GET32(GPIOCBASE+0x04);
ra&=(~(3<<20)); //PC13
ra|= (1<<20) ; //PC13
ra&=(~(3<<22)); //PC13
ra|= (0<<22) ; //PC13
PUT32(GPIOCBASE+0x04,ra);
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<200000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<200000;ra++) dummy(ra);
fun(3,0);
return(0);
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c so.c -o so.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c fun.c -o fun.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o so.o fun.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -O binary so.elf so.bin
confirm there is actually a divide instruction that we are going to hit
800011e: 2100 movs r1, #0
8000120: 2003 movs r0, #3
8000122: f000 f80f bl 8000144 <fun>
...
08000144 <fun>:
8000144: fbb0 f0f1 udiv r0, r0, r1
8000148: 4770 bx lr
load and run and the handler is called.
target halted due to debug-request, current mode: Handler UsageFault
xPSR: 0x81000006 pc: 0x08000086 msp: 0x20000fc0
> mdw 0xE000ED28
0xe000ed28: 02000000
and that indicates it was a divide by zero.
So everything you needed to know/do really was in the documentation, one document.
99.999% of bare-metal programming is reading or doing experiments to validate what was read, almost none of the work is writing the final application, it is but a tiny part of the job.
Before you can get to bare-metal programming you have to have mastered the toolchain otherwise nothing will work. Mastering the toolchain can be done without any target hardware, using free tools so that is just a matter of sitting down and doing it.
As far as you're trying to do a floating point divide by zero on a core that doesn't have hardware divide by zero you need to look at the soft float, for example libgcc:
ARM_FUNC_START divsf3
ARM_FUNC_ALIAS aeabi_fdiv divsf3
CFI_START_FUNCTION
# Mask out exponents, trap any zero/denormal/INF/NAN.
mov ip, #0xff
ands r2, ip, r0, lsr #23
do_it ne, tt
COND(and,s,ne) r3, ip, r1, lsr #23
teqne r2, ip
teqne r3, ip
beq LSYM(Ldv_s)
LSYM(Ldv_x):
...
# Division by 0x1p*: let''s shortcut a lot of code.
LSYM(Ldv_1):
and ip, ip, #0x80000000
orr r0, ip, r0, lsr #9
adds r2, r2, #127
do_it gt, tt
COND(rsb,s,gt) r3, r2, #255
and so on
which should have been visible in the disassembly, I don't off-hand see a forced exception (intentional undefined instruction, swi/svc or anything like that). This is only one possible library and now that I think about it this looks like arm not thumb, so would have to go looking for that, easier to just look at the disassembly.
Based on your comment and if I read the other question again I assume because it didn't raise an exception the correct result of a divide by zero is a properly signed infinity. but if you switch to a cortex-m4 or m7 then you might be able to trigger a hardware exception, but....read the documentation to find out.
Edit 2
void fun ( void )
{
int a = 3;
int b = 0;
volatile int c = a/b;
}
fun.c:6:18: warning: unused variable ‘c’ [-Wunused-variable]
6 | volatile int c = a/b;
| ^
08000140 <fun>:
8000140: deff udf #255 ; 0xff
8000142: bf00 nop
> halt
target halted due to debug-request, current mode: Handler UsageFault
xPSR: 0x01000006 pc: 0x08000076 msp: 0x20000fc0
> mdw 0xE000ED28
0xe000ed28: 00010000
and that bit means
The processor has attempted to execute an undefined instruction
So volatile failed to produce the desired result using gcc
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
arm-linux-gnueabi-gcc --version
arm-linux-gnueabi-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
produces
Disassembly of section .text:
00000000 <fun>:
0: deff udf #255 ; 0xff
2: bf00 nop
as well (and you can godbolt your way through others).
Yes a fault was produced and it was the usage fault but this would have been yet another Stack Overflow question as to why I didn't get a divide by zero. Blindly using volatile to force a divide doesn't work.
Making all three volatile
void fun ( void )
{
volatile int a = 3;
volatile int b = 0;
volatile int c = a/b;
}
Disassembly of section .text:
00000000 <fun>:
0: 2203 movs r2, #3
2: 2300 movs r3, #0
4: b084 sub sp, #16
6: 9201 str r2, [sp, #4]
8: 9302 str r3, [sp, #8]
a: 9b01 ldr r3, [sp, #4]
c: 9a02 ldr r2, [sp, #8]
e: fb93 f3f2 sdiv r3, r3, r2
12: 9303 str r3, [sp, #12]
14: b004 add sp, #16
16: 4770 bx lr
will generate the desired fault.
and with no optimizations
00000000 <fun>:
0: b480 push {r7}
2: b085 sub sp, #20
4: af00 add r7, sp, #0
6: 2303 movs r3, #3
8: 60fb str r3, [r7, #12]
a: 2300 movs r3, #0
c: 60bb str r3, [r7, #8]
e: 68fa ldr r2, [r7, #12]
10: 68bb ldr r3, [r7, #8]
12: fb92 f3f3 sdiv r3, r2, r3
16: 607b str r3, [r7, #4]
18: bf00 nop
1a: 3714 adds r7, #20
1c: 46bd mov sp, r7
1e: bc80 pop {r7}
20: 4770 bx lr
will also generate the desired fault
So master the language first (read read read), then master the toolchain second (read read read) then bare-metal programming (read read read). It is all about reading, not about coding. As shown above even with decades of experience at this level, you can't completely predict what the tools will generate; you have to just try it, but most important because you figured it out for one tool one time one day on one machine no reason to get too broad in your assumptions. Have to try it and examine what the compiler produces, repeat the process until you get the desired effect. Push comes to shove just write a few lines of asm and be done with it.
You weren't seeing faults because you weren't generating any and/or weren't trapping them or both. The list of possible reasons why is long based on the code provided, but these examples, that you should have no problem porting to your platform, should demonstrate your hardware works too, and then you can sort out why your software didn't by connecting the dots between code that does and code that doesn't. All I did was follow the documentation, and examine the output of the compiler, once I had the minimum number of things enabled, the fault handler was called. Without those enabled the usage fault was not triggered.
BusFault, HardFault, MemmanageFault, UsageFault, SVC Call , NMI those are internal exception for arm cortex-M microprocessors.
it depends really from which processor you are using, but let's suppose you are having cortex-m3:
By default all fault are mapped to hardfault handler unless you
enable them explicitly to get mapped to their own handler
Set bits: USGFAULTENA, BUSFAULTENA, MEMFAULTENA in system handler
control and state register => those fault each one will be mapped to
its proper handler
To generate a fault you can try :
Access not mapped memory area => this generate Busfault
Execute unrecognized instruction => UsageFault
Explictly trigger one of those fault by setting one of those bits:
USGFAULTACT, BUSFAULTACT, MEMFAULTACT in systen handler control and
status register => this will generate an exception for sure
Please for more details refer to : https://developer.arm.com/documentation/dui0552/a/cortex-m3-peripherals/system-control-block/system-handler-control-and-state-register?lang=en
Related
I want to call a C function, say:
int foo(int a, int b) {return 2;}
inside an assembly (ARM) code. I read that I need to mention
import foo
in my assembly code, for assembler to search for foo in C file. But, I am stuck at passing arguments a and b from assembly and retrieving an integer (here 2) again back in assembly. Could someone could explain me how to do this, with a mini example?
You have already written the minimal example.
int foo(int a, int b) {return 2;}
compile and disassemble
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -d so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <foo>:
0: e3a00002 mov r0, #2
4: e12fff1e bx lr
Anything to do with a and b are dead code so optimized out. While using C to learn asm is good/okay to get started you really want to do it with optimizations on which mean you have to work harder on crafting the experimental code.
int foo(int a, int b) {return 2;}
int bar ( void )
{
return(foo(5,4));
}
and we learn nothing new.
Disassembly of section .text:
00000000 <foo>:
0: e3a00002 mov r0, #2
4: e12fff1e bx lr
00000008 <bar>:
8: e3a00002 mov r0, #2
c: e12fff1e bx lr
need to do this for the call:
int foo(int a, int b);
int bar ( void )
{
return(foo(5,4));
}
and now we see
00000000 <bar>:
0: e92d4010 push {r4, lr}
4: e3a01004 mov r1, #4
8: e3a00005 mov r0, #5
c: ebfffffe bl 0 <foo>
10: e8bd4010 pop {r4, lr}
14: e12fff1e bx lr
(yes this is built for the this compilers default target armv4t, should be obvious to some others have no clue how I/we know)(can also tell how new/old the compiler is from this example as well (there was an abi change years ago that is visible here)(the newer versions of gcc are worse than older so older is still good to use for some use cases))
per this compilers convention (now while this compiler does use the arm convention of some version of some document for some version of this compiler, always remember it is the compiler authors choice, they are under no obligation to conform to anyones written standard, they choose)
So we see that the first parameter goes in r0, the second in r1. You can craft functions with more operands or more types of operands to see what nuances there are. How many are in registers and when they start using the stack instead. For example try a 64 bit variable then a 32 in that order as operands then try it in reverse.
To see what is going on on the callee side.
int foo(int a, int b)
{
return((a<<1)+b+0x123);
}
We see that r0 and r1 are the first two operands, the compiler would be grossly broken otherwise.
00000000 <foo>:
0: e0810080 add r0, r1, r0, lsl #1
4: e2800e12 add r0, r0, #288 ; 0x120
8: e2800003 add r0, r0, #3
c: e12fff1e bx lr
What we did not see explicitly in the caller example is that r0 is where the return is stored (at least for this variable type).
The ABI documention is not an easy read, but if you first "just try it" then if you wish refer to the documentation it should help with the documentation. At the end of the day you have a compiler you are going to use, it has a convention and is probably part of a toolchain so you must conform to that compilers convention not some third party document (even if that third party is arm) AND you should probably use that toolchain's assembler which means you should use that assembly language (many incompatible assembly languages for arm, the tool defines the language not the target).
You can see how simple it is to figure this out on your own.
And...so this gets painful but you can look at the assembly output of the compiler, at least some will let you. With gcc you can use -save-temps or -S
int foo(int a, int b)
{
return 2;
}
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global foo
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type foo, %function
foo:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
mov r0, #2
bx lr
.size foo, .-foo
.ident "GCC: (15:9-2019-q4-0ubuntu1) 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]"
Almost none of this do you "need".
The minimum looks like this
.globl foo
foo:
mov r0,#2
bx lr
.global or .globl are equivalent, somewhat reflects the age or how/when you learned gnu assembler.
Now this will break if you are mixing arm and thumb instructions, this defaults to arm.
arm-none-eabi-as x.s -o x.o
arm-none-eabi-objdump -d x.o
x.o: file format elf32-littlearm
Disassembly of section .text:
00000000 :
0: e3a00002 mov r0, #2
4: e12fff1e bx lr
If we want thumb then we have to tell it
.thumb
.globl foo
foo:
mov r0,#2
bx lr
and we get thumb.
00000000 <foo>:
0: 2002 movs r0, #2
2: 4770 bx lr
With ARM and with the gnu toolchain at least you can mix arm and thumb and the linker will take care of the transition
int foo ( int, int );
int fun ( void )
{
return(foo(1,2));
}
we do not need a bootstrap nor other things to get the linker to link so we can see how that part of it works.
arm-none-eabi-ld so.o x.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -d so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <fun>:
8000: e92d4010 push {r4, lr}
8004: e3a01002 mov r1, #2
8008: e3a00001 mov r0, #1
800c: eb000001 bl 8018 <foo>
8010: e8bd4010 pop {r4, lr}
8014: e12fff1e bx lr
00008018 <foo>:
8018: 2002 movs r0, #2
801a: 4770 bx lr
Now this is broken not just because we have no bootstrap, etc, but there is a bl to foo but foo is thumb and the caller is arm. So for gnu assembler for arm you can take this shortcut which I think I learned from an older gcc, but whatever
.thumb
.thumb_func
.globl foo
foo:
mov r0,#2
bx lr
.thumb_func says the next label you find is considered a function label not just an address.
00008000 <fun>:
8000: e92d4010 push {r4, lr}
8004: e3a01002 mov r1, #2
8008: e3a00001 mov r0, #1
800c: eb000003 bl 8020 <__foo_from_arm>
8010: e8bd4010 pop {r4, lr}
8014: e12fff1e bx lr
00008018 <foo>:
8018: 2002 movs r0, #2
801a: 4770 bx lr
801c: 0000 movs r0, r0
...
00008020 <__foo_from_arm>:
8020: e59fc000 ldr ip, [pc] ; 8028 <__foo_from_arm+0x8>
8024: e12fff1c bx ip
8028: 00008019 .word 0x00008019
802c: 00000000 .word 0x00000000
The linker adds a trampoline as I call it, I think others call it a vaneer. Either way the toolchain took care of is so long as we write the code right.
Remember and in particular this syntax for the assembler is very much assembler specific other assemblers may have other syntax to make this work. From the gcc generated code we see the generic solution which is more typing but probably a better habit.
.thumb
.type foo, %function
.global foo
foo:
mov r0,#2
bx lr
the .type foo, %function works for both arm and thumb in gnu assembler for arm. And it does not have to be positioned just before the labe (just like .globl or .global does not either. We get the same result from the toolchain with this assembly language.
Just for demonstration...
arm-none-eabi-as x.s -o x.o
arm-none-eabi-gcc -O2 -mthumb -c so.c -o so.o
arm-none-eabi-ld so.o x.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -d so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <fun>:
8000: b510 push {r4, lr}
8002: 2102 movs r1, #2
8004: 2001 movs r0, #1
8006: f000 f807 bl 8018 <__foo_from_thumb>
800a: bc10 pop {r4}
800c: bc02 pop {r1}
800e: 4708 bx r1
00008010 <foo>:
8010: e3a00002 mov r0, #2
8014: e12fff1e bx lr
00008018 <__foo_from_thumb>:
8018: 4778 bx pc
801a: e7fd b.n 8018 <__foo_from_thumb>
801c: eafffffb b 8010 <foo>
And you can see it works both ways thumb to arm arm to thumb if we write the asm write it does the rest of the work for us.
Now I personally hate the unified syntax, it is one of the major mistakes arm has made along with CMSIS. But, you want to do this for a living you find that you pretty much hate most corporate decisions and worse, have to work/operate with them. Often the time unified syntax generates the wrong instruction and have to fiddle with the syntax to get it to work, but if I have to get a specific instruction then I have to fiddle about to get it to generate the specific instruction I am after. Other than a bootstrap and some other exceptions you do not often write assembly language anyway, usually compile something then take the compiler generated code and tune it or replace it.
I started with the arm gnu tools before unified syntax so I am used to
.thumb
.globl hello
hello:
sub r0,#1
bne hello
instead of
.thumb
.globl hello
hello:
subs r0,#1
bne hello
And fine with bouncing between the two syntaxes (unified and not, yes two assembly languages within one tool).
All of the above is with the 32 bit arm, if you are interested in 64 bit arm, AND using gnu tools, then a percentage of this still applies, you just need to use the aarch64 tools not the arm tools from gnu. ARM's aarch64 is a completely different, and incompatible, instruction set from aarch32. But gnu syntax like .global and .type...function are often used across all gnu supported targets. There are exceptions for some directives, but if you take the same approach of having the tools themselves tell you how they work...by using them...You can figure this out.
so.elf: file format elf64-littleaarch64
Disassembly of section .text:
0000000000400000 <fun>:
400000: 52800041 mov w1, #0x2 // #2
400004: 52800020 mov w0, #0x1 // #1
400008: 14000001 b 40000c <foo>
000000000040000c <foo>:
40000c: 52800040 mov w0, #0x2 // #2
400010: d65f03c0 ret
What you need to do is place the arguments in the correct registers (or on the stack) as required. All the details on how to do this are what is known as the calling convention and forms a very important part of the Application Binary Interface(ABI).
Details on the ARM (Armv7) calling convention can be found at: https://developer.arm.com/documentation/den0013/d/Application-Binary-Interfaces/Procedure-Call-Standard
I have a STM32F103C8 MCU, and I want to control GPIO registers without Cube MX. The MCU has an embedded LED and I want control it. I'm currently using CubeMX and IAR Software, and I make the pin an output (in CubeMX) with this code:
HAL_GPIO_TogglePin(Ld2_GPIO_Port,Ld2_Pin);
HAL_Delay(1000);
This works, but I want to do it without Cube and HAL library; I want to edit the register files directly.
Using GPIO using registers is very easy. You fo not have to write your own startup (as ion the #old_timer answer). Only 2 steps are needed
you will need the STM provided CMSIS headers with datatypes declarations and human readable #defines and the reference manual
Enable GPIO port clock.
ecample: RCC -> APB2ENR |= RCC_APB2ENR_IOPAEN;
Configure the pins using CRL/CRH GPIO registers
#define GPIO_OUTPUT_2MHz (0b10)
#define GPIO_OUTPUT_PUSH_PULL (0 << 2)
GPIOA -> CRL &= ~(GPIO_CRL_MODE0 | GPIO_CRL_CNF0);
GPIOA -> CRL |= GPIO_OUTPUT_2MHz | GPIO_OUTPUT_PUSH_PULL;
Manipulate the output
/* to toggle */
GPIOA -> ODR ^= (1 << pinNummer);
/* to set */
GPIOA -> BSRR = (1 << pinNummer);
/* to reset */
GPIOA -> BRR = (1 << pinNummer);
//or
GPIOA -> BSRR = (1 << (pinNummer + 16));
It is very good to know how to do bare metal without the canned libraries, and or to be able to read through those libraries and understand what you are getting yourself into by using them.
This blinks port C pin 13 that is where you generally find the user led on the stm32 blue pill boards. You can figure it out from here and the documentation for the STM32F103C8.
flash.s
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.word loop
.thumb_func
reset:
bl notmain
b loop
.thumb_func
loop: b .
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
so.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
#define GPIOCBASE 0x40011000
#define RCCBASE 0x40021000
#define STK_CSR 0xE000E010
#define STK_RVR 0xE000E014
#define STK_CVR 0xE000E018
#define STK_MASK 0x00FFFFFF
static int delay ( unsigned int n )
{
unsigned int ra;
while(n--)
{
while(1)
{
ra=GET32(STK_CSR);
if(ra&(1<<16)) break;
}
}
return(0);
}
int notmain ( void )
{
unsigned int ra;
unsigned int rx;
ra=GET32(RCCBASE+0x18);
ra|=1<<4; //enable port c
PUT32(RCCBASE+0x18,ra);
//config
ra=GET32(GPIOCBASE+0x04);
ra&=~(3<<20); //PC13
ra|=1<<20; //PC13
ra&=~(3<<22); //PC13
ra|=0<<22; //PC13
PUT32(GPIOCBASE+0x04,ra);
PUT32(STK_CSR,4);
PUT32(STK_RVR,1000000-1);
PUT32(STK_CVR,0x00000000);
PUT32(STK_CSR,5);
for(rx=0;;rx++)
{
PUT32(GPIOCBASE+0x10,1<<(13+0));
delay(50);
PUT32(GPIOCBASE+0x10,1<<(13+16));
delay(50);
}
return(0);
}
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
build
arm-none-eabi-as --warn --fatal-warnings flash.s -o flash.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -c so.c -o so.o
arm-none-eabi-ld -o so.elf -T flash.ld flash.o so.o
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf so.bin -O binary
PUT32/GET32 is IMO a highly recommended style of abstraction, decades of experience and it has many benefits over the volatile pointer or worse the misuse of unions thing that is the current FAD. Not meant to be a library but to show code that does not require any libraries, only the files provided are required.
Most mcus you need to enable clocks to the peripheral before you can talk to it. You can see the read-modify-write of an RCC register.
Most MCUs the GPIO pins reset to inputs so you need to set one to an output to drive/blink an led. Even within the STM32 world but certainly across brands/families the GPIO (and other) peripherals are not expected to be identical nor even compatible so you have to refer to the documentation for that part and it will show how to make a pin an output. very good idea to read-modify-write instead of just write, but since you are in complete control over the chip you can just write if you wish, try that later.
This chip has a nice register that allows us to change the output state of one or more but not necessarily all GPIO outputs in a single write, no read-modify-write required. So I can set or clear pin 13 of GPIOC without affecting the state of the other GPIOC pins.
Some cortex-ms have a systick timer, for example not all cortex-m3s have to have one it is up to the chip folks usually and some cores may not have the option. This chip does so you can use it. In this example the timer is set to roll over every 1 million clocks, the delay function waits for N number of rollovers before returning. so 50,000,000 clocks between led state changes. since this code runs right from reset without messing with the clocking or other systems, the internal HSI 8MHz clock is used 50/8 = 6.25 seconds between led state changes. systick is very easy to use, but remember it is a 24 bit counter not 32 so if you want to do now vs then you must mask it.
I don't remember if it is an up counter
elapsed = (now - then) & 0x00FFFFFF;
or down
elapsed = (then - now) & 0x00FFFFFF;
(now = GET32(systick count register address))
The systick timer is in the arm documentation not the chip documentation necessarily although sometimes ST produces their own version, you want the arm one for sure and maybe then the st one. infocenter.arm.com (you have to give up an email address or you can Google sometimes you get lucky, someone will illegally post them somewhere) this chip will tell you it uses a cortex-m3 so find the technical reference manual for the cortex-m3 in that you will find it is based on architecture armv7-m so under architecture find the armv7-m documentation, between these you see how the vector table works, the systick timer and its addresses, etc.
Examine vector table
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000 andcs r1, r0, r0
8000004: 08000041 stmdaeq r0, {r0, r6}
8000008: 08000047 stmdaeq r0, {r0, r1, r2, r6}
800000c: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000010: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000014: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000018: 08000047 stmdaeq r0, {r0, r1, r2, r6}
800001c: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000020: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000024: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000028: 08000047 stmdaeq r0, {r0, r1, r2, r6}
800002c: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000030: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000034: 08000047 stmdaeq r0, {r0, r1, r2, r6}
8000038: 08000047 stmdaeq r0, {r0, r1, r2, r6}
800003c: 08000047 stmdaeq r0, {r0, r1, r2, r6}
08000040 <reset>:
8000040: f000 f806 bl 8000050 <notmain>
8000044: e7ff b.n 8000046 <loop>
08000046 <loop>:
8000046: e7fe b.n 8000046 <loop>
The entry point code with our vector table which starts off with the value we would like to put in the stack pointer on reset should be the first thing, then vector tables which are the address of the handler ORRed with 1 (not as easy to find in the docs sometimes). the disassembly of these addresses is because I used the disassembler to view them those are not actual instructions in the vector table it is a table of vectors. the tool is just doing its best to disassemble everything, if you look at the rest of the output it also disassembles the ascii tables and other things which are also not code.
.data is not supported in this example a bunch more work would be required.
I recommend if/when you get yours working you then examine the HAL library sources to see that when you dig through layers of sometimes bloated or scary code, you will end up with the same core registers, they may choose to always configure all the gpio registers for example, speed and pull up/down, turn off the alternate function, etc. Or not. the above knows it is coming out of reset and the state of the system so doesn't go to those lengths for some peripherals you can pop the reset for that peripheral and put it in a known state rather than try to make a library that anticipates it being left in any condition and trying to configure from that state. YMMV.
It is good professionally to know how to work at this level as well as how to use libraries. An MCU chip vendor will often have two libraries, certainly for older parts like these, the current library product and the legacy library product, when a new library comes out to keep it fresh and competitive (looking) the oldest one will drop off from support and you sometimes have current and prior. depends on the vendor, depends on the part, depends on how they manage their software products (same goes for their IDEs and other tools).
Most of the stm32 parts esp a blue pill and other boards you can get do not require the fancy IDEs to program but external hardware is sometimes required unless you get a NUCLEO or Discovery board then you have at least enough to program the part with free software not attached to ST. with a nucleo it is mbed style where you simply copy the .bin file to the virtual usb drive and the board takes care of programming the development MCU.
In order to combine .c and assembly, I want to pass start address of my .c code, and program microcontroller to know that its program starts at that address. As I am writing my startup file in assembly, I need to pass .c code starting address to assembly, and then to write this address to the specific memory region of microcontroller ( so the microcontroller can start execution on this address after RESET)
Trying to create a project for stm32f103 in Keil with this structure:
Some .c file, for example main.c (for the main part of the program).
Startup file in assembly language. Which gets the adress of entry to the function written in some .c file, to be passed to Reset_Handler
Scatter file, written in this way:
LR_IROM1 0x08000000 0x00010000 { ; load region size_region
ER_IROM1 0x08000000 0x00010000 { ; load address = execution address
*.o (RESET, +First) ; RESET is code section with I.V.T.
* (InRoot$$Sections)
.ANY (+RO)
.ANY (+XO)
}
RW_IRAM1 0x20000000 0x00005000 { ; RW data
.ANY (+RW +ZI)
}
}
The problem is passing the entry point to the .c function. Reset_Handler, which needs .c entry point(starting adress) passed by __main, looks like this:
Reset_Handler PROC
EXPORT Reset_Handler [WEAK]
IMPORT __main
LDR R0, =__main
BX R0
ENDP
bout entry point __main, as a answer for one assembly raleted question was written:
__main() is the compiler supplied entry point for your C code. It is not the main() function you write, but performs initialisation for the
standard library, static data, the heap before calling your `main()'
function.
So, how to get this entry point in my assembly file?
Edit>> If somebody is interested in solution for KEIL, here it is, its all that simple!
Simple assembly startup.s file:
AREA STACK, NOINIT, READWRITE
SPACE 0x400
Stack_top
AREA RESET, DATA, READONLY
dcd Stack_top
dcd Reset_Handler
EXPORT _InitMC
IMPORT notmain
AREA PROGRAM, CODE, READONLY
Reset_Handler PROC
bl notmain
ENDP
_InitMC PROC ;start of the assembly procedure
Loop
b Loop ;infinite loop
ENDP
END
Simple c file:
extern int _InitMC();
int notmain(void) {
_InitMC();
return 0;
}
Linker is the same as the one mentioned above.
Project build was successful.
Using the gnu toolchain for example:
Bootstrap:
.cpu cortex-m0
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word loop
.word loop
.word loop
.thumb_func
reset:
bl notmain
b loop
.thumb_func
loop: b .
.align
.thumb_func
.globl fun
fun:
bx lr
.end
C entry point (function name is not relevant, sometimes using main() adds garbage, depends on the compiler/toolchain)
void fun ( unsigned int );
int notmain ( void )
{
unsigned int ra;
for(ra=0;ra<1000;ra++) fun(ra);
return(0);
}
Linker script
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
Build
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m0 -march=armv6-m -c so.c -o so.thumb.o
arm-none-eabi-ld -o so.thumb.elf -T flash.ld flash.o so.thumb.o
arm-none-eabi-objdump -D so.thumb.elf > so.thumb.list
arm-none-eabi-objcopy so.thumb.elf so.thumb.bin -O binary
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m3 -march=armv7-m -c so.c -o so.thumb2.o
arm-none-eabi-ld -o so.thumb2.elf -T flash.ld flash.o so.thumb2.o
arm-none-eabi-objdump -D so.thumb2.elf > so.thumb2.list
arm-none-eabi-objcopy so.thumb2.elf so.thumb2.bin -O binary
Result (all thumb versions)
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000
8000004: 08000015
8000008: 0800001b
800000c: 0800001b
8000010: 0800001b
08000014 <reset>:
8000014: f000 f804 bl 8000020 <notmain>
8000018: e7ff b.n 800001a <loop>
0800001a <loop>:
800001a: e7fe b.n 800001a <loop>
0800001c <fun>:
800001c: 4770 bx lr
800001e: 46c0 nop ; (mov r8, r8)
08000020 <notmain>:
8000020: b570 push {r4, r5, r6, lr}
8000022: 25fa movs r5, #250 ; 0xfa
8000024: 2400 movs r4, #0
8000026: 00ad lsls r5, r5, #2
8000028: 0020 movs r0, r4
800002a: 3401 adds r4, #1
800002c: f7ff fff6 bl 800001c <fun>
8000030: 42ac cmp r4, r5
8000032: d1f9 bne.n 8000028 <notmain+0x8>
8000034: 2000 movs r0, #0
8000036: bd70 pop {r4, r5, r6, pc}
Of course this has to be placed in flash at the right place with some tool.
The vector table is mapped by logic to 0x00000000 in the stm32 family.
08000000 <_start>:
8000000: 20001000
8000004: 08000015 <---- reset ORR 1
And in this minimal code the reset handler calls the C code the C code messes around and returns. Technically a fully functional program for most stm32s (change the stack init to a smaller value for those with less ram say 0x20000400 and it should work anywhere by using -mthumb by itself (armv4t) or adding the cortex-m0. well okay not the armv8ms they can technically not support all of armv6m but the one in the field I know about does.
I don't have Kiel so don't know how to translate to that, but it shouldn't be much of a stretch, just syntax.
I am new to MCU and trying to figure out how arm (Cortex M3-M4) based MCU boots. Because booting is specific to any SOC, I took an example hardware board of STM for case study.
Board: STMicroelectronics – STM32L476 32-bit.
In this board when booting mode is (x0)"Boot from User Flash", board maps 0x0000000 address to flash memory address. On flash memory I have pasted my binary with first 4 bytes pointing to vector table first entry, which is esp. Now if I press reset button ARM documentation says PC value will be set to 0x00000000.
CPU generally executes stream of instructions based on PC -> PC + 1 loop. In this case if I see PC value points to esp, which is not instruction. How does Arm CPU does the logic of not use this instruction address, but do a jump to value store at address 0x00000004?
Or this is the case:
Reset produces a special hardware interrupt and cause PC value to be value at 0x00000004, if this is the case why Arm documentation says it sets PC value to 0x00000000?
Ref: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3761.html
What values are in ARM registers after a power-on reset? Applies to:
ARM1020/22E, ARM1026EJ-S, ARM1136, ARM720T, ARM7EJ-S, ARM7TDMI,
ARM7TDMI-S, ARM920/922T, ARM926EJ-S, ARM940T, ARM946E-S, ARM966E-S,
ARM9TDMI
Answer Registers R0 - R14 (including banked registers) and SPSR (in
all modes) are undefined after reset.
The Program Counter (PC/R15) will be set to 0x000000, or 0xFFFF0000 if
the core has a VINITHI or CFGHIVECS input which is set high as the
core leaves reset. This input should be set to reflect where the base
of the vector table in your system is located.
The Current Program Status Register (CPSR) will indicate that the ARM
core has started in ARM state, Supervisor mode with both FIQ and IRQ
mask bits set. The condition code flags will be undefined. Please see
the ARM Architecture Manual for a detailed description of the CPSR.
The cortex-m's do not boot the same way the traditional and full sized cores boot. Those at least for the reset as you pointed out fetch from address 0x00000000 (or the alternate if asserted) the first instructions, not really fair to call it the PC value as at this point the PC is somewhat bugus, there are multiple program counters being produced a fake one in r15, one leading the fetching, one doing prefetch, none are really the program counter. anyway, doesnt matter.
The cortex-m as documented in the armv7-m documentation (for the m3 and m4, for the m0 and m0+ see the armv6-m although they so far all boot the same way). These use a vector TABLE not instructions. The CORE reads address 0x00000000 (or an alternate if a strap is asserted) and that 32 bit value gets loaded into the stack pointer register. it reads address 0x00000004 it checks the lsbit (maybe not all cores do) if set then this is a valid thumb address, strips the lsbit off (makes it a zero) and begins to fetch the first instructions for the reset handler at that address so if your flash starts with
0x00000000 : 0x20001000
0x00000004 : 0x00000101
the cortex-m will put 0x20001000 in the stack pointer and fetch the first instructions from address 0x100. Being thumb instructions are 16 bits with thumb2 extensions being two 16 bit portions, its not an x86 the program counter is aligned for the full sized processors with 32 bit instructions it fetches on aligned addresses 0x0000, 0x0004, 0x0008 it doesnt increment pc <= pc + 1; For thumb mode or thumb processors it is pc = pc + 2. But also the fetches are not necessarily single instruction transactions, for the full sized they may fetch 4 or 8 words per transaction, the cortex-ms as documented in the technical reference manuals some are able to be compiled or strapped to 16 bits at a time or 32 bits at a time. So no need to talk about or think about execution loops fetching pc = pc + 1, that doesnt make sense even in an x86 these days.
to be fair arms documentation is generally good, on the better side compared to a number of others, not the best. Unlike the full sized arm exception table, the vector table in the cortex-m documentation was not done as well as it could have been, could have/should have just done something like the full sized but shown they were vectors not instructions. It is in there though in the architectural reference manual for the armv6-m and armv7-m (and I would assume armv8-m as well but have not looked, got some parts last week but boards are not here yet, will know very soon). Cant look for words like reset have to look for interrupt or undefined or hardfault, etc in that manual.
EDIT
unwrap your mind on this notion of how the processor starts fetching, it can be any arbitrary address they add into the design, and then the execution of the instructions determines the next address and next address, etc.
Also understand unlike say x86 or microchip pic or the avrs, etc, the core and the chips are two different companies. Even in those same company designs, but certainly where there is a clear division between the IP with a known bus, the ARM CORE will read address 0x00000004 on the AMBA/AXI/AHB bus, the chip vendor can mirror that address in as many different places as they want, in this case with the stm32 there probably isnt actually anything at 0x00000000 as their documentation implies based on the boot pins they map it either to an internal bootloader, or they map it to the user application at 0x08000000 (or in most stm32's if there is an exception thats fine I have not yet seen it) so when strapped that way and the logic has those addresses mirrored you will see the same 32 bit values at 0x00000000 and 0x08000000, 0x00000004 and 0x08000004 and so on for some limited amount of address space. This is why even though linking for 0x00000000 will work to some extent (till you hit that limit which is probably smaller than the application flash size), you will see most folks link for 0x08000000 and the hardware takes care of the rest, so your table really wants to look like
0x08000000 : 0x20001000
0x08000004 : 0x08000101
for an stm32, at least the dozens I have seen so far.
The processor reads 0x00000000 which is mirrored to the first item in the application flash, finds 0x20001000, it then reads 0x00000004 which is mirroed to the second word in the application flash and gets 0x08000101 which causes a fetch from 0x08000100 and now we are executing from the proper fully mapped application flash address space. so long as you dont change the mirroring, which I dont know if you can on an stm32 (nxp chips you can and I dont know about ti or other brands off hand). Some of the cortex-m cores the VTOR register is there and changable (others it is fixed at 0x00000000 and you cant change it), you do not need to change it to 0x08000000 for an stm32, at least all the ones I know about. its only if you are actively changing the mirroring of the zero address space yourself if possible or if you say have your own bootloader and maybe YOUR application space is 0x08004000 and that application wants a vector table of its own. then you either use VTOR or you build the bootloaders vector table such that it runs code that reads the vectors at 0x08004000 and branches to those. The NXP and others in the past certainly with the ARMV7TDMI cores, would let you change the mirroring of address zero because those older cores didnt have a programmable vector table offset register, helping you solve that problem in their chip designs. Newer ARM cores with a VTOR eliminate that need and over time the chip vendors might not bother anymore if they do at all...
EDIT
I dont know if you have the discovery board or the nucleo, I assume the latter as the former is not available (wish I knew about that one would like to have one. And/or I already have one and its buried in a drawer and I never got to it).
so here is a somewhat minimal program you can try on your stm32
.cpu cortex-m0
.thumb
.globl _start
_start:
.word 0x20000400
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,=0x20000000
mov r2,sp
str r2,[r0]
add r0,r0,#4
mov r2,pc
str r2,[r0]
add r0,r0,#4
mov r1,#0
top:
str r1,[r0]
add r1,r1,#1
b top
build
arm-none-eabi-as so.s -o so.o
arm-none-eabi-ld -Ttext=0x08000000 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary so.bin
this should build with arm-linux-whatever- or other arm-whatever-whatever tools from a binutils from the last 10 years.
The disassembly is important to examine before using the binary, dont want to brick your chip (with an stm32 there is a way to get unbricked)
08000000 <_start>:
8000000: 20000400 andcs r0, r0, r0, lsl #8
8000004: 08000013 stmdaeq r0, {r0, r1, r4}
8000008: 08000011 stmdaeq r0, {r0, r4}
800000c: 08000011 stmdaeq r0, {r0, r4}
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
8000012: 4805 ldr r0, [pc, #20] ; (8000028 <top+0x6>)
8000014: 466a mov r2, sp
8000016: 6002 str r2, [r0, #0]
8000018: 3004 adds r0, #4
800001a: 467a mov r2, pc
800001c: 6002 str r2, [r0, #0]
800001e: 3004 adds r0, #4
8000020: 2100 movs r1, #0
08000022 <top>:
8000022: 6001 str r1, [r0, #0]
8000024: 3101 adds r1, #1
8000026: e7fc b.n 8000022 <top>
8000028: 20000000 andcs r0, r0, r0
the disassembler doesnt know that the vector table is not instructions so you can ignore those.
08000000 <_start>:
8000000: 20000400
8000004: 08000013
8000008: 08000011
800000c: 08000011
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
Does it start the vector table at 0x08000000, check. Our stack pointer init value is at 0x00000000, yes, the reset vector we had the tools place for us. thumb_func tells them the following label is an address for some code/function/procedure/whatever_not_data so they orr the one on there for us. our reset handler is at address 0x08000012 so we want to see 0x08000013 in the vector table, check. I tossed in a couple more for demonstration purposes, sent them to an infinite loop at address 0x08000010 so the vector table should have 0x08000011, check.
So assuming you have a nucleo board not the discovery then you can copy the so.bin file to the thumb drive that shows up when you plug it in.
If you use openocd to connect through the stlink interface into the board now you can see that it was running (details left to the reader to figure out)
Open On-Chip Debugger
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 0048cd01 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
> resume
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 005e168c 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
so we can see that the stack pointer had 0x20000400 as expected
0x20000000: 20000400 0800001e 0048cd01
the program counter which is not some magical thing, they have to somewhat fake it to make the instruction set work.
800001a: 467a mov r2, pc
as defined in the instruction set the pc value used in this instruction is two instructions ahead of the address of this instruction, so 0x0800001A + 4 = 0x0800001E which is what we see in the memory dump.
And the third item is a counter showing we are running, the resume and halt shows that that count kept going
0x20000000: 20000400 0800001e 005e168
So this demonstrates, the vector table, initializing the stack pointer, the reset vector, where code execution starts, what the value of the pc is at some point in the program, and seeing the program run.
the .cpu cortex-m0 makes it build the most compatible program for the cortex-m family and the mov r0,=0x20000000 was cheating, you posted the same feature in your comment it says I want to load the address of blah into the register a label is just an address and they let you put just an address =_estack is the address of a label =0x20000000 is just a number treated as an address (addresses are just numbers as well, nothing magical about them). I could have done a smaller immediate with a shift or explicitly have done the pc relative load. force of habit in this case.
EDIT2
In attempt for a programmer to understand that the chip is logic, only some percentage of it is software/instruction driven, even within that it is just logic that does more things than the software instruction itself indicates. You want to read from memory your instruction asks the processor to do it but in a real chip there are a number of steps involved to actually perform that, microcoded or not (ARMs are not microcoded) there are state machines that walk through the various steps to perform each of these tasks. grab the values from registers, compute the address, do the memory transaction which is a handful of separate steps, take the return value and place it in the register file.
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,loop_counts
loop_top:
sub r0,r0,#1
bne loop_top
b reset
.align
loop_counts: .word 0x1234
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 00000013 andeq r0, r0, r3, lsl r0
8: 00000011 andeq r0, r0, r1, lsl r0
c: 00000011 andeq r0, r0, r1, lsl r0
00000010 <loop>:
10: e7fe b.n 10 <loop>
00000012 <reset>:
12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
00000014 <loop_top>:
14: 3801 subs r0, #1
16: d1fd bne.n 14 <loop_top>
18: e7fb b.n 12 <reset>
1a: 46c0 nop ; (mov r8, r8)
0000001c <loop_counts>:
1c: 00001234 andeq r1, r0, r4, lsr r2
Just barely enough of an instruction set simulator to run that program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ROMMASK 0xFFFF
#define RAMMASK 0xFFF
unsigned short rom[ROMMASK+1];
unsigned short ram[RAMMASK+1];
unsigned int reg[16];
unsigned int pc;
unsigned int cpsr;
unsigned int inst;
int main ( void )
{
unsigned int ra;
unsigned int rb;
unsigned int rc;
unsigned int rx;
//just putting something there, a real chip might have an MBIST, might not.
memset(reg,0xBA,sizeof(reg));
memset(ram,0xCA,sizeof(ram));
memset(rom,0xFF,sizeof(rom));
//in a real chip the rom/flash would contain the program and not
//need to do anything to it, this sim needs to have the program
//various ways to have done this...
//00000000 <_start>:
rom[0x00>>1]=0x1000; // 0: 20001000 andcs r1, r0, r0
rom[0x02>>1]=0x2000;
rom[0x04>>1]=0x0013; // 4: 00000013 andeq r0, r0, r3, lsl r0
rom[0x06>>1]=0x0000;
rom[0x08>>1]=0x0011; // 8: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0A>>1]=0x0000;
rom[0x0C>>1]=0x0011; // c: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0E>>1]=0x0000;
//
//00000010 <loop>:
rom[0x10>>1]=0xe7fe; // 10: e7fe b.n 10 <loop>
//
//00000012 <reset>:
rom[0x12>>1]=0x4802; // 12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
//
//00000014 <loop_top>:
rom[0x14>>1]=0x3801; // 14: 3801 subs r0, #1
rom[0x16>>1]=0xd1fd; // 16: d1fd bne.n 14 <loop_top>
rom[0x18>>1]=0xe7fb; // 18: e7fb b.n 12 <reset>
rom[0x1A>>1]=0x46c0; // 1a: 46c0 nop ; (mov r8, r8)
//
//0000001c <loop_counts>:
rom[0x1C>>1]=0x0004; // 1c: 00001234 andeq r1, r0, r4, lsr r2
rom[0x1E>>1]=0x0000;
//reset
//THIS IS NOT SOFTWARE DRIVEN LOGIC, IT IS JUST LOGIC
ra=rom[0x00>>1];
rb=rom[0x02>>1];
reg[14]=(rb<<16)|ra;
ra=rom[0x04>>1];
rb=rom[0x06>>1];
rc=(rb<<16)|ra;
if((rc&1)==0) return(1); //normally run a fault handler here
pc=rc&0xFFFFFFFE;
reg[15]=pc+2;
cpsr=0x000000E0;
//run
//THIS PART BELOW IS SOFTWARE DRIVEN LOGIC
//still you can see that each instruction requires some amount of
//non-software driven logic.
//while(1)
for(rx=0;rx<20;rx++)
{
inst=rom[(pc>>1)&ROMMASK];
printf("0x%08X : 0x%04X\n",pc,inst);
reg[15]=pc+4;
pc+=2;
if((inst&0xF800)==0x4800)
{
//LDR
printf("LDR r%02u,[PC+0x%08X]",(inst>>8)&0x7,(inst&0xFF)<<2);
ra=(inst>>0)&0xFF;
rb=reg[15]&0xFFFFFFFC;
ra=rb+(ra<<2);
printf(" {0x%08X}",ra);
rb=rom[((ra>>1)+0)&ROMMASK];
rc=rom[((ra>>1)+1)&ROMMASK];
ra=(inst>>8)&0x07;
reg[ra]=(rc<<16)|rb;
printf(" {0x%08X}\n",reg[ra]);
continue;
}
if((inst&0xF800)==0x3800)
{
//SUB
ra=(inst>>8)&0x07;
rb=(inst>>0)&0xFF;
printf("SUBS r%u,%u ",ra,rb);
rc=reg[ra];
rc-=rb;
reg[ra]=rc;
printf("{0x%08X}\n",rc);
//do flags
if(rc==0) cpsr|=0x80000000; else cpsr&=(~0x80000000); //N flag
//dont need other flags for this example
continue;
}
if((inst&0xF000)==0xD000) //B conditional
{
if(((inst>>8)&0xF)==0x1) //NE
{
ra=(inst>>0)&0xFF;
if(ra&0x80) ra|=0xFFFFFF00;
rb=reg[15]+(ra<<1);
printf("BNE 0x%08X\n",rb);
if((cpsr&0x80000000)==0)
{
pc=rb;
}
continue;
}
}
if((inst&0xF000)==0xE000) //B
{
ra=(inst>>0)&0x7FF;
if(ra&0x400) ra|=0xFFFFF800;
rb=reg[15]+(ra<<1);
printf("B 0x%08X\n",rb);
pc=rb;
continue;
}
printf("UNDEFINED INSTRUCTION 0x%08X: 0x%04X\n",pc-2,inst);
break;
}
return(0);
}
You are welcome to hate my coding style, this is a brute force thrown together for this question thing. No I dont work for ARM, this can all be pulled from public documents/information. I shortened the loop to 4 counts to see it hit the outer loop
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
Perhaps this helps perhaps this makes it worse. Most of the logic is not driven by instructions, each instruction, requires some amount of logic not counting the common logic like instruction fetching and things like that.
If you add more code this simulator will break it ONLY supports these handful of instructions and this loop.
The most important thing to check when you're confused about some behaviour of an Arm processor is probably to check the version of the architecture which applies. You will find a huge amount of very old legacy documentation which relates to ARM7 and ARM9 designs. Whilst not all of this is wrong today, it can be very misleading.
ARM v4, ARM v5, ARM v6: These are legacy designs, rarely even used in derivative products now.
ARM v7A: These are the first of the Cortex series. Cortex-A5 is the entry-level for a linux class device in 2018.
ARM v7M, ARM v6M: These are the common microcontroller devices like your STM32, and already these have over 10 years of history
ARM v8A: These introduce the 64 bit instruction set (T32/A32/A64 in one device), already entry level in the R-pi 3 for example.
ARM v8M: The latest iteration of an microcontroller architecture with more advanced security features, just starting to become available 2018Q2
Specifically, ARMv6M/ARMv7M/ARMv8M provide a very different exception model compared with all of the other ARM architectures (remaining similar within the family), whilst many of the other differences are more incremental or focused on specialised area.
I've been programming in C and C++ for quite a long time now, so I'm familiar with the linking process as a user: the preprocessor expands all prototypes and macros in each .c file which is then compiled separately into its own object file, and all object files together with static libraries are linked into an executable.
However I'd like to know more about this process: how does the linker link the object files (what do they contain anyway?)? Matching declared but undefined functions with their definitions in other files (how?)? Translating into the exact content of the program memory (context: microcontrollers)?
Application example
Ideally, I'm looking for a detailed step-by-step description of what the process is doing, based on the following simplistic example. Since it doesn't appear to be said anywhere, fame and glory to whoever answers in this way.
main.c
#include "otherfile.h"
int main(void) {
otherfile_print("Foo");
return 0;
}
otherfile.h
void otherfile_print(char const *);
otherfile.c
#include "otherfile.h"
#include <stdio.h>
void otherfile_print(char const *str) {
printf(str);
}
printf is insanely complicated, very bad for a microcontroller hello world example, blinking leds are better but that gets specific to the microcontroller. this will suffice for linking.
two.c
unsigned int glob;
unsigned int two ( unsigned int a, unsigned int b )
{
glob=5;
return(a+b+7);
}
one.c
extern unsigned int glob;
unsigned int two ( unsigned int, unsigned int );
unsigned int one ( void )
{
return(two(5,6)+glob);
}
start.s
.globl _start
_start:
bl one
b .
build everything.
% arm-none-eabi-gcc -O2 -c one.c -o one.o
% arm-none-eabi-gcc -O2 -c two.c -o two.o
% touch start.s
% arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -c one.c -o one.o
% arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -c two.c -o two.o
% arm-none-eabi-as start.s -o start.o
% arm-none-eabi-ld -Ttext=0x10000000 start.o one.o two.o -o onetwo.elf
now lets look...
arm-none-eabi-objdump -D start.o
...
00000000 <_start>:
0: ebfffffe bl 0 <one>
4: eafffffe b 4 <_start+0x4>
it not is the compiler/assemblers job to deal with external references so the branch link to one is left incomplete, they chose to make it a bl of 0 but they could have simply left it totally unencoded, it is up to the authors of the toolchain as to how to communicate between the compiler, assembler, and linker via object files.
Same here
00000000 <one>:
0: e92d4008 push {r3, lr}
4: e3a00005 mov r0, #5
8: e3a01006 mov r1, #6
c: ebfffffe bl 0 <two>
10: e59f300c ldr r3, [pc, #12] ; 24 <one+0x24>
14: e5933000 ldr r3, [r3]
18: e0800003 add r0, r0, r3
1c: e8bd4008 pop {r3, lr}
20: e12fff1e bx lr
24: 00000000 andeq r0, r0, r0
both the function two and the address for the global variable glob are unknown. Note that for the unknown variable the compiler generates code that requires the explicit address of the global so that the linker simply needs to fill in the address, also glob is .data not .text.
00000000 <two>:
0: e59f3010 ldr r3, [pc, #16] ; 18 <two+0x18>
4: e2811007 add r1, r1, #7
8: e3a02005 mov r2, #5
c: e0810000 add r0, r1, r0
10: e5832000 str r2, [r3]
14: e12fff1e bx lr
18: 00000000 andeq r0, r0, r0
here too the global is in .data not here, so the linker will have to place .data and the things in it and then fill in the addresses.
so here we have linked it all together, the gnu linker requires an entry point label defined _start (main is an extern address required by the standard bootstrap, which I am not using so we dont get a main not found error). Because I am not using a linker script the gnu linker places items in the binary in the order they were defined on the command line, as desired i need start first for a microcontroller since I am controlling the boot. I used a non-zero here for demonstration purposes as well...
10000000 <_start>:
10000000: eb000000 bl 10000008 <one>
10000004: eafffffe b 10000004 <_start+0x4>
10000008 <one>:
10000008: e92d4008 push {r3, lr}
1000000c: e3a00005 mov r0, #5
10000010: e3a01006 mov r1, #6
10000014: eb000005 bl 10000030 <two>
10000018: e59f300c ldr r3, [pc, #12] ; 1000002c <one+0x24>
1000001c: e5933000 ldr r3, [r3]
10000020: e0800003 add r0, r0, r3
10000024: e8bd4008 pop {r3, lr}
10000028: e12fff1e bx lr
1000002c: 1000804c andne r8, r0, ip, asr #32
10000030 <two>:
10000030: e59f3010 ldr r3, [pc, #16] ; 10000048 <two+0x18>
10000034: e2811007 add r1, r1, #7
10000038: e3a02005 mov r2, #5
1000003c: e0810000 add r0, r1, r0
10000040: e5832000 str r2, [r3]
10000044: e12fff1e bx lr
10000048: 1000804c andne r8, r0, ip, asr #32
Disassembly of section .bss:
1000804c <__bss_start>:
1000804c: 00000000 andeq r0, r0, r0
so the linker starts to place the first item start.o, it roughly figures out how big that needs to be by just putting what was there. those two instructions. they take 8 bytes so in theory the second item one.o goes next at 0x10000008. That means the encoding for the bl one in start.s can be completed to use the correct relative address (_start + 8 which is the value of the pc when executing so the offset is zero, pc+0 is the encoding)
the linker has roughly placed one.o into the binary it is building and it has to resolve the address to two and the global so it has to place two.o and then figure out where the end of that is to place in this case .bss not .data since I didnt pre-init the variable.
the label for two is at 0x10000030 so it encodes the bl two in one() for that relative offset, it has also placed glob at 1000804c for some reason (I didnt complete define where ram was so the gnu linker will do things like this). Despite the reason, that is where the linker defined the home for that global variable and where the address to glob is needed is filled in by the linker, both one() and two() needed those filled in.
So the compiler (assembler) and linker have to in the end result in a usable binary, the compiler (assembler) tend to worry about making position independent machine code and leave enough information for the linker so that it has the machine code and a list of unresolved externs that it has to fill in. compilers have improved over time, a simple model would be to have an address location like they did above for the global variables address, where the linker computes the absolute address and just fills it in, clearly above they did not encode the function call in a way that it can use an absolute address to one and two. instead it uses pc relative addressing. This means that the linker has to know the machine code encoding of the bl instruction. the current generation of gnu linker knows quite a bit more and can do some cool things resolving arm to thumb and back, stuff it didnt used to know (you dont need to compile for thumb interwork anymore the linker takes care of it).
So the linker takes binary blobs including data and...links them together into one binary. It first needs to know the actual addresses for the various things in the binary. How you tell the linker this is linker specific and not a global thing for all C/C++ toolchains. Gnu linker scripts are a programming language in and of themselves. These are not necessarily physical nor virtual addresses it is simply the address space of the code in whatever mode it is in (virtual or physical). Once the linker knows the addresses it, based on linker rules (again linker specific) it starts placing these various binary blobs into those address spaces. then it goes through and resolves the external/global addresses. It was not above but can be an iterative process. If for example the function two() was at an address in memory that cannot be accessed with a single pc relative instruction (say we put one near zero and two near 0xF0000000) then those that wrote the linker have two choices, the simple choice is to simply state that it cannot encode/implement that far of a branch and bail out and gnu linker did or still does do that. Or the other solution is the linker fixes the problem. the linker could add a few words of data within the range of the pc relative branch link and those few words of data are a trampoline for example an absolute address that is loaded into a register then a register based branch or perhaps of clever a pc relative branch if the trampoline is within range (in the case of 0x10000000 to 0xF0000000 that wouldnt work). If the linker has to add these few words then that may mean that some of the binary blobs have to move to make room for those few words and now all of the addresses in those binary blobs now have to move as well. So you have to make another pass across all the binary blobs, resolving all of the new addresses filling in the answers and for pc relative determining if you can still reach everything. Adding those few words might have made something that was reachable with a pc-relative now unreachable and now that requires a solution (error or patch).
The assembler itself for a single source file has to go through even more of these gyrations esp for a variable length instruction set like x86 where the addressing is a big vague. I recommend trying for yourself to make a simple assembler that only supports a few instructions but some of those branches. and parse and encode the instructions and compare that to an existing debugged assembler like gnu assembler.
test.s
ldr r1,locdat
nop
nop
nop
nop
nop
b over
locdat: .word 0x12345678
top:
nop
nop
nop
nop
nop
nop
over:
b top
the right answer is
00000000 <locdat-0x1c>:
0: e59f1014 ldr r1, [pc, #20] ; 1c <locdat>
4: e1a00000 nop ; (mov r0, r0)
8: e1a00000 nop ; (mov r0, r0)
c: e1a00000 nop ; (mov r0, r0)
10: e1a00000 nop ; (mov r0, r0)
14: e1a00000 nop ; (mov r0, r0)
18: ea000006 b 38 <over>
0000001c <locdat>:
1c: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
00000020 <top>:
20: e1a00000 nop ; (mov r0, r0)
24: e1a00000 nop ; (mov r0, r0)
28: e1a00000 nop ; (mov r0, r0)
2c: e1a00000 nop ; (mov r0, r0)
30: e1a00000 nop ; (mov r0, r0)
34: e1a00000 nop ; (mov r0, r0)
00000038 <over>:
38: eafffff8 b 20 <top>
there are parallels to that activity and the job of a linker. also you could fashion a simple linker based on the above files or something similar, extract the binary blobs and other info and start placing them in whatever address space you want.
Either one are fairly simple programming tasks, yet fairly educational. Having an existing toolchain that can produce the answer you can figure out where you are going wrong or how to get at the right answer.