ARM bootloader: Interrupt Vector Table Understanding

ARM bootloader: Interrupt Vector Table Understanding - arm

The code following is the first part of u-boot to define interrupt vector table, and my question is how every line will be used. I understand the first 2 lines which is the starting point and the first instruction to implement: reset, and we define reset below. But when will we use these instructions below? According to System.map, every instruction has a fixed address, so _fiq is at 0x0000001C, when we want to execute fiq, we will copy this address into pc and then execute,right? But in which way can we jump to this instruction: ldr pc, _fiq? It's realised by hardware or software? Hope I make myself understood correctly.
>.globl _start
>_start:b reset
> ldr pc, _undefined_instruction
> ldr pc, _software_interrupt
> ldr pc, _prefetch_abort
> ldr pc, _data_abort
> ldr pc, _not_used
> ldr pc, _irq
> ldr pc, _fiq
>_undefined_instruction: .word undefined_instruction
>_software_interrupt: .word software_interrupt
>_prefetch_abort: .word prefetch_abort
>_data_abort: .word data_abort
>_not_used: .word not_used
>_irq: .word irq
>_fiq: .word fiq

If you understand reset then you understand all of them.
When the processor is reset then hardware sets the pc to 0x0000 and starts executing by fetching the instruction at 0x0000. When an undefined instruction is executed or tries to be executed the hardware responds by setting the pc to 0x0004 and starts executing the instruction at 0x0004. irq interrupt, the hardware finishes the instruction it is executing starts executing the instruction at address 0x0018. and so on.
00000000 <_start>:
0: ea00000d b 3c <reset>
4: e59ff014 ldr pc, [pc, #20] ; 20 <_undefined_instruction>
8: e59ff014 ldr pc, [pc, #20] ; 24 <_software_interrupt>
c: e59ff014 ldr pc, [pc, #20] ; 28 <_prefetch_abort>
10: e59ff014 ldr pc, [pc, #20] ; 2c <_data_abort>
14: e59ff014 ldr pc, [pc, #20] ; 30 <_not_used>
18: e59ff014 ldr pc, [pc, #20] ; 34 <_irq>
1c: e59ff014 ldr pc, [pc, #20] ; 38 <_fiq>
00000020 <_undefined_instruction>:
20: 00000000 andeq r0, r0, r0
00000024 <_software_interrupt>:
24: 00000000 andeq r0, r0, r0
00000028 <_prefetch_abort>:
28: 00000000 andeq r0, r0, r0
0000002c <_data_abort>:
2c: 00000000 andeq r0, r0, r0
00000030 <_not_used>:
30: 00000000 andeq r0, r0, r0
00000034 <_irq>:
34: 00000000 andeq r0, r0, r0
00000038 <_fiq>:
38: 00000000 andeq r0, r0, r0
Now of course in addition to changing the pc and starting execution from these addresses. The hardware will save the state of the machine, switch processor modes if necessary and then start executing at the new address from the vector table.
Our job as programmers is to build the binary such that the instructions we want to be run for each of these instructions is at the right address. The hardware provides one word, one instruction for each location. Now if you never expect to ever have any of these exceptions, you dont have to have a branch at address zero for example you can just have your program start, there is nothing magic about the memory at these addresses. If you expect to have these exceptions, then you have two choices for instructions that are one word and can jump out of the way of the exception that follows. One is a branch the other is a load pc. There are pros and cons to each.

When the hardware takes an exception, the program counter (PC) is automatically set to the address of the relevant exception vector and the processor begins executing instructions from that address. When the processor comes out of reset, the PC is automatically set to base+0. An undefined instruction sets the PC to base+4, etc. The base address of the vector table (base) is either 0x00000000, 0xFFFF0000, or VBAR depending on the processor and configuration. Note that this provides limited flexibility in where the vector table gets placed and you'll need to consult the ARM documentation in conjunction with the reference manual for the device that you are using to get the right value to be used.
The layout of the table (4 bytes per exception) makes it necessary to immediately branch from the vector to the actual exception handler. The reasons for the LDR PC, label approach are twofold - because a PC-relative branch is limited to (24 << 2) bits (+/-32MB) using B would constrain the layout of the code in memory somewhat; by loading an absolute address the handler can be located anywhere in memory. Secondly it makes it very simple to change exception handlers at runtime, by simply writing a different address to that location, rather than having to assemble and hotpatch a branch instruction.
There's little value to having a remappable reset vector in this way, however, which is why you tend to see that one implemented as a simple branch to skip over the rest of the vectors to the real entry point code.

Related

ARM Thumb GCC Disassembled C. Caller-saved registers not saved and loading and storing same register immediately

Context: STM32F469 Cortex-M4 (ARMv7-M Thumb-2), Win 10, GCC, STM32CubeIDE; Learning/Trying out inline assembly & reading disassembly, stack managements etc., writing to core registers, observing contents of registers, examining RAM around stack pointer to understand how things work.
I've noticed that at some point, when I call a function, in the beginning of a called function, which received an argument, the instructions generated for the C function do "store R3 at RAM address X" followed immediately "Read RAM address X and store in RAM". So it's writing and reading the same value back, R3 is not changed. If it only had wanted to save the value of R3 onto the stack, why load it back then?
C code, caller function (main), my code:
asm volatile(" LDR R0,=#0x00000000\n"
" LDR R1,=#0x11111111\n"
" LDR R2,=#0x22222222\n"
" LDR R3,=#0x33333333\n"
" LDR R4,=#0x44444444\n"
" LDR R5,=#0x55555555\n"
" LDR R6,=#0x66666666\n"
" MOV R7,R7\n" //Stack pointer value is here, used for stack data access
" LDR R8,=#0x88888888\n"
" LDR R9,=#0x99999999\n"
" LDR R10,=#0xAAAAAAAA\n"
" LDR R11,=#0xBBBBBBBB\n"
" LDR R12,=#0xCCCCCCCC\n"
);
testInt = addFifteen(testInt); //testInt=0x03; returns uint8_t, argument uint8_t
Function call generates instructions to load function argument into R3, then move it to R0, then branch with link to addFifteen. So by the time I enter addFifteen, R0 and R3 have value 0x03 (testInt). So far so good. Here is what function call looks like:
testInt = addFifteen(testInt);
08000272: ldrb r3, [r7, #11]
08000274: mov r0, r3
08000276: bl 0x80001f0 <addFifteen>
So I go into addFifteen, my C code for addFifteen:
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
and its disassembly:
addFifteen:
080001f0: push {r7}
080001f2: sub sp, #12
080001f4: add r7, sp, #0
080001f6: mov r3, r0
080001f8: strb r3, [r7, #7]
080001fa: ldrb r3, [r7, #7]
080001fc: adds r3, #15
080001fe: uxtb r3, r3
08000200: mov r0, r3
08000202: adds r7, #12
08000204: mov sp, r7
08000206: ldr.w r7, [sp], #4
0800020a: bx lr
My primary interest is in 1f8 and 1fa lines. It stored R3 on stack and then loads freshly written value back into the register that still holds the value anyway.
Questions are:
What is the purpose of this "store register A into RAM X, next read value from RAM X into register A"? Read instruction doesn't seem to serve any purpose. Make sure RAM write is complete?
Push{r7} instruction makes stack 4-byte aligned instead of 8-byte aligned. But immediately after that instruction we have SP decremented by 12 (bytes), so it becomes 8-byte aligned again. Therefore, this behavior is ok. Is this statement correct? What if an interrupt happens between these two instructions? Will alignment be fixed during ISR stacking for the duration of ISR?
From what I read about caller/callee saved registers (very hard to find any sort of well-organized information on that, if you have good material, please, share a link), at least R0-R3 must be placed on stack when I call a function. However, it's easy to notice in this case that NONE of the registers were pushed on stack, and I verified it by checking memory around stack pointer, it would have been easy to notice 0x11111111 and 0x22222222, but they aren't there, and nothing is pushing them there. The values in R0 and R3 that I had before I called the function are simply gone forever. Why weren't any registers pushed on stack before function call? I would expect to have R3 0x33333333 when addFifteen returns because that's how it was before function call, but that value is casually overwritten even before branch to addFifteen. Why didn't GCC generate instructions to push R0-R3 onto the stack and only after that branch with link to addFifteen?
If you need some compiler settings, please, let me know where to find them in Eclipse (STM32CubeIDE) and what exactly you need there, I will happily provide them and add them to the question here.

uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
What you are looking at here is unoptimized and at least with gnu the input and local variables get a memory location on the stack.
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
What you see with r3 is that the input variable, input, comes in r0. For some reason, code is not optimized, it goes into r3, then it is saved in its memory location on the stack.
Setup the stack
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
save input to the stack
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
so now we can start implementing the code in the function which wants to do math on the input function, so do that math
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
Convert the result to an unsigned char.
e: b2db uxtb r3, r3
Now prepare the return value
10: 4618 mov r0, r3
and clean up and return
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
Now if I tell it not to use a frame pointer (just a waste of a register).
00000000 <addFifteen>:
0: b082 sub sp, #8
2: 4603 mov r3, r0
4: f88d 3007 strb.w r3, [sp, #7]
8: f89d 3007 ldrb.w r3, [sp, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: b002 add sp, #8
14: 4770 bx lr
And you can still see each of the fundamental steps in implementing the function. Unoptimized.
Now if you optimize
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
It removes all the excess.
number two.
Yes I agree this looks wrong, but gnu certainly does not keep the stack on an alignment at all times, so this looks wrong. But I have not read the details on the arm calling convention. Nor have I read to see what gcc's interpretation is. Granted they may claim a spec, but at the end of the day the compiler authors choose the calling convention for their compiler, they are under no obligation to arm or intel or others to conform to any spec. Their choice, and like the C language itself, there are lots of places where it is implementation defined and gnu implements the C language one way and others another way. Perhaps this is the same. Same goes for this saving of the incoming variable to the stack. We will see that llvm/clang does not.
number three.
r0-r3 and another register or two may be called caller saved, but the better way to think of them is volatile. The callee is free to modify them without saving them. It is not so much a case of saving the r0 register, but instead r0 represents a variable and you are managing that variable in functionally implementing the high level code.
For example
unsigned int fun1 ( void );
unsigned int fun0 ( unsigned int x )
{
return(fun1()+x);
}
00000000 <fun0>:
0: b510 push {r4, lr}
2: 4604 mov r4, r0
4: f7ff fffe bl 0 <fun1>
8: 4420 add r0, r4
a: bd10 pop {r4, pc}
x comes in in r0, and we need to preserve that value until after fun1() is called. r0 can be destroyed/modified by fun1(). So in this case they save r4, not r0, and keep x in r4.
clang does this as well
00000000 <fun0>:
0: b5d0 push {r4, r6, r7, lr}
2: af02 add r7, sp, #8
4: 4604 mov r4, r0
6: f7ff fffe bl 0 <fun1>
a: 1900 adds r0, r0, r4
c: bdd0 pop {r4, r6, r7, pc}
Back to your function.
clang, unoptimized also keeps the input variable in memory (stack).
00000000 <addFifteen>:
0: b081 sub sp, #4
2: f88d 0003 strb.w r0, [sp, #3]
6: f89d 0003 ldrb.w r0, [sp, #3]
a: 300f adds r0, #15
c: b2c0 uxtb r0, r0
e: b001 add sp, #4
10: 4770 bx lr
and you can see the same steps, prep the stack, store the input variable. Take the input variable do the math. Prepare the return value. Clean up, return.
Clang/llvm optimized:
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
Happens to be the same as gnu. Not expected that any two different compilers generate the same code, nor any expectation that any two versions of the same compiler generate the same code.
unoptimized, the input and local variables (none in this case) get a home on the stack. So what you are seeing is the input variable being put in its home on the stack as part of the setup of the function. Then the function itself wants to operate on that variable so, unoptimized, it needs to fetch that value from memory to create an intermediate variable (that in this case did not get a home on the stack) and so on. You see this with volatile variables as well. They will get written to memory then read back then modified then written to memory and read back, etc...
yes I agree, but I have not read the specs. End of the day it is gcc's calling convention or interpretation of some spec they choose to use. They have been doing this (not being aligned 100% of the time) for a long time and it does not fail. For all called functions they are aligned when the functions are called. Interrupts in arm code generated by gcc is not aligned all the time. Been this way since they adopted that spec.
by definition r0-r3, etc are volatile. The callee can modify them at will. The callee only needs to save/preserve them if IT needs them. In both the unoptimized and optimized cases only r0 matters for your function it is the input variable and it is used for the return value. You saw in the function I created that the input variable was preserved for later, even when optimized. But, by definition, the caller assumes these registers are destroyed by called functions, and called functions can destroy the contents of these registers and no need to save them.
As far as inline assembly goes, which is a different assembly language than "real" assembly language. I think you have a ways to go before being ready for that, but maybe not. After decades of constant bare metal work I have found zero real use cases for inline assembly, the cases I see are laziness avoiding allowing real assembly into the make system or ways to avoid writing real assembly language. I see it as a ghee whiz feature that folks use like unions and bitfields.
Within gnu, for arm, you have at least four incompatible assembly languages for arm. The not unified syntax real assembly, the unified syntax real assembly. The assembly language that you see when you use gcc to assemble instead of as and then inline assembly for gcc. Despite claims of compatibility clang arm assembly language is not 100% compatible with gnu assembly language and llvm/clang does not have a separate assembler you feed it to the compiler. Arms various toolchains over the years have completely incompatible assembly language to gnu for arm. This is all expected and normal. Assembly language is specific to the tool not the target.
Before you can get into inline assembly language learn some of the real assembly language. And to be fair perhaps you do, and perhaps quite well, and this question is about the discover of how compilers generate code, and how strange it looks as you find out that it is not some one to one thing (all tools in all cases generate the same output from the same input).
For inline asm, while you can specify registers, depending on what you are doing, you generally want to let the compiler choose the register, most of the work for inline assembly is not the assembly but the language that specific compiler uses to interface it...which is compiler specific, move to another compiler and the expectation is a whole new language to learn. While moving between assemblers is also a whole new language at least the syntax of the instructions themselves tend to be the same and the language differences are in everything else, labels and directives and such. And if lucky and it is a toolchain not just an assembler, you can look at the output of the compiler to start to understand the language and compare it to any documentation you can find. Gnus documentation is pretty bad in this case, so a lot of reverse engineering is needed. At the same time you are more likely to be successful with gnu tools over any other, not because they are better, in many cases they are not, but because of the sheer user base and the common features across targets and over decades of history.
I would get really good at interfacing asm with C by creating mock C functions to see which registers are used, etc. And/or even better, implement it in C, compile it, then hand modify/improve/whatever the output of the compiler (you do not need to be a guru to beat the compiler, to be as consistent, perhaps, but fairly often you can easily see improvements that can be made on the output of gcc, and gcc has been getting worse over the last several versions it is not getting better, as you can see from time to time on this site). Get strong in the asm for this toolchain and target and how the compiler works, and then perhaps learn the gnu inline assembly language.

I'm not sure there is a specific purpose to do it. it is just one solution that the compiler has found to do it.
For example the code:
unsigned int f(unsigned int a)
{
return sqrt(a + 1);
}
compiles with ARM GCC 9 NONE with optimisation level -O0 to:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7, #4]
ldr r3, [r7, #4]
adds r3, r3, #1
mov r0, r3
bl __aeabi_ui2d
mov r2, r0
mov r3, r1
mov r0, r2
mov r1, r3
bl sqrt
...
and in level -O1 to:
push {r3, lr}
adds r0, r0, #1
bl __aeabi_ui2d
bl sqrt
...
As you can see the asm is much easier to understand in -O1: store parameter in R0, add 1, call functions.
The hardware supports non aligned stack during exception. See here
The "caller saved" registers do not necessarily need to be stored on the stack, it's up to the caller to know whether it needs to store them or not.
Here you are mixing (if I understood correctly) C and assembly: so you have to do the compiler job before switching back to C: either you store values in callee saved registers (and then you know by convention that the compiler will store them during function call) or you store them yourself on the stack.

What is 'veneer' that arm linker uses in function call?

I just read https://www.keil.com/support/man/docs/armlink/armlink_pge1406301797482.htm. but can't understand what a veneer is that arm linker inserts between function calls.
In "Procedure Call Standard for the ARM Architecture" document, it says,
5.3.1.1 Use of IP by the linker Both the ARM- and Thumb-state BL instructions are unable to address the full 32-bit address space, so
it may be necessary for the linker to insert a veneer between the
calling routine and the called subroutine. Veneers may also be needed
to support ARM-Thumb inter-working or dynamic linking. Any veneer
inserted must preserve the contents of all registers except IP (r12)
and the condition code flags; a conforming program must assume that a
veneer that alters IP may be inserted at any branch instruction that
is exposed to a relocation that supports inter-working or long
branches. Note R_ARM_CALL, R_ARM_JUMP24, R_ARM_PC24, R_ARM_THM_CALL,
R_ARM_THM_JUMP24 and R_ARM_THM_JUMP19 are examples of the ELF
relocation types with this property. See [AAELF] for full details
Here is what I guess, is it something like this ? : when function A calls function B, and when those two functions are too far apart for the bl command to express, the linker inserts function C between function A and B in such a way function C is close to function B. Now function A uses b instruction to go to function C(copying all the registers between the function call), and function C uses bl instruction(copying all the registers too). Of course the r12 register is used to keep the remaining long jump address bits. Is this what veneer means? (I don't know why arm doesn't explain what veneer is but only what veneer provides..)

It is just a trampoline. Interworking is the easier one to demonstrate, using gnu here, but the implication is that Kiel has a solution as well.
.globl even_more
.type eve_more,%function
even_more:
bx lr
.thumb
.globl more_fun
.thumb_func
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
extern unsigned int even_more ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+even_more(a));
}
Unlinked object:
Disassembly of section .text:
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: ebfffffe bl 0 <more_fun>
c: e1a04000 mov r4, r0
10: e1a00005 mov r0, r5
14: ebfffffe bl 0 <even_more>
18: e0840000 add r0, r4, r0
1c: e8bd4070 pop {r4, r5, r6, lr}
20: e12fff1e bx lr
Linked binary (yes completely unusable, but demonstrates what the tool does)
Disassembly of section .text:
00001000 <fun>:
1000: e92d4070 push {r4, r5, r6, lr}
1004: e1a05000 mov r5, r0
1008: eb000008 bl 1030 <__more_fun_from_arm>
100c: e1a04000 mov r4, r0
1010: e1a00005 mov r0, r5
1014: eb000002 bl 1024 <even_more>
1018: e0840000 add r0, r4, r0
101c: e8bd4070 pop {r4, r5, r6, lr}
1020: e12fff1e bx lr
00001024 <even_more>:
1024: e12fff1e bx lr
00001028 <more_fun>:
1028: 4770 bx lr
102a: 46c0 nop ; (mov r8, r8)
102c: 0000 movs r0, r0
...
00001030 <__more_fun_from_arm>:
1030: e59fc000 ldr r12, [pc] ; 1038 <__more_fun_from_arm+0x8>
1034: e12fff1c bx r12
1038: 00001029 .word 0x00001029
103c: 00000000 .word 0x00000000
You cannot use bl to switch modes between arm and thumb so the linker has added a trampoline as I call it or have heard it called that you hop on and off to get to the destination. In this case essentially converting the branch part of bl into a bx, the link part they take advantage of just using the bl. You can see this done for thumb to arm or arm to thumb.
The even_more function is in the same mode (ARM) so no need for the trampoline/veneer.
For the distance limit of bl lemme see. Wow, that was easy, and gnu called it a veneer as well:
.globl more_fun
.type more_fun,%function
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+1);
}
MEMORY
{
bob : ORIGIN = 0x00000000, LENGTH = 0x1000
ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.some : { so.o(.text*) } > bob
.more : { more.o(.text*) } > ted
}
Disassembly of section .some:
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: eb000003 bl 18 <__more_fun_veneer>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <__more_fun_veneer>:
18: e51ff004 ldr pc, [pc, #-4] ; 1c <__more_fun_veneer+0x4>
1c: 20000000 .word 0x20000000
Disassembly of section .more:
20000000 <more_fun>:
20000000: e12fff1e bx lr
Staying in the same mode it did not need the bx.
The alternative is that you replace every bl instruction at compile time with a more complicated solution just in case you need to do a far call. Or since the bl offset/immediate is computed at link time you can, at link time, put the trampoline/veneer in to change modes or cover the distance.
You should be able to repeat this yourself with Kiel tools, all you needed to do was either switch modes on an external function call or exceed the reach of the bl instruction.
Edit
Understand that toolchains vary and even within a toolchain, gcc 3.x.x was the first to support thumb and I do not know that I saw this back then. Note the linker is part of binutils which is as separate development from gcc. You mention "arm linker", well arm has its own toolchain, then they bought Kiel and perhaps replaced Kiel's with their own or not. Then there is gnu and clang/llvm and others. So it is not a case of "arm linker" doing this or that, it is a case of the toolchains linker doing this or that and each toolchain is first free to use whatever calling convention they want there is no mandate that they have to use ARM's recommendations, second they can choose to implement this or not or simply give you a warning and you have to deal with it (likely in assembly language or through function pointers).
ARM does not need to explain it, or let us say, it is clearly explained in the Architectural Reference Manual (look at the bl instruction, the bx instruction look for the words interworking, etc. All quite clearly explained) for a particular architecture. So there is no reason to explain it again. Especially for a generic statement where the reach of bl varies and each architecture has different interworking features, it would be a long set of paragraphs or a short chapter to explain something that is already clearly documented.
Anyone implementing a compiler and linker would be well versed in the instruction set before hand and understand the bl and conditional branch and other limitations of the instruction set. Some instruction sets offer near and far jumps and some of those the assembly language for the near and far may be the same mnemonic so the assembler will often decide if it does not see the label in the same file to implement a far jump/call rather than a near one so that the objects can be linked.
In any case before linking you have to compile and assembly and the toolchain folks will have fully understood the rules of the architecture. ARM is not special here.

This is Raymond Chen's comment :
The veneer has to be close to A because B is too far away. A does a bl
to the veneer, and the veneer sets r12 to the final destination(B) and
does a bx r12. bx can reach the entire address space.
This answers to my question enough, but he doesn't want to write a full answer (maybe for lack of time..) I put it here as an answer and select it. If someone posts a better, more detailed answer, I'll switch to it.

ARM PC value after Reset

I am new to MCU and trying to figure out how arm (Cortex M3-M4) based MCU boots. Because booting is specific to any SOC, I took an example hardware board of STM for case study.
Board: STMicroelectronics – STM32L476 32-bit.
In this board when booting mode is (x0)"Boot from User Flash", board maps 0x0000000 address to flash memory address. On flash memory I have pasted my binary with first 4 bytes pointing to vector table first entry, which is esp. Now if I press reset button ARM documentation says PC value will be set to 0x00000000.
CPU generally executes stream of instructions based on PC -> PC + 1 loop. In this case if I see PC value points to esp, which is not instruction. How does Arm CPU does the logic of not use this instruction address, but do a jump to value store at address 0x00000004?
Or this is the case:
Reset produces a special hardware interrupt and cause PC value to be value at 0x00000004, if this is the case why Arm documentation says it sets PC value to 0x00000000?
Ref: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3761.html
What values are in ARM registers after a power-on reset? Applies to:
ARM1020/22E, ARM1026EJ-S, ARM1136, ARM720T, ARM7EJ-S, ARM7TDMI,
ARM7TDMI-S, ARM920/922T, ARM926EJ-S, ARM940T, ARM946E-S, ARM966E-S,
ARM9TDMI
Answer Registers R0 - R14 (including banked registers) and SPSR (in
all modes) are undefined after reset.
The Program Counter (PC/R15) will be set to 0x000000, or 0xFFFF0000 if
the core has a VINITHI or CFGHIVECS input which is set high as the
core leaves reset. This input should be set to reflect where the base
of the vector table in your system is located.
The Current Program Status Register (CPSR) will indicate that the ARM
core has started in ARM state, Supervisor mode with both FIQ and IRQ
mask bits set. The condition code flags will be undefined. Please see
the ARM Architecture Manual for a detailed description of the CPSR.

The cortex-m's do not boot the same way the traditional and full sized cores boot. Those at least for the reset as you pointed out fetch from address 0x00000000 (or the alternate if asserted) the first instructions, not really fair to call it the PC value as at this point the PC is somewhat bugus, there are multiple program counters being produced a fake one in r15, one leading the fetching, one doing prefetch, none are really the program counter. anyway, doesnt matter.
The cortex-m as documented in the armv7-m documentation (for the m3 and m4, for the m0 and m0+ see the armv6-m although they so far all boot the same way). These use a vector TABLE not instructions. The CORE reads address 0x00000000 (or an alternate if a strap is asserted) and that 32 bit value gets loaded into the stack pointer register. it reads address 0x00000004 it checks the lsbit (maybe not all cores do) if set then this is a valid thumb address, strips the lsbit off (makes it a zero) and begins to fetch the first instructions for the reset handler at that address so if your flash starts with
0x00000000 : 0x20001000
0x00000004 : 0x00000101
the cortex-m will put 0x20001000 in the stack pointer and fetch the first instructions from address 0x100. Being thumb instructions are 16 bits with thumb2 extensions being two 16 bit portions, its not an x86 the program counter is aligned for the full sized processors with 32 bit instructions it fetches on aligned addresses 0x0000, 0x0004, 0x0008 it doesnt increment pc <= pc + 1; For thumb mode or thumb processors it is pc = pc + 2. But also the fetches are not necessarily single instruction transactions, for the full sized they may fetch 4 or 8 words per transaction, the cortex-ms as documented in the technical reference manuals some are able to be compiled or strapped to 16 bits at a time or 32 bits at a time. So no need to talk about or think about execution loops fetching pc = pc + 1, that doesnt make sense even in an x86 these days.
to be fair arms documentation is generally good, on the better side compared to a number of others, not the best. Unlike the full sized arm exception table, the vector table in the cortex-m documentation was not done as well as it could have been, could have/should have just done something like the full sized but shown they were vectors not instructions. It is in there though in the architectural reference manual for the armv6-m and armv7-m (and I would assume armv8-m as well but have not looked, got some parts last week but boards are not here yet, will know very soon). Cant look for words like reset have to look for interrupt or undefined or hardfault, etc in that manual.
EDIT
unwrap your mind on this notion of how the processor starts fetching, it can be any arbitrary address they add into the design, and then the execution of the instructions determines the next address and next address, etc.
Also understand unlike say x86 or microchip pic or the avrs, etc, the core and the chips are two different companies. Even in those same company designs, but certainly where there is a clear division between the IP with a known bus, the ARM CORE will read address 0x00000004 on the AMBA/AXI/AHB bus, the chip vendor can mirror that address in as many different places as they want, in this case with the stm32 there probably isnt actually anything at 0x00000000 as their documentation implies based on the boot pins they map it either to an internal bootloader, or they map it to the user application at 0x08000000 (or in most stm32's if there is an exception thats fine I have not yet seen it) so when strapped that way and the logic has those addresses mirrored you will see the same 32 bit values at 0x00000000 and 0x08000000, 0x00000004 and 0x08000004 and so on for some limited amount of address space. This is why even though linking for 0x00000000 will work to some extent (till you hit that limit which is probably smaller than the application flash size), you will see most folks link for 0x08000000 and the hardware takes care of the rest, so your table really wants to look like
0x08000000 : 0x20001000
0x08000004 : 0x08000101
for an stm32, at least the dozens I have seen so far.
The processor reads 0x00000000 which is mirrored to the first item in the application flash, finds 0x20001000, it then reads 0x00000004 which is mirroed to the second word in the application flash and gets 0x08000101 which causes a fetch from 0x08000100 and now we are executing from the proper fully mapped application flash address space. so long as you dont change the mirroring, which I dont know if you can on an stm32 (nxp chips you can and I dont know about ti or other brands off hand). Some of the cortex-m cores the VTOR register is there and changable (others it is fixed at 0x00000000 and you cant change it), you do not need to change it to 0x08000000 for an stm32, at least all the ones I know about. its only if you are actively changing the mirroring of the zero address space yourself if possible or if you say have your own bootloader and maybe YOUR application space is 0x08004000 and that application wants a vector table of its own. then you either use VTOR or you build the bootloaders vector table such that it runs code that reads the vectors at 0x08004000 and branches to those. The NXP and others in the past certainly with the ARMV7TDMI cores, would let you change the mirroring of address zero because those older cores didnt have a programmable vector table offset register, helping you solve that problem in their chip designs. Newer ARM cores with a VTOR eliminate that need and over time the chip vendors might not bother anymore if they do at all...
EDIT
I dont know if you have the discovery board or the nucleo, I assume the latter as the former is not available (wish I knew about that one would like to have one. And/or I already have one and its buried in a drawer and I never got to it).
so here is a somewhat minimal program you can try on your stm32
.cpu cortex-m0
.thumb
.globl _start
_start:
.word 0x20000400
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,=0x20000000
mov r2,sp
str r2,[r0]
add r0,r0,#4
mov r2,pc
str r2,[r0]
add r0,r0,#4
mov r1,#0
top:
str r1,[r0]
add r1,r1,#1
b top
build
arm-none-eabi-as so.s -o so.o
arm-none-eabi-ld -Ttext=0x08000000 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary so.bin
this should build with arm-linux-whatever- or other arm-whatever-whatever tools from a binutils from the last 10 years.
The disassembly is important to examine before using the binary, dont want to brick your chip (with an stm32 there is a way to get unbricked)
08000000 <_start>:
8000000: 20000400 andcs r0, r0, r0, lsl #8
8000004: 08000013 stmdaeq r0, {r0, r1, r4}
8000008: 08000011 stmdaeq r0, {r0, r4}
800000c: 08000011 stmdaeq r0, {r0, r4}
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
8000012: 4805 ldr r0, [pc, #20] ; (8000028 <top+0x6>)
8000014: 466a mov r2, sp
8000016: 6002 str r2, [r0, #0]
8000018: 3004 adds r0, #4
800001a: 467a mov r2, pc
800001c: 6002 str r2, [r0, #0]
800001e: 3004 adds r0, #4
8000020: 2100 movs r1, #0
08000022 <top>:
8000022: 6001 str r1, [r0, #0]
8000024: 3101 adds r1, #1
8000026: e7fc b.n 8000022 <top>
8000028: 20000000 andcs r0, r0, r0
the disassembler doesnt know that the vector table is not instructions so you can ignore those.
08000000 <_start>:
8000000: 20000400
8000004: 08000013
8000008: 08000011
800000c: 08000011
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
Does it start the vector table at 0x08000000, check. Our stack pointer init value is at 0x00000000, yes, the reset vector we had the tools place for us. thumb_func tells them the following label is an address for some code/function/procedure/whatever_not_data so they orr the one on there for us. our reset handler is at address 0x08000012 so we want to see 0x08000013 in the vector table, check. I tossed in a couple more for demonstration purposes, sent them to an infinite loop at address 0x08000010 so the vector table should have 0x08000011, check.
So assuming you have a nucleo board not the discovery then you can copy the so.bin file to the thumb drive that shows up when you plug it in.
If you use openocd to connect through the stlink interface into the board now you can see that it was running (details left to the reader to figure out)
Open On-Chip Debugger
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 0048cd01 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
> resume
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 005e168c 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
so we can see that the stack pointer had 0x20000400 as expected
0x20000000: 20000400 0800001e 0048cd01
the program counter which is not some magical thing, they have to somewhat fake it to make the instruction set work.
800001a: 467a mov r2, pc
as defined in the instruction set the pc value used in this instruction is two instructions ahead of the address of this instruction, so 0x0800001A + 4 = 0x0800001E which is what we see in the memory dump.
And the third item is a counter showing we are running, the resume and halt shows that that count kept going
0x20000000: 20000400 0800001e 005e168
So this demonstrates, the vector table, initializing the stack pointer, the reset vector, where code execution starts, what the value of the pc is at some point in the program, and seeing the program run.
the .cpu cortex-m0 makes it build the most compatible program for the cortex-m family and the mov r0,=0x20000000 was cheating, you posted the same feature in your comment it says I want to load the address of blah into the register a label is just an address and they let you put just an address =_estack is the address of a label =0x20000000 is just a number treated as an address (addresses are just numbers as well, nothing magical about them). I could have done a smaller immediate with a shift or explicitly have done the pc relative load. force of habit in this case.
EDIT2
In attempt for a programmer to understand that the chip is logic, only some percentage of it is software/instruction driven, even within that it is just logic that does more things than the software instruction itself indicates. You want to read from memory your instruction asks the processor to do it but in a real chip there are a number of steps involved to actually perform that, microcoded or not (ARMs are not microcoded) there are state machines that walk through the various steps to perform each of these tasks. grab the values from registers, compute the address, do the memory transaction which is a handful of separate steps, take the return value and place it in the register file.
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,loop_counts
loop_top:
sub r0,r0,#1
bne loop_top
b reset
.align
loop_counts: .word 0x1234
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 00000013 andeq r0, r0, r3, lsl r0
8: 00000011 andeq r0, r0, r1, lsl r0
c: 00000011 andeq r0, r0, r1, lsl r0
00000010 <loop>:
10: e7fe b.n 10 <loop>
00000012 <reset>:
12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
00000014 <loop_top>:
14: 3801 subs r0, #1
16: d1fd bne.n 14 <loop_top>
18: e7fb b.n 12 <reset>
1a: 46c0 nop ; (mov r8, r8)
0000001c <loop_counts>:
1c: 00001234 andeq r1, r0, r4, lsr r2
Just barely enough of an instruction set simulator to run that program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ROMMASK 0xFFFF
#define RAMMASK 0xFFF
unsigned short rom[ROMMASK+1];
unsigned short ram[RAMMASK+1];
unsigned int reg[16];
unsigned int pc;
unsigned int cpsr;
unsigned int inst;
int main ( void )
{
unsigned int ra;
unsigned int rb;
unsigned int rc;
unsigned int rx;
//just putting something there, a real chip might have an MBIST, might not.
memset(reg,0xBA,sizeof(reg));
memset(ram,0xCA,sizeof(ram));
memset(rom,0xFF,sizeof(rom));
//in a real chip the rom/flash would contain the program and not
//need to do anything to it, this sim needs to have the program
//various ways to have done this...
//00000000 <_start>:
rom[0x00>>1]=0x1000; // 0: 20001000 andcs r1, r0, r0
rom[0x02>>1]=0x2000;
rom[0x04>>1]=0x0013; // 4: 00000013 andeq r0, r0, r3, lsl r0
rom[0x06>>1]=0x0000;
rom[0x08>>1]=0x0011; // 8: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0A>>1]=0x0000;
rom[0x0C>>1]=0x0011; // c: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0E>>1]=0x0000;
//
//00000010 <loop>:
rom[0x10>>1]=0xe7fe; // 10: e7fe b.n 10 <loop>
//
//00000012 <reset>:
rom[0x12>>1]=0x4802; // 12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
//
//00000014 <loop_top>:
rom[0x14>>1]=0x3801; // 14: 3801 subs r0, #1
rom[0x16>>1]=0xd1fd; // 16: d1fd bne.n 14 <loop_top>
rom[0x18>>1]=0xe7fb; // 18: e7fb b.n 12 <reset>
rom[0x1A>>1]=0x46c0; // 1a: 46c0 nop ; (mov r8, r8)
//
//0000001c <loop_counts>:
rom[0x1C>>1]=0x0004; // 1c: 00001234 andeq r1, r0, r4, lsr r2
rom[0x1E>>1]=0x0000;
//reset
//THIS IS NOT SOFTWARE DRIVEN LOGIC, IT IS JUST LOGIC
ra=rom[0x00>>1];
rb=rom[0x02>>1];
reg[14]=(rb<<16)|ra;
ra=rom[0x04>>1];
rb=rom[0x06>>1];
rc=(rb<<16)|ra;
if((rc&1)==0) return(1); //normally run a fault handler here
pc=rc&0xFFFFFFFE;
reg[15]=pc+2;
cpsr=0x000000E0;
//run
//THIS PART BELOW IS SOFTWARE DRIVEN LOGIC
//still you can see that each instruction requires some amount of
//non-software driven logic.
//while(1)
for(rx=0;rx<20;rx++)
{
inst=rom[(pc>>1)&ROMMASK];
printf("0x%08X : 0x%04X\n",pc,inst);
reg[15]=pc+4;
pc+=2;
if((inst&0xF800)==0x4800)
{
//LDR
printf("LDR r%02u,[PC+0x%08X]",(inst>>8)&0x7,(inst&0xFF)<<2);
ra=(inst>>0)&0xFF;
rb=reg[15]&0xFFFFFFFC;
ra=rb+(ra<<2);
printf(" {0x%08X}",ra);
rb=rom[((ra>>1)+0)&ROMMASK];
rc=rom[((ra>>1)+1)&ROMMASK];
ra=(inst>>8)&0x07;
reg[ra]=(rc<<16)|rb;
printf(" {0x%08X}\n",reg[ra]);
continue;
}
if((inst&0xF800)==0x3800)
{
//SUB
ra=(inst>>8)&0x07;
rb=(inst>>0)&0xFF;
printf("SUBS r%u,%u ",ra,rb);
rc=reg[ra];
rc-=rb;
reg[ra]=rc;
printf("{0x%08X}\n",rc);
//do flags
if(rc==0) cpsr|=0x80000000; else cpsr&=(~0x80000000); //N flag
//dont need other flags for this example
continue;
}
if((inst&0xF000)==0xD000) //B conditional
{
if(((inst>>8)&0xF)==0x1) //NE
{
ra=(inst>>0)&0xFF;
if(ra&0x80) ra|=0xFFFFFF00;
rb=reg[15]+(ra<<1);
printf("BNE 0x%08X\n",rb);
if((cpsr&0x80000000)==0)
{
pc=rb;
}
continue;
}
}
if((inst&0xF000)==0xE000) //B
{
ra=(inst>>0)&0x7FF;
if(ra&0x400) ra|=0xFFFFF800;
rb=reg[15]+(ra<<1);
printf("B 0x%08X\n",rb);
pc=rb;
continue;
}
printf("UNDEFINED INSTRUCTION 0x%08X: 0x%04X\n",pc-2,inst);
break;
}
return(0);
}
You are welcome to hate my coding style, this is a brute force thrown together for this question thing. No I dont work for ARM, this can all be pulled from public documents/information. I shortened the loop to 4 counts to see it hit the outer loop
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
Perhaps this helps perhaps this makes it worse. Most of the logic is not driven by instructions, each instruction, requires some amount of logic not counting the common logic like instruction fetching and things like that.
If you add more code this simulator will break it ONLY supports these handful of instructions and this loop.

The most important thing to check when you're confused about some behaviour of an Arm processor is probably to check the version of the architecture which applies. You will find a huge amount of very old legacy documentation which relates to ARM7 and ARM9 designs. Whilst not all of this is wrong today, it can be very misleading.
ARM v4, ARM v5, ARM v6: These are legacy designs, rarely even used in derivative products now.
ARM v7A: These are the first of the Cortex series. Cortex-A5 is the entry-level for a linux class device in 2018.
ARM v7M, ARM v6M: These are the common microcontroller devices like your STM32, and already these have over 10 years of history
ARM v8A: These introduce the 64 bit instruction set (T32/A32/A64 in one device), already entry level in the R-pi 3 for example.
ARM v8M: The latest iteration of an microcontroller architecture with more advanced security features, just starting to become available 2018Q2
Specifically, ARMv6M/ARMv7M/ARMv8M provide a very different exception model compared with all of the other ARM architectures (remaining similar within the family), whilst many of the other differences are more incremental or focused on specialised area.

In house bootloader ARM cortex M4 NRF52 chip

I am working on making a bootloader for a side project.
I have read in a hex file, verified the checksum and stored everything in flash with a corresponding address with an offset of 0x4000. I am having issues jumping to my application. I have read, searched and tried alot of different things such as the code here.
http://www.keil.com/support/docs/3913.htm
my current code is this;
int binary_exec(void * Address){
int i;
__disable_irq();
// Disable IRQs
for (i = 0; i < 8; i ++) NVIC->ICER[i] = 0xFFFFFFFF;
// Clear pending IRQs
for (i = 0; i < 8; i ++) NVIC->ICPR[i] = 0xFFFFFFFF;
// -- Modify vector table location
// Barriars
__DSB();
__ISB();
// Change the vector table
SCB->VTOR = ((uint32_t)0x4000 & 0x1ffff80);
// Barriars
__DSB();
__ISB();
__enable_irq();
// -- Load Stack & PC
binExec(Address);
return 0;
}
__asm void binexec(uint32_t *address)
{
mov r1, r0
ldr r0, [r1, #4]
ldr sp, [r1]
blx r0"
}
This just jumps to a random location and does not do anything. I have manually added the address to the PC using keil's register window and it jumps straight to my application but I have not found a way to do it using code. Any ideas? Thank you in advance.
Also the second to last line of the hex file there is the start linear address record:
http://www.keil.com/support/docs/1584.htm
does anyone know what to do with this line?
Thank you,
Eric Micallef

This is what I am talking about can you show us some fragments that look like this, this is an entire application just doesnt do much...
20004000 <_start>:
20004000: 20008000
20004004: 20004049
20004008: 2000404f
2000400c: 2000404f
20004010: 2000404f
20004014: 2000404f
20004018: 2000404f
2000401c: 2000404f
20004020: 2000404f
20004024: 2000404f
20004028: 2000404f
2000402c: 2000404f
20004030: 2000404f
20004034: 2000404f
20004038: 2000404f
2000403c: 20004055
20004040: 2000404f
20004044: 2000404f
20004048 <reset>:
20004048: f000 f806 bl 20004058 <notmain>
2000404c: e7ff b.n 2000404e <hang>
2000404e <hang>:
2000404e: e7fe b.n 2000404e <hang>
20004050 <dummy>:
20004050: 4770 bx lr
...
20004054 <systick_handler>:
20004054: 4770 bx lr
20004056: bf00 nop
20004058 <notmain>:
20004058: b510 push {r4, lr}
2000405a: 2400 movs r4, #0
2000405c: 4620 mov r0, r4
2000405e: 3401 adds r4, #1
20004060: f7ff fff6 bl 20004050 <dummy>
20004064: 2c64 cmp r4, #100 ; 0x64
20004066: d1f9 bne.n 2000405c <notmain+0x4>
20004068: 2000 movs r0, #0
2000406a: bd10 pop {r4, pc}
offset 0x00 is the stack pointer
20004000: 20008000
offset 0x04 is the reset vector or the entry point to this program
20004004: 20004049
I filled in the unused ones so they land in an infinite loop
20004008: 2000404f
and tossed in a different one just to show
2000403c: 20004055
In this case the VTOR would be set to 0x2004000 I would read 0x20004049 from 0x20004004 and then BX to that address.
so my binexec would be fed the address 0x20004000 and I would do something like this
ldr r1,[r0]
mov sp,r1
ldr r2,[r0,#4]
bx r2
If I wanted to fake a reset into that code. a thumb approach with thumb2 I assume you can ldr sp,[r0], I dont hand code thumb2 so dont have those memorized, and there are different thumb2 sets of extensions, as well as different syntax options in gas.
Now if you were not going to support interrupts, or for other reasons (might carry some binary code in your flash that you want to perform better and you copy that from flash to ram then use it in ram) you could download to ram an application that simply has its first instruction at the entry point, no vector table:
20004000 <_start>:
20004000: f000 f804 bl 2000400c <notmain>
20004004: e7ff b.n 20004006 <hang>
20004006 <hang>:
20004006: e7fe b.n 20004006 <hang>
20004008 <dummy>:
20004008: 4770 bx lr
...
2000400c <notmain>:
2000400c: b510 push {r4, lr}
2000400e: 2400 movs r4, #0
20004010: 4620 mov r0, r4
20004012: 3401 adds r4, #1
20004014: f7ff fff8 bl 20004008 <dummy>
20004018: 2c64 cmp r4, #100 ; 0x64
2000401a: d1f9 bne.n 20004010 <notmain+0x4>
2000401c: 2000 movs r0, #0
2000401e: bd10 pop {r4, pc}
In this case it would need to be agreed that the downloaded program is built for 0x20004000, you would download the data to that address, but when you want to run it you would instead do this
.globl binexec
binexec:
bx r0
in C
binexec(0x20004000|1);
or
.globl binexec
binexec:
orr r0,#1
bx r0
just to be safe(r).
In both cases you need to build your binaries right if you want them to run, both have to be linked for the target address, in particular the vector table approach, thus the question, can you show us an example vector table from one of your downloaded, programs, even the first few words might suffice...

Beagleboard Qemu baremetal with UEFI

I am trying to boot a freertos app from UEFI on Qemu
When i run the app from uboot, using the below commands it runs without any errors
fatload mmc 0 80300000 rtosdemo.bin
go 0x80300000
An uefi application loads the elf file at 0x80300000 and then I tried two options.
My boot.s file is below
`start:
_start:
_mainCRTStartup:
ldr r0, .LC6
msr CPSR_c, #MODE_UND|I_BIT|F_BIT /* Undefined Instruction */
mov sp, r0
sub r0, r0, #UND_STACK_SIZE
msr CPSR_c, #MODE_ABT|I_BIT|F_BIT /* Abort Mode */
mov sp, r0
...
`
Disassembly file
`
80300000 <_undf-0x20>:
80300000: ea001424 b 80305098 <start>
80300004: e59ff014 ldr pc, [pc, #20] ; 80300020 <_undf>
80300008: e59ff014 ldr pc, [pc, #20] ; 80300024 <_swi>
8030000c: e59ff014 ldr pc, [pc, #20] ; 80300028 <_pabt>
80300010: e59ff014 ldr pc, [pc, #20] ; 8030002c <_dabt>
...........
80305098 <start>:
80305098: e59f00f4 ldr r0, [pc, #244] ; 80305194 <endless_loop+0x18>
8030509c: e321f0db msr CPSR_c, #219 ; 0xdb
803050a0: e1a0d000 mov sp, r0
803050a4: e2400004 sub r0, r0, #4
`
use goto 0x80305098 which is the entry point addr specified in the elf file. Now it jumps to ldr r0, .. instruction but after that it just seems to be jumping some where in the middle of some function rather than stepping into msr instruction.
Since in uboot its jumping to 0x80300000, I tried by jumping to that addr, now it goes to instruction b 80305098 <start>, but after that instruction instead of jumping to 80305098 it just goes to the next instruction ldr pc, [pc, #20].
So any ideas on where I am going wrong?
EDIT:
I updated boot.s to
start:
_start:
_mainCRTStartup:
.thumb
thumb_entry_point:
blx arm_entry_point
.arm
arm_entry_point:
ldr r0, .LC6
msr CPSR_c, #MODE_UND|I_BIT|F_BIT /* Undefined Instruction Mode */
mov sp, r0
Now it works fine.

This is ARM code, but it sounds very much like it's being jumped to in Thumb state. The word e59f00f4 will be interpreted in Thumb as lsls r4, r6, #3; b 0x80304bde (if I've got my address maths right), which seems consistent with "jumping somewhere in the middle of some function". You can verify by checking bit 5 of the CPSR (assuming you're not in user mode) - if it's set, you've come in in Thumb state.
If that is the case, then the 'proper' solution probably involves making the UEFI loader application clever enough to do the right kind of interworking branch, but a quick and easy hack would be to place a shim somewhere just for the initial entry, something like:
.thumb
thumb_entry_point:
blx arm_entry_point
.arm
arm_entry_point:
b start

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight