Strange content when debugging some Armv5 assembly code

Strange content when debugging some Armv5 assembly code - arm

I am trying to learn ARM by debugging a simple piece of ARM assembly.
.global start, stack_top
start:
ldr sp, =stack_top
bl main
b .
The linker script looks like below:
ENTRY(start)
SECTIONS
{
. = 0x10000;
.text : {*(.text)}
.data : {*(.data)}
.bss : {*(.bss)}
. = ALIGN(8);
. = . +0x1000;
stack_top = .;
}
I run this on qemu arm emulator. The binary is loaded at 0x10000. So I put a breakpoint there. As soon as the bp is hit. I checked the pc register. It's value is 0x10000. Then I disassemble the instruction at 0x10000.
I see a strange comment ; 0x1000c <start+12>. What does it mean? Where does it come from?
Breakpoint 1, 0x00010000 in start ()
(gdb) i r pc
pc 0x10000 0x10000 <start>
(gdb) x /i 0x10000
=> 0x10000 <start>: ldr sp, [pc, #4] ; 0x1000c <start+12> <========= HERE
(gdb) x /i 0x10004
0x10004 <start+4>: bl 0x102b0 <main>
Then I continued to debug:
I want to see the effect of the ldr sp, [pc, #4] at 0x10000 on the sp register. So I debug as below.
From the above disassembly, I expected the value of sp to be [pc + 4], which should be the content located at 0x10000 + 4 = 0x10004. But the sp turns out to be 0x11520.
(gdb) i r sp
sp 0x0 0x0
(gdb) si
0x00010004 in start ()
(gdb) x /i $pc
=> 0x10004 <start+4>: bl 0x102b0 <main>
(gdb) i r sp
sp 0x11520 0x11520 <=================== HERE
(gdb) x /x &stack_top
0x11520: 0x00000000
So the 0x11520 value does come from the linker script symbol stack_top. But how is it related to the ldr sp, [pc,#4] instruction at 0x10000?
ADD 1 - 9:29 AM 12/20/2019
Many thanks for the detailed answer by #old_timer.
I was reading the book Embedded and Real-Time Operating Systems by K. C. Wang. I learned about the pipeline thing from this book. Quoted as below:
So, if the pipeline thing is no longer relevant today. What reason makes the pc value 2 ahead of the currently executed instruction?
I just found below thread addressing this issue:
Why does the ARM PC register point to the instruction after the next one to be executed?
Basically, it just another case that people keep making mistakes/flaws/pitfalls for themselves as they advance the technologies.
So back to this question:
In my assembly, it is pc-relative addressing being used.
ARM's PC pointer is 2 ahead of the currently executed instruction. (And deal with that!)

.global start, stack_top
start:
ldr sp, =stack_top
bl main
b .
Assuming arm mode you have three instructions there, the first possible pool for the stack_top value to live is after the .b
_start: ( 0x00000000 )
0x00000000 ldr sp,=stack_top
0x00000004 bl main
0x00000008 b .
0x0000000c stack_top
and from what you have shown this is where the assembler allocated that space.
So at _start + 12 is the location of the stack_top VALUE. The pseudo code ldr sp,=stack_top either gets turned into a mov or a pc relative load. The pc is two ahead for historical reasons which have zero relevance today, some architectures the pc is the current instruction, some it is the address at the next instruction variable length or not, and in the case of arm (aarch32) and thumb it is "two ahead" so 8. So a pc-relative load for an instruction at address 0x00000000 to reach 0x0000000C is 0xC - 8 = 4. so ldr sp,[pc,#4].
Now the CONTENTS at that address is as you asked in the linker script computed by the linker at link time. You put some code in there then padded some stuff didn't show the rest of your code, could have made this a complete example, but either way from your post the linker ended up computing 0x11520.
So reverse engineering your question and comments we see that the binary starts with (once linked)
_start: ( 0x00010000 )
0x00010000 ldr sp,[pc, #4]
0x00010004 bl main
0x00010008 b .
0x0001000c 0x11520
In arm mode, so the first instruction will load the value 0x11520 into the stack pointer as you asked. Nothing strange or wrong here.
The 0x1000C <_start + 12> is simply stating that the address 0x1000C is an offset of 12 away from the nearest label _start. Sometimes that is useful information.
Using the pseudo instruction and not defining a pool the assembler is going to attempt to find a home if you added a nop or some other code
.global start, stack_top
start:
ldr sp, =stack_top
bl main
nop
b .
Then it is likely the assembler would now put that at pc + 8 which after being linked would be 0x10010 and if nothing else changes the stack pointer MIGHT be at the same value or 4 (or more) further along, depends on alignments and padding made by the tool along the way.
The point being the pipe no longer works that way if it ever did in real products so don't think of this as a pipe thing any more than the branch shadow instructions in mips mean anything relevant today (when enabled). For every instruction set that has pc-relative addressing you need to know the rule, is it the address of the instruction (less common), the address of the next instruction (most common) or two ahead, or other. Likewise folks for a while hard-coded in their brain 8 bytes ahead, rather than two ahead, and when they switched to thumb had issues.
Now of course there are the thumb2 extensions which hose thinking about two ahead. I don't off-hand know the aarch64 rule, I would hope it is next instruction and not infected with the two ahead from aarch32. But as with arm (A32) and thumb (T16 and T32) it is easy to find this information in the arm documentation (which as a rule for any architecture you should have handy when writing or analyzing machine/assembly language).

When accessing the pc from an instruction (e.g. ldr or mov), an offset of 8 is added in ARM (A32) mode, and an offset of 4 in Thumb (T32) mode. IIRC this is because of the way function calls worked in old ARM versions. This is documented e.g. in the ARMv7A Architecture Reference Manual in chapter A2.3, on p. A2-45.
The comment ; 0x1000c <start+12> is indeed generated by the disassembler, to indicate the address calculated by PC+4.
Side note: ldr <register>, =<value> is not an actual instruction, but translated by the assembler into 1-2 instructions and optionally a literal value to obtain the desired value in the most efficient way.
If you are interested in that, I wrote a tutorial on learning ARM assembly step-by-step on Cortex-M.

(I think I can explain it now. If I am wrong, please feel free to correct me.)
I tried a slightly different assembly with one more label. Shown as below:
.global start, stack_top, label2 ;<========== HERE I add a new label2
start:
ldr sp, =stack_top // sp = &stack_top, as soon as we have the stack ready, we can call C function
label2:
bl main
b .
The new debug session is like this:
Breakpoint 1, 0x00010000 in start ()
(gdb) i r pc
pc 0x10000 0x10000 <start>
(gdb) x /i $pc <======== (1)
=> 0x10000 <start>: ldr sp, [pc, #4] ; 0x1000c <label2+8> <======= (2)
(gdb) i r sp
sp 0x0 0x0
(gdb) si
0x00010004 in label2 ()
(gdb) x /i $pc
=> 0x10004 <label2>: bl 0x102b0 <main>
(gdb) i r sp
sp 0x11520 0x11520
(gdb) x /x 0x1000c <========== (3)
0x1000c <label2+8>: 0x00011520
(gdb) x /x &stack_top <========== (4)
0x11520: 0x00000000
Though at line (1), I seem to be asking for the pc value, and at line (2) it does gives me a value 0x10000, it is actually NOT the real pc value at that moment.
Because ARM processor has a fetch-decode-execution pipeline. When one instruction is being executed, 2 more instructions ahead are being fetched/decoded.
So pc actually points to the fetched instruction. The currently executed instruction at 0x10000 is actually pc-8 since I am using ARM mode instruction and each instruction takes 4 bytes. So the actual pc value is 0x10008.
So [pc, #4] gives 0x10008 + 4 = 0x1000C which is just what the comment ; 0x1000c <label2+8> says. (This is pc-relative addressing by the way, please read #old_timer's answer for more details about it).
It seems gdb chooses to use the nearest label to represent the address calculation result. So it choose label2. In my original question, it chooses start.
And line (3) and (4) confirm that memory location at 0x1000c does hold the stack_top value.
So to summarize, below 2 things should be noted:
ARM instruction pipeline
GDB convenient display in the form of comment for the address calculation result in an instruction
Last thought...
BTW, I think when I dump the pc value at line (1), it would be much better if the real pc value for the fetched instruction can be displayed, i.e 0x10008. That can avoid much confusion.
More thought...
Please read below thread for why pc is 2 ahead of the currently executed instruction.
Why does the ARM PC register point to the instruction after the next one to be executed?
Though the 3-stage fetch-decode-execute pipeline is no longer relevant (thanks to #old_timer), the calculation in above answer is still mathematically valid. And other parts are valid as well.

Related

Position independent binary for Atmel SAM Cortex-M0+

I am trying to create a position independent binary for a Cortex-M0+ using the ARM GNU toolchain included with Atmel Studio 7 (arm-none-eabi ?). I have looked many places for information on how to do this, but am not successful. This would facilitate creating ping-pong images in low-high Flash memory areas for OTA updates without needing to know or care whether the update was a ping or pong image for that unit.
I have an 8 kB bootloader resident at 0x0000 which I can communicate with over UART and which will jump to 0x6000 (24 kB) after reset if it detects a binary there (i.e. not 0xFFFF erased Flash). This SAM-BA bootloader allows me to dump memory and erase and program Flash with .bin files at a designated address.
In the application project (simple LED blink), doing nothing but adding -section-start=.text=0x6000 to the linker command line results in the LED blink code working after it is programmed at 0x6000 by the bootloader. I see also in the hex file that it starts at 0x6000.
In my attempt to create a position independent binary, I have removed the above linker item, and added the -fPIC flag to the command lines for the compiler, the linker and the assembler. But, I think I still see absolute branch addresses in the disassembly, such as :
28e: d001 beq.n 294
And the result is that the LED blink binary I load at 0x6000 does not execute unless I specifically tell the linker to put it at 0x6000, which defeats the purpose. Note that I do also see what looks like relative branches in other parts of the disassembly :
21c: 4b03 ldr r3, [pc, #12] ; (22c )
21e: 58d3 ldr r3, [r2, r3]
220: 9301 str r3, [sp, #4]
222: 4798 blx r3
The SRAM is always at the same address (0x20000000), I just need to be able to re-position the executable. I have not modified the linker command file, and it does not have section for .got (e.g. (.got) or similar).
Can anyone explain to me the specific changes I need to make to the compiler/assembler/linker flags to create a position independent binary in this setup ? Many thanks in advance.

You need to look more closely at your disassembly. For 0xd001, I get this:
0x00000000: d001 .. BEQ {pc}+0x6 ; 0x6
In your case, the toolchain has tried to be helpful. Clearly, a 16 bit opcode can't encode an absolute address with a 32 bit address space. So you are closer than you think to a solution.

Why do I get the same address every time I build + disassemble a function inside GDB?

Every time when I disassemble a function, why do I always get the same instruction address and constants' address?
For example, after executing the following commands,
gcc -o hello hello.c -ggdb
gdb hello
(gdb) disassemble main
the dump code would be:
When I quit gdb and re-disassemble the main function, I will get the same result as before. The instruction address and even the address of constants are always the same for each disassemble command in gdb. Why is that? Does the compiled file hello contain certain information about the address of each assembly instruction as well as the constants' addresses?

If you made a position-independent executable (e.g. with gcc -fpie -pie, which is the default for gcc in many recent Linux distros), the kernel would randomize the address it mapped your executable at. (Except when running under GDB: GDB disables ASLR by default even for shared libraries, and for PIE executables.)
But you're making a position-dependent executable, which can take advantage of static addresses being link-time constants (by using them as immediates and so on without needing runtime relocation fixups). e.g. you or the compiler can use mov $msg, %edi (like your code) instead of lea msg, %rdi (with -fpie).
Regular (position-dependent) executables have their load-address set in the ELF headers: use readelf -a ./a.out to see the ELF metadata.
A non-PIE executable will load at the same time every time even without running it under GDB, at the address specified in the ELF program headers.
(gcc / ld chooses 0x400000 by default on x86-64-linux-elf; you could change this with a linker script). Relocation information for all the static addresses hard-coded into the code + data is not available, so the loader couldn't fix up the addresses even if it wanted to.
e.g. in a simple executable (with only a text segment, not data or bss) I built with -no-pie (which seems to be the default in your gcc):
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000c5 0x00000000000000c5 R E 0x200000
Section to Segment mapping:
Segment Sections...
00 .text
So the ELF headers request that offset 0 in the file be mapped to virtual address 0x0000000000400000. (And the ELF entry point is 0x400080; that's where _start is.) I'm not sure what the relevance of PhysAddr = VirtAddr is; user-space executables don't know and can't easily find out what physical addresses the kernel used for pages of RAM backing their virtual memory, and it can change at any time as pages are swapped in / out.
Note that readelf does line wrapping; note there are two rows of columns headers. The 0x200000 is the Align column for that one LOADed segment.

By default, the GNU toolchain for x86-64 Linux produces position-dependent executables which are mapped at address 0x400000. (position-independent executables will be mapped at 0x55… addresses instead). It is possible to change that by building GCC --enable-default-pie, or by specifying compiler and linker flags.
However, even for a position-independent executable (PIE), the addresses would be constant between GDB runs because GDB disables address space layout randomization by default. GDB does this so that breakpoints at absolute addresses can be re-applied after the program has been started.

There are a variety of executable file formats. Typically, an executable file contains information anout several memory sections or segments. Inside the executable, references to memory addresses may be expressed relative to the beginning of a section. The executable also contains a relocation table. The relocation table is a list of those references, including where each one is in the executable, what section it refers to, and what type of reference it is (what field of an instruction it is used in, etc.).
The loader (software that loads your program into memory) reads the executable and writes the sections to memory. In your case, the loader appears to be using the same base addresses for sections every time it runs. After initially putting the sections in memory, the loader reads the relocation table and uses it to fix up all the references to memory by adjusting them based on where each section was loaded into memory. For example, the compiler may write an instruction as, in effect, “Load register 3 from the start of the data section plus 278 bytes.” If the loader puts the data section at address 2000, it will adjust this instruction to use the sum of 2000 and 278, making “Load register 3 from address 2278.”
Good modern loaders randomize where sections are loaded. They do this because malicious people are sometimes able to exploit bugs in programs to cause them to execute code injected by the attacker. Randomizing section locations prevents the attacker from knowing the address where their code will be injected, which can hinder their ability to prepare the code to be injected. Since your addresses are not changing, it appears your loader does not do this. You may be using an older system.
Some processor architectures and/or loaders support position independent code (PIC). In this case, the form of an instruction may be “Load register 3 from 694 bytes beyond where this instruction is.” In that case, as long as the data is always at the same distance from the instruction, it does not matter where they are in memory. When the process executes the instruction, it will add the address of the instruction to 694, and that will be the address of the data. Another way of implementing PIC-like code is for the loader to provide the addresses of each section to the program, by putting those addresses in registers or fixed locations in memory. Then the program can use those base addresses to do its own address calculations. Since your program has an address built into the code, it does not appear your program is using these methods.

a not intended to be really executed program
bootstrap
.globl _start
_start:
bl one
b .
first c file
extern unsigned int hello;
unsigned int one ( void )
{
return(hello+5);
}
second c file (being extern forces the compiler to compile the first object in a certain way)
unsigned int hello;
linker script
MEMORY
{
ram : ORIGIN = 0x00001000, LENGTH = 0x4000
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > ram
}
building position dependent
Disassembly of section .text:
00001000 <_start>:
1000: eb000000 bl 1008 <one>
1004: eafffffe b 1004 <_start+0x4>
00001008 <one>:
1008: e59f3008 ldr r3, [pc, #8] ; 1018 <one+0x10>
100c: e5930000 ldr r0, [r3]
1010: e2800005 add r0, r0, #5
1014: e12fff1e bx lr
1018: 0000101c andeq r1, r0, r12, lsl r0
Disassembly of section .bss:
0000101c <hello>:
101c: 00000000 andeq r0, r0, r0
the key here is at address 0x1018 the compiler had to leave a placeholder for the address to the external item. shown as offset 0x10 below
00000000 <one>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <one+0x10>
4: e5930000 ldr r0, [r3]
8: e2800005 add r0, r0, #5
c: e12fff1e bx lr
10: 00000000 andeq r0, r0, r0
The linker fills this in at link time. You can see in the disassembly above that position dependent it fills in the absolute address of where to find that item. For this code to work the code must be loaded in a way that that item shows up at that address. It has to be loaded at a specific position or address in memory. Position dependent. (loaded at address 0x1000 basically).
If your toolchain supports position independent (gnu does) then this represents a solution.
Disassembly of section .text:
00001000 <_start>:
1000: eb000000 bl 1008 <one>
1004: eafffffe b 1004 <_start+0x4>
00001008 <one>:
1008: e59f3014 ldr r3, [pc, #20] ; 1024 <one+0x1c>
100c: e59f2014 ldr r2, [pc, #20] ; 1028 <one+0x20>
1010: e08f3003 add r3, pc, r3
1014: e7933002 ldr r3, [r3, r2]
1018: e5930000 ldr r0, [r3]
101c: e2800005 add r0, r0, #5
1020: e12fff1e bx lr
1024: 00000014 andeq r0, r0, r4, lsl r0
1028: 00000000 andeq r0, r0, r0
Disassembly of section .got:
0000102c <.got>:
102c: 0000103c andeq r1, r0, r12, lsr r0
Disassembly of section .got.plt:
00001030 <_GLOBAL_OFFSET_TABLE_>:
...
Disassembly of section .bss:
0000103c <hello>:
103c: 00000000 andeq r0, r0, r0
It has a performance hit of course, but instead of the compiler and linker working together by leaving one location, there is now a table, global offset table (for this solution) that is at a known location which is position relative to the code, that contains linker supplied offsets.
The program is not position independent yet, it will certainly not work if you load it anywhere. The loader has to patch up the table/solution based on where it wants to place the items. This is far simpler than having a very long list of each of the locations to patch in the first solution, although that would have been a very possible way to do it. A table in the executable (executables contain more than the program and data they have other items of information as you know if you objdump or readelf an elf file) could contain all of those offsets and the loader could patch those up too.
If your data and bss and other memory sections are fixed relative to .text as I have built here, then a got wasnt necessary the linker could have at link time computed the relative offset to the resource and along with the compiler found the item in an position independent way, and the binary could have been loaded just about anywhere (some minimum alignment may hav been required) and it would work without any patching. With the gnu solution I think you can move the segments relative to each other.
It is incorrect to state that the kernel will or would always randomize your location if built position independent. While possible so long as the toolchain and the loader from the operating system (a completely separate development) work hand in hand, the loader has the opportunity. But that does not in any way mean that every loader does or will. Specific operating systems/distros/versions may have that set as a default yes. If they come across a binary that is position independent (built in a way that loader expects). It is like saying all mechanics on the planet will use a specific brand and type of oil if you show up in their garage with a specific brand of car. A specific mechanic may always use a specific oil brand and type for a specific car, but that doesnt mean all mechanics will or perhaps even can obtain that specific oil brand or type. If that individual business chooses to as a policy then you as a customer can start to form an assumption that that is what you will get (with that assumption then failing when they change their policy).
As far as disassembly you can statically disassemble your project at build time or whenever. If loaded at a different position then there will be an offset to what you are seeing, but the .text code will still be in the same place relative to other code in that segment. If the static disassembly shows a call being 0x104 bytes ahead, then even if loaded somewhere else you should see that relative jump also be 0x104 bytes ahead, the addresses may be different.
Then there is the debugger part of this, for the debugger to work/show the correct information it also has to be part of the toolchain/loader(/os) team for everything to work/look right. It has to know this was position independent and have to know where it was loaded and/or the debugger is doing the loading for you and may not use the standard OS loader in the same way that a command line or gui does. So you might still see the binary in the same place every time when using the debugger.
The main bug here was your expectation. First operating systems like windows, linux, etc desire to use an MMU to allow them to manage memory better. To pick some/many non-linear blocks of physical memory and create a linear area of virtual memory for your program to live, more importantly the virtual address space for each separate program can look the same, I can have every program load at 0x8000 in virtual address space, without interfering with each other, with an MMU designed for this and an operating system that takes advantage of this. Even with this MMU and operating system and position independent loading one would hope they are not using physical addresses, they are still creating a virtual address space, just possibly with different load points for each program or each instance of a program. Expecting all operating systems to do this all the time is an expectation problem. And when using a debugger you are not in a stock environment, the program runs differently, can be loaded differently, etc. It is not the same as running without the debugger, so using a debugger also changes what you should expect to see happen. Two levels of expectation here to deal with.
Use an external component in a very simple program as I made above, see in the disassembly of the object that it has built for position independence as well as in the linking then try Linux as Peter has indicated and see if it loads in a different place each time, if not then you need to be looking at superuser SE or google around about how to use linux (and/or gdb) to get it to change the load location.

Why is SP (apparently) stored on exception entry on Cortex-M3?

I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?

The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!

Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.

Why Cortex-M requires its first word as initial stack pointer?

Imagine that I am writing an assembly language program which doesn't use the stack at all.
Why does Cortex-M mandates for a stack pointer ? What does the CPU do with this stack pointer even if I don't use it(Does the CPU require the SP to function ? I don't think that SP is mandatory for a CPU to function.) ?
I makes sense for me to place at the first address the first instruction to be executed or an address where the program begins.

The CPU does not technically require the stack pointer to run. It does, however, require the stack pointer in order to properly service interrupts and exceptions. It places certain information on the stack before servicing an interrupt so that system state can be resumed following the interrupt. You could theoretically experience an exception in the first instruction after boot, so the SP needs to be set up before it begins execution.
In addition, are you familiar with the vector table? Generally the first several addresses in a processor are reserved for the vector table anyway. The vector table contains jump addresses which the hardware references when servicing interrupts and exceptions.

You are more than welcome to make your own chips with your own processor, you are also welcome to license ARMs products and perhaps modify them.
The cortex-m vs full sized arm makes a whole lot of sense. arm made the effort in this design to make it such that you dont have to have any assembly language or at least not as much. two things that a bootstrap has to do to get you from first instruction of an event handler (reset is an event) to C functions to handle that event. One is the stack, second is preserve state (save registers that have to be preserved before entering the next layer function on the stack ideally). They do both of these in hardware now. The initial stack pointer is most definitely not required, but that address cannot be repurposed, if you dont want to put a stack address there, then dont, but whatever you put there will get loaded into r13, that is how the hardware works, dont like it find another processor. second unlike the full sized arms the vectors are now vectors and not instructions so they are an address so you cant just put your instructions at address 0x4. Now if you want to get anal about this you can probably put your first instruction at address 0x000 so long as the 32 bit value at address 0x4 is all zeros. so like a full sized arm the first 32 bits has to be a jump in that case.
The more sane thing to do of course is to reserve a little room for the vector table, maybe your assembly actually wants to service some interrupts (well maybe not you need a stack for that since the hardware is going to use the stack pointer to write to whether you like it or not, find another processor if you dont like it). then after this reserved space then put your first instrucitons assembly, machine code, whatever and have the reset vector point to this address just like most other processors families. most have a vector table somewhere not always at address zero, but somewhere and that points to some other place where the code is.
You can do this in as little as 8 bytes in front of your program
.cpu cortex-m4
.thumb
.word 0
.word _start
.thumb_func
.globl _start
_start:
/* code starts here*/
or I am willing to bet this will work too and only cost you 4 bytes not 8. (well technically it costs you 4 or 6 bytes or 8 depending on how you are counting).
.cpu cortex-m4
.thumb
.thumb_func
.globl _start
_start:
b over
.word _start
over:
/* code continues here*/
so long as the branch definitely generates two 16 bit instructions.
Disassembly of section .text:
00000000 <_start>:
0: e001 b.n 6 <over>
2: 00000000 andeq r0, r0, r0
00000006 <over>:
6: e7fe b.n 6 <over>
Nope, didnt work, how about
.cpu cortex-m4
.thumb
.thumb_func
.globl _start
_start:
b over
.align
.word _start
over:
/* code continues here*/
b .
ahh, much better.
00000000 <_start>:
0: e002 b.n 8 <over>
2: bf00 nop
4: 00000001 andeq r0, r0, r1
00000008 <over>:
8: e7fe b.n 8 <over>
a: bf00 nop
If you end up actually wanting to do enough assembly to do something useful then you probably will want the stack and they save you some number of bytes by letting you just load the stack pointer on reset rather have to write code to do it in the reset handler.
assembly means you can do whatever you want, it seems quite obvious here though they were catering to the masses (C) to make their lives just a little bit easier, one stack, most of the time you can now just put the address to your C handler function directly, bootstrap can go away. things you couldnt do with the full sized and prior arm solutions.

C code calling an Assembly routine - ARM

I'm currently working on a bootloader for an ARM Cortex M3.
I have two functions, one in C and one in assembly but when I attempt to call the assembly function my program hangs and generates some sort of fault.
The functions are as follows,
C:
extern void asmJump(void* Address) __attribute__((noreturn));
void load(void* Address)
{
asmJump(Address);
}
Assembly:
.section .text
.global asmJump
asmJump: # Accepts the address of the Vector Table
# as its first parameter (passed in r0)
ldr r2, [r0] # Move the stack pointer addr. to a temp register.
ldr r3, [r0, #4] # Move the reset vector addr. to a temp register.
mov sp, r2 # Set the stack pointer
bx r3 # Jump to the reset vector
And my problem is this:
The code prints "Hello" over serial and then calls load. The code that is loaded prints "Good Bye" and then resets the chip.
If I slowly step through the part where load calls asmJump everything works perfectly. However, when I let the code run my code experiences a 'memory fault'. I know that it is a memory fault because it causes a Hard Fault in some way (the Hard Fault handler's infinite while loop is executing when I pause after 4 or 5 seconds).
Has anyone experienced this issue before? If so, can you please let me know how to resolve it?
As you can see, I've tried to use the function attributes to fix the issue but have not managed to arrive at a solution yet. I'm hoping that someone can help me understand what the problem is in the first place.
Edit:
Thanks #JoeHass for your answer, and #MartinRosenau for your comment, I've since went on to find this SO answer that had a very thorough explanation of why I needed this label. It is a very long read but worth it.

I think you need to tell the assembler to use the unified syntax and explicitly declare your function to be a thumb function. The GNU assembler has directives for that:
.syntax unified
.section .text
.thumb_func
.global asmJump
asmJump:
The .syntax unified directive tells the assembler that you are using the modern syntax for assembly code. I think this is an unfortunate relic of some legacy syntax.
The .thumb_func directive tells the assembler that this function will be executed in thumb mode, so the value that is used for the symbol asmJump has its LSB set to one. When a Cortex-M executes a branch it checks the LSB of the target address to see if it is a one. If it is, then the target code is executed in thumb mode. Since that is the only mode supported by the Cortex-M, it will fault if the LSB of the target address is a zero.

Since you mention you have the debugger working, use it!
Look at the fault status registers to determine the fault source. Maybe it's not asmJump crashing but the code you're invoking.

If that is your all your code.. I suppose your change of SP called the segment error or something like that.
You should save your SP before changing it and restore it after the use of it.
ldr r6, =registerbackup
str sp, [r6]
#your code
...
ldr r6, =registerbackup
ldr sp, [r6]