C code calling an Assembly routine - ARM - c

I'm currently working on a bootloader for an ARM Cortex M3.
I have two functions, one in C and one in assembly but when I attempt to call the assembly function my program hangs and generates some sort of fault.
The functions are as follows,
C:
extern void asmJump(void* Address) __attribute__((noreturn));
void load(void* Address)
{
asmJump(Address);
}
Assembly:
.section .text
.global asmJump
asmJump: # Accepts the address of the Vector Table
# as its first parameter (passed in r0)
ldr r2, [r0] # Move the stack pointer addr. to a temp register.
ldr r3, [r0, #4] # Move the reset vector addr. to a temp register.
mov sp, r2 # Set the stack pointer
bx r3 # Jump to the reset vector
And my problem is this:
The code prints "Hello" over serial and then calls load. The code that is loaded prints "Good Bye" and then resets the chip.
If I slowly step through the part where load calls asmJump everything works perfectly. However, when I let the code run my code experiences a 'memory fault'. I know that it is a memory fault because it causes a Hard Fault in some way (the Hard Fault handler's infinite while loop is executing when I pause after 4 or 5 seconds).
Has anyone experienced this issue before? If so, can you please let me know how to resolve it?
As you can see, I've tried to use the function attributes to fix the issue but have not managed to arrive at a solution yet. I'm hoping that someone can help me understand what the problem is in the first place.
Edit:
Thanks #JoeHass for your answer, and #MartinRosenau for your comment, I've since went on to find this SO answer that had a very thorough explanation of why I needed this label. It is a very long read but worth it.

I think you need to tell the assembler to use the unified syntax and explicitly declare your function to be a thumb function. The GNU assembler has directives for that:
.syntax unified
.section .text
.thumb_func
.global asmJump
asmJump:
The .syntax unified directive tells the assembler that you are using the modern syntax for assembly code. I think this is an unfortunate relic of some legacy syntax.
The .thumb_func directive tells the assembler that this function will be executed in thumb mode, so the value that is used for the symbol asmJump has its LSB set to one. When a Cortex-M executes a branch it checks the LSB of the target address to see if it is a one. If it is, then the target code is executed in thumb mode. Since that is the only mode supported by the Cortex-M, it will fault if the LSB of the target address is a zero.

Since you mention you have the debugger working, use it!
Look at the fault status registers to determine the fault source. Maybe it's not asmJump crashing but the code you're invoking.

If that is your all your code.. I suppose your change of SP called the segment error or something like that.
You should save your SP before changing it and restore it after the use of it.
ldr r6, =registerbackup
str sp, [r6]
#your code
...
ldr r6, =registerbackup
ldr sp, [r6]

Related

STM32 Hardfault when trying to access memory

I am analyzing code written for STM32H730 microcontroller. I find the below snippet of code which is giving hardfault when the BootHoldRequest(&fnBoot) is called.
#define BOOTBLOCK_ADD 0x08000000L
#define BootHoldRequest (*((BOOTLOAD_PROCEED_TYPE *) (BOOTBLOCK_ADD + 0x200)))
typedef void (* CALLBACK_PTR)(void);
typedef uint16_t BOOTLOAD_PROCEED_TYPE(CALLBACK_PTR *);
typedef void (* VOID_FUN_TYPE)(void);
static VOID_FUN_TYPE fnBoot;
if (BootHoldRequest(&fnBoot)) //<--------- HARDFAULT
{
}
As it is impossible to answer your question not seeing the whole project (including linker scripts etc) I will only show how to debug this issue.
What does this code do?
if (BootHoldRequest(&fnBoot))
ldr r0, .L6
ldr r3, .L6+4
bx r3
.L6:
.word .LANCHOR0
.word 0x8000200
It loads the 4 bytes address from the BOOTBLOCK_ADD + 0x200 location and then next calls code located at this address. I do not know if you have the correct data there so you need to check it yourself.
If you use IDE (in my example Atollic - which is almost identical with STM32Cube IDE) you can easily check it.
Two methods:
Set the breakpoint at this line.
Use the expression window to see what is at this address:
Enter the instruction debug mode
And follow the code one assembly instruction at the time. You will see if the code does what it is supposing to do.
It is not your code. It is the code from my project I work on.

Writing to memory mapped GPIO-registers does not write anything

On my NUCLEO-H7A3ZI-Q, I am trying to make the LED at port PB7 turn on using assembly. According to the STM32H7A3 reference manual, port B is mapped at address 0x50820400 (page 129):
The following code should write the value 0xc0 to the address 0x50820400, pointing into the first byte of GPIOB_MODER, which is rw:
.section .text
reset_handler:
nop
ldr r0, GPIO_ADDR
mov r1, #0xc0
strb r1, [r0]
done:
b done
.align 2
GPIO_ADDR: .word 0x58020400
.section .vectors
.word 0x20001ffe # Initial SP
.word reset_handler # Entrypoint
However, this does not work. Looking at the memory using STM32CubeProgrammer before and after the strb instruction gives the same value 0xFFFFFEBF at 0x58020400 before and after the write instruction.
The value 0xFFFFFEBF is the reset value of GPIOB_MODER, which makes sense. However, all other values in the memory mapped region are also 0xFFFFFEBF, whereas the documentation states the reset value of some other values should not be 0xFFFFFEBF. This might suggest that I have missed some type of initialization step, but I could not find anything in the manual that states something like that should be necessary, but the manual is ~3000 pages, so I might have missed something :)
You need to enable GPIO peripheral clock first. RCC register is used for that.
I would rather discourage you from learning STM32 uCs using assembler. It is the way to nowhere.
Start from the programming manual & reference manual where ARM uCs low level programming is described. Clocks, peripherals etc etc

Strange content when debugging some Armv5 assembly code

I am trying to learn ARM by debugging a simple piece of ARM assembly.
.global start, stack_top
start:
ldr sp, =stack_top
bl main
b .
The linker script looks like below:
ENTRY(start)
SECTIONS
{
. = 0x10000;
.text : {*(.text)}
.data : {*(.data)}
.bss : {*(.bss)}
. = ALIGN(8);
. = . +0x1000;
stack_top = .;
}
I run this on qemu arm emulator. The binary is loaded at 0x10000. So I put a breakpoint there. As soon as the bp is hit. I checked the pc register. It's value is 0x10000. Then I disassemble the instruction at 0x10000.
I see a strange comment ; 0x1000c <start+12>. What does it mean? Where does it come from?
Breakpoint 1, 0x00010000 in start ()
(gdb) i r pc
pc 0x10000 0x10000 <start>
(gdb) x /i 0x10000
=> 0x10000 <start>: ldr sp, [pc, #4] ; 0x1000c <start+12> <========= HERE
(gdb) x /i 0x10004
0x10004 <start+4>: bl 0x102b0 <main>
Then I continued to debug:
I want to see the effect of the ldr sp, [pc, #4] at 0x10000 on the sp register. So I debug as below.
From the above disassembly, I expected the value of sp to be [pc + 4], which should be the content located at 0x10000 + 4 = 0x10004. But the sp turns out to be 0x11520.
(gdb) i r sp
sp 0x0 0x0
(gdb) si
0x00010004 in start ()
(gdb) x /i $pc
=> 0x10004 <start+4>: bl 0x102b0 <main>
(gdb) i r sp
sp 0x11520 0x11520 <=================== HERE
(gdb) x /x &stack_top
0x11520: 0x00000000
So the 0x11520 value does come from the linker script symbol stack_top. But how is it related to the ldr sp, [pc,#4] instruction at 0x10000?
ADD 1 - 9:29 AM 12/20/2019
Many thanks for the detailed answer by #old_timer.
I was reading the book Embedded and Real-Time Operating Systems by K. C. Wang. I learned about the pipeline thing from this book. Quoted as below:
So, if the pipeline thing is no longer relevant today. What reason makes the pc value 2 ahead of the currently executed instruction?
I just found below thread addressing this issue:
Why does the ARM PC register point to the instruction after the next one to be executed?
Basically, it just another case that people keep making mistakes/flaws/pitfalls for themselves as they advance the technologies.
So back to this question:
In my assembly, it is pc-relative addressing being used.
ARM's PC pointer is 2 ahead of the currently executed instruction. (And deal with that!)
.global start, stack_top
start:
ldr sp, =stack_top
bl main
b .
Assuming arm mode you have three instructions there, the first possible pool for the stack_top value to live is after the .b
_start: ( 0x00000000 )
0x00000000 ldr sp,=stack_top
0x00000004 bl main
0x00000008 b .
0x0000000c stack_top
and from what you have shown this is where the assembler allocated that space.
So at _start + 12 is the location of the stack_top VALUE. The pseudo code ldr sp,=stack_top either gets turned into a mov or a pc relative load. The pc is two ahead for historical reasons which have zero relevance today, some architectures the pc is the current instruction, some it is the address at the next instruction variable length or not, and in the case of arm (aarch32) and thumb it is "two ahead" so 8. So a pc-relative load for an instruction at address 0x00000000 to reach 0x0000000C is 0xC - 8 = 4. so ldr sp,[pc,#4].
Now the CONTENTS at that address is as you asked in the linker script computed by the linker at link time. You put some code in there then padded some stuff didn't show the rest of your code, could have made this a complete example, but either way from your post the linker ended up computing 0x11520.
So reverse engineering your question and comments we see that the binary starts with (once linked)
_start: ( 0x00010000 )
0x00010000 ldr sp,[pc, #4]
0x00010004 bl main
0x00010008 b .
0x0001000c 0x11520
In arm mode, so the first instruction will load the value 0x11520 into the stack pointer as you asked. Nothing strange or wrong here.
The 0x1000C <_start + 12> is simply stating that the address 0x1000C is an offset of 12 away from the nearest label _start. Sometimes that is useful information.
Using the pseudo instruction and not defining a pool the assembler is going to attempt to find a home if you added a nop or some other code
.global start, stack_top
start:
ldr sp, =stack_top
bl main
nop
b .
Then it is likely the assembler would now put that at pc + 8 which after being linked would be 0x10010 and if nothing else changes the stack pointer MIGHT be at the same value or 4 (or more) further along, depends on alignments and padding made by the tool along the way.
The point being the pipe no longer works that way if it ever did in real products so don't think of this as a pipe thing any more than the branch shadow instructions in mips mean anything relevant today (when enabled). For every instruction set that has pc-relative addressing you need to know the rule, is it the address of the instruction (less common), the address of the next instruction (most common) or two ahead, or other. Likewise folks for a while hard-coded in their brain 8 bytes ahead, rather than two ahead, and when they switched to thumb had issues.
Now of course there are the thumb2 extensions which hose thinking about two ahead. I don't off-hand know the aarch64 rule, I would hope it is next instruction and not infected with the two ahead from aarch32. But as with arm (A32) and thumb (T16 and T32) it is easy to find this information in the arm documentation (which as a rule for any architecture you should have handy when writing or analyzing machine/assembly language).
When accessing the pc from an instruction (e.g. ldr or mov), an offset of 8 is added in ARM (A32) mode, and an offset of 4 in Thumb (T32) mode. IIRC this is because of the way function calls worked in old ARM versions. This is documented e.g. in the ARMv7A Architecture Reference Manual in chapter A2.3, on p. A2-45.
The comment ; 0x1000c <start+12> is indeed generated by the disassembler, to indicate the address calculated by PC+4.
Side note: ldr <register>, =<value> is not an actual instruction, but translated by the assembler into 1-2 instructions and optionally a literal value to obtain the desired value in the most efficient way.
If you are interested in that, I wrote a tutorial on learning ARM assembly step-by-step on Cortex-M.
(I think I can explain it now. If I am wrong, please feel free to correct me.)
I tried a slightly different assembly with one more label. Shown as below:
.global start, stack_top, label2 ;<========== HERE I add a new label2
start:
ldr sp, =stack_top // sp = &stack_top, as soon as we have the stack ready, we can call C function
label2:
bl main
b .
The new debug session is like this:
Breakpoint 1, 0x00010000 in start ()
(gdb) i r pc
pc 0x10000 0x10000 <start>
(gdb) x /i $pc <======== (1)
=> 0x10000 <start>: ldr sp, [pc, #4] ; 0x1000c <label2+8> <======= (2)
(gdb) i r sp
sp 0x0 0x0
(gdb) si
0x00010004 in label2 ()
(gdb) x /i $pc
=> 0x10004 <label2>: bl 0x102b0 <main>
(gdb) i r sp
sp 0x11520 0x11520
(gdb) x /x 0x1000c <========== (3)
0x1000c <label2+8>: 0x00011520
(gdb) x /x &stack_top <========== (4)
0x11520: 0x00000000
Though at line (1), I seem to be asking for the pc value, and at line (2) it does gives me a value 0x10000, it is actually NOT the real pc value at that moment.
Because ARM processor has a fetch-decode-execution pipeline. When one instruction is being executed, 2 more instructions ahead are being fetched/decoded.
So pc actually points to the fetched instruction. The currently executed instruction at 0x10000 is actually pc-8 since I am using ARM mode instruction and each instruction takes 4 bytes. So the actual pc value is 0x10008.
So [pc, #4] gives 0x10008 + 4 = 0x1000C which is just what the comment ; 0x1000c <label2+8> says. (This is pc-relative addressing by the way, please read #old_timer's answer for more details about it).
It seems gdb chooses to use the nearest label to represent the address calculation result. So it choose label2. In my original question, it chooses start.
And line (3) and (4) confirm that memory location at 0x1000c does hold the stack_top value.
So to summarize, below 2 things should be noted:
ARM instruction pipeline
GDB convenient display in the form of comment for the address calculation result in an instruction
Last thought...
BTW, I think when I dump the pc value at line (1), it would be much better if the real pc value for the fetched instruction can be displayed, i.e 0x10008. That can avoid much confusion.
More thought...
Please read below thread for why pc is 2 ahead of the currently executed instruction.
Why does the ARM PC register point to the instruction after the next one to be executed?
Though the 3-stage fetch-decode-execute pipeline is no longer relevant (thanks to #old_timer), the calculation in above answer is still mathematically valid. And other parts are valid as well.

Sorting ARM Assembly

I am newbie. I have difficulties with understanding memory ARM memory map.
I have found example of simple sorting algorithm
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortc
; r0 = &arr[0]
; r1 = length
__sortc
stmfd sp!, {r2-r9, lr}
mov r4, r1 ; inner loop counter
mov r3, r4
sub r1, r1, #1
mov r9, r1 ; outer loop counter
outer_loop
mov r5, r0
mov r4, r3
inner_loop
ldr r6, [r5], #4
ldr r7, [r5]
cmp r7, r6
; swap without swp
strls r6, [r5]
strls r7, [r5, #-4]
subs r4, r4, #1
bne inner_loop
subs r9, r9, #1
bne outer_loop
ldmfd sp!, {r2-r9, pc}^
END
And this assembly should be called this way from C code
#define MAX_ELEMENTS 10
extern void __sortc(int *, int);
int main()
{
int arr[MAX_ELEMENTS] = {5, 4, 1, 3, 2, 12, 55, 64, 77, 10};
__sortc(arr, MAX_ELEMENTS);
return 0;
}
As far as I understand this code creates array of integers on the stack and calls _sortc function which implemented in assembly. This function takes this values from the stack and sorts them and put back on the stack. Am I right ?
I wonder how can I implement this example using only assembly.
For example defining array of integers
DCD 3, 7, 2, 8, 5, 7, 2, 6
BTW Where DCD declared variables are stored in the memory ??
How can I operate with values declared in this way ? Please explain how can I implement this using assembly only without any C code, even without stack, just with raw data.
I am writing for ARM7TDMI architecture
AREA ARM, CODE, READONLY - this marks start of section for code in the source.
With similar AREA myData, DATA, READWRITE you can start section where it's possible to define data like data1 DCD 1,2,3, this will compile as three words with values 1, 2, 3 in consecutive bytes, with label data1 pointing to the first byte of first word. (some AREA docs from google).
Where these will land in physical memory after loading executable depends on how the executable is linked (linker is using a script file which is helping him to decide which AREA to put where, and how to create symbol table for dynamic relocation done by the executable loader, by editing the linker script you can adjust where the code and data land, but normally you don't need to do that).
Also the linker script and assembler directives can affect size of available stack, and where it is mapped in physical memory.
So for your particular platform: google for memory mappings on web and check the linker script (for start just use linker option to produce .map file to see where the code and data are targeted to land).
So you can either declare that array in some data area, then to work with it, you load symbol data1 into register ("load address of data1"), and use that to fetch memory content from that address.
Or you can first put all the numbers into the stack (which is set probably to something reasonable by the OS loader of your executable), and operate in the code with the stack pointer to access the numbers in it.
You can even DCD some values into CODE area, so those words will end between the instructions in memory mapped as read-only by executable loader. You can read those data, but writing to them will likely cause crash. And of course you shouldn't execute them as instructions by accident (forgetting to put some ret/jump instruction ahead of DCD).
without stack
Well, this one is tricky, you have to be careful to not use any call/etc. and to have interrupts disabled, etc.. basically any thing what needs stack.
When people code a bootloader, usually they set up some temporary stack ASAP in first few instructions, so they can use basic stack functionality before setting up whole environment properly, or loading OS. A space for that temporary stack is often reserved somewhere in/after the code, or an unused memory space according to defined machine state after reset.
If you are down to the metal, without OS, usually all memory is writeable after reset, so you can then intermix code and data as you wish (just jumping around the data, not executing them by accident), without using AREA definitions.
But you should make your mind, whether you are creating application in user space of some OS (so you have things like stack and data areas well defined and you can use them for your convenience), or you are creating boot loader code which has to set it all up for itself (more difficult, so I would suggest at first going into user land of some OS, having C wrapper around with clib initialized is often handy too, so you can call things like printf from ASM for convenient output).
How can I operate with values declared in this way
It doesn't matter in machine code, which way the values were declared. All that matters is, if you have address of the memory, and if you know the structure, how the data are stored there. Then you can work with them in any way you want, using any instruction you want. So body of that asm example will not change, if you allocate the data in ASM, you will just pass the pointer as argument to it, like the C does.
edit: some example done blindly without testing, may need further syntax fixing to work for OP (or maybe there's even some bug and it will not work at all, let me know in comments if it did):
AREA myData, DATA, READWRITE
SortArray
DCD 5, 4, 1, 3, 2, 12, 55, 64, 77, 10
SortArrayEnd
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortasmarray
__sortasmarray
; if "add r0, pc, #SortArray" fails (code too far in memory from array)
; then this looks like some heavy weight way of loading any address
; ldr r0, =SortArray
; ldr r1, =SortArrayEnd
add r0, pc, #SortArray ; address of array
; calculate array size from address of end
; (as I couldn't find now example of thing like "equ $-SortArray")
add r1, pc, #SortArrayEnd
sub r1, r1, r0
mov r1, r1, lsr #2
; do a direct jump instead of "bl", so __sortc returning
; to lr will actually return to called of this
b __sortc
; ... rest of your __sortc assembly without change
You can call it from C code as:
extern void __sortasmarray();
int main()
{
__sortasmarray();
return 0;
}
I used among others this Introducing ARM assembly language to refresh my ARM asm memory, but I'm still worried this may not work as is.
As you can see, I didn't change any thing in the __sortc. Because there's no difference in accessing stack memory, or "dcd" memory, it's the same computer memory. Once you have the address to particular word, you can ldr/str it's value with that address. The __sortc receives address of first word in array to sort in both cases, from there on it's just memory for it, without any context how that memory was defined in source, allocated, initialized, etc. As long as it's writeable, it's fine for __sortc.
So the only "dcd" related thing from me is loading array address, and the quick search for ARM examples shows it may be done in several ways, this add rX, pc, #label way is optimal, but does work only for +-4k range? There's also pseudo instruction ADR rX, #label doing this same thing, and maybe switching to other in case of range problem? For any range it looks like ldr rX, = label form is used, although I'm not sure if it's pseudo instruction or how it works, check some tutorials and disassembly the machine code to see how it was compiled.
It's up to you to learn all the ARM assembly peculiarities and how to load addresses of arrays, I don't need ARM ASM at the moment, so I didn't dig into those details.
And there should be some equ way to define length of array, instead of calculating it in code from end address, but I couldn't find any example, and I'm not going to read full Assembler docs to learn about all it's directives (in gas I think ArrayLength equ ((.-SortArray)/4) would work).

ARM GCC generated functions prolog

I mentioned that ARM toolchains could generate different function prologs. Actually, i saw two obj files (vmlinux) with completely different function prologs:
The first case looks like:
push {some registers maybe, fp, lr} (lr ommited in leaf function)
The second case looks like:
push {some registers maybe, fp, sp, lr, pc} (i can confuse the order)
So as i see the second one pushes additionally pc and sp. Also i saw some comments in crash utility (kdump project) where was stated, that kernel stackframe should have format {..., fp, sp, lr, pc} what confuse me more, because i see that in some cases it is not true.
1.) Am i right about that some gcc extra flags are needed for pushing additionally pc and sp in function prolog? If yes what are they?.
2.) What is this used for? Basically, as i understand i can unwind stack with FP and LR only, why do i need this additional values?
3.) If this things dealth nothing with compilation flags - how can i force generation of this extended function prolog and again what is the purpose?
Thank you.
1.) Am i right about that some gcc extra flags are needed for pushing additionally pc and sp in function prolog? If yes what are they?.
There are many gcc options that will affect stack frames (-march, -mtune, etc may affect the instructions used for instance). In your case, it was -mapcs-frame. Also, -fomit-frame-pointer will remove frames from leaf functions. Several static functions maybe merged together into a single generated function further reducing the number of frames. The APCS can cause slightly slower code but is needed for stack traces.
2.) What is this used for? Basically, as i understand i can unwind stack with FP and LR only, why do i need this additional values?
All registers that are not parameters (r0-r3) need to be saved as they need to be restored when returning to the caller. The compiler will allocate additional locals on the stack so sp will almost always change when fp changes. For why the pc is stored, see below.
3.) If this things dealth nothing with compilation flags - how can i force generation of this extended function prolog and again what is the purpose?
It is compiler flags as you had guessed.
; Prologue - setup
mov ip, sp ; get a copy of sp.
stm sp!, {fp, ip, lr, pc} ; Save the frame on the stack. See Addendum
sub fp, ip, #4 ; Set the new frame pointer.
...
; Epilogue - return
ldm sp, {fp, sp, lr} ; restore stack, frame pointer and old link.
... ; maybe more stuff here.
bx lr ; return.
A typical save is stm sp!, {fp, ip, lr, pc} and a restore of ldm sp, {fp, sp, lr}. This is correct if you examine the ABI/APCS documents. Note, there is no '!' to try and fix the stack. It is loaded explicitly from the stored ip value.
Also, the saved pc is not used in the epilogue. It is just discarded data on the stack. So why do this? Exception handlers (interrupts, signals or C++ exceptions) and other stack trace mechanisms want to know who saved a frame. The ARM always only have one function prologue (one point of entry). However, there are multiple exits. In some cases, a return like return function(); may actually turn into a b function in the maybe more stuff here. This is known as a tail call. Also when a leaf function is called in the middle of a routine and an exception occurs, it will see a PC range of leaf, but the leaf may have no call frame. By saving the pc, the call frame can be examined when an exception occurs in leaf to know who really saved the stack. Tables of pc versus destructor, etc. maybe stored to allow objects to be freed or to figure out how to call a signal handler. The extra pc is just plain nice when tracing a stack and the operation is almost free due to pipe lining.
See also: ARM Link and frame register question for how the compiler uses these registers.

Resources