In house bootloader ARM cortex M4 NRF52 chip - c

I am working on making a bootloader for a side project.
I have read in a hex file, verified the checksum and stored everything in flash with a corresponding address with an offset of 0x4000. I am having issues jumping to my application. I have read, searched and tried alot of different things such as the code here.
http://www.keil.com/support/docs/3913.htm
my current code is this;
int binary_exec(void * Address){
int i;
__disable_irq();
// Disable IRQs
for (i = 0; i < 8; i ++) NVIC->ICER[i] = 0xFFFFFFFF;
// Clear pending IRQs
for (i = 0; i < 8; i ++) NVIC->ICPR[i] = 0xFFFFFFFF;
// -- Modify vector table location
// Barriars
__DSB();
__ISB();
// Change the vector table
SCB->VTOR = ((uint32_t)0x4000 & 0x1ffff80);
// Barriars
__DSB();
__ISB();
__enable_irq();
// -- Load Stack & PC
binExec(Address);
return 0;
}
__asm void binexec(uint32_t *address)
{
mov r1, r0
ldr r0, [r1, #4]
ldr sp, [r1]
blx r0"
}
This just jumps to a random location and does not do anything. I have manually added the address to the PC using keil's register window and it jumps straight to my application but I have not found a way to do it using code. Any ideas? Thank you in advance.
Also the second to last line of the hex file there is the start linear address record:
http://www.keil.com/support/docs/1584.htm
does anyone know what to do with this line?
Thank you,
Eric Micallef

This is what I am talking about can you show us some fragments that look like this, this is an entire application just doesnt do much...
20004000 <_start>:
20004000: 20008000
20004004: 20004049
20004008: 2000404f
2000400c: 2000404f
20004010: 2000404f
20004014: 2000404f
20004018: 2000404f
2000401c: 2000404f
20004020: 2000404f
20004024: 2000404f
20004028: 2000404f
2000402c: 2000404f
20004030: 2000404f
20004034: 2000404f
20004038: 2000404f
2000403c: 20004055
20004040: 2000404f
20004044: 2000404f
20004048 <reset>:
20004048: f000 f806 bl 20004058 <notmain>
2000404c: e7ff b.n 2000404e <hang>
2000404e <hang>:
2000404e: e7fe b.n 2000404e <hang>
20004050 <dummy>:
20004050: 4770 bx lr
...
20004054 <systick_handler>:
20004054: 4770 bx lr
20004056: bf00 nop
20004058 <notmain>:
20004058: b510 push {r4, lr}
2000405a: 2400 movs r4, #0
2000405c: 4620 mov r0, r4
2000405e: 3401 adds r4, #1
20004060: f7ff fff6 bl 20004050 <dummy>
20004064: 2c64 cmp r4, #100 ; 0x64
20004066: d1f9 bne.n 2000405c <notmain+0x4>
20004068: 2000 movs r0, #0
2000406a: bd10 pop {r4, pc}
offset 0x00 is the stack pointer
20004000: 20008000
offset 0x04 is the reset vector or the entry point to this program
20004004: 20004049
I filled in the unused ones so they land in an infinite loop
20004008: 2000404f
and tossed in a different one just to show
2000403c: 20004055
In this case the VTOR would be set to 0x2004000 I would read 0x20004049 from 0x20004004 and then BX to that address.
so my binexec would be fed the address 0x20004000 and I would do something like this
ldr r1,[r0]
mov sp,r1
ldr r2,[r0,#4]
bx r2
If I wanted to fake a reset into that code. a thumb approach with thumb2 I assume you can ldr sp,[r0], I dont hand code thumb2 so dont have those memorized, and there are different thumb2 sets of extensions, as well as different syntax options in gas.
Now if you were not going to support interrupts, or for other reasons (might carry some binary code in your flash that you want to perform better and you copy that from flash to ram then use it in ram) you could download to ram an application that simply has its first instruction at the entry point, no vector table:
20004000 <_start>:
20004000: f000 f804 bl 2000400c <notmain>
20004004: e7ff b.n 20004006 <hang>
20004006 <hang>:
20004006: e7fe b.n 20004006 <hang>
20004008 <dummy>:
20004008: 4770 bx lr
...
2000400c <notmain>:
2000400c: b510 push {r4, lr}
2000400e: 2400 movs r4, #0
20004010: 4620 mov r0, r4
20004012: 3401 adds r4, #1
20004014: f7ff fff8 bl 20004008 <dummy>
20004018: 2c64 cmp r4, #100 ; 0x64
2000401a: d1f9 bne.n 20004010 <notmain+0x4>
2000401c: 2000 movs r0, #0
2000401e: bd10 pop {r4, pc}
In this case it would need to be agreed that the downloaded program is built for 0x20004000, you would download the data to that address, but when you want to run it you would instead do this
.globl binexec
binexec:
bx r0
in C
binexec(0x20004000|1);
or
.globl binexec
binexec:
orr r0,#1
bx r0
just to be safe(r).
In both cases you need to build your binaries right if you want them to run, both have to be linked for the target address, in particular the vector table approach, thus the question, can you show us an example vector table from one of your downloaded, programs, even the first few words might suffice...

Related

Keil ARMCC int64 comparison for Cortex M3

I noticed that armcc generates this kind of code to compare two int64 values:
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
Which can be roughly translated as:
r0 = lower_word_1 ^ lower_word_2
r1 = higher_word_1 ^ higher_word_2
r0 = r1 | r0
jump if r0 is not zero
and something like this, when comparing int64 (int r0,r1) with integral constant (i.e. int, in r3)
0x08000674 4058 EORS r0,r0,r3
0x08000676 4308 ORRS r0,r0,r1
0x08000678 D116 BNE 0x080006A8
with the same idea, just skipping comparing higher words altogether since it just needs to be zero.
but I'm interested - why is it so complicated?
Both cases can be done very straight-forward by comparing lower and higher words and making BNE after both:
for two int64, assuming the same registers
CMP lower words
BNE
CMP higher words
BNE
and for int64 with integral constant:
CMP lower words
BNE
CBNZ if higher word is non-zero
This will take the same number of instructions, each may (or may not, depending on the registers used) be 2 bytes in length.
arm-none-eabi-gcc does something different but no playing around with EORS either
So why armcc does this? I can't see any real benefit; both version require the same number of commands (each of which my be wide or short, so no real profit there).
The only slight benefit I can see is that less branching which my be somewhat beneficial for a flash prefetch buffer. But since there is no cache or branch prediction, I'm not really buying it.
So my reasoning is that this pattern is simply legacy, from ARM7 Architecture where no CBZ/CBNZ existed and mixing ARM and Thumb instructions was not very easy.
Am I missing something?
P.S. Armcc does this on every optimization level so I presume it is some kind of 'hard-coded' piece
UPD: Sure, there is an execution pipeline that will be flushed with every branch taken, however every solution requires at least one conditional branch that will or will not be taken (depending on integers that are compared), so pipeline will be flushed anyway with equal probability.
So I can't really see a point in minimizing conditional branches.
Moreover, if lower and higher words would be compared explicitly and integers are not equal, branch will be taken sooner.
Avoiding branch instruction completely is possible with IT-block but on Cortex-M3 it can be only up to 4 instructions long so I'm gonna ignore this for generality.
The efficiency of the generated code is not counted in the number of the machine code instructions. You need to know the internals of the target machine as well (not only the clock/instruction) but also how the fetch/decode/execute process works.
Every branch instruction in the Cortex M3 devices flushes the pipeline. Pipeline has to be fed again. If you run from FLASH memory (it is slow) wait states will also significantly slow this process. The compiler tries to avoid branches as much as it is possible.
It can be done your way using other instructions:
int foo(int64_t x, int64_t y)
{
return x == y;
}
cmp r1, r3
itte eq
cmpeq r0, r2
moveq r0, #1
movne r0, #0
bx lr
Trust your compiler. People who write them know their trade :). Before you learn more about the ARM Cortex you cant judge the compiler this simple way as you do now.
The code from your example is very well optimized and simple. Keil does a very good job.
As pointed out the difference is branching vs not branching. If you can avoid branching you want to avoid branching.
While the ARM documentation may be interesting, as with an x86 and a full sized ARM and many other places the system plays as of a role here. High performance cores like ones from ARM are sensitive to the system implementation. These cortex-m cores are used in microcontrollers which are quite cost sensitive, so while they blow away a PIC or AVR or msp430 for mips to mhz and mips per dollar they are still cost sensitive. With newer technology or perhaps higher cost, you are starting to see flashes that are at the speed of the processor for the full range (do not have to add wait states at various places across the range of valid clock speeds), but for a long time you saw the flash at half the speed of the core at the slowest core speeds. And then getting worse as you choose higher core speeds. But sram often matching the core. Either way flash is a major portion of the cost of the part and how much and how fast it is to some extent drives part price.
Depending on the core (anything from ARM) the fetch size and as a result alignment varies and as a result benchmarks can be skewed/manipulated based on alignment of a loop style test and how many fetches are needed (trivial to demonstrate with many cortex-ms). The cortex-ms are generally either a halfword or full word fetch and some are compile time options for the chip vendor (so you might have two chips with the same core but the performance varies). And this can be demonstrated too...just not here...unless pushed, I have done this demo too many times at this site now. But we can manage that here in this test.
I do not have a cortex-m3 handy I would have to dig one out and wire it up if need be, should not need to though have a cortex-m4 handy which is also an armv7-m. A NUCLEO-F411RE
Test fixture
.thumb_func
.globl HOP
HOP:
bx r2
.balign 0x20
.thumb_func
.globl TEST0
TEST0:
push {r4,r5}
mov r4,#0
mov r5,#0
ldr r2,[r0]
t0:
cmp r4,r5
beq skip
skip:
subs r1,r1,#1
bne t0
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
The systick timer generally works just fine for these kinds of tests, no need to mess with the debuggers timer it often just shows the same thing with more work. More than enough here.
Called like this with the result printed out in hex
hexstring(TEST0(STK_CVR,0x10000));
hexstring(TEST0(STK_CVR,0x10000));
copy the flash code to ram and execute there
hexstring(HOP(STK_CVR,0x10000,0x20000001));
hexstring(HOP(STK_CVR,0x10000,0x20000001));
Now the stm32's have this cache thing in front of the flash which affects loop based benchmarks like these as well as other benchmarks against these parts, sometimes you cannot get past that and you end up with a bogus benchmark. But not in this case.
To demonstrate fetch effects you want a system delay in fetching, if the fetches are too fast you might not see the fetch effects.
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d1ff bne.n 8000030 <skip>
08000030 <skip>:
00050001 <-- flash time
00050001 <-- flash time
00060004 <-- sram time
00060004 <-- sram time
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d0ff beq.n 8000030 <skip>
08000030 <skip>:
00060001
00060001
00080000
00080000
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: bf00 nop
08000030 <skip>:
00050001
00050001
00060000
00060000
So we can see that if the branch is not taken it is the same as a nop. As far as this loop based test goes. So perhaps there is a branch predictor (often a small cache that remembers the last N number of branches and their destinations and can start prefetch a clock or two early). I did not dig into it yet, did not really need to as we can already see that there is a performance cost due to a branch that has to be taken (making your suggested code not equal despite the same number of instructions, this is the same number of instructions but not equal performance).
So the quickest way to remove the loop and avoid the stm32 cache thing is to do something like this in ram
push {r4,r5}
mov r4,#0
mov r5,#0
cmp r4,r5
ldr r2,[r0]
instruction under test repeated many times
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
with the instruction under test being a bne to the next, a beq to the next or a nop
// 800002e: d1ff bne.n 8000030 <skip>
00002001
// 800002e: d0ff beq.n 8000030 <skip>
00004000
// 800002e: bf00 nop
00001001
I did not have room for 0x10000 instructions so I used 0x1000, and we can see that there is a hit for both branch types with the one that does branch being more costly.
Note that the loop based benchmark did not show this difference, have to be careful doing benchmarks or judging results. Even the ones I have shown here.
I could spend more time tweaking core settings or system settings, but based on experience I think this has already demonstrated the desire not to have a cmp, bne, cbnz replace eor, orr, bne. Now to be fair, your other one where it is a eor.w (thumb2 extensions) that burns more clocks than thumb2 instructions so there is another thing to consider (I measured it as well).
Remember for these high performance cores you need to be very sensitive to fetching and fetch alignment, very easy to make a bad benchmark. Not that an x86 is not high performance, but to make the inefficient core run smoother there is a ton of stuff around it to try to keep the core fed, similar to running a semi-truck vs a sports car, the truck can be efficient once up to speed on the highway but city driving, not so much even keeping to the speed limit a Yugo will get across town faster than the semi truck (if it does not break down). Fetch effects, unaligned transfers, etc are difficult to see in an x86, but an ARM somewhat easy, so to get the best performance you want to avoid the easy cycle eaters.
Edit
Note that I jumped to conclusions too early about what GCC produces. Had to work more on trying to craft an equivalent comparison. I started with
unsigned long long fun2 ( unsigned long long a)
{
if(a==0) return(1);
return(0);
}
unsigned long long fun3 ( unsigned long long a)
{
if(a!=0) return(1);
return(0);
}
00000028 <fun2>:
28: 460b mov r3, r1
2a: 2100 movs r1, #0
2c: 4303 orrs r3, r0
2e: bf0c ite eq
30: 2001 moveq r0, #1
32: 4608 movne r0, r1
34: 4770 bx lr
36: bf00 nop
00000038 <fun3>:
38: 460b mov r3, r1
3a: 2100 movs r1, #0
3c: 4303 orrs r3, r0
3e: bf14 ite ne
40: 2001 movne r0, #1
42: 4608 moveq r0, r1
44: 4770 bx lr
46: bf00 nop
Which used an it instruction which is a natural solution here since the if-then-else cases can be a single instruction. Interesting that they chose to use r1 instead of the immediate #0 I wonder if that is a generic optimization, due to complexity with immediates on a fixed length instruction set or perhaps immediates take less space on some architectures. Who knows.
800002e: bf0c ite eq
8000030: bf00 nopeq
8000032: bf00 nopne
00003002
00003002
800002e: bf14 ite ne
8000030: bf00 nopne
8000032: bf00 nopeq
00003002
00003002
Using sram 0x1000 sets of three instructions linearly, so 0x3002 means 1 clock per instruction on average.
Putting a mov in the it block doesn't change performance
ite eq
moveq r0, #1
movne r0, r1
It is still one clock per.
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a)
{
for(;a!=0;a--)
{
more_fun(5);
}
return(0);
}
48: b538 push {r3, r4, r5, lr}
4a: ea50 0301 orrs.w r3, r0, r1
4e: d00a beq.n 66 <fun4+0x1e>
50: 4604 mov r4, r0
52: 460d mov r5, r1
54: 2005 movs r0, #5
56: f7ff fffe bl 0 <more_fun>
5a: 3c01 subs r4, #1
5c: f165 0500 sbc.w r5, r5, #0
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
66: 2000 movs r0, #0
68: 2100 movs r1, #0
6a: bd38 pop {r3, r4, r5, pc}
This is basically the compare with zero
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
Against another
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a, unsigned long long b)
{
for(;a!=b;a--)
{
more_fun(5);
}
return(0);
}
00000048 <fun4>:
48: 4299 cmp r1, r3
4a: bf08 it eq
4c: 4290 cmpeq r0, r2
4e: d011 beq.n 74 <fun4+0x2c>
50: b5f8 push {r3, r4, r5, r6, r7, lr}
52: 4604 mov r4, r0
54: 460d mov r5, r1
56: 4617 mov r7, r2
58: 461e mov r6, r3
5a: 2005 movs r0, #5
5c: f7ff fffe bl 0 <more_fun>
60: 3c01 subs r4, #1
62: f165 0500 sbc.w r5, r5, #0
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
6e: 2000 movs r0, #0
70: 2100 movs r1, #0
72: bdf8 pop {r3, r4, r5, r6, r7, pc}
74: 2000 movs r0, #0
76: 2100 movs r1, #0
78: 4770 bx lr
7a: bf00 nop
And they choose to use an it block here.
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
It is on par with this for number of instructions.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
But those thumb2 instructions are going to execute longer. So overall I think GCC appears to have made a better sequence, but of course you want to check apples to apples start with the same C code and see what each produced. The gcc one reads easier than the eor/orr stuff, can think less about what it is doing.
8000040: 406c eors r4, r5
00001002
8000042: ea94 0305 eors.w r3, r4, r5
00002001
0x1000 instructions one is two halfwords (thumb2) one is one halfword (thumb). Takes two clocks not really surprised.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
I see six clocks there before adding any other penalties, not four (on this cortex-m4).
Note I made the eors.w aligned and unaligned and it did not change the performance. Still two clocks.

qemu-arm branches to a seemingly abstract instruction

I am trying to build a binary translator for arm bear metal compiled code and I try to verify proper execution flow by comparing it to that of qemu-arm. I use the following command to dump the program flow:
qemu-arm -d in_asm,cpu -singlestep -D a.flow a.out
I noticed something strange, where the program seems to jump to an irrelevant instruction, since 0x000080b4 is not the branch nor the next instruction following 0x000093ec.
0x000093ec: 1afffff9 bne 0x93d8
R00=00000000 R01=00009c44 R02=00000002 R03=00000000
R04=00000001 R05=0001d028 R06=00000002 R07=00000000
R08=00000000 R09=00000000 R10=0001d024 R11=00000000
R12=f6ffed88 R13=f6ffed88 R14=000093e8 R15=000093ec
PSR=20000010 --C- A usr32
R00=00000000 R01=00009c44 R02=00000002 R03=00000000
R04=00000001 R05=0001d028 R06=00000002 R07=00000000
R08=00000000 R09=00000000 R10=0001d024 R11=00000000
R12=f6ffed88 R13=f6ffed88 R14=000093e8 R15=000093d8
PSR=20000010 --C- A usr32
----------------
IN:
0x000080b4: e59f3060 ldr r3, [pc, #96] ; 0x811c
The instruction that actually executes corresponds to the beggining of the <frame_dummy> tag in the disassembly. Can someone explain what actually happens within the emulator and is this behavior normal in the ARM architecture? The program was compiled with: arm-none-eabi-gcc --specs=rdimon.specs a.c
Here is the same segment of the program flow without the CPU state:
0x0000804c: e59f3018 ldr r3, [pc, #24] ; 0x806c
0x00008050: e3530000 cmp r3, #0 ; 0x0
0x00008054: 01a0f00e moveq pc, lr
----------------
IN: __libc_init_array
0x000093e8: e1560004 cmp r6, r4
0x000093ec: 1afffff9 bne 0x93d8
----------------
IN:
0x000080b4: e59f3060 ldr r3, [pc, #96] ; 0x811c
0x000080b8: e3530000 cmp r3, #0 ; 0x0
0x000080bc: 0a000009 beq 0x80e8
This is the disassembly of this part:
93d4: 0a000005 beq 93f0 <__libc_init_array+0x68>
93d8: e2844001 add r4, r4, #1
93dc: e4953004 ldr r3, [r5], #4
93e0: e1a0e00f mov lr, pc
93e4: e1a0f003 mov pc, r3
93e8: e1560004 cmp r6, r4
93ec: 1afffff9 bne 93d8 <__libc_init_array+0x50>
93f0: e8bd4070 pop {r4, r5, r6, lr}
It is a reverse jump to previously emitted TB, you don't even have to read that much back:
IN: __libc_init_array
0x000093d8: e2844001 add r4, r4, #1 ; 0x1
0x000093dc: e4953004 ldr r3, [r5], #4
0x000093e0: e1a0e00f mov lr, pc
0x000093e4: e1a0f003 mov pc, r3
----------------
IN: register_fini
0x0000804c: e59f3018 ldr r3, [pc, #24] ; 0x806c
0x00008050: e3530000 cmp r3, #0 ; 0x0
0x00008054: 01a0f00e moveq pc, lr
----------------
IN: __libc_init_array
0x000093e8: e1560004 cmp r6, r4
0x000093ec: 1afffff9 bne 0x93d8
So, qemu is not showing it again. Notice this loop is iterating function pointers, the first one points to register_fini and the second one to the magical 0x000080b4 address in question (no symbol for it). When this unnamed function conditionally returns with moveq pc, lr control is transferred back to __libc_init_array address 0x000093e8 which then determines that the array end has been reached and again just returns to its caller at 0x000093f0.

ARM Disassembly - confused about "LDR r7, [pc, #0x14]"

I am attempting to learn ARM assembly. I decided to disassembly the "read" function and this is what I get. From the looks of it, it seems to be making a system call (svc #0) using the R7 register as the system call number.
mov ip, r7 # save R7
ldr r7, [pc, #0x14] # get system call number and put it into R7 ??
svc #0 # make system call
mov r7, ip # restore R7
cmn r0, #0x1000
bxls lr
rsb r0, r0, #0 # R0 = 0
b #2976848216
I am a bit confused though on why it is loading the system call number the way it is ("LDR r7, [PC, #0x14]"). Isn't this just doing in C code r7 = *(pc + 0x14)? I looked at other functions that might also use system calls (e.g. kill, wait, etc.) and they use a very similar convention (i.e. LDR R7, [PC, #0x14]).
This is on Android if it helps at all.
Thanks!
mov ip, r7 ## save R7
ldr r7, [pc, #0x14] ## get system call number and put it into R7 ??
svc #0 ## make system call
mov r7, ip ## restore R7
cmn r0, #0x1000 #
bxls lr #
rsb r0, r0, #0 ## R0 = 0
.word 0x1234
.word 0xABCD
you pretty much left out the most important parts so had to improvise
00000000 <.text>:
0: e1a0c007 mov ip, r7
4: e59f7014 ldr r7, [pc, #20] ; 20 <.text+0x20>
8: ef000000 svc 0x00000000
c: e1a0700c mov r7, ip
10: e3700a01 cmn r0, #4096 ; 0x1000
14: 912fff1e bxls lr
18: e2600000 rsb r0, r0, #0
1c: 00001234 andeq r1, r0, r4, lsr r2
20: 0000abcd andeq sl, r0, sp, asr #23
And yes it is doing what you say it is doing, it is loading some value in r7 before making the system call, now what value is it as to why it is using a pc relative load (likely a constant that wont fit as an immediate, and/or a link time resolved value rather than compile time) and are there different values for different system calls and is r7 a parameter or not? Well you didnt provide enough information to talk about that. Once you have/see that information then that should be pretty obvious what those answers are...if any of those are is your question.

Beagleboard Qemu baremetal with UEFI

I am trying to boot a freertos app from UEFI on Qemu
When i run the app from uboot, using the below commands it runs without any errors
fatload mmc 0 80300000 rtosdemo.bin
go 0x80300000
An uefi application loads the elf file at 0x80300000 and then I tried two options.
My boot.s file is below
`start:
_start:
_mainCRTStartup:
ldr r0, .LC6
msr CPSR_c, #MODE_UND|I_BIT|F_BIT /* Undefined Instruction */
mov sp, r0
sub r0, r0, #UND_STACK_SIZE
msr CPSR_c, #MODE_ABT|I_BIT|F_BIT /* Abort Mode */
mov sp, r0
...
`
Disassembly file
`
80300000 <_undf-0x20>:
80300000: ea001424 b 80305098 <start>
80300004: e59ff014 ldr pc, [pc, #20] ; 80300020 <_undf>
80300008: e59ff014 ldr pc, [pc, #20] ; 80300024 <_swi>
8030000c: e59ff014 ldr pc, [pc, #20] ; 80300028 <_pabt>
80300010: e59ff014 ldr pc, [pc, #20] ; 8030002c <_dabt>
...........
80305098 <start>:
80305098: e59f00f4 ldr r0, [pc, #244] ; 80305194 <endless_loop+0x18>
8030509c: e321f0db msr CPSR_c, #219 ; 0xdb
803050a0: e1a0d000 mov sp, r0
803050a4: e2400004 sub r0, r0, #4
`
use goto 0x80305098 which is the entry point addr specified in the elf file. Now it jumps to ldr r0, .. instruction but after that it just seems to be jumping some where in the middle of some function rather than stepping into msr instruction.
Since in uboot its jumping to 0x80300000, I tried by jumping to that addr, now it goes to instruction b 80305098 <start>, but after that instruction instead of jumping to 80305098 it just goes to the next instruction ldr pc, [pc, #20].
So any ideas on where I am going wrong?
EDIT:
I updated boot.s to
start:
_start:
_mainCRTStartup:
.thumb
thumb_entry_point:
blx arm_entry_point
.arm
arm_entry_point:
ldr r0, .LC6
msr CPSR_c, #MODE_UND|I_BIT|F_BIT /* Undefined Instruction Mode */
mov sp, r0
Now it works fine.
This is ARM code, but it sounds very much like it's being jumped to in Thumb state. The word e59f00f4 will be interpreted in Thumb as lsls r4, r6, #3; b 0x80304bde (if I've got my address maths right), which seems consistent with "jumping somewhere in the middle of some function". You can verify by checking bit 5 of the CPSR (assuming you're not in user mode) - if it's set, you've come in in Thumb state.
If that is the case, then the 'proper' solution probably involves making the UEFI loader application clever enough to do the right kind of interworking branch, but a quick and easy hack would be to place a shim somewhere just for the initial entry, something like:
.thumb
thumb_entry_point:
blx arm_entry_point
.arm
arm_entry_point:
b start

ARM-C Inter-working

I am trying out a simple program for ARM-C inter-working. Here is the code:
#include<stdio.h>
#include<stdlib.h>
int Double(int a);
extern int Start(void);
int main(){
int result=0;
printf("in C main\n");
result=Start();
printf("result=%d\n",result);
return 0;
}
int Double(int a)
{
printf("inside double func_argument_value=%d\n",a);
return (a*2);
}
The assembly file goes as-
.syntax unified
.cpu cortex-m3
.thumb
.align
.global Start
.global Double
.thumb_func
Start:
mov r10,lr
mov r0,#42
bl Double
mov lr,r10
mov r2,r0
mov pc,lr
During debugging on LPC1769(embedded artists board), I get an hardfault error on the instruction " result=Start(). " I am trying to do an arm-C internetworking here. the lr value during the execution of the above the statement(result=Start()) is 0x0000029F, where the faulting instruction is,and the pc value is 0x0000029E.
This is how I got the faulting instruction in r1
__asm("mrs r0,MSP\n"
"isb\n"
"ldr r1,[r0,#24]\n");
Can anybody please explain where I am going wrong? Any solution is appreciated.
Thank you in advance.
I am a beginner in cortex-m3 & am using the NXP LPCXpresso IDE powered by Code_Red.
Here is the disassembly of my code.
IntDefaultHandler:
00000269: push {r7}
0000026b: add r7, sp, #0
0000026d: b.n 0x26c <IntDefaultHandler+4>
0000026f: nop
00000271: mov r3, lr
00000273: mov.w r0, #42 ; 0x2a
00000277: bl 0x2c0 <Double>
0000027b: mov lr, r3
0000027d: mov r2, r0
0000027f: mov pc, lr
main:
00000281: push {r7, lr}
00000283: sub sp, #8
00000285: add r7, sp, #0
00000287: mov.w r3, #0
0000028b: str r3, [r7, #4]
0000028d: movw r3, #11212 ; 0x2bcc
00000291: movt r3, #0
00000295: mov r0, r3
00000297: bl 0xd64 <printf>
0000029b: bl 0x270 <Start>
0000029f: mov r3, r0
000002a1: str r3, [r7, #4]
000002a3: movw r3, #11224 ; 0x2bd8
000002a7: movt r3, #0
000002ab: mov r0, r3
000002ad: ldr r1, [r7, #4]
000002af: bl 0xd64 <printf>
000002b3: mov.w r3, #0
000002b7: mov r0, r3
000002b9: add.w r7, r7, #8
000002bd: mov sp, r7
000002bf: pop {r7, pc}
Double:
000002c0: push {r7, lr}
000002c2: sub sp, #8
000002c4: add r7, sp, #0
000002c6: str r0, [r7, #4]
000002c8: movw r3, #11236 ; 0x2be4
000002cc: movt r3, #0
000002d0: mov r0, r3
000002d2: ldr r1, [r7, #4]
000002d4: bl 0xd64 <printf>
000002d8: ldr r3, [r7, #4]
000002da: mov.w r3, r3, lsl #1
000002de: mov r0, r3
000002e0: add.w r7, r7, #8
000002e4: mov sp, r7
000002e6: pop {r7, pc}
As per your advice Dwelch, I have changed the r10 to r3.
I assume you mean interworking not internetworking? The LPC1769 is a cortex-m3 which is thumb/thumb2 only so it doesnt support arm instructions so there is no interworking available for that platform. Nevertheless, playing with the compiler to see what goes on:
Get the compiler to do it for you first, then try it yourself in asm...
start.s
.thumb
.globl _start
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
two.c
unsigned int two ( unsigned int t )
{
return(t+5);
}
Makefile
hello.list : start.s hello.c two.c
arm-none-eabi-as -mthumb start.s -o start.o
arm-none-eabi-gcc -c -O2 hello.c -o hello.o
arm-none-eabi-gcc -c -O2 -mthumb two.c -o two.o
arm-none-eabi-ld -Ttext=0x1000 start.o hello.o two.o -o hello.elf
arm-none-eabi-objdump -D hello.elf > hello.list
clean :
rm -f *.o
rm -f *.elf
rm -f *.list
produces hello.list
Disassembly of section .text:
00001000 <_start>:
1000: 4801 ldr r0, [pc, #4] ; (1008 <hang+0x2>)
1002: 46fe mov lr, pc
1004: 4700 bx r0
00001006 <hang>:
1006: e7fe b.n 1006 <hang>
1008: 0000100c andeq r1, r0, ip
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
...
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
hello.o disassembled by itself:
00000000 <hello>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <two>
8: e8bd4008 pop {r3, lr}
c: e2800007 add r0, r0, #7
10: e12fff1e bx lr
the compiler uses bl assuming/hoping it will be calling arm from arm. but it didnt, so what they did was put a trampoline in there.
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
the bl to __two_from_arm is an arm mode to arm mode branch link. the address of the destination function (two) with the lsbit set, which tells bx to switch to thumb mode, is loaded into the disposable register ip (r12?) then the bx ip happens switching modes. the branch link had setup the return address in lr, which was an arm mode address no doubt (lsbit zero).
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
the two() function does its thing and returns, note you have to use bx lr not mov pc,lr when interworking. Basically if you are not running an ARMv4 without the T, or an ARMv5 without the T, mov pc,lr is an okay habit. But anything ARMv4T or newer (ARMv5T or newer) use bx lr to return from a function unless you have a special reason not to. (avoid using pop {pc} as well for the same reason unless you really need to save that instruction and are not interworking). Now being on a cortex-m3 which is thumb+thumb2 only, well you cant interwork so you can use mov pc,lr and pop {pc}, but the code is not portable, and it is not a good habit as that habit will bite you when you switch back to arm programming.
So since hello was in arm mode when it used bl which is what set the link register, the bx in two_from_arm does not touch the link register, so when two() returns with a bx lr it is returning to arm mode after the bl __two_from_arm line in the hello() function.
Also note the extra 0x0000 after the thumb function, this was to align the program on a word boundary so that the following arm code was aligned...
to see how the compiler does thumb to arm change two as follows
unsigned int three ( unsigned int );
unsigned int two ( unsigned int t )
{
return(three(t)+5);
}
and put that function in hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
unsigned int three ( unsigned int t )
{
return(t+3);
}
and now we get another trampoline
00001028 <two>:
1028: b508 push {r3, lr}
102a: f000 f80b bl 1044 <__three_from_thumb>
102e: 3005 adds r0, #5
1030: bc08 pop {r3}
1032: bc02 pop {r1}
1034: 4708 bx r1
1036: 46c0 nop ; (mov r8, r8)
...
00001044 <__three_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eafffff4 b 1020 <three>
104c: 00000000 andeq r0, r0, r0
Now this is a very cool trampoline. the bl to three_from_thumb is in thumb mode and the link register is set to return to the two() function with the lsbit set no doubt to indicate to return to thumb mode.
The trampoline starts with a bx pc, pc is set to two instructions ahead and the pc internally always has the lsbit clear so a bx pc will always take you to arm mode if not already in arm mode, and in either mode two instructions ahead. Two instructions ahead of the bx pc is an arm instruction that branches (not branch link!) to the three function, completing the trampoline.
Notice how I wrote the call to hello() in the first place
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
that actually wont work will it? It will get you from arm to thumb but not from thumb to arm. I will leave that as an exercise for the reader.
If you change start.s to this
.thumb
.globl _start
_start:
bl hello
hang : b hang
the linker takes care of us:
00001000 <_start>:
1000: f000 f820 bl 1044 <__hello_from_thumb>
00001004 <hang>:
1004: e7fe b.n 1004 <hang>
...
00001044 <__hello_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eaffffee b 1008 <hello>
I would and do always disassemble programs like these to make sure the compiler and linker resolved these issues. Also note that for example __hello_from_thumb can be used from any thumb function, if I call hello from several places, some arm, some thumb, and hello was compiled for arm, then the arm calls would call hello directly (if they can reach it) and all the thumb calls would share the same hello_from_thumb (if they can reach it).
The compiler in these examples was assuming code that stays in the same mode (simple branch link) and the linker added the interworking code...
If you really meant inter-networking and not interworking, then please describe what that is and I will delete this answer.
EDIT:
You were using a register to preserve lr during the call to Double, that will not work, no register will work for that you need to use memory, and the easiest is the stack. See how the compiler does it:
00001008 <hello>:
1008: e92d4008 push {r3, lr}
100c: eb000009 bl 1038 <__two_from_arm>
1010: e8bd4008 pop {r3, lr}
1014: e2800007 add r0, r0, #7
1018: e12fff1e bx lr
r3 is pushed likely to align the stack on a 64 bit boundary (makes it faster). the thing to notice is the link register is preserved on the stack, but the pop does not pop to pc because this is not an ARMv4 build, so a bx is needed to return from the function. Because this is arm mode we can pop to lr and simply bx lr.
For thumb you can only push r0-r7 and lr directly and pop r0-r7 and pc directly you dont want to pop to pc because that only works if you are staying in the same mode (thumb or arm). this is fine for a cortex-m, or fine if you know what all of your callers are, but in general bad. So
00001024 <two>:
1024: b508 push {r3, lr}
1026: f000 f811 bl 104c <__three_from_thumb>
102a: 3005 adds r0, #5
102c: bc08 pop {r3}
102e: bc02 pop {r1}
1030: 4708 bx r1
same deal r3 is used as a dummy register to keep the stack aligned for performance (I used the default build for gcc 4.8.0 which is likely a platform with a 64 bit axi bus, specifying the architecture might remove that extra register). Because we cannot pop pc, I assume because r1 and r3 would be out of order and r3 was chosen (they could have chosen r2 and saved an instruction) there are two pops, one to get rid of the dummy value on the stack and the other to put the return value in a register so that they can bx to it to return.
Your Start function does not conform to the ABI and as a result when you mix it in with such large libraries as a printf call, no doubt you will crash. If you didnt it was dumb luck. Your assembly listing of main shows that neither r4 nor r10 were used and assuming main() is not called other than the bootstrap, then that is why you got away with either r4 or r10.
If this really is an LPC1769 this this whole discussion is irrelevant as it does not support ARM and does not support interworking (interworking = mixing of ARM mode code and thumb mode code). Your problem was unrelated to interworking, you are not interworking (note the pop {pc} at the end of the functions). Your problem was likely related to your assembly code.
EDIT2:
Changing the makefile to specify the cortex-m
00001008 <hello>:
1008: b508 push {r3, lr}
100a: f000 f805 bl 1018 <two>
100e: 3007 adds r0, #7
1010: bd08 pop {r3, pc}
1012: 46c0 nop ; (mov r8, r8)
00001014 <three>:
1014: 3003 adds r0, #3
1016: 4770 bx lr
00001018 <two>:
1018: b508 push {r3, lr}
101a: f7ff fffb bl 1014 <three>
101e: 3005 adds r0, #5
1020: bd08 pop {r3, pc}
1022: 46c0 nop ; (mov r8, r8)
first and foremost it is all thumb since there is no arm mode on a cortex-m, second the bx is not needed for function returns (Because there are no arm/thumb mode changes). So pop {pc} will work.
it is curious that the dummy register is still used on a push, I tried an arm7tdmi/armv4t build and it still did that, so there is some other flag to use to get rid of that behavior.
If your desire was to learn how to make an assembly function that you can call from C, you should have just done that. Make a C function that somewhat resembles the framework of the function you want to create in asm:
extern unsigned int Double ( unsigned int );
unsigned int Start ( void )
{
return(Double(42));
}
assemble then disassemble
00000000 <Start>:
0: b508 push {r3, lr}
2: 202a movs r0, #42 ; 0x2a
4: f7ff fffe bl 0 <Double>
8: bd08 pop {r3, pc}
a: 46c0 nop ; (mov r8, r8)
and start with that as you assembly function.
.globl Start
.thumb_func
Start:
push {lr}
mov r0, #42
bl Double
pop {pc}
That, or read the arm abi for gcc and understand what registers you can and cant use without saving them on the stack, what registers are used for passing and returning parameters.
_start function is the entry point of a C program which makes a call to main(). In order to debug _start function of any C program is in the assembly file. Actually the real entry point of a program on linux is not main(), but rather a function called _start(). The standard libraries normally provide a version of this that runs some initialization code, then calls main().
Try compiling this with gcc -nostdlib:

Resources