I am writing some bare metal code for the Raspberry Pi and am getting an unexpected warning from the ARM cross assembler on Windows. The instructions causing the warnings were:
stmdb sp!,{r0-r14}^
and
ldmia sp!,{r0-r14}^
The warning is:
Warning: writeback of base register is UNPREDICTABLE
I can sort of understand this as although the '^' modifier tells the processor to store the user mode copies of the registers, it doesn't know what mode the processor will be in when the instruction is executed and there doesn't appear to be a way to tell it. I was a little more concerned to get the same warning for:
stmdb sp!,{r0-r9,sl,fp,ip,lr}^
and:
ldmia sp!,{r0-r9,sl,fp,ip,lr}^
despite the fact that I am explicitly not storing ANY sp register.
My concern is that, although I used to do a lot of assembler coding about 15 years ago, ARM code is new to me and I may be misunderstanding something! Also, if I can safely ignore the warnings, is there any way to suppress them?
The ARM Architecture Reference Manual says that writeback is not allowed in LDM/SMT of user registers. It is allowed in the exception return case, where pc is in the register list.
LDM (exception return)
LDM{<amode>}<c> <Rn>{!},<registers_with_pc>^
LDM (user registers)
LDM{<amode>}<c> <Rn>,<registers_without_pc>^
The term "writeback" refers not to the presence or absence of SP in the register list, but to the ! symbol which means the instruction is supposed to update the SP value with the end of transfer area address. The base register (SP) value will be used from the current mode, not the User mode, so you can still load or store user-mode SP value into your stack. From the ARM ARM B9.3.6 LDM (User registers):
In a PL1 mode other than System mode, Load Multiple (User registers)
loads multiple User mode registers from consecutive memory locations
using an address from a base register. The registers loaded cannot
include the PC. The processor reads the base register value normally,
using the current mode to determine the correct Banked version of the
register. This instruction cannot writeback to the base register.
The encoding diagram reflects this by specifying the bit 21 (W, writeback) as '(0)' which means that the result is unpredictable if the bit is not 0.
So the solution is just to not specify the ! and decrement or increment SP manually if necessary.
Related
I am currently debugging a hard fault trap which turned out to be a precise data bus error on a STM32F205 Cortex-M3 processor, using Keil uVision. Due to a lengthy debugging and googling process I found the assembly instruction that caused the trap. Now I am looking for a way to avoid this lengthy process next time a trap occurs.
In the application note 209 by Keil it says:
PRECISEERR: Precise data bus error:
0 = no precise data bus error
1 = a data bus error has occurred, and the PC value stacked for the exception return points to the instruction that caused the fault. When the processor sets this bit it writes the faulting address to SCB->BFAR
and also this:
An exception saves the state of registers R0-R3, R12, PC & LR either the Main Stack or the Process Stack (depends on the stack in use when the exception occurred).
The last quote I am interpreting as such that there should be 7 registers plus the respective stack. When I look up my SP address in the memory I see the address that caused the error at an address 10 words higher than the stack pointer address.
My questions are:
Is the address of the instruction that caused the trap always saved 10 words higher than the current stack pointer? And could you please point out a document where I can read up on how and why this is?
Is there another register that would contain this address as well?
As you said, exceptions (or interrupts) on ARM Cortex-M3 will automatically stack some registers, namely :
Program Counter (PC)
Processor Status Register (xPSR)
r0-r3
r12
Link Register (LR).
For a total of 8 registers (reference : Cortex™-M3 Technical Reference Manual, Chapter 5.5.1).
Now, if you write your exception handler in a language other than assembly, the compiler may stack additional registers, and you can't be sure how many.
One simple solution is to add a small code before the real handler, to pass the address of the auto-stacked registers :
void myRealHandler( void * stack );
void handler(void) {
asm volatile("mov r0, sp");
asm volatile("b myRealHandler");
}
The register BFAR is specific to bus faults. It will contain the faulty address, not the address of the faulty instruction. For example, if there was an error reading an integer at address 0x30005632, BFAR will be set to 0x30005632.
The precise stack location of the return address depends on how much stack the interrupt handler requires. If you look at the disassembly of your HardFault_Handler, you should be able to see how much data is stored on the stack / how many registers are pushed in addition to the registers pushed by the hardware interrupt machinery (R0-R3, R12, PC, LR & PSR)
I found this to be a pretty good idea on how to debug Hard Faults, though it requires a bit of inline assembly.
I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
But ARM Compiler armcc User Guide, paragraph 7.18 is saying:
"All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."
And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
Target CPU: Cortex M0+ (ARMv6-M).
Edit:
Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).
In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.
You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.
See: ldm/stm and gcc for issues with GCC stm/ldm.
Taking your example,
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.
However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).
If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.
How does the R15 of ARM differ from the general PC of a CPU?
Both of them are program counters only. What is the difference?
ARM's PC is more similar to a regular register with some restrictions than x86's IP is similar to a regular register.
Considering general PC is an Intel x86 based CPU, in x86's case you can't manipulate PC (Instruction pointer) directly but it is updated implicitly by provided control flow instructions.
In ARM's case historically Program Counter (PC), mapped as register at index 15 (16th register) can be manipulated directly via arithmetic instructions. For example you can add 16 to PC which would alter flow of instruction stream similar to a 16-byte forward jump instruction.
The ARM PC maybe more of a general register than most CPUs, but it is still very special. The traditional simple arithmetic instructions can use the PC as an input argument in many cases. Here it functions as a pointer or array base. It can also be used as the output for control transfer with these instructions. As a read-only value, it is useful for calculating return values in a PC-independent way. It is also useful to use as a constant table look-up in near-by code. For these cases, the PC is very much like a regular register. This is probably more common on many RISC CPUs as opposed to a CISC ISA.
However, when the PC is used as a destination (lvalue or updated and written), the behavior is often non-standard. Some examples of special cases (for some ARM architechure versions) for R15/PC are,
adcs - copies SPSR to CPSR
adds - copies SPSR to CPSR
ands - copies SPSR to CPSR
bics - copies SPSR to CPSR
bx r15 - highly discourage or not supported.
clz r15 - not supported.
mcr pXX, xx, r15,... - unpredictable
etc.
In most cases, using the PC as a destination of an instruction will have some special case. Especially, the use of the S (normally to set conditions codes) can be used to return from an exception. This might be used as some sort of veneer when returning from an exception or just a direct return. In some cases, the meaning of the instruction might change completely. For instance, ldm sp, {r0-r15}^ and ldm sp, {r0-r14}^ use different register banks; the first will load the registers according to the mode in the SPSR; whereas the 2nd will load the register to user mode.
For load/store, atomics, mode manipulation, co-processor and complex arithmetic (64 bit multiplies, etc) instructions, the PC is often unsupported or has a different meaning; the different meaning is often a mechanism for handling exceptions for system level code.
I would like to create a debugging tool which will help me debug better my application.
I'm working bare-bones (without an OS). using IAR embedded workbench on Atmel's SAM3.
I have a Watchdog timer, which calls a specific IRQ in case of timeout (This will be replaced with a software reset on release).
In the IRQ handler, I want to print out (UART) the stack trace, of where exactly the Watchdog timeout occurred.
I looked in the web, and I didn't find any implementation of that functionality.
Anyone has an idea on how to approach this kind of thing ?
EDIT: OK, I managed to grab the return address from the stack, so I know exactly where the WDT timeout occurred.
Unwinding the whole stack is not simple as it first appears, because each function pushes different amount of local variables into the stack.
The code I end up with is this (for others, who may find it usefull)
void WDT_IrqHandler( void )
{
uint32_t * WDT_Address;
Wdt *pWdt = WDT ;
volatile uint32_t dummy ;
WDT_Address = (uint32_t *) __get_MSP() + 16 ;
LogFatal ("Watchdog Timer timeout,The Return Address is %#X", *WDT_Address);
/* Clear status bit to acknowledge interrupt */
dummy = pWdt->WDT_SR ;
}
ARM defines a pair of sections, .ARM.exidx and .ARM.extbl, that contain enough information to unwind the stack without debug symbols. These sections exist for exception handling but you can use them to perform a backtrace as well. Add -funwind-tables to force GCC to include these sections.
To do this with ARM, you will need to tell your compiler to generate stack frames. For instance with gcc, check the option -mapcs-frame. It may not be the one you need, but this will be a start.
If you do not have this, it will be nearly impossible to "unroll" the stack, because you will need for each function the exact stack usage depending on parameters and local variables.
If you are looking for some exemple code, you can check dump_stack() in Linux kernel sources, and find back the related piece of code executed for ARM.
It should be pretty straight forward to follow execution. Not programmatically in your isr...
We know from the ARM ARM that on a Cortex-M3 it pushes xPSR,
ReturnAddress, LR (R14), R12, R3, R2, R1, and R0 on the stack. mangles the lr so it can detect a return from interrupt then calls the entry point listed in the vector table. if you implement your isr in asm to control the stack, you can have a simple loop that disables the interrupt source (turns off the wdt, whatever, this is going to take some time) then goes into a loop to dump a portion of the stack.
From that dump you will see the lr/return address, the function/instruction that was interrupted, from a disassembly of your program you can then see what the compiler has placed on the stack for each function, subtract that off at each stage and go as far back as you like or as far back as you have printed the stack contents.
You could also make a copy of the stack in ram and dissect it later rather than doing such things in an isr (the copy still takes too much time but is less intrusive than waiting on the uart).
If all you are after is the address of the instruction that was interrupted, that is the most trivial task, just read that from the stack, it will be at a known place, and print it out.
Did I hear my name? :)
You will probably need a tiny bit of inline assembly. Just figure out the format of the stack frames, and which register holds the ordinary1 stack pointer, and transfer the relevant values into C variables from which you can format strings for output to the UART.
It shouldn't be too tricky, but of course (being rather low-level) you need to pay attention to the details.
1As in "non-exception"; not sure if the ARM has different stacks for ordinary code and exceptions, actually.
Your watchdog timer can fire at any point, even when the stack does not contain enough information to unwind (e.g. stack space has been allocated for register spill, but the registers not copied yet).
For properly optimized code, you need debug info, period. All you can do from a watchdog timer is a register and stack dump in a format that is machine readable enough to allow conversion into a core dump for gdb.
According to the ARM manual, it should be possible to access the banked registers for a specific CPU mode as, for instance, "r13_svc". When I try to do this gcc yells at me with the following error:
immediate expression requires a # prefix -- `mov r2,sp_svc'
What's wrong?
Update. The following text from the ARM Architecture Reference Manual for ARMv5 and ARMv6 led me to believe that it is possible, section A2.4.2:
Registers R13 and R14 have six banked
physical registers each. One is used
in User and System modes, and each of
the remaining five is used in one of
the five exception modes. Where it is
necessary to be specific about which
version is being referred to, you use
names of the form: R13_mode
R14_mode where mode is the
appropriate one of usr, svc (for
Supervisor mode), abt, und, irq and
fiq.
The correct syntax for this is mrs r2,sp_svc or mrs r3, sp_usr. This is a new armv7 extension. The code can be seen in the ARM Linux KVM source file interrupt_head.S. The gas binutils patch for this instruction support by Matthew Gretton-Dann. It requires the virtualization extensions are far as I understand.
According to what I understand, the LPAE (large physical address extension) implies the virtualization extensions. So Cortex-A7, Cortex-A12, Cortex-A15, and Cortex-A17 may be able to use this extension. However, the Cortex-A5, Cortex-A8, and Cortex-A9 can not.
Documentation on the instruction can be found in the ARMv7a TRM revC, under section B9.3.9 MRS (Banked register).
For other Cortex-A (and ARMv6) CPU's you can use the cps instruction to switch modes and transfer the banked register to an un-banked register (R0-R7) and then switch back. The obvious difficulty is with user mode. The correct way to handle this is with ldm rN, {sp,lr}^; user mode has no simple way back to the privileged modes.
For all older CPUs, the information given by old_timer will work. Mainly, use mrs/msr to change modes. mrs/msr works over the full class of ARM cpus but requires multiple instructions and hence may have race issues which require interrupt and exception masking depending on context.
This is an important instruction (sequences) for context switching (which VMs do a lot of).
I don't think that's possible with the mov instruction; at least according to the ARM Architecture Reference Manual I'm reading. What document do you have? There are is a variant of ldm that can load user mode registers from a privileged mode (using ^). Your only other option is to switch to SVC mode, do mov r2, sp, and then switch back to whatever other mode you were using.
The error you're getting is because it doesn't understand sp_svc, so it thinks you're trying to do an immediate mov, which would look like:
mov r2, #0x14
So that's why it says "requires a # prefix".
You use mrs and msr to change modes by changing bits in the cpsr then use r13 normally.
From the arm arm
MRS R0,CPSR
BIC R0,R0,#0x1F
ORR R0,R0,#0x13
MSR CPSR_c,R0
then
mov sp,#0x10000000
or if you need more bits in the immediate
ldr sp,=0x12345600
or if you dont want the assembler placing your data, you can place it yourself.
ldr sp,svc_stack
b 1f
svc_stack: .word 0x12345600
1:
You will see typical arm startup code, where the application is going to support interrupts, aborts and other exceptions, to set all of your stack pointers that you are going to need, change mode, set sp, change mode, set sp, change mode ...