Why PC is loaded with address containing undefined instruction? - STM32H745 - c

I have a problem enabling the MPU on the STM32H745 MCU. I wanted to just disable MPU, set region and then enable it. However, HardFault showed up. I thought it was a matter of wrong region settings. But after commenting, I noticed the problem occurs just by turning on the MPU.
Code:
static syslog_status_t setMPU_sysLog(void)
{
[...]
ARM_MPU_Disable();
/* ARM_MPU_SetRegion(ARM_MPU_RBAR(0, (uint32_t)NON_CACHABLE_RAM4_D3_BASE_ADDR),
ARM_MPU_RASR(0UL, ARM_MPU_AP_FULL, 1UL, 0UL, 0UL, 1UL, 0x00UL, ARM_MPU_REGION_SIZE_8KB)); */
HALT_IF_DEBUGGING();
ARM_MPU_Enable(0);
return SYSLOG_OK;
}
I use just CMSIS API, so I check assembly and woops:
>0x80003ec <setMPU_sysLog+36> bkpt 0x0001
0x80003ee <setMPU_sysLog+38> ldr r3, [pc, #28] ; (0x800040c <setMPU_sysLog+68>)
0x80003f0 <setMPU_sysLog+40> movs r2, #1
0x80003f2 <setMPU_sysLog+42> str.w r2, [r3, #148] ; 0x94
0x80003f6 <setMPU_sysLog+46> ldr r2, [r3, #36] ; 0x24
0x80003f8 <setMPU_sysLog+48> orr.w r2, r2, #65536 ; 0x10000
0x80003fc <setMPU_sysLog+52> str r2, [r3, #36] ; 0x24
0x80003fe <setMPU_sysLog+54> dsb sy
0x8000402 <setMPU_sysLog+58> isb sy
0x8000406 <setMPU_sysLog+62> movs r0, #0
0x8000408 <setMPU_sysLog+64> bx lr
0x800040a <setMPU_sysLog+66> nop
0x800040c <setMPU_sysLog+68> ; <UNDEFINED> instruction: 0xed00e000
0x8000410 <initSysLog> push {r3, lr}
Load UNDEFINED instruction to PC in 0x80003ee? What could cause this compilator(?) error? Has anyone encountered such a problem? How to start of debugging it? Additional debug information below:
0x08000398 in my_fault_handler_c (frame=0x2001ffb0) at CM7/exceptionHandlers.c:29
29 HALT_IF_DEBUGGING();
(gdb) p/a *frame
$1 = {r0 = 0xde684c0e, r1 = 0x6cefc92c, r2 = 0xed5b5cfb, r3 = 0xa3feeed1, r12 = 0xef082047, lr = 0xd7121a9e, return_address = 0xf16a13cf, xpsr = 0xf60e2caf}
Fields in SCB > HFSR:
VECTTBL: 0 Vector table hard fault
FORCED: 1 Forced hard fault
DEBUG_VT: 0 Reserved for Debug use
Fields in SCB > CFSR_UFSR_BFSR_MMFSR:
IACCVIOL: 1
DACCVIOL: 0
MUNSTKERR: 0
MSTKERR: 1
MLSPERR: 0
MMARVALID: 0
IBUSERR: 0 Instruction bus error
PRECISERR: 0 Precise data bus error
IMPRECISERR: 0 Imprecise data bus error
UNSTKERR: 0 Bus fault on unstacking for a return from exception
STKERR: 0 Bus fault on stacking for exception entry
LSPERR: 0 Bus fault on floating-point lazy state preservation
BFARVALID: 0 Bus Fault Address Register (BFAR) valid flag
UNDEFINSTR: 0 Undefined instruction usage fault
INVSTATE: 0 Invalid state usage fault
INVPC: 0 Invalid PC load usage fault
NOCP: 0 No coprocessor usage fault.
UNALIGNED: 0 Unaligned access usage fault
DIVBYZERO: 0 Divide by zero usage fault
arm-none-eabi-gcc -v
cc version 10.2.1 20201103 (release) (GNU Arm Embedded Toolchain 10-2020-q4-major)

The problem was to not set PRIVDEFENA bit. So turning on the MPU as follows helped:
ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk);

It is not the undefined instruction. It is the value (in this case the address of the hardware register block) used by your function. ARM Thumb instructions cannot set the register with 32 bits value, so it has to be stored in the memory and loaded from there.
It is not a bug - it is something very standard.
Example:
typedef struct
{
volatile uint32_t reg1;
volatile uint32_t reg2;
}MYREG_t;
#define MYREG ((MYREG_t *)0xed00e000)
void foo(uint32_t val)
{
MYREG -> reg2 = val;
}
void bar(uint32_t val)
{
MYREG -> reg1 = val;
}
and generated code:
foo:
ldr r3, .L3
str r0, [r3, #4]
bx lr
.L3:
.word -318709760
bar:
ldr r3, .L6
str r0, [r3]
bx lr
.L6:
.word -318709760
The places where this data is stored and never reached by the code. The same is in your code. It returns from the function before getting tere (bc lr)
If you use the disassembly tool (as you did), it will not understand it and show undefined instructions.
BTW are you using arm-none-eabi-gdb? as it shows nonsense values of the registers,

Related

S32K146EVB Read Collision when erasing/writing flash

when i try to erase or write to the program flash on my S32K146 EVB i run into a Fault at the moment the FTFC should execute the command. Also the RDCOLLERR bit in the FTFC_STAT register is set. This is the Error from S32DS:
BusFault: A precise (synchronous) data access error has occurred. Possible location: 0x00000BA0.
The PC stopped at 0xb8a.
This is the disassembly:
11 while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
00000b88: nop
00000b8a: ldr r3, [pc, #20] ; (0xba0 <execute_command+44>)
00000b8c: ldrb r3, [r3, #0]
00000b8e: uxtb r3, r3
00000b90: sxtb r3, r3
00000b92: cmp r3, #0
00000b94: bge.n 0xb8a <execute_command+22>
12 return;
00000b96: nop
13 }
00000b98: mov sp, r7
00000b9a: pop {r7}
00000b9c: bx lr
00000b9e: nop
00000ba0: movs r0, r0
00000ba2: ands r2, r0
Strangely enough this does not happen, when i step through the program line by line. Then the flash gets programmed correctly.
This is my routine for erasing a flash sector:
void flash_erase_section(unsigned int addr)
{
// wrong address
if ((addr > FLASH_END_ADDRESS && addr < FLEXNVM_START_ADDRESS) || addr > FLEXNVM_END_ADDRESS){
return;
}
asm volatile("cpsid i");
// wait if operation in progress
while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
// clear flags
FTFC->FSTAT = FTFC_FSTAT_ACCERR_MASK | FTFC_FSTAT_FPVIOL_MASK;
FTFC->FCCOB[3] = 0x09; // erase flash section command
FTFC->FCCOB[2] = (addr >> 16) & 0xFF; // address[23:16]
FTFC->FCCOB[1] = (addr >> 8) & 0xFF; // address[15:8]
FTFC->FCCOB[0] = addr & 0xF0; // address[7:0] 128 bit aligned
execute_command();
asm volatile("cpsie i");
return;
}
The error happens in execute_command():
void execute_command()
{
FTFC->FSTAT |= FTFC_FSTAT_CCIF_MASK;
while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
}
As mentioned earlier, this only happens when NOT debugging step by step. I suspect this has something to do with the flash being busy, but i did not find anything that would help me understand.
Thank you for your help.
I found a workaround. It seems that the MCU threw a Bus Fault because, by accessing the flash memory, the cached instructions became invalid. Disabling caching by writing LMEM->PCCRMR = 0; resolved the issue.
Nonetheless it would be interesting if there is a solution which doesn't include disabling caching alltogether.

ARM Cortex M7 unaligned access and memcpy

I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?
For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4
Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

How does this disassembly correspond to the given C code?

Environment: GCC 4.7.3 (arm-none-eabi-gcc) for ARM Cortex m4f. Bare-metal (actually MQX RTOS, but here that's irrelevant). The CPU is in Thumb state.
Here's a disassembler listing of some code I'm looking at:
//.label flash_command
// ...
while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}
// Compiles to:
12: bf00 nop
14: f04f 0300 mov.w r3, #0
18: f2c4 0302 movt r3, #16386 ; 0x4002
1c: 781b ldrb r3, [r3, #0]
1e: b2db uxtb r3, r3
20: b2db uxtb r3, r3
22: b25b sxtb r3, r3
24: 2b00 cmp r3, #0
26: daf5 bge.n 14 <flash_command+0x14>
The constants (after expending macros, etc.) are:
address of FTFE_FSTAT is 0x40020000u
FTFE_FSTAT_CCIF_MASK is 0x80u
This is compiled with NO optimization (-O0), so GCC shouldn't be doing anything fancy... and yet, I don't get this code. Post-answer edit: Never assume this. My problem was getting a false sense of security from turning off optimization.
I've read that "uxtb r3,r3" is a common way of truncating a 32-bit value. Why would you want to truncate it twice and then sign-extend? And how in the world is this equivalent to the bit-masking operation in the C-code?
What am I missing here?
Edit: Types of the thing involved:
So the actual macro expansion of FTFE_FSTAT comes down to
((((FTFE_MemMapPtr)0x40020000u))->FSTAT)
where the struct is defined as
/** FTFE - Peripheral register structure */
typedef struct FTFE_MemMap {
uint8_t FSTAT; /**< Flash Status Register, offset: 0x0 */
uint8_t FCNFG; /**< Flash Configuration Register, offset: 0x1 */
//... a bunch of other uint_8
} volatile *FTFE_MemMapPtr;
The two uxtb instructions are the compiler being stupid, they should be optimized out if you turn on optimization. The sxtb is the compiler being brilliant, using a trick that you wouldn't expect in unoptimized code.
The first uxtb is due to the fact that you loaded a byte from memory. The compiler is zeroing the other 24 bits of register r3, so that the byte value fills the entire register.
The second uxtb is due to the fact that you're ANDing with an 8-bit value. The compiler realizes that the upper 24-bits of the result will always be zero, so it's using uxtb to clear the upper 24-bits.
Neither of the uxtb instructions does anything useful, because the sxtb instruction overwrites the upper 24 bits of r3 anyways. The optimizer should realize that and remove them when you compile with optimizations enabled.
The sxtb instruction takes the one bit you care about 0x80 and moves it into the sign bit of register r3. That way, if bit 0x80 is set, then r3 becomes a negative number. So now the compiler can compare with 0 to determine whether the bit was set. If the bit was not set then the bge instruction branches back to the top of the while loop.

ARM Cortex-M3 startup file

I am modifying a startup file for an ARM Cortex-M3 microcontroller. Everything works fine so far, but I have a question regarding the need of using assembler code to perform the zero-filling of the BSS block.
By default the reset interrupt in the startup file looks as follows:
// Zero fill the bss segment.
__asm( " ldr r0, =_bss\n"
" ldr r1, =_ebss\n"
" mov r2, #0\n"
" .thumb_func\n"
" zero_loop:\n"
" cmp r0, r1\n"
" it lt\n"
" strlt r2, [r0], #4\n"
" blt zero_loop"
);
Using that code everything works as expected. However if I change the previous code for the following it stops working:
// Zero fill the bss segment.
for(pui32Dest = &_bss; pui32Dest < &_ebss; )
{
*pui32Dest++ = 0;
}
In principle both codes should do the same (fill the BSS with zeros), but the second one does not work for some reason that I fail to understand. I belive that the .thumb_func directive must play a role here, but I am not very familiar with ARM assembler. Any ideas or directions to help me understanding? Thanks!
Edit: By the way, the code to initialize the data segment (e.g. copy from Flash to RAM) is as follows and works just fine.
// Copy the data segment initializers from flash to SRAM.
pui32Src = &_etext;
for(pui32Dest = &_data; pui32Dest < &_edata; )
{
*pui32Dest++ = *pui32Src++;
}
Edit: Added the dissasembled code for both functions.
Assembly for the first looks like:
2003bc: 4806 ldr r0, [pc, #24] ; (2003d8 <zero_loop+0x14>)
2003be: 4907 ldr r1, [pc, #28] ; (2003dc <zero_loop+0x18>)
2003c0: f04f 0200 mov.w r2, #0
002003c4 <zero_loop>:
2003c4: 4288 cmp r0, r1
2003c6: bfb8 it lt
2003c8: f840 2b04 strlt.w r2, [r0], #4
2003cc: dbfa blt.n 2003c4 <zero_loop>
Assembly for the second looks like:
2003bc: f645 5318 movw r3, #23832 ; 0x5d18
2003c0: f2c2 0300 movt r3, #8192 ; 0x2000
2003c4: 9300 str r3, [sp, #0]
2003c6: e004 b.n 2003d2 <ResetISR+0x6e>
2003c8: 9b00 ldr r3, [sp, #0]
2003ca: 1d1a adds r2, r3, #4
2003cc: 9200 str r2, [sp, #0]
2003ce: 2200 movs r2, #0
2003d0: 601a str r2, [r3, #0]
2003d2: 9a00 ldr r2, [sp, #0]
2003d4: f644 033c movw r3, #18492 ; 0x483c
2003d8: f2c2 0300 movt r3, #8192 ; 0x2000
2003dc: 429a cmp r2, r3
2003de: d3f3 bcc.n 2003c8 <ResetISR+0x64>
If the initial stack is in the .bss section as suggested, you can see from the disassembly why the C code fails - it's loading the current pointer from the stack, saving the incremented pointer back to the stack, zeroing the location, then reloading the incremented pointer for the next iteration. If you zero the contents of the stack while using them, Bad Things happen.
In this case, turning on optimisation might fix it (a smart compiler should generate pretty much the same as the assembly code if it actually tries). More generally, though, it's probably safer to consider sticking with assembly code when doing things like this that would normally be done at a level below the C runtime environment - bootstrapping a C environment from C code which expects that environment to exist already is risky at best, since you can only hope the code doesn't attempt to use anything that's not yet set up.
After a quick look around (I'm not overly familiar with the specifics of Cortex-M development), it seems an alternative/additional solution might be adjusting the linker script to move the stack somewhere else.

ARM bootloader: Interrupt Vector Table Understanding

The code following is the first part of u-boot to define interrupt vector table, and my question is how every line will be used. I understand the first 2 lines which is the starting point and the first instruction to implement: reset, and we define reset below. But when will we use these instructions below? According to System.map, every instruction has a fixed address, so _fiq is at 0x0000001C, when we want to execute fiq, we will copy this address into pc and then execute,right? But in which way can we jump to this instruction: ldr pc, _fiq? It's realised by hardware or software? Hope I make myself understood correctly.
>.globl _start
>_start:b reset
> ldr pc, _undefined_instruction
> ldr pc, _software_interrupt
> ldr pc, _prefetch_abort
> ldr pc, _data_abort
> ldr pc, _not_used
> ldr pc, _irq
> ldr pc, _fiq
>_undefined_instruction: .word undefined_instruction
>_software_interrupt: .word software_interrupt
>_prefetch_abort: .word prefetch_abort
>_data_abort: .word data_abort
>_not_used: .word not_used
>_irq: .word irq
>_fiq: .word fiq
If you understand reset then you understand all of them.
When the processor is reset then hardware sets the pc to 0x0000 and starts executing by fetching the instruction at 0x0000. When an undefined instruction is executed or tries to be executed the hardware responds by setting the pc to 0x0004 and starts executing the instruction at 0x0004. irq interrupt, the hardware finishes the instruction it is executing starts executing the instruction at address 0x0018. and so on.
00000000 <_start>:
0: ea00000d b 3c <reset>
4: e59ff014 ldr pc, [pc, #20] ; 20 <_undefined_instruction>
8: e59ff014 ldr pc, [pc, #20] ; 24 <_software_interrupt>
c: e59ff014 ldr pc, [pc, #20] ; 28 <_prefetch_abort>
10: e59ff014 ldr pc, [pc, #20] ; 2c <_data_abort>
14: e59ff014 ldr pc, [pc, #20] ; 30 <_not_used>
18: e59ff014 ldr pc, [pc, #20] ; 34 <_irq>
1c: e59ff014 ldr pc, [pc, #20] ; 38 <_fiq>
00000020 <_undefined_instruction>:
20: 00000000 andeq r0, r0, r0
00000024 <_software_interrupt>:
24: 00000000 andeq r0, r0, r0
00000028 <_prefetch_abort>:
28: 00000000 andeq r0, r0, r0
0000002c <_data_abort>:
2c: 00000000 andeq r0, r0, r0
00000030 <_not_used>:
30: 00000000 andeq r0, r0, r0
00000034 <_irq>:
34: 00000000 andeq r0, r0, r0
00000038 <_fiq>:
38: 00000000 andeq r0, r0, r0
Now of course in addition to changing the pc and starting execution from these addresses. The hardware will save the state of the machine, switch processor modes if necessary and then start executing at the new address from the vector table.
Our job as programmers is to build the binary such that the instructions we want to be run for each of these instructions is at the right address. The hardware provides one word, one instruction for each location. Now if you never expect to ever have any of these exceptions, you dont have to have a branch at address zero for example you can just have your program start, there is nothing magic about the memory at these addresses. If you expect to have these exceptions, then you have two choices for instructions that are one word and can jump out of the way of the exception that follows. One is a branch the other is a load pc. There are pros and cons to each.
When the hardware takes an exception, the program counter (PC) is automatically set to the address of the relevant exception vector and the processor begins executing instructions from that address. When the processor comes out of reset, the PC is automatically set to base+0. An undefined instruction sets the PC to base+4, etc. The base address of the vector table (base) is either 0x00000000, 0xFFFF0000, or VBAR depending on the processor and configuration. Note that this provides limited flexibility in where the vector table gets placed and you'll need to consult the ARM documentation in conjunction with the reference manual for the device that you are using to get the right value to be used.
The layout of the table (4 bytes per exception) makes it necessary to immediately branch from the vector to the actual exception handler. The reasons for the LDR PC, label approach are twofold - because a PC-relative branch is limited to (24 << 2) bits (+/-32MB) using B would constrain the layout of the code in memory somewhat; by loading an absolute address the handler can be located anywhere in memory. Secondly it makes it very simple to change exception handlers at runtime, by simply writing a different address to that location, rather than having to assemble and hotpatch a branch instruction.
There's little value to having a remappable reset vector in this way, however, which is why you tend to see that one implemented as a simple branch to skip over the rest of the vectors to the real entry point code.

Resources