when i try to erase or write to the program flash on my S32K146 EVB i run into a Fault at the moment the FTFC should execute the command. Also the RDCOLLERR bit in the FTFC_STAT register is set. This is the Error from S32DS:
BusFault: A precise (synchronous) data access error has occurred. Possible location: 0x00000BA0.
The PC stopped at 0xb8a.
This is the disassembly:
11 while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
00000b88: nop
00000b8a: ldr r3, [pc, #20] ; (0xba0 <execute_command+44>)
00000b8c: ldrb r3, [r3, #0]
00000b8e: uxtb r3, r3
00000b90: sxtb r3, r3
00000b92: cmp r3, #0
00000b94: bge.n 0xb8a <execute_command+22>
12 return;
00000b96: nop
13 }
00000b98: mov sp, r7
00000b9a: pop {r7}
00000b9c: bx lr
00000b9e: nop
00000ba0: movs r0, r0
00000ba2: ands r2, r0
Strangely enough this does not happen, when i step through the program line by line. Then the flash gets programmed correctly.
This is my routine for erasing a flash sector:
void flash_erase_section(unsigned int addr)
{
// wrong address
if ((addr > FLASH_END_ADDRESS && addr < FLEXNVM_START_ADDRESS) || addr > FLEXNVM_END_ADDRESS){
return;
}
asm volatile("cpsid i");
// wait if operation in progress
while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
// clear flags
FTFC->FSTAT = FTFC_FSTAT_ACCERR_MASK | FTFC_FSTAT_FPVIOL_MASK;
FTFC->FCCOB[3] = 0x09; // erase flash section command
FTFC->FCCOB[2] = (addr >> 16) & 0xFF; // address[23:16]
FTFC->FCCOB[1] = (addr >> 8) & 0xFF; // address[15:8]
FTFC->FCCOB[0] = addr & 0xF0; // address[7:0] 128 bit aligned
execute_command();
asm volatile("cpsie i");
return;
}
The error happens in execute_command():
void execute_command()
{
FTFC->FSTAT |= FTFC_FSTAT_CCIF_MASK;
while ((FTFC->FSTAT & FTFC_FSTAT_CCIF_MASK) == 0);
}
As mentioned earlier, this only happens when NOT debugging step by step. I suspect this has something to do with the flash being busy, but i did not find anything that would help me understand.
Thank you for your help.
I found a workaround. It seems that the MCU threw a Bus Fault because, by accessing the flash memory, the cached instructions became invalid. Disabling caching by writing LMEM->PCCRMR = 0; resolved the issue.
Nonetheless it would be interesting if there is a solution which doesn't include disabling caching alltogether.
Related
I have a problem enabling the MPU on the STM32H745 MCU. I wanted to just disable MPU, set region and then enable it. However, HardFault showed up. I thought it was a matter of wrong region settings. But after commenting, I noticed the problem occurs just by turning on the MPU.
Code:
static syslog_status_t setMPU_sysLog(void)
{
[...]
ARM_MPU_Disable();
/* ARM_MPU_SetRegion(ARM_MPU_RBAR(0, (uint32_t)NON_CACHABLE_RAM4_D3_BASE_ADDR),
ARM_MPU_RASR(0UL, ARM_MPU_AP_FULL, 1UL, 0UL, 0UL, 1UL, 0x00UL, ARM_MPU_REGION_SIZE_8KB)); */
HALT_IF_DEBUGGING();
ARM_MPU_Enable(0);
return SYSLOG_OK;
}
I use just CMSIS API, so I check assembly and woops:
>0x80003ec <setMPU_sysLog+36> bkpt 0x0001
0x80003ee <setMPU_sysLog+38> ldr r3, [pc, #28] ; (0x800040c <setMPU_sysLog+68>)
0x80003f0 <setMPU_sysLog+40> movs r2, #1
0x80003f2 <setMPU_sysLog+42> str.w r2, [r3, #148] ; 0x94
0x80003f6 <setMPU_sysLog+46> ldr r2, [r3, #36] ; 0x24
0x80003f8 <setMPU_sysLog+48> orr.w r2, r2, #65536 ; 0x10000
0x80003fc <setMPU_sysLog+52> str r2, [r3, #36] ; 0x24
0x80003fe <setMPU_sysLog+54> dsb sy
0x8000402 <setMPU_sysLog+58> isb sy
0x8000406 <setMPU_sysLog+62> movs r0, #0
0x8000408 <setMPU_sysLog+64> bx lr
0x800040a <setMPU_sysLog+66> nop
0x800040c <setMPU_sysLog+68> ; <UNDEFINED> instruction: 0xed00e000
0x8000410 <initSysLog> push {r3, lr}
Load UNDEFINED instruction to PC in 0x80003ee? What could cause this compilator(?) error? Has anyone encountered such a problem? How to start of debugging it? Additional debug information below:
0x08000398 in my_fault_handler_c (frame=0x2001ffb0) at CM7/exceptionHandlers.c:29
29 HALT_IF_DEBUGGING();
(gdb) p/a *frame
$1 = {r0 = 0xde684c0e, r1 = 0x6cefc92c, r2 = 0xed5b5cfb, r3 = 0xa3feeed1, r12 = 0xef082047, lr = 0xd7121a9e, return_address = 0xf16a13cf, xpsr = 0xf60e2caf}
Fields in SCB > HFSR:
VECTTBL: 0 Vector table hard fault
FORCED: 1 Forced hard fault
DEBUG_VT: 0 Reserved for Debug use
Fields in SCB > CFSR_UFSR_BFSR_MMFSR:
IACCVIOL: 1
DACCVIOL: 0
MUNSTKERR: 0
MSTKERR: 1
MLSPERR: 0
MMARVALID: 0
IBUSERR: 0 Instruction bus error
PRECISERR: 0 Precise data bus error
IMPRECISERR: 0 Imprecise data bus error
UNSTKERR: 0 Bus fault on unstacking for a return from exception
STKERR: 0 Bus fault on stacking for exception entry
LSPERR: 0 Bus fault on floating-point lazy state preservation
BFARVALID: 0 Bus Fault Address Register (BFAR) valid flag
UNDEFINSTR: 0 Undefined instruction usage fault
INVSTATE: 0 Invalid state usage fault
INVPC: 0 Invalid PC load usage fault
NOCP: 0 No coprocessor usage fault.
UNALIGNED: 0 Unaligned access usage fault
DIVBYZERO: 0 Divide by zero usage fault
arm-none-eabi-gcc -v
cc version 10.2.1 20201103 (release) (GNU Arm Embedded Toolchain 10-2020-q4-major)
The problem was to not set PRIVDEFENA bit. So turning on the MPU as follows helped:
ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk);
It is not the undefined instruction. It is the value (in this case the address of the hardware register block) used by your function. ARM Thumb instructions cannot set the register with 32 bits value, so it has to be stored in the memory and loaded from there.
It is not a bug - it is something very standard.
Example:
typedef struct
{
volatile uint32_t reg1;
volatile uint32_t reg2;
}MYREG_t;
#define MYREG ((MYREG_t *)0xed00e000)
void foo(uint32_t val)
{
MYREG -> reg2 = val;
}
void bar(uint32_t val)
{
MYREG -> reg1 = val;
}
and generated code:
foo:
ldr r3, .L3
str r0, [r3, #4]
bx lr
.L3:
.word -318709760
bar:
ldr r3, .L6
str r0, [r3]
bx lr
.L6:
.word -318709760
The places where this data is stored and never reached by the code. The same is in your code. It returns from the function before getting tere (bc lr)
If you use the disassembly tool (as you did), it will not understand it and show undefined instructions.
BTW are you using arm-none-eabi-gdb? as it shows nonsense values of the registers,
What is the real benefit of using byte access (instead of word) on a Cortex M0+.
LPC845 does have for all package pins (GPIO) one "BYTE" register and one "WORD" register.
If I want to set a GPIO using WORD register, I would use something like this
*((volatile uint32_t*)0xA0001088) = 1;
0000036e: movs r4, #137 ; 0x89
00000370: ldr r0, [pc, #80] ; (0x3c4 <AppInit+144>)
00000372: str r4, [r0, #0]
In the same time, I could use BYTE register
(using this, GPIO peripheral will automatically ignore everything is over 0xFF, writing 1 to pin)
*((volatile uint8_t*)0xA000102C) = 1;
0000039c: ldr r3, [pc, #48] ; (0x3d0 <AppInit+156>)
0000039e: adds r1, #56 ; 0x38
000003a0: strb r1, [r3, #0]
Now both techniques are consuming the same number of instructions.
Which one is the best to use and why?
I am working with an ARM device produced by Infineon. There seems to be a problem which I can't seem to find a solution to when configuring PLL. When configuring the register holding N, P and K value for a normal PLL mode, the code produces an interrupt and doesn't pause afterwards. Here is the code as shown in the Disassembler (Eclipse):
1333 SCU_PLL->PLLCON1 = (uint32_t)((SCU_PLL->PLLCON1 & ~(SCU_PLL_PLLCON1_NDIV_Msk | SCU_PLL_PLLCON1_K2DIV_Msk |
08000cc8: ldr r1, [pc, #252] ; (0x8000dc8 <XMC_SCU_CLOCK_StartSystemPll+400>)
08000cca: ldr r3, [pc, #252] ; (0x8000dc8 <XMC_SCU_CLOCK_StartSystemPll+400>)
08000ccc: ldr r2, [r3, #8]
08000cce: ldr r3, [pc, #252] ; (0x8000dcc <XMC_SCU_CLOCK_StartSystemPll+404>)
08000cd0: ands r3, r2
1334 SCU_PLL_PLLCON1_PDIV_Msk)) | ((ndiv - 1UL) << SCU_PLL_PLLCON1_NDIV_Pos) |
08000cd2: ldr r2, [r7, #4]
08000cd4: subs r2, #1
08000cd6: lsls r2, r2, #8
08000cd8: orrs r2, r3
1335 ((kdiv_temp - 1UL) << SCU_PLL_PLLCON1_K2DIV_Pos) |
08000cda: ldr r3, [r7, #16]
08000cdc: subs r3, #1
08000cde: lsls r3, r3, #16
1334 SCU_PLL_PLLCON1_PDIV_Msk)) | ((ndiv - 1UL) << SCU_PLL_PLLCON1_NDIV_Pos) |
08000ce0: orrs r2, r3
1336 ((pdiv - 1UL)<< SCU_PLL_PLLCON1_PDIV_Pos));
It seems like the code "breaks" on the following instruction:
08000cce: ldr r3, [pc, #252] ; (0x8000dcc <XMC_SCU_CLOCK_StartSystemPll+404>)
In other words, if I use the 'step into' function, it jumps to the following interrupt right before moving onto the 'ldr' instruction shown above. The following are the configurations of N, P and K values that I have used.
.syspll_config.n_div = 80U,
.syspll_config.p_div = 2U,
.syspll_config.k_div = 4U,
I've been told that the name of the handler doesn't mean much, but here is what Disassembler settles on after the program fails to execute line 08000cce.
08000298: b.n 0x8000298 <VADC0_G3_3_IRQHandler>
Also, here is what is shown in the console.
Starting target CPU...
Debugger requested to halt target...
...Target halted (PC = 0x08000298)
/.../
WARNING: Failed to read memory # address 0xFFFFFFE8
WARNING: Failed to read memory # address 0xFFFFFFE8
EDIT: Perhaps for the sake of completeness I would include a code snippet from system.c file that initializes PLL module with its default values, which works fine. It is very similar to the code shown in the first code pane of this question, perhaps with the exception of resetting the affected register values before writing new P, N and K values. I have divided the initialization code into two parts - resetting and setting the values; it appears that the code "breaks" during the reset phase.
SCU_PLL->PLLCON1 = ((PLL_NDIV << SCU_PLL_PLLCON1_NDIV_Pos) |
(PLL_K2DIV_24MHZ << SCU_PLL_PLLCON1_K2DIV_Pos) |
(PLL_PDIV << SCU_PLL_PLLCON1_PDIV_Pos));
The problem ended up being caused by a trap request (promoted to NMI) that was generated upon disconnecting the VCO (voltage-controlled oscillator) from the external oscillator OSC. Disconnecting the two hardware components is important in configuring the PLL registers, however, if the trap request upon loss-of-lock is not cleared and disabled, the following command will generate an interrupt:
/* disconnect Oscillator from PLL */
SCU_PLL->PLLCON0 |= (uint32_t)SCU_PLL_PLLCON0_FINDIS_Msk;
The command precedes the following line, which is what I thought was originally causing the problem, thus posting it in the question:
SCU_PLL->PLLCON1 = (uint32_t)((SCU_PLL->PLLCON1 & ~(SCU_PLL_PLLCON1_NDIV_Msk | SCU_PLL_PLLCON1_K2DIV_Msk | SCU_PLL_PLLCON1_PDIV_Msk))
Note that trap request can help troubleshoot problems with the PLL module, therefore they need to be enabled again. However, a trap request is still generated regardless of whether the uC will act on it or not (as decided by enable/disable bit). So, in order to restore trap functionality again, one needs to, again, clear and then enable the module as follows:
SCU_TRAP->TRAPCLR |= SCU_TRAP_TRAPCLR_SVCOLCKT_Msk;
SCU_TRAP->TRAPDIS &= ~SCU_TRAP_TRAPDIS_SVCOLCKT_Msk;
Along the way I have discovered this interesting article that may help anyone working on ARM uCs and facing unexpected interrupts: Debugging and Diagnosing Hard Fault & Other Exceptions.
I'm using a netif struct (similar to http://www.nongnu.org/lwip/structnetif.html) and I got a question related to the alignment. I noticed that every int start on an address that is a multiplier of 4 (e.g. 0x20010db0). However, let's take a look at the following :
struct netif {
...
u8_t hwaddr_len (at address 0x20010db8)
u8_t[8] hwaddr (at address 0x20010db9)
u8_t mtu (at address 0x20010dc1)
...
}
From what I understand, hwaddr_len is align on 4 bytes, hwaddr is "align" on 1 bytes (because it's a u8_t, this isn't align on 4 bytes (32 bits)) and mtu is "align" on 1 byte. After that, all the other member of the struct are align again on 4 bytes. So, I think this should be good, even if hwaddr is not align on a 4 bytes multipler, but when I try to do a memcpy from "src" to hwaddr, I got a unalign access error.
I'm compiling on arm gcc compiler. Is anyone got an idea why it is failing?
Ps : I don't have much knowledge about ARM alignment issues, sorry if my question may seem obvious.
EDIT :
Version of the compiler : gcc-arm-none-eabi-4_9-2015q3
The section where it is failing :
lpwif_get_slla(struct lpwif *lpwif, void *lla, unsigned char lla_len)
{
WpanDeviceP dev = lpwif->dev->wpan;
unsigned char len = 0;
if (lla_len >= 8) {
if (lpwif->eui[0] == 0xff) {
/* Fetch WPAN Device's Long Address. */
uint64_t addr64;
memset(&addr64, 0xff, sizeof(addr64));
WpanGet(dev, WpanPibAttr_macExtendedAddress, &addr64, 8);
/* Always return address in network-byte order */
lpwif->eui[0] = (addr64 >> 56) & 0xff;
lpwif->eui[1] = (addr64 >> 48) & 0xff;
lpwif->eui[2] = (addr64 >> 40) & 0xff;
lpwif->eui[3] = (addr64 >> 32) & 0xff;
lpwif->eui[4] = (addr64 >> 24) & 0xff;
lpwif->eui[5] = (addr64 >> 16) & 0xff;
lpwif->eui[6] = (addr64 >> 8) & 0xff;
lpwif->eui[7] = (addr64 >> 0) & 0xff;
}
if (lpwif->eui[0] == 0xff) return 0; /* Device has no EUI-64 address. */
if (lla) memcpy(lla, lpwif->eui, 8);
}
And this method is called by `lpwif_get_slla(&state->lpwif, netif->hwaddr, 8);
Disassembly:
if (lpwif->eui[0] == 0xff) return 0; /* Device has no EUI-64 address. */
10098ac: 6bfb ldr r3, [r7, #60] ; 0x3c
10098ae: f893 3024 ldrb.w r3, [r3, #36] ; 0x24
10098b2: 2bff cmp r3, #255 ; 0xff
10098b4: d101 bne.n 10098ba <lpwif_get_slla+0x10e>
10098b6: 2300 movs r3, #0
10098b8: e07f b.n 10099ba <lpwif_get_slla+0x20e>
if (lla) memcpy(lla, lpwif->eui, 8);
10098ba: 6bbb ldr r3, [r7, #56] ; 0x38
10098bc: 2b00 cmp r3, #0
10098be: d006 beq.n 10098ce <lpwif_get_slla+0x122>
10098c0: 6bfb ldr r3, [r7, #60] ; 0x3c
10098c2: 3324 adds r3, #36 ; 0x24
10098c4: 6bb8 ldr r0, [r7, #56] ; 0x38
10098c6: 4619 mov r1, r3
10098c8: 2208 movs r2, #8
10098ca: 4b3f ldr r3, [pc, #252] ; (10099c8 <lpwif_get_slla+0x21c>)
10098cc: 4798 blx r3
return 8;
10098ce: 2308 movs r3, #8
10098d0: e073 b.n 10099ba <lpwif_get_slla+0x20e>
Try to replace the memcpy with a copy with a simple for loop. The compiler is probabily optimizing it, assuming it is memory aligned.
Which is faster on ARM?
*p++ = (*p >> 7) * 255;
or
*p++ = ((*p >> 7) << 8) - 1
Essentially what I'm doing here is taking an 8-bit word and setting it to 255 if >= 128, and 0 otherwise.
If p is char below statement is just an assignment to 255.
*p++ = ((*p >> 7) << 8) - 1
If p is int, then of course it is a different story.
You can use GCC Explorer to see how the assembly output looks like. Below is appearently what you get from Linaro's arm-linux-gnueabi-g++ 4.6.3 with -O2 -march=armv7-a flags;
void test(char *p) {
*p++ = (*p >> 7) * 255;
}
void test2(char *p) {
*p++ = ((*p >> 7) << 8) - 1 ;
}
void test2_i(int *p) {
*p++ = ((*p >> 7) << 8) - 1 ;
}
void test3(char *p) {
*p++ = *p >= 128 ? ~0 : 0;
}
void test4(char *p) {
*p++ = *p & 0x80 ? ~0 : 0;
}
creates
test(char*):
ldrb r3, [r0, #0] # zero_extendqisi2
sbfx r3, r3, #7, #1
strb r3, [r0, #0]
bx lr
test2(char*):
movs r3, #255
strb r3, [r0, #0]
bx lr
test2_i(int*):
ldr r3, [r0, #0]
asrs r3, r3, #7
lsls r3, r3, #8
subs r3, r3, #1
str r3, [r0, #0]
bx lr
test3(char*):
ldrsb r3, [r0, #0]
cmp r3, #0
ite lt
movlt r3, #255
movge r3, #0
strb r3, [r0, #0]
bx lr
test4(char*):
ldrsb r3, [r0, #0]
cmp r3, #0
ite lt
movlt r3, #255
movge r3, #0
strb r3, [r0, #0]
bx lr
If you are not nitpicking best is always to check assembly of the generated code over such details. People tend to overestimate compilers, I agree most of the time they do great but I guess it is anyone's right to question generated code.
You should also be careful interpreting instructions, since they won't always match into cycle accurate listing due to core's architectural featuers like having out-of-order, super scalar deep pipelines. So it might not be always shortest sequence of instructions win.
Well, to answer the question in your title, on ARM, a SHIFT+SUB can be done in a single instruction with 1 cycle latenency, while a MUL usually has multiple cycle latency. So the shift will usually be faster.
To answer the implied question of what C code to write for this, generally you are best off with the simplest code that expresses your intent:
*p++ = *p >= 128 ? ~0 : 0; // set byte to all ones iff >= 128
or
*p++ = *p & 0x80 ? ~0 : 0; // set byte to all ones based on the MSB
this will generally get converted by the compiler into the fastest way of doing it, whether that is a shift and whatever, or a conditional move.
Despite the fact that your compiler can optimize the line quite well (and reading the assembly will tell you what is really done), you can refer from this page to know exactly how much cycles a MUL can take.