Debugging STM32 Cortex-M4, running debugger causes hard fault - c

I'm trying to debug some firmware I'm developing on an STM32 Nucleo-64 dev board which has an STM32L412RBT6 Cortex-M4 MCU, and I've been running into a very weird error. Basically, when I try to debug, I end up getting a hard fault on one specific instruction (details below). The odd thing is that this only happens when I'm debugging. If I reset the MCU and let it run without the debugger, it executes the line fine and moves on with the rest of the code.
My setup:
- Using the onboard ST-Link debugger on the Nucleo board
- STM32CubeIDE environment
- GDB debugger
The debugging process seems to be working fine other than this hard fault. I can start the debugger, step through some commands, see changes in variables in my watch list, turn on/off LEDs, etc.
The line that causes the hard fault when debugging is a SPI read-write command:
ret = HAL_SPI_TransmitReceive(&hspi1, (uint8_t *) txBuffer, (uint8_t *) rxBuffer, 4, 1000);
And the line within the HAL_SPI_TransmitReceive() function that causes the fault is the if statement:
if ((((HAL_GetTick() - tickstart) >= Timeout) && ((Timeout != HAL_MAX_DELAY)))
|| (Timeout == 0U))
{
errorcode = HAL_TIMEOUT;
goto error;
}
I've just started working on this project and was just trying to run a quick SPI loopback test to make sure the driver was working properly. There is very little else in the project at the moment, just some initialization for the clock, GPIO, and SPI. I'm confident the code is sound, and again it is working fine when I'm not actively debugging.
I don't have any experience debugging these types of errors. After some searching, I arrived at the Cortex status registers. The Configurable Fault Status Register has a value of 0x0008 0000, which corresponds to the NOCP bit being set, indicating that a coprocessor instruction was issued but the coprocessor was disabled or not present. I'm not sure how this could be relevant, is there something I'm missing? The Hard Fault Status Register has a value of 0x0400 0000, with the FORCED bit set (meaning a configurable fault was escalated to a hard fault).
When I reset the MCU within the debug environment, it immediately goes to the hard fault handler. In that case, the Configurable Fault Status Register has a value of 0x8200, with the BFARVALID and PRECISERR bits set. With BFARVALID set, the Bus Fault Address Register holds the address that triggered the fault, and it has a value of 0x2000 a008. Here's where I get really confused, as the memory ranges for this chip are flash program memory from 0x0800 0000 to 0x0802 0000, and SRAM from 0x2000 0000 to 0x2000 a000, so the address of 0x2000 a008 is outside of either of those. I'm really stumped by this and not sure where to go from here. Any suggestions are much appreciated.

Related

Hard fault in STM32F101RF due to MRC2 Disassembly?

I am having a bootloader code wherein I will sending/receiving data via USART . I have configured USART to operate in interrupt mode.
USART functionality works perfectly fine independently. Verified this with multiple read/write instances.
When I integrate USART code with my bootloader code, bootloader will keep on checking if there is any pending data to read from USART.
If there is any pending data, bootloader will read the Data Register (DR) for the data received already through interrupt. (kind of polling + interrupt)
My problem :
Whenever there is an USART receive interrupt is triggered , inside receive interrupt service routine hard fault error occurs.
PC says its inside hard fault routine where I read data from DR.
But a strange thing I saw is, from the location where hard fault hits, in the disassembly I see only MRC2 commands
Is the issue occuring because of this ?? 0x8004802 is the location where my hardfault hits.
Kindly enlighten me on this
Look at the raw hex values: 0xFF all the way starting with a suspicious address 0x8004800 which is for sure a page boundary.
In other words: The flash memory is bad or was erased and not completely written. Verifying the flashed program (bootloader) should fail.
If that was in your bootloader code, it might have tried to overwrite itself - or simply erased the wrong memory page.

How do I exit from a ARM fault handler?

I am using STM32F746, an ARM cortex-M7 based processor. I am trying to do something hacky, which requires me to return to the program from a MemManage Fault handler.
When entering MemManage Fault handler, the PC before the fault and everything I need is stored on the stack. So I thought I can simply recover those to return to the previous execution point.
However, I cannot properly restore the xPSR.
The previous CPSR before the fault handler is saved in the stack, so I tried restoring it using MSR instruction.
I tried both MSR, xpsr, r12 and MSR, apsr, r12
However, it would only restore the flags and not the other parts of the CPSR , such as the GE or the system mode bits.
(and my mode bits seems also weird.. my xPSR shows as: 0x61070004, but this tells me that the last 5 bits cannot be 0x04)
How can I go back to the program point before the fault handler? I also tried popping the pc but it does not work and I think the problem is CPSR not getting properly restored.
When a Cortex M7 enters an exception handler, the execution context is saved as follows and of course restored when exiting the handler (from ARM Cortex M7 Programming Manual):
As you see, the xPSR is restored after the return from exception.
Furthermore
faults are a subset of the exceptions.
You can do a simple test: dereference on purpose an unvalid pointer. It will trigger a HardFault. Modify your HardFault handler to just return and do nothing. You can check that the context is restored. I tried on STM32H753, it works fine, xPSR latest bits (ISR_NUMBER) are indeed 0 (thread mode).
Be careful though: I don't know for MemManage but HardFault returns to the very same instruction that triggered the fault (and not to the following instruction like a regular exception). It means you will execute again the same instruction after the Hardfault.

STM32F4 FSMC/FMC SRAM as Heap/Stack results in random hardfaults

we are currently evaluating to use an external SRAM for C/C++ heap storage on our platform using a STM32F439BI microcontroller.
The problem
Using the SRAM as storage for heap results in random hardfaults which are raised from buserrors/imprecice buserrors.
Without placing the heap on the SRAM, memory tests run successfully on the whole SRAM (8 bit/16 bit and 32 bit accesses).
Connecting a debugger I can observe these errors sometimes before a hardfault occurs. Most often a word is read from the SRAM and the CPU register fills with addresses of the following format: 0x-1F3-1F3 (- is most often '0', sometimes 'A' or '6'). The pattern '1F3' persists. If the same address is read again some lines further down the correct value is read (some other address in 0x60000000 space).
If I stop the program on a breakpoint at some point early in the program and step a few lines, I get these errors more frequently.
Further details
The SRAM is connected using the FMC/FSMC peripheral on FMC bank 1 and SRAM bank 1 and is therefore memory-mapped to address 0x60000000.
All settings for GPIO pins and FMC configuration are set from the startup file before main() executes or static objects are created.
The SRAM is the following: CY7C1041GN30
We connect all 16 data pins, all 18 address pins, BHE, BLE, OE, WE and CE to our controller. All pins are configured as push-pull-alternate-function, pull-up, AF_12 (FMC), very high speed. We enable clocks for all necessary pins and the clock for FMC. Note: Initially we started out without pull-up/down showing the same symptoms.
The controller runs with a clock speed of 168 MHz
As stated above, a memory test runs successfully
We use DMA for SPI, I2C and ADC data transfers
We frequently use interrupts, including external (pin) interrupts
We use the following timing settings:
AddressSetupTime: 2
AddressHoldTime: 4
DataSetupTime: 4
BusTurnAroundDuration: 1
CLKDivision: 2
DataLatency: 2
We configure the FMC as follows:
NSBank FMC_NORSRAM_BANK1,
DataAddressMux FMC_DATA_ADDRESS_MUX_DISABLE,
MemoryType FMC_MEMORY_TYPE_SRAM,
MemoryDataWidth FMC_NORSRAM_MEM_BUS_WIDTH_16,
BurstAccessMode FMC_BURST_ACCESS_MODE_DISABLE,
WaitSignalPolarity FMC_WAIT_SIGNAL_POLARITY_LOW,
WrapMode FMC_WRAP_MODE_DISABLE,
WaitSignalActive FMC_WAIT_TIMING_BEFORE_WS,
WriteOperation FMC_WRITE_OPERATION_ENABLE,
WaitSignal FMC_WAIT_SIGNAL_DISABLE,
ExtendedMode FMC_EXTENDED_MODE_DISABLE,
AsynchronousWait FMC_ASYNCHRONOUS_WAIT_DISABLE,
WriteBurst FMC_WRITE_BURST_DISABLE,
ContinuousClock FMC_CONTINUOUS_CLOCK_SYNC_ASYNC,
WriteFifo 0,
PageSize 0
We spend a lot of time of experimenting with longer timings and compared all the settings to examples including this one: Using STM32L476/486 FSMC peripheral
to drive external memories (although this one is for the STM32L4, I am fairly certain it applies to this controller as well)
Findings on similar problems
The problem sounds very similar to this errata sheet entry: "2.3.4 Corruption of data read from the FMC" but it also says the error is fixed in our revision of the controller (3)
I hope someone out there has seen this strange behaviour before and can help us. After over one week of debugging we expect some kind of error in the controller when interrupts/DMA accesses occur while the CPU accesses the SRAM (when we use it as heap, it is accessed very frequently). Hopefully you can shed some light on this topic.
Sorry for not getting back to you, internet.
Yes, we found out what the issue was (at least in our case). Problem was that the J-Link debugger we use is causing problems if it hangs above the power electronics on our pcb (it is mounted vertically). If we guide the ribbon cable out at the top (only digital electronics) the error disappears. So our guess is, that some noise from the electronics was caught up by the cable and directly injected into the JTAG port, which caused failures inside the MCU.
Just got a confirmation from ST, that there is a bug in the STM32F469 FMC that might cause incorrect values if the write fifo is disabled. The workaround is to have the fifo enabled. It is the same issue as in this F7 processor https://www.st.com/resource/en/errata_sheet/dm00145382.pdf

How to correctly use a startup-ipi to start an application processor?

My goal is to let my own kernel start an application cpu. It uses the same mechanism as the linux kernel:
Send asserting and level triggered init-IPI
Wait...
Send deasserting and level triggered init-IPI
Wait...
Send up to two startup-IPIs with vector number (0x40000 >> 12) (the entry code for the application processor lies there)
Currently I'm just interested in making it work with QEMU. Unfortunately, instead of jumping to 0x40000, the application cpu jumps to 0x0 with the cs register set to 0x4000. (I checked with gdb).
The Intel MultiProcessor Specification (B.4.2) explains that the behavior that I noticed is valid if the target processor is halted immediately after RESET or INIT. But shouldn't this also apply to the code of the linux kernel? It sends the startup-IPI after the init-IPI. Or do I misunderstand the specification?
What can I do to have the application processor jump to 0x000VV000 and not to 0x0 with the cs register set to 0xVV00? I really can't see, where linux does something that changes the behavior.
It seems that I really misunderstood the specification: Since the application cpu is started in real mode 0x000VV000 is equivalent to 0xVV00:0x0000. It is not possible to represent the address just in the 16 bit ip register. Therefore a segment offset for the code segment is required.
Additionally, debugging real mode code with gdb is comparable complicated because it does not respect the segment offset. When required to see the disassembled code of the trampoline at the current position, it is necessary to calculate the physical location:
x/20i $eip+0xVV000
This makes gdb print the next 20 instructions at 0xVV00:$eip.

How to set register value to enable interrupt?

I am handling interrupt for a device in Android. (Android 4.2.2 Kernel 2.6.29, running on Mach-Goldfish virtual device).
So far I have registered my device with the interrupt #17. It hasn't been enabled yet so signals sent to this interrupt are ignored and my interrupt handler is not notified.
The register that enables my device is at offset 0x00, and the memory address, as returned by
(char __iomem *)IO_ADDRESS(resource->start - IO_START)
starts at 0xFE016000.
I tried: (in mydevice_probe())
writel(0x07, 0xFE016000);
//0x07 is a mask to enable three sub-devices at bit 0, bit 1 and bit 2.
But the kernel crashed right away. The following writels also did not work:
writel(0x00, 0xFE016000);
writel(0x01, 0xFE016000);
What did I miss? Could any one show me how to get this done? In case I got the start address wrong, could you point out the way to get it correctly?
Thanks.
P/S: The kernel panic:
qemu: fatal: mydevice_write: Bad offset fea000
R00=c02ef00b R01=00000000 R02=00000007 R03=e0808000
R04=c0340864 R05=c031e3b0 R06=c0173b6c R07=c031e3cc
R08=00000000 R09=00100100 R10=00000000 R11=df827e34
R12=ff016000 R13=df827e18 R14=c002e96c R15=c0030aac
PSR=20000013 --C- A svc32
Aborted (core dumped)
This is to close my question.
it turns out that the emulator is faulty.
Normally writel(MASK, IO_ADDRESS(resource->start - IO_START)); should work.

Resources