STM8 ASM safely execute WFE - c

I have c code runned from RAM in low power run mode (so interrupt are no handled). This mode enabled by code sequence:
jump to RAM
SIM
switch off internal flash, and power regulator, switch to low speed clock source (LSE)
do some work with WFE mode (low power wait mode)
switch on power regulator and flash, restore clock source
RIM
jump to flash
So there no problem whith WFE instruction that described in errata sheet. Problem with this construction, it may be cause of CPU lock in low power wait mode forever:
while nbit(TIM1_SR1,CC3IF) asm("wfe");
that is disassembly as:
000035 720252B602 BTJT TIM1_SR1, #1, 0xB6
00003A 728F WFE
Event from timer has a probabilistic nature, and this code don't guarantee that it will happen after WFE instruction be executed:
BTJT instruction exucted in 2 cycles, and have length 5;
code executed from RAM may be not continuous because "fetch" states pause execution on few cycles
I use manual PM0044, and at page 26 it content pretty table:
There 2 cases when code execution stalled at 3 cycles. So I'm not sure that my asynchronous wakeup event will not occur between BTJT and WFE instructions.
Is there ways to be ensure strict logical sequence (check condition > wfe > wakeup event)?

If your lockup problems are caused by the WFE errata I mentioned then there should be an easier solution than trying to achieve "proper application timing".
The errata provided by STMicroelectronics reads:
Two types of failures can occur:
Case 1: In case WFE instruction is placed in the two MSB of the
32-bit word within the memory, an event which occurs during the WFE
execution cycle or re-execution cycle (when returning from ISR
handler) will cause an incorrect code execution.
Case 2: An interrupt request, which occurs during the WFE
execution cycle will lead to incorrect code execution. This is also
valid for the WFE re-execution cycle, while returning from an ISR
handler
Case 2 shouldn't apply in your case as you say "Interrupts are not handled because I use low power run mode". If interrupts can't occur during the WFE instruction only the failure described in the first case could be causing your lockups.
Case 1 only applies if the WFE instruction is in a certain alignment within a 32-bit word in memory. So if you can ensure the WFE instruction never appears in code aligned this way then you won't encounter this failure. If your assembler supports an align directive you could use it achieve this, maybe along with a label and a jump if the assembler doesn't insert NOPs. However an easier solution is given as a "dedicated workaround" in errata:
Replace the WFE instruction with
WFE
JRA next
next:
This appears to work around the failure by putting what amounts to a 2-byte NOP after the WFE instruction. My guess is the failure results in the CPU not executing the instruction immediately following the WFE instruction, instead skipping ahead two bytes to the instruction (if any) at the start of the next 32-bit world. Putting a 2-byte NOP in the space skipped over means it doesn't matter if the failure occurs or not.

Solution found by OP:
I have read errata (thanks to Ross Ridge) few times more attentively, and this is main idea:
General solution is to ensure no interrupt request or event occurs during WFE instruction execution or re-execution cycle by proper application timing.

Related

Cortex-M0+ not responding to PendSV

I'm running on a Raspberry Pi Pico (RP2040, Cortex-M0+ core, debugging via VSCode cortex-debug using JLink SWD), and I'm seeing strange behaviour regarding PendSV.
Immediately prior, the SVCall exception handler requested PendSV via the ICSR register. But on exception return, rather than tail-chaining the PendSV, execution instead returns to the calling code and continues non-exception execution.
All the while the ICSR register shows the pending PendSV, even while thread code instructions are repeatedly stepped. System handler priorities are all zero, IRQ priorities are lower.
According to the ARMv6-M reference manual, PendSV cannot be disabled.
So, what am I missing that would cause this behaviour?
Edited to add:
Perhaps it's a debugger interaction? The JLink software (v4.95d) is still in Beta...
I see that the debugger can actually disable PendSV and Systick - C1.5.1 Debug Stepping: "Optionally, the debugger can set DHCSR.C_MASKINTS to 1 to prevent PendSV, SysTick, and external configurable interrupts from occurring. This is described as masking these interrupts. Table C1-7 on page C1-326 summarizes instruction stepping control."
It turns out that the problem is caused by single-stepping the instruction that writes to the PENDSVSET bit in the ICSR: the bit is set, and the VECTPENDING field shows 0xe, but the PendSV never fires.
Free-running over that instruction to a later breakpoint sees the PendSV fire correctly.
So it is indeed a debugger interaction.
Whether that's to do with interrupts being inhibited as #cooperised suggests isn't clear - the DHCSR's C_MASKINTS bit reads as zero throughout, but how that bit is manipulated during the actual step operation isn't visible at this level.
Which makes me wonder whether the way the JLink is performing the step induces unpredictable/indeterminate behaviour - e.g. as per the warning in the C_MASKINTS description. Or perhaps this is simply what happens in an M0+ under these circumstances, and I've never single-stepped this instruction before.
In any case, the workaround is simply to not single-step the instruction that sets PENDSVSET.
Edited to add:
Finally, #cooperised was correct.
On taking more care to distinguish exactly between stepping (including stepping over function calls) and running (including running to the very next instruction), it's clear that stepping disables interrupts including PendSV.
The same thing happened to me but I found that the reason was that I was not closing the previous PensSV interrupt by returning through LR containing 0xFFFFFFF9. Instead I was returning via the PC to a previous routine's return address.
Since I did not return via 0xFFFFFFF9 it was not properly closing the previous PendSV and did not recognize subsequent ones.

DSB on ARM Cortex M4 processors

I have read the ARM documentation and it appears that they say in some places that the Cortex M4 can reorder memory writes, while in other places it indicates that M4 will not.
Specifically I am wondering if the DBM instruction is needed like:
volatile int flag=0;
char buffer[10];
void foo(char c)
{
__ASM volatile ("dbm" : : : "memory");
__disable_irq(); //disable IRQ as we use flag in ISR
buffer[0]=c;
flag=1;
__ASM volatile ("dbm" : : : "memory");
__enable_irq();
}
Uh, it depends on what your flag is, and it also varies from chip to chip.
In case that flag is stored in memory:
DSB is not needed here. An interrupt handler that would access flag would have to load it from memory first. Even if your previous write is still in progress the CPU will make sure that the load following the store will happen in the correct order.
If your flag is stored in peripheral memory:
Now it gets interesting. Lets assume flag is in some hardware peripheral. A write to it may make an interrupt pending or acknowledge an interrupt (aka clear a pending interrupt). Contrary to the memory example above this effect happens without the CPU having to read the flag first. So the automatic ordering of stores and loads won't help you. Also writes to flag may take effect with a surprisingly long delay due to different clock domains between the CPU and the peripheral.
So the following szenario can happen:
you write flag=1 to clear an handled interrupt.
you enable interrupts by calling __enable_irq()
interrupts get enabled, write to flag=1 is still pending.
wheee, an interrupt is pending and the CPU jumps to the interrupt handler.
flag=1 takes effect. You're now in an interrupt handler without anything to do.
Executing a DSB in front of __enable_irq() will prevent this problem because whatever is triggered by flag=1 will be in effect before __enable_irq() executes.
If you think that this case is purely academic: Nope, it's real.
Just think about a real-time clock. These usually runs at 32khz. If you write into it's peripheral space from a CPU running at 64Mhz it can take a whopping 2000 cycles before the write takes effect. Now for real-time clocks the data-sheet usually shows specific sequences that make sure you don't run into this problem.
The same thing can however happen with slow peripherals.
My personal anecdote happened when implementing power-saving late in a project. Everything was working fine. Then we reduced the peripheral clock speed of I²C and SPI peripherals to the lowest possible speed we could get away with. This can save lots of power and extend battery live. What we found out was that suddenly interrupts started to do unexpected things. They seem to fire twice each time wrecking havoc. Putting a DSB at the end of each affected interrupt handler fixed this because - you can guess - the lower clock speed caused us to leave the interrupt handlers before clearing the interrupt source was in effect due to the slow peripheral clock.
This section of the Cortex M4 generic device user guide enumerates the factors which can affect reordering.
the processor can reorder some memory accesses to improve efficiency, providing this does not affect the behavior of the instruction sequence.
the processor has multiple bus interfaces
memory or devices in the memory map have different wait states
some memory accesses are buffered or speculative.
You should also bear in mind that both DSB and ISB are often required (in that order), and that C does not make any guarantees about the ordering (except in-thread volatile accesses).
You will often observe that the short pipeline and instruction sequences can combine in such a way that the race conditions seem unreachable with a specific compiled image, but this isn't something you can rely on. Either the timing conditions might be rare (but possible), or subsequent code changes might change the resulting instruction sequence.

ARM WFI won't sleep

I am trying to enter standby mode on a Cortex-M4. The normal behaviour is that the device wakes up about every 2 minutes but on my latest FW release, it seems that the code is "randomly" stuck.
After investigation it seems that the code passes the WFI instruction without going to standby (no standby => no reset => infinite loop => ... => 42).
So after many unclear spec reading my understanding is that the WFI may not go to sleep if there are pending interrupts.
Can you confirm the last sentence
How to ensure all pending interrupts are cleared before calling WFI ?
There are three conditions that cause the processor to wake up from a WFI instruction:
a non-masked interrupt occurs and its priority is greater than the current execution priority (i.e. the interrupt is taken)
an interrupt masked by PRIMASK becomes pending
a Debug Entry request.
If any of the wake up conditions are true when the WFI instruction executes, then it is effectively a NOP (i.e. you don't go to sleep).
As for making sure that no interrupts are pending, that's your code that must do that. Usually it means making sure that the interrupt source is satisfied so that it does not assert its interrupt request and then clear the necessary pending bit. You can see what is pending by reading at the interrupt pending registers but interrupt handlers are usually tasked to make sure they leave things quiescent.
Note that most systems have to do some work immediately before or after executing WFI. For example, there is often a test that must be done to determine if there is any additional work to be done before deciding to go to sleep with WFI. That test and the execution of WFI are then done in a critical section where PRIMASK is set to 1 (so we are exercising option #2 above). This will insure that no interrupts gets in between the test and the WFI and that after wakeup, no interrupt gets in case there are additional operations (usually involving clocking) that need to get done. After wake up, PRIMASK is set back to 0 (exiting the critical section) and any pending interrupt is taken.
Also ARM recommends executing a DSB instruction immediately before WFI to insure that any data operations are finished before the processor goes to sleep. It may not be strictly necessary in all situations, but put it in just in case circumstances change and you overlook it.

arm sleep mode entry and exit differences WFE, WFI

I am reasonably new to the ARM architectures and I am trying to wrap my head around the wake up mechanism.
So first of all I am finding it difficult to find good info on this. ARM's documentation seems to be very terse on the topic.
What I'd like to understand is when the Cortex (particularly the M0 as that's what I am working with) will wake up.
For reference, I have also consulted the following:
What is the purpose of WFI and WFE instructions and the event signals?
Why does the processor enter standby when using WFE instruction but not when using WFI instruction?
The docs on the WFE instructions are:
3.7.11. WFE
Wait For Event.
Syntax
WFE
Operation
If the event register is 0, WFE suspends execution until
one of the following events occurs:
an exception, unless masked by the exception mask registers or the current
priority level
an exception enters the Pending state, if SEVONPEND in the
System Control Register is set
a Debug Entry request, if debug is enabled
an event signaled by a peripheral or another processor in a
multiprocessor system using the SEV instruction.
If the event register is 1, WFE clears it to 0 and completes immediately.
For more information see Power management.
Note
WFE is intended for power saving only. When writing software assume
that WFE might behave as NOP.
Restrictions
There are no restrictions.
Condition flags
This instruction does not change the flags.
Examples
WFE ; Wait for event
The WFI:
3.7.12. WFI
Wait for Interrupt.
Syntax
WFI
Operation
WFI suspends execution until one of the following events occurs:
an exception
an interrupt becomes pending, which would preempt if PRIMASK was clear
a Debug Entry request, regardless of whether debug is enabled.
Note
WFI is intended for power saving only. When writing software assume
that WFI might behave as a NOP operation.
Restrictions
There are no restrictions.
Condition flags
This instruction does not change the flags.
Examples
WFI ; Wait for interrupt
So, some questions:
1) Firstly, can someone please clarify the difference between:
a) System Handler Priority Registers
b) Interrupt Priority Registers.
Is it just that b) are for interrupts that aren't system related such as pendSv?
Now for some scenarios. Really I would like to understand how the scenarios governed by the:
NVIC IRQ enable
NVIC pending
PRIMASK
affect the entry and exit of WFE and WFI.
So the various combinations of these bits yields 8 different scenarios
{NVIC_IRQ enable, NVIC pending, PRIMASK}.
I have already added my vague understanding so far. Please help me with this table.
000 - No prevention of WFE or WFI entry but no wake up condition either
001 - as 000
010 - How does pending affect entry into sleep mode for WFE and WFI?
011 - I guess the answer here is as 010 but with possibly different wake up conditions?
100 - I'd guess WFE and WFI both enter low power mode and exit low power mode no problem.
101 - Any difference to WFE and WFI power mode exit here?
110 - No idea!
111 - No idea!
I am excluding the priorities here as I'm not too concerned about the exception handling order just yet.
Excluding SEV and the event signals, does WFE behave the same as WFI if SEVONPEND is 0?
The primary mechanism for wake that you'll see on a Cortex-M is an interrupt, hence WFI (wait for interrupt). On all of the implementations that I've seen that results in clock-gating the core, although deeper sleep/higher latency modes are sometimes available if the design supports it.
WFE is more relevant in multi-processor designs.
With regard to the questions -
1. Interrupts and System Handlers are very similar in the Cortex-M, differing primarily by how they are triggered. The architecture distinguishes between them, but in practice they are the same.
Are for your bit tables, they don't really make sense. Each Cortex-M implementation has it's own interpretation of what happens during WFI. It can vary from basic clock gating to deep-sleep modes. Consult your microprocessor documentation for the real story.
PRIMASK doesn't affect wake from sleep behavior.
My answer your question about the different between WFI and WFE is based on ARM Cortex-A9 MPcore, please take a look at this link ARM cortex-a9 MPcore TRM.
Basically, there are four CPU modes run mode, standby mode, dormant mode, shutdown mode.
The differences for WFI and WFE are the way to bring CPU to run mode.
WFE can works with the execution of an SEV instruction on any processor in the multiprocessor system, and also works with an assertion of the EVENTI input signal.
WFI doesn't have these two.
Also the way they handle the causes.
WFI must work with IRQ_Handler, WFE doesn't have to.

What is the irq latency due to the operating system?

How can I estimate the irq latency on ARM processor?
What is the definition for irq latency?
Interrupt Request (irq) latency is the time that takes for interrupt request to travel from source of the interrupt to the point when it will be serviced.
Because there are different interrupts coming from different sources via different paths, obviously their latency is depending on the type of the interrupt. You can find table with very good explanations about latency (both value and causes) for particular interrupts on ARM site
You can find more information about it in ARM9E-S Core Technical Reference Manual:
4.3 Maximum interrupt latency
If the sampled signal is asserted at the same time as a multicycle instruction has started
its second or later cycle of execution, the interrupt exception entry does not start until
the instruction has completed.
The longest LDM instruction is one that loads all of the registers, including the PC.
Counting the first Execute cycle as 1, the LDM takes 16 cycles.
• The last word to be transferred by the LDM is transferred in cycle 17, and the abort
status for the transfer is returned in this cycle.
• If a Data Abort happens, the processor detects this in cycle 18 and prepares for
the Data Abort exception entry in cycle 19.
• Cycles 20 and 21 are the Fetch and Decode stages of the Data Abort entry
respectively.
• During cycle 22, the processor prepares for FIQ entry, issuing Fetch and Decode
cycles in cycles 23 and 24.
• Therefore, the first instruction in the FIQ routine enters the Execute stage of the
pipeline in stage 25, giving a worst-case latency of 24 cycles.
and
Minimum interrupt latency
The minimum latency for FIQ or IRQ is the shortest time the request can be sampled
by the input register (one cycle), plus the exception entry time (three cycles). The first
interrupt instruction enters the Execute pipeline stage four cycles after the interrupt is
asserted
There are three parts to interrupt latency:
The interrupt controller picking up the interrupt itself. Modern processors tend to do this quite quickly, but there is still some time between the device signalling it's pin and the interrupt controller picking it up - even if it's only 1ns, it's time [or whatever the method of signalling interrupts are].
The time until the processor starts executing the interrupt code itself.
The time until the actual code supposed to deal with the interrupt is running - that is, after the processor has figured out which interrupt, and what portion of driver-code or similar should deal with the interrupt.
Normally, the operating system won't have any influence over 1.
The operating system certainly influences 2. For example, an operating system will sometimes disable interrupts [to avoid an interrupt interfering with some critical operation, such as for example modifying something to do with interrupt handling, or when scheduling a new task, or even when executing in an interrupt handler. Some operating systems may disable interrupts for several milliseconds, where a good realtime OS will not have interrupts disabled for more than microseconds at the most.
And of course, the time it takes from the first instruction in the interrupt handler runs, until the actual driver code or similar is running can be quite a few instructions, and the operating system is responsible for all of them.
For real time behaviour, it's often the "worst case" that matters, where in non-real time OS's, the overall execution time is much more important, so if it's quicker to not enable interrupts for a few hundred instructions, because it saves several instructions of "enable interrupts, then disable interrupts", a Linux or Windows type OS may well choose to do so.
Mats and Nemanja give some good information on interrupt latency. There are two is one more issue I would add, to the three given by Mats.
Other simultaneous/near simultaneous interrupts.
OS latency added due to masking interrupts. Edit: This is in Mats answer, just not explained as much.
If a single core is processing interrupts, then when multiple interrupts occur at the same time, usually there is some resolution priority. However, interrupts are often disabled in the interrupt handler unless priority interrupt handling is enabled. So for example, a slow NAND flash IRQ is signaled and running and then an Ethernet interrupt occurs, it may be delayed until the NAND flash IRQ finishes. Of course, if you have priorty interrupts and you are concerned about the NAND flash interrupt, then things can actually be worse, if the Ethernet is given priority.
The second issue is when mainline code clears/sets the interrupt flag. Typically this is done with something like,
mrs r9, cpsr
biceq r9, r9, #PSR_I_BIT
Check arch/arm/include/asm/irqflags.h in the Linux source for many macros used by main line code. A typical sequence is like this,
lock interrupts;
manipulate some flag in struct;
unlock interrupts;
A very large interrupt latency can be introduced if that struct results in a page fault. The interrupts will be masked for the duration of the page fault handler.
The Cortex-A9 has lots of lock free instructions that can prevent this by never masking interrupts; because of better assembler instructions than swp/swpb. This second issue is much like the IRQ latency due to ldm/stm type instructions (these are just the longest instructions to run).
Finally, a lot of the technical discussions will assume zero-wait state RAM. It is likely that the cache will need to be filled and if you know your memory data rate (maybe 2-4 machine cycles), then the worst case code path would multiply by this.
Whether you have SMP interrupt handling, priority interrupts, and lock free main line depends on your kernel configuration and version; these are issues for the OS. Other issues are intrinsic to the CPU/SOC interrupt controller, and to the interrupt code itself.

Resources