DSB on ARM Cortex M4 processors

DSB on ARM Cortex M4 processors - arm

I have read the ARM documentation and it appears that they say in some places that the Cortex M4 can reorder memory writes, while in other places it indicates that M4 will not.
Specifically I am wondering if the DBM instruction is needed like:
volatile int flag=0;
char buffer[10];
void foo(char c)
{
__ASM volatile ("dbm" : : : "memory");
__disable_irq(); //disable IRQ as we use flag in ISR
buffer[0]=c;
flag=1;
__ASM volatile ("dbm" : : : "memory");
__enable_irq();
}

Uh, it depends on what your flag is, and it also varies from chip to chip.
In case that flag is stored in memory:
DSB is not needed here. An interrupt handler that would access flag would have to load it from memory first. Even if your previous write is still in progress the CPU will make sure that the load following the store will happen in the correct order.
If your flag is stored in peripheral memory:
Now it gets interesting. Lets assume flag is in some hardware peripheral. A write to it may make an interrupt pending or acknowledge an interrupt (aka clear a pending interrupt). Contrary to the memory example above this effect happens without the CPU having to read the flag first. So the automatic ordering of stores and loads won't help you. Also writes to flag may take effect with a surprisingly long delay due to different clock domains between the CPU and the peripheral.
So the following szenario can happen:
you write flag=1 to clear an handled interrupt.
you enable interrupts by calling __enable_irq()
interrupts get enabled, write to flag=1 is still pending.
wheee, an interrupt is pending and the CPU jumps to the interrupt handler.
flag=1 takes effect. You're now in an interrupt handler without anything to do.
Executing a DSB in front of __enable_irq() will prevent this problem because whatever is triggered by flag=1 will be in effect before __enable_irq() executes.
If you think that this case is purely academic: Nope, it's real.
Just think about a real-time clock. These usually runs at 32khz. If you write into it's peripheral space from a CPU running at 64Mhz it can take a whopping 2000 cycles before the write takes effect. Now for real-time clocks the data-sheet usually shows specific sequences that make sure you don't run into this problem.
The same thing can however happen with slow peripherals.
My personal anecdote happened when implementing power-saving late in a project. Everything was working fine. Then we reduced the peripheral clock speed of I²C and SPI peripherals to the lowest possible speed we could get away with. This can save lots of power and extend battery live. What we found out was that suddenly interrupts started to do unexpected things. They seem to fire twice each time wrecking havoc. Putting a DSB at the end of each affected interrupt handler fixed this because - you can guess - the lower clock speed caused us to leave the interrupt handlers before clearing the interrupt source was in effect due to the slow peripheral clock.

This section of the Cortex M4 generic device user guide enumerates the factors which can affect reordering.
the processor can reorder some memory accesses to improve efficiency, providing this does not affect the behavior of the instruction sequence.
the processor has multiple bus interfaces
memory or devices in the memory map have different wait states
some memory accesses are buffered or speculative.
You should also bear in mind that both DSB and ISB are often required (in that order), and that C does not make any guarantees about the ordering (except in-thread volatile accesses).
You will often observe that the short pipeline and instruction sequences can combine in such a way that the race conditions seem unreachable with a specific compiled image, but this isn't something you can rely on. Either the timing conditions might be rare (but possible), or subsequent code changes might change the resulting instruction sequence.

Related

How to know whether an IRQ was served immediately on ARM Cortex M0+ (or any other MCU)

For my application (running on an STM32L082) I need accurate (relative) timestamping of a few types of interrupts. I do this by running a timer at 1 MHz and taking its count as soon as the ISR is run. They are all given the highest priority so they pre-empt less important interrupts. The problem I'm facing is that they may still be delayed by other interrupts at the same priority and by code that disables interrupts, and there seems to be no easy way to know this happened. It is no problem that the ISR was delayed, as long as I know that the particular timestamp is not accurate because of this.
My current approach is to let each ISR and each block of code with interrupts disabled check whether interrupts are pending using NVIC->ISPR[0] and flagging this for the pending ISR. Each ISR checks this flag and, if needed, flags the timestamp taken as not accurate.
Although this works, it feels like it's the wrong way around. So my question is: is there another way to know whether an IRQ was served immediately?
The IRQs in question are EXTI4-15 for a GPIO pin change and RTC for the wakeup timer. Unfortunately I'm not in the position to change the PCB layout and use TIM input capture on the input pin, nor to change the MCU used.
update
The fundamental limit to accuracy in the current setup is determined by the nature of the internal RTC calibration, which periodically adds/removes 32kHz ticks, leading to ~31 µs jitter. My goal is to eliminate (or at least detect) additional timestamping inaccuracies where possible. Having interrupts blocked incidentally for, say, 50+ µs is hard to avoid and influences measurements, hence the need to at least know when this occurs.
update 2
To clarify, I think this is a software question, asking if a particular feature exists and if so, how to use it. The answer I am looking for is one of: "yes it is possible, just check bit X of register Y", or "no it is not possible, but MCU ... does have such a feature, called ..." or "no, such a feature is generally not available on any platform (but the common workaround is ...)". This information will guide me (and future readers) towards a solution in software, and/or requirements for better hardware design.

In general
The ideal solution for accurate timestamping is to use timer capture hardware (built-in to the microcontroller, or an external implementation). Aside from that, using a CPU with enough priority levels to make your ISR always the highest priority could work, or you might be able to hack something together by making the DMA engine sample the GPIO pins (specifics below).
Some microcontrollers have connections between built-in peripherals that allow one peripheral to trigger another (like a GPIO pin triggering timer capture even though it isn't a dedicated timer capture input pin). Manufacturers have different names for this type of interconnection, but a general overview can be found on Wikipedia, along with a list of the various names. Exact capabilities vary by manufacturer.
I've never come across a feature in a microcontroller for indicating if an ISR was delayed by a higher priority ISR. I don't think it would be a commonly-used feature, because your ISR can be interrupted by a higher priority ISR at any moment, even after you check the hypothetical was_delayed flag. A higher priority ISR can often check if a lower priority interrupt is pending though.
For your specific situation
A possible approach is to use a timer and DMA (similar to audio streaming, double-buffered/circular modes are preferred) to continuously sample your GPIO pins to a buffer, and then you scan the buffer to determine when the pins changed. Note that this means the CPU must scan the buffer before it is overwritten again by DMA, which means the CPU can only sleep in short intervals and must keep the timer and DMA clocks running. ST's AN4666 is a relevant document, and has example code here (account required to download example code). They're using a different microcontroller, but they claim the approach can be adapted to others in their lineup.
Otherwise, with your current setup, I don't think there is a better solution than the one you're using (the flag that's set when you detect a delay). The ARM Cortex-M0+ NVIC does not have a feature to indicate if an ISR was delayed.
A refinement to your current approach might be making the ISRs as short as possible, so they only do the timestamp collection and then put any other work into a queue for processing by the main application at a lower priority (only applicable if the work is more complex than the enqueue operation, and if the work isn't time-sensitive). Eliminating or making the interrupts-disabled regions short should also help.

When is a Cortex write to a device realised

When writing to device registers on a Cortex M0 (in my case, on an STM32L073), a question arises as to how careful one should be in a) ordering accesses to device memory and b) deciding that a change to a peripheral configuration has actually completed to the point that any dependencies become valid.
Taking a specific example to change the internal voltage regulator to a different voltage. You write the change to PWR->CR and read the status from PWR->CSR. I see code that does something like this:
Write to PWR->CR to set the voltage range
Spin until (PWR->CSR & voltage flag) becomes zero
In my mind there are three issues here:
Access ordering. This is Device Memory so transaction order is preserved relative to other Device access transactions. I would assume this means a DSB is not required between the write to CR and the read from CSR. A linked question and the answer to this is: [ARM CortexA]Difference between Strongly-ordered and Device Memory Type
Device memory can be buffered. Is there a possibility that a write to CR could still be in process when the read from CSR occurs. This would mean that the voltage flag would be clear and the code would proceed. In actual fact the flag hasn't gone high yet!
Hardware response time. Is there a latency between the write and the effects becoming final? In actuality this should always be documented - for the STM32 the docs definitively say that the flag is set when the CR register changes.
Are there any race condition possibilities here? It's really the buffering that worries me - that a peripheral write is still in progress when a peripheral read takes place.

Access ordering.
Accesses are strongly ordered and you do not need barrier instructions to read back the same register.
Device memory can be buffered. Is there a possibility that a write to CR
Yes, it is possible. But it is not because of buffering but because of the bus propagation time. It may take several clocks before a particular operation will go through all bridges.
Hardware response time. Is there a latency between the write and the
effects becoming final
Even if there is a latency it is not important from your point of view. If you set bit in the CR register and wait for the result in the status register. Simply wait for the status bit to have the expected value.

The right way to clear an interrupt flag on STM32

I'm developping a bare-metal project on a STM32L4 and I'm starting from an existing code base.
The ISRs have been implemented the following way:
read interrupt status in the peripheral to know what event(s) provoked the interrupt
do something
clear the flags that have read at the beginning.
Is it the right way to clear the flag ? Shouldn't the flags be cleared at the very beginning of the ISR ? My understanding is that, if the same peripheral event is happening a second time during step 2, it will not provoke a second IRQ so it would be lost. On the other hand if you clear the flag as soon as you can, this second event would pulse the interrupt whose state in the CPU would change to "pending and active": a second IRQ would happen.
PS: From STM32 Processor Programming Manual I read: "STM32 interrupts are both level-sensitive and pulse-sensitive".

Definitely at the beginning (unless you have special reasons in the program logic) as some time is needed the for actual write to the flag clear register to propagate through the buses.
If you decide for some reason to put it at the end of the interrupt you should leave some instructions, place the barrier instruction or read back the register before the interrupt routine return to make sure that the clear operation has propagated across the buses. Otherwise you may have a "phantom" duplicate routine calls.

Why NOP/few extra lines of code/optimization of pointer aliasing helps? [Fujitsu MB90F543 MCU C code]

I am trying to fix an bug found in a mature program for Fujitsu MB90F543. The program works for nearly 10 years so far, but it was discovered, that under some special circumstances it fails to do two things at it's very beginning. One of them is crucial.
After low and high level initialization (ports, pins, peripherials, IRQ handlers) configuration data is read over SPI from EEPROM and status LEDs are turned on for a moment (to turn them a data is send over SPI to a LED driver).
When those special circumstances occur first and only first function invoking just a few EEPROM reads fails and additionally a few of the LEDs that should, don't turn on.
The program is written in C and compiled using Softune v30L32.
Surprisingly it is sufficient to add single __asm(" NOP ") in low level hardware init to make the program work as expected under mentioned circumstances. It is sufficient to turn off 'Control optimization of pointer aliasing' in Optimization settings. Adding just a few lines of code in various places helps too.
I have compared (DIFFed) ASM listings of compiled program for a version with and without __asm(" NOP ") and with both aforementioned optimizer settings and they all look just fine.
The only warning Softune compiler has been printing for years during compilation is as follows:
*** W1372L: The section is placed outside the RAM area or the I/O area (IOXTND)
I do realize it's rather general question, but maybe someone who has a bigger picture will be able to point out possible cause.
Have you got an idea what may cause such a weird behaviour? How to locate the bug and fix it?
During the initialization a few long (about 20ms) delay loops are used. They don't help although they were increased from about 2ms, yet single NOP in any line of the hardware initialization function and even before or after the function helps.
Both the wait loops works. I have checked it using an oscilloscope. (I have added LED turn on before and off after).
I have checked timming hypothesis by slowing down SPI clock from 1MHz to 500kHz. It does not change anything. Slowing down to 250kHz makes watchdog resets, as some parts of the code execute too long (>25ms).
One more thing. I have observed that adding local variables in any source file sometimes makes the problem disappear or reappear. The same concerns initializing uninitialized local variables. Adding a few extra lines of a code in any of the files helps or reveals the problem.
void main(void)
{
watchdog_init();
// waiting for power supply to stabilize
wait; // about 45ms
hardware_init();
clear_watchdog();
application_init();
clear_watchdog();
wait; // about 20ms
test_LED();
{...}
}
void hardware_init (void)
{
__asm("NOP"); // how it comes it helps? - it may be in any line of the function
io_init(); // ports initialization
clk_init();
timer_init();
adc_init();
spi_init();
LED_init();
spi_start();
key_driver_init();
can_init();
irq_init(); // set IRQ priorities and global IRQ enable
}

Could be one of many things but two spring to mind.
Timing.
Maybe the wait is not long enough for power to stabilize and not everything is synced to the clock. The NOP gets everything back in sync.
Alignment.
Perhaps the NOP gets your instructions aligned on a 32 or 64 bit boundary expected by the hardware. (we used to do this a lot on mainframe assemblers as IO operations often expected things to be on double word boundarys).

The problem was solved. It was caused by a trivial bug.
EEPROM's nHOLD and nCS signals were not initialized immediately after MCU's reset, but before the first use of the EEPROM. As a result they were 0's, so active.
This means EEPROM was selected, but waiting on hold. Meantime other transfer using SPI started. After 6 out of 8 CLK pulses EEPROM's nHOLD I/O pin was initialized and brought high. EEPROM was no longer on hold so it clocked in last two bits of a data for an other peripheral. Every subsequent operation on the EEPROM found it being having not synchronized CLK and MOSI.
When I have added NOP or anything other the moment of nHOLD 0->1 edge was shifted to happen after the last CLK pulse. Now CLK-MOSI were in sync.
All I have had to do was to initialize all the EEPROM's SPI lines, in
particular nHOLD and nCS right after the MCU reset.

What is the irq latency due to the operating system?

How can I estimate the irq latency on ARM processor?
What is the definition for irq latency?

Interrupt Request (irq) latency is the time that takes for interrupt request to travel from source of the interrupt to the point when it will be serviced.
Because there are different interrupts coming from different sources via different paths, obviously their latency is depending on the type of the interrupt. You can find table with very good explanations about latency (both value and causes) for particular interrupts on ARM site
You can find more information about it in ARM9E-S Core Technical Reference Manual:
4.3 Maximum interrupt latency
If the sampled signal is asserted at the same time as a multicycle instruction has started
its second or later cycle of execution, the interrupt exception entry does not start until
the instruction has completed.
The longest LDM instruction is one that loads all of the registers, including the PC.
Counting the first Execute cycle as 1, the LDM takes 16 cycles.
• The last word to be transferred by the LDM is transferred in cycle 17, and the abort
status for the transfer is returned in this cycle.
• If a Data Abort happens, the processor detects this in cycle 18 and prepares for
the Data Abort exception entry in cycle 19.
• Cycles 20 and 21 are the Fetch and Decode stages of the Data Abort entry
respectively.
• During cycle 22, the processor prepares for FIQ entry, issuing Fetch and Decode
cycles in cycles 23 and 24.
• Therefore, the first instruction in the FIQ routine enters the Execute stage of the
pipeline in stage 25, giving a worst-case latency of 24 cycles.
and
Minimum interrupt latency
The minimum latency for FIQ or IRQ is the shortest time the request can be sampled
by the input register (one cycle), plus the exception entry time (three cycles). The first
interrupt instruction enters the Execute pipeline stage four cycles after the interrupt is
asserted

There are three parts to interrupt latency:
The interrupt controller picking up the interrupt itself. Modern processors tend to do this quite quickly, but there is still some time between the device signalling it's pin and the interrupt controller picking it up - even if it's only 1ns, it's time [or whatever the method of signalling interrupts are].
The time until the processor starts executing the interrupt code itself.
The time until the actual code supposed to deal with the interrupt is running - that is, after the processor has figured out which interrupt, and what portion of driver-code or similar should deal with the interrupt.
Normally, the operating system won't have any influence over 1.
The operating system certainly influences 2. For example, an operating system will sometimes disable interrupts [to avoid an interrupt interfering with some critical operation, such as for example modifying something to do with interrupt handling, or when scheduling a new task, or even when executing in an interrupt handler. Some operating systems may disable interrupts for several milliseconds, where a good realtime OS will not have interrupts disabled for more than microseconds at the most.
And of course, the time it takes from the first instruction in the interrupt handler runs, until the actual driver code or similar is running can be quite a few instructions, and the operating system is responsible for all of them.
For real time behaviour, it's often the "worst case" that matters, where in non-real time OS's, the overall execution time is much more important, so if it's quicker to not enable interrupts for a few hundred instructions, because it saves several instructions of "enable interrupts, then disable interrupts", a Linux or Windows type OS may well choose to do so.

Mats and Nemanja give some good information on interrupt latency. There are two is one more issue I would add, to the three given by Mats.
Other simultaneous/near simultaneous interrupts.
OS latency added due to masking interrupts. Edit: This is in Mats answer, just not explained as much.
If a single core is processing interrupts, then when multiple interrupts occur at the same time, usually there is some resolution priority. However, interrupts are often disabled in the interrupt handler unless priority interrupt handling is enabled. So for example, a slow NAND flash IRQ is signaled and running and then an Ethernet interrupt occurs, it may be delayed until the NAND flash IRQ finishes. Of course, if you have priorty interrupts and you are concerned about the NAND flash interrupt, then things can actually be worse, if the Ethernet is given priority.
The second issue is when mainline code clears/sets the interrupt flag. Typically this is done with something like,
mrs r9, cpsr
biceq r9, r9, #PSR_I_BIT
Check arch/arm/include/asm/irqflags.h in the Linux source for many macros used by main line code. A typical sequence is like this,
lock interrupts;
manipulate some flag in struct;
unlock interrupts;
A very large interrupt latency can be introduced if that struct results in a page fault. The interrupts will be masked for the duration of the page fault handler.
The Cortex-A9 has lots of lock free instructions that can prevent this by never masking interrupts; because of better assembler instructions than swp/swpb. This second issue is much like the IRQ latency due to ldm/stm type instructions (these are just the longest instructions to run).
Finally, a lot of the technical discussions will assume zero-wait state RAM. It is likely that the cache will need to be filled and if you know your memory data rate (maybe 2-4 machine cycles), then the worst case code path would multiply by this.
Whether you have SMP interrupt handling, priority interrupts, and lock free main line depends on your kernel configuration and version; these are issues for the OS. Other issues are intrinsic to the CPU/SOC interrupt controller, and to the interrupt code itself.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight