DMB instructions in an interrupt-safe FIFO

DMB instructions in an interrupt-safe FIFO - c

Related to this thread, I have a FIFO which should work across different interrupts on a Cortex M4.
The head index must be
atomically written (modified) by multiple interrupts (not threads)
atomically read by a single (lowest level) interrupt
The function for moving the FIFO head looks similar to this (there are also checks to see if the head overflowed in the actual code but this is the main idea):
#include <stdatomic.h>
#include <stdint.h>
#define FIFO_LEN 1024
extern _Atomic int32_t _head;
int32_t acquire_head(void)
{
while (1)
{
int32_t old_h = atomic_load(&_head);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
{
return old_h;
}
}
}
GCC will compile this to:
acquire_head:
ldr r2, .L8
.L2:
// int32_t old_h = atomic_load(&_head);
dmb ish
ldr r1, [r2]
dmb ish
// int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
adds r3, r1, #1
ubfx r3, r3, #0, #10
// if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
dmb ish
.L5:
ldrex r0, [r2]
cmp r0, r1
bne .L6
strex ip, r3, [r2]
cmp ip, #0
bne .L5
.L6:
dmb ish
bne .L2
bx lr
.L8:
.word _head
This is a bare metal project without an OS/threads. This code is for a logging FIFO which is not time critical, but I don't want the acquiring of the head to make an impact on the latency of the rest of my program, so my question is:
do I need all these dmbs?
will there be a noticeable performance penalty with these instructions, or can I just ignore this?
if an interrupt happens during a dmb, how many additional cycles of latency does it create?

TL:DR yes, LL/SC (STREX/LDREX) can be good for interrupt latency compared to disabling interrupts, by making an atomic RMW interruptible with a retry.
This may come at the cost of throughput, because apparently disabling / re-enabling interrupts on ARMv7 is very cheap (like maybe 1 or 2 cycles each for cpsid if / cpsie if), especially if you can unconditionally enable interrupts instead of saving the old state. (Temporarily disable interrupts on ARM).
The extra throughput costs are: if LDREX/STREX are any slower than LDR / STR on Cortex-M4, a cmp/bne (not-taken in the successful case), and any time the loop has to retry the whole loop body runs again. (Retry should be very rare; only if an interrupt actually comes in while in the middle of an LL/SC in another interrupt handler.)
C11 compilers like gcc don't have a special-case mode for uniprocessor systems or single-threaded code, unfortunately. So they don't know how to do code-gen that takes advantage of the fact that anything running on the same core will see all our operations in program order up to a certain point, even without any barriers.
(The cardinal rule of out-of-order execution and memory reordering is that it preserves the illusion of a single-thread or single core running instructions in program order.)
The back-to-back dmb instructions separated only by a couple ALU instructions are redundant even on a multi-core system for multi-threaded code. This is a gcc missed-optimization, because current compilers do basically no optimization on atomics. (Better to be safe and slowish than to risk ever being too weak. It's hard enough to reason about, test, and debug lockless code without worrying about possible compiler bugs.)
Atomics on a single-core CPU
You can vastly simplify it in this case by masking after an atomic_fetch_add, instead of simulating an atomic add with earlier rollover using CAS. (Then readers must mask as well, but that's very cheap.)
And you can use memory_order_relaxed. If you want reordering guarantees against an interrupt handler, use atomic_signal_fence to enforce compile-time ordering without asm barriers against runtime reordering. User-space POSIX signals are asynchronous within the same thread in exactly the same way that interrupts are asynchronous within the same core.
// readers must also mask _head & (FIFO_LEN - 1) before use
// Uniprocessor but with an atomic RMW:
int32_t acquire_head_atomicRMW_UP(void)
{
atomic_signal_fence(memory_order_seq_cst); // zero asm instructions, just compile-time
int32_t old_h = atomic_fetch_add_explicit(&_head, 1, memory_order_relaxed);
atomic_signal_fence(memory_order_seq_cst);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
return new_h;
}
On the Godbolt compiler explorer
## gcc8.2 -O3 with your same options.
acquire_head_atomicRMW:
ldr r3, .L4 ## load the static address from a nearby literal pool
.L2:
ldrex r0, [r3]
adds r2, r0, #1
strex r1, r2, [r3]
cmp r1, #0
bne .L2 ## LL/SC retry loop, not load + inc + CAS-with-LL/SC
adds r0, r0, #1 ## add again: missed optimization to not reuse r2
ubfx r0, r0, #0, #10
bx lr
.L4:
.word _head
Unfortunately there's no way I know of in C11 or C++11 to express a LL/SC atomic RMW that contains an arbitrary set of operations, like add and mask, so we could get the ubfx inside the loop and part of what gets stored to _head. There are compiler-specific intrinsics for LDREX/STREX, though: Critical sections in ARM.
This is safe because _Atomic integer types are guaranteed to be 2's complement with well-defined overflow = wraparound behaviour. (int32_t is already guaranteed to be 2's complement because it's one of the fixed-width types, but the no-UB-wraparound is only for _Atomic). I'd have used uint32_t, but we get the same asm.
Safely using STREX/LDREX from inside an interrupt handler:
ARM® Synchronization Primitives (from 2009) has some details about the ISA rules that govern LDREX/STREX. Running an LDREX initializes the "exclusive monitor" to detect modification by other cores (or by other non-CPU things in the system? I don't know). Cortex-M4 is a single-core system.
You can have a global monitor for memory shared between multiple CPUs, and local monitors for memory that's marked non-shareable. That documentation says "If a region configured as Shareable is not associated with a global monitor, Store-Exclusive operations to that region always fail, returning 0 in the destination register." So if STREX seems to always fail (so you get stuck in a retry loop) when you test your code, that might be the problem.
An interrupt does not abort a transaction started by an LDREX. If you were context-switching to another context and resuming something that might have stopped right before a STREX, you could have a problem. ARMv6K introduced clrex for this, otherwise older ARM would use a dummy STREX to a dummy location.
See When is CLREX actually needed on ARM Cortex M7?, which makes the same point I'm about to, that CLREX is often not needed in an interrupt situation, when not context-switching between threads.
(Fun fact: a more recent answer on that linked question points out that Cortex M7 (or Cortex M in general?) automatically clears the monitor on interrupt, meaning clrex is never necessary in interrupt handlers. The reasoning below can still apply to older single-core ARM CPUs with a monitor that doesn't track addresses, unlike in multi-core CPUs.)
But for this problem, the thing you're switching to is always the start of an interrupt handler. You're not doing pre-emptive multi-tasking. So you can never switch from the middle of one LL/SC retry loop to the middle of another. As long as STREX fails the first time in the lower-priority interrupt when you return to it, that's fine.
That will be the case here because a higher-priority interrupt will only return after it does a successful STREX (or didn't do any atomic RMWs at all).
So I think you're ok even without using clrex from inline asm, or from an interrupt handler before dispatching to C functions. The manual says a Data Abort exception leaves the monitors architecturally undefined, so make sure you CLREX in that handler at least.
If an interrupt comes in while you're between an LDREX and STREX, the LL has loaded the old data in a register (and maybe computed a new value), but hasn't stored anything back to memory yet because STREX hadn't run.
The higher-priority code will LDREX, getting the same old_h value, then do a successful STREX of old_h + 1. (Unless it is also interrupted, but this reasoning works recursively). This might possibly fail the first time through the loop, but I don't think so. Even if so, I don't think there can be a correctness problem, based on the ARM doc I linked. The doc mentioned that the local monitor can be as simple as a state-machine that just tracks LDREX and STREX instructions, letting STREX succeed even if the previous instruction was an LDREX for a different address. Assuming Cortex-M4's implementation is simplistic, that's perfect for this.
Running another LDREX for the same address while the CPU is already monitoring from a previous LDREX looks like it should have no effect. Performing an exclusive load to a different address would reset the monitor to open state, but for this it's always going to be the same address (unless you have other atomics in other code?)
Then (after doing some other stuff), the interrupt handler will return, restoring registers and jumping back to the middle of the lower-priority interrupt's LL/SC loop.
Back in the lower-priority interrupt, STREX will fail because the STREX in the higher-priority interrupt reset the monitor state. That's good, we need it to fail because it would have stored the same value as the higher-priority interrupt that took its spot in the FIFO. The cmp / bne detects the failure and runs the whole loop again. This time it succeeds (unless interrupted again), reading the value stored by the higher-priority interrupt and storing & returning that + 1.
So I think we can get away without a CLREX anywhere, because interrupt handlers always run to completion before returning to the middle of something they interrupted. And they always begin at the beginning.
Single-writer version
Or, if nothing else can be modifying that variable, you don't need an atomic RMW at all, just a pure atomic load, then a pure atomic store of the new value. (_Atomic for the benefit or any readers).
Or if no other thread or interrupt touches that variable at all, it doesn't need to be _Atomic.
// If we're the only writer, and other threads can only observe:
// again using uniprocessor memory order: relaxed + signal_fence
int32_t acquire_head_separate_RW_UP(void) {
atomic_signal_fence(memory_order_seq_cst);
int32_t old_h = atomic_load_explicit(&_head, memory_order_relaxed);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
atomic_store_explicit(&_head, new_h, memory_order_relaxed);
atomic_signal_fence(memory_order_seq_cst);
return new_h;
}
acquire_head_separate_RW_UP:
ldr r3, .L7
ldr r0, [r3] ## Plain atomic load
adds r0, r0, #1
ubfx r0, r0, #0, #10 ## zero-extend low 10 bits
str r0, [r3] ## Plain atomic store
bx lr
This is the same asm we'd get for non-atomic head.

Your code is written in a very not "bare metal" way. Those "general" atomic functions do not know if the value read or stored is located in the internal memory or maybe it is a hardware register located somewhere far from the core and connected via buses and sometimes write/read buffers.
That is the reason why the generic atomic function has to place so many DMB instructions. Because you read or write the internal memory location they are not needed at all (M4 does not have any internal cache so this kind of strong precautions are not needed as well)
IMO it is just enough to disable the interrupts when you want to access the memory location the atomic way.
PS the stdatomic is in a very rare use in the bare metal uC development.
The fastest awy to guarantee the exclusive access on M4 uC is to disable and enable the interrupts.
__disable_irq();
x++;
__enable_irq();
71 __ASM volatile ("cpsid i" : : : "memory");
080053e8: cpsid i
79 x++;
080053ea: ldr r2, [pc, #160] ; (0x800548c <main+168>)
080053ec: ldrb r3, [r2, #0]
080053ee: adds r3, #1
080053f0: strb r3, [r2, #0]
60 __ASM volatile ("cpsie i" : : : "memory");
which will cost only 2 or 4 additional clocks for both instructions.
It guarantees the atomicity and does not provide unnecessary overhead

dmb is required in situations like
p1:
str r5, [r1]
str r0, [r2]
and
p2:
wait([r2] == 0)
ldr r5, [r1]
(from http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf, section 6.2.1 "Weakly-Ordered Message Passing problem").
In-CPUI optimizations can reorder the instructions on p1 so you have to insert an dmb betweeen both stores.
In your example, there are too much dmb which is probably caused by expanding atomic_xxx() which might have dmb both at start and end.
In should be enough to have
acquire_head:
ldr r2, .L8
dmb ish
.L2:
// int32_t old_h = atomic_load(&_head);
ldr r1, [r2]
...
bne .L5
.L6:
bne .L2
dmb ish
bx lr
and no other dmb between.
Performance impact is difficult to estimate (you would have to benchmark code with and without dmb). dmb does not consume cpu cycles; it just stops pipelining within the cpu.

Related

ARM ITCM interface and FLash access

If the access to the Flash memory is done starting from the address 0x0200 0000, it is performed automatically via the ITCM bus. The ART accelerator™ should be enabled to get the equivalent of 0-wait state access to the Flash memory via the ITCM bus. The ART is enabled by setting the bit 9 in the FLASH_ACR register while the ART-Prefetch is enabled by setting the bit 8 in the same register.
If i place my program code starting at 0x0200 0000, What would happen if ART accelerator is not enabled ? will it be beneficial to just use AXIM bus instead for startup code and then enable ART accelerator and point execution to program region which is at 0x0200 0000.
I am just a bit confused.
https://www.st.com/content/ccc/resource/technical/document/application_note/0e/53/06/68/ef/2f/4a/cd/DM00169764.pdf/files/DM00169764.pdf/jcr:content/translations/en.DM00169764.pdf
Page 12

So let's just try it. NUCLEO-F767ZI
Cortex-M7s in general:
Prefetch Unit
The Prefetch Unit (PFU) provides:
1.2.3
• 64-bit instruction fetch bandwidth.
• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
• A Branch Target Address Cache (BTAC) for the single-cycle turn-around of branch predictor state and target address.
• A static branch predictor when no BTAC is specified.
• Forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.
For this test the branch prediction gets in the way so turn that off:
Set ACTLR to 00003000 (hex, most numbers here are hex)
Don't see how to disable the PFU wouldn't expect to have control like that anyway.
So we expect the prefetch to read 64 bits at a time, 4 instructions on an aligned
boundary.
From ST
The DBANK bit is set indicating a single bank
Instruction prefetch
In case of single bank mode (nDBANK option bit is set) 256 bits representing 8 instructions of 32 bits to 16 instructions of 16 bits according to the program launched. So, in the case of sequential code, at least 8 CPU cycles are needed to execute the previous instruction line read.
So ST is going to turn that into a 256 bit or 16 instructions
Using the systick timer. Am running at 16Mhz so flash is at zero wait states.
08000140 <inner>:
8000140: 46c0 nop ; (mov r8, r8)
8000142: 46c0 nop ; (mov r8, r8)
8000144: 46c0 nop ; (mov r8, r8)
8000146: 46c0 nop ; (mov r8, r8)
8000148: 46c0 nop ; (mov r8, r8)
800014a: 46c0 nop ; (mov r8, r8)
800014c: 3901 subs r1, #1
800014e: d1f7 bne.n 8000140 <inner>
00120002
So 12 clocks per loop. Two prefetches from ARM, the first one becomes a single ST fetch. Should be zero wait state. Note the address this is AXIM
If I reduce the number of nops it stays at 0x1200xx until here:
08000140 <inner>:
8000140: 46c0 nop ; (mov r8, r8)
8000142: 46c0 nop ; (mov r8, r8)
8000144: 3901 subs r1, #1
8000146: d1fb bne.n 8000140 <inner>
00060003
One arm fetch instead of two. Time cut in half, so the prefetch is dominating our
performance.
08000140 <inner>:
8000140: 46c0 nop ; (mov r8, r8)
8000142: 46c0 nop ; (mov r8, r8)
8000144: 46c0 nop ; (mov r8, r8)
8000146: 46c0 nop ; (mov r8, r8)
8000148: 3901 subs r1, #1
800014a: d1f9 bne.n 8000140 <inner>
000 (zero wait states)
00120002
001 (1 wait state)
00140002
002 (2 wait states)
00160002
202 (2 wait states enable ART)
0015FFF3
Why would that affect AXIM?
so each wait state adds 2 clocks per loop, there are two fetches per loop so perhaps each fetch causes st to do one of its 256 bit fetches, that seems broken though.
Switch to ITCM
00200140 <inner>:
200140: 46c0 nop ; (mov r8, r8)
200142: 46c0 nop ; (mov r8, r8)
200144: 46c0 nop ; (mov r8, r8)
200146: 46c0 nop ; (mov r8, r8)
200148: 3901 subs r1, #1
20014a: d1f9 bne.n 200140 <inner>
000
00070004
001
00080003
002
00090003
202
00070004
ram
00070003
So ITCM alone, zero wait state, ART off is 7 clocks per loop for a 6 instruction
loop with a branch. seems reasonable. For this tiny test turning on ART with 2 wait states puts us back at 7 per loop.
Note that from ram this code runs at 7 per loop as well. Let's try another couple
00F
00230007
20F
00070004
I didn't look for other branch predictors other than the BTAC
First thing to note you don't want to ever run an MCU faster than you have to, burns power, many you need to add flash wait states, many the CPU and peripherals have different max clock speeds so there is a boundary where it becomes non-linear (takes X clock cycles at a slow clock rate, peripheral clock = CPU clock, there is a place where N times faster is NX clocks to do something, but one or more boundaries where it takes more than NX to do something when the CPU clock is N times faster). This particular part has this non-linear issue. If you are using libraries from ST to set the clock then you are possibly getting worst case flash wait states, where if you set it up and read the documentation you might be able to shave one or two/few.
The Cortex-M7 has optional L1 caches, didn't mess with it this time around but ST had this ART thing before these came out and I believe they defeat/disable the i cache at least, would it make it better or worse to have both? If it has it then that would make the first past slow then the remaining possibly faster even in AXIM space. You are welcome to try it. Seem to remember they did something tricky with a strap on the processor core, it wasn't easy to see how it was defeated, and that may not be this chip/core but was definitely ST. The M4 doesn't have a cache so it would have to be an M7 that I messed with (this one in particular).
So the short answer is the performance isn't that horrible if you leave off the ART and/or run out of AXIM. ST has implemented the flash such that the ITCM interface is faster than AXIM. We can see the effects of the ARMs fetch itself if you enable branch prediction you can see that as well if you turn it on.
It shouldn't be difficult to create a benchmark that defeats these features, just like you can make one that makes the L1 caches (or any other cache) hurt performance. The ART thing like any other cache makes performance less predictable and as you change your code, add a line, remove a line the performance can jump anywhere from no change to a lot as a result.
Depending on the processor and fetch sizes and alignments your code performance can vary by adding or removing code above the performance-sensitive part(s) of the project, but that depends on some factors that we rarely have visibility into.
Hard to tell looks like they are claiming that ART reduces power. I would expect it to increase power having those srams on/clocked. Don't see an obvious how much you save if you turn off the flash and run from ram. The M7 parts are not really meant to be the low powered parts like some STM32L parts where you can get to ones/tens of microamps (micro not milli, been there done that).
The small number of clocks 0x70004 instead of 0x70000 have to do with some of the fetching overhead be it ARM or ST or a combination of the two. To see memory/flash performance you need to disable as much of the features like branch prediction, caches that you can disable, etc. Otherwise, it's hard to measure performance and then make assumptions about what the flash/memory/bus is doing. I suspect there are still things I didn't turn off to make a clean measurement, and/or can't turn off. And simple nop loops (tried other non-nop instructions, didn't change it) won't tell you everything. Using the docs as a guide you can try to cache-thrash the ART or other and see what kind of hits that take.
For performance-critical code you can run from RAM and avoid all of these issues, I didn't search for it but assume that these parts SRAM can run as fast as the CPU. The answer isn't jumping out at me, you can figure it out.
Note my test actually looks like
ldr r2,[r0]
inner:
nop
nop
nop
nop
sub r1,#1
bne inner
ldr r3,[r0]
sub r0,r2,r3
bx lr
where sampling of systick is just in front of and back of. Before the branch. To measure ART you would want to sample the time before the branch for a memory range that has not been read it is not magically possible to read that faster the first read into the cache should be slower. If I move the time sampling further away I can see it go from 0x7000A to 0x70027 for 0 to 15 wait states with ART on. That is a noticeable performance hit for branches into code that has not been run/cached yet. Knowing the size of the art fetches, should be easy to make a test that hops a lot and the ART feature starts to not matter.
Short answer, the ITCM is a different bus interface on the ARM core, ST has implemented their design such that there is a performance gain. So even without ART enabled using ITCM is faster than AXIM (likely an ARM bus thing not ST flash thing). If you are running fast enough clock rates to have to add wait states to the flash then ART can mostly erase those.

I THINK that the question is much simpler than what the other answers assume.
If you are thinking about doing stuff like put your program elsewhere than simply in flash: Don't. As ST says: with ART the performance will be very close to "zero wait state". So don't worry about it. Anything else you try to do is not going to be faster than that.

Q. If i place my program code starting at 0x0200 0000, What would happen if ART accelerator is not enabled?
A. Program execution (instruction fetch and constant access) will be painfully slow, with a crazy number of wait cycles (15?).
[ UPD. I have to correct that this applies more to configurations with high clock frequency, e.g. 15 wait states are needed for 216 MHz. With lower frequencies, the flash access penalty will be less significant, and minimal at 16 MHz. We do not know what frequency is used by O.P. ]
[ UPD2. At most 9 wait states are needed at 216 MHz, sorry. ]
Q. Which bus is preferable for flash code access, AXI or ITCM?
A. The voluminous document you referred to includes some performance measurements, that also compare various code placement options. The results somewhat differ between processors models because cache sizes and bus width are different. Your code will likely be affected differently. My take away from this paper is, unless your code is performance critical, both options work reasonable well. However, having two parallel busses with caches enables you to do creative things like partitioning your code into pieces and allocating them to separate busses so that the critical but rarely used code is not evicted from the cache. I mean, if you really need that.

How to calculate the LR value when ARMv7-A architecture implementation takes an IRQ exception

I'm researching the the Arm Architecture Reference Manual ARMv7-A ARMv7-R edition document these days. When I read about the exception handling part of this manual, it comes across a confusions to me. The problem is about how to decide the LR value when ARMv7-A architecture implementation takes an IRQ exception.
EXAMPLE: Suppose that the processor is executing an instruction at the address of 0x_0000_1000 and an IRQ is taken.First, we have to calculate some parameters to be used to calculate the LR.
preferred return address,which is the Address of next instruction to execute in this case. So preferred return address = 0x_0000_1002 in thumb instruction set or preferred return address = 0x_0000_1004 for arm instruction set.preferred return address for the exception
PC,which is the program counter and holds the current program address.In this case, PC = 0x_0000_1004 in thumb instruction state or PC = 0x_0000_1008 in arm instruction state.how to calculate PC
Then, here are 2 methods mentioned in the document to decide the LR value when taking this IRQ exception.
by using preferred return address. LR = preferred return address + offset that depends on the instruction set state when the exception was taken.In this case LR = 0x_0000_1002 + 4 in thumb instruction state or LR = 0x_0000_1004 + 4 in arm instruction state.Offsets applied to Link value for exceptions taken to PL1 modes
by using PC. LR = PC-0 if in thumb instruction set or LR = PC-4 when in arm instruction set.In this case LR = 0x_0000_1004 - 0 in thumb instruction set or LR = 0x_0000_1008 - 4 in arm instruction state. Pseudocode description of taking the IRQ exception
Problem:the LR results calculated by the 2 methods are different both in thumb set state and arm set state(with first method we get LR = 0x_0000_1006 or LR = 0x_0000_1008,but second method we get LR = 0x_0000_1004 or LR = 0x_0000_1004). which one is correct or is there any wrong with my understanding?

TL;DR - the IRQ LR will point to the next instruction to complete work as would normally be run without an interrupt. Otherwise, code would not execute the same in the presence of interrupts.
It is confusing as the ARM documents may refer to PC in many different contexts and they are not the same.
EXAMPLE:Suppose that the processor is executing an instruction at the address of 0x_0000_1000 and an IRQ is taken. First, we have to calculate some parameters to be used to calculate the LR.
preferred return address,which is the Address of next instruction to execute in this case. So preferred return address = 0x_0000_1002 in thumb instruction set or preferred return address = 0x_0000_1004 for arm instruction set.preferred return address for the exception
This is not correct. The ARM cpu has a pipeline and the last instruction that it has deemed to have completed is the next instruction. Take for example this sequence,
0: cmp r1, #42
1: bne 4f ; interrupt happens as this completes.
2: add r2, r2, #4
3: b 5f
4: sub r2, r2, #2
5: ; more code.
If the interrupt happens as label '1:' happens, the next instruction will be either '2:' or '4:'. If you followed your rule this would either increase interrupt latency by never allowing an interrupt in such cases, or interrupts would cause incorrect code. Specifically, your link says next instruction to execute.
PC,which is the program counter and holds the current program address.In this case, PC = 0x_0000_1004 in thumb instruction state or PC = 0x_0000_1008 in arm instruction state.how to calculate PC
Here you are mixing concepts. One is when you use a value like ldr r0, [pc, #42]. When you calculate the offset, you must add two to the current ldr instruction. The actual PC is not necessarily this value. At some point (original version), the ARM was a two stage pipeline. In order to keep behaviour the same, subsequent ARM cpus follow the rule of being two ahead when calculating ldr r0, [pc, #42] type addresses. However, the actual PC may be much different inside the CPU. The concept above describes the programmer visible PC for use with addressing.
The CPU will make a decision, sometimes base on configuration, on what work to complete. For instance, ldm sp!, {r0-r12} may take some time to complete. The CPU may decide to abort this instruction to keep interrupt latency low. Alternatively, it may perform 12 memory reads which could have wait states. The LR_irq will be set to the ldm instruction or the next instruction depending whether it is aborted or not.

Atomic access to ARM peripheral registers

I want to use the overflow, compare match and capture functionality of a general purpose timer on a ST2M32F103REY Cortex M3 at the same time. CC1 is configured as compare match and CC3 is configured as capture. The IRQ handler looks as follows:
void TIM3_IRQHandler(void) {
if(TIM3->SR & TIM_SR_UIF){
TIM3->SR &= ~TIM_SR_UIF;
// do something on overflow
}
if(TIM3->SR & TIM_SR_CC1IF) {
TIM3->SR &= ~TIM_SR_CC1IF;
// do something on compare match
}
if(TIM3->SR & TIM_SR_CC3IF) {
TIM3->SR &= ~TIM_SR_CC3IF;
// do something on capture
}
}
In principle, it works good, but it sometimes seems to skip a part. My theory is that this is because the operations of resetting the IRQ flags, e.g. TIM3->SR &= ~TIM_SR_UIF, is not atomic*, so it might happen that for example a TIM_SR_CC1IF occurring between load and store is overwritten.
* The disassembly of the instruction is as follows
8012e02: 8a13 ldrh r3, [r2, #16]
8012e06: f023 0301 bic.w r3, r3, #1
8012e0a: 041b lsls r3, r3, #16
8012e0c: 0c1b lsrs r3, r3, #16
8012e0e: 8213 strh r3, [r2, #16]
Is this plausible? Can the content of the TIM3->SR register change during the execution of the IRQ handler?
Is there a possibility to do an atomic read and write to the TIM3->SR register?
Is there another suitable solution?
By the way: There is similar question but that one is about protecting access by multiple processes or cores and not about protecting simultaneous access by software and hardware.

Section 15.4.5 of the reference manual (CD00171190) states that all bits in TIMx->SR work in rc_w0 mode (or are reserved).
According to the programming manual (PM0056):
read/clear (rc_w0): Software can read as well as clear this bit by writing 0. Writing ‘1’ has no effect on the bit value.
This means that you can simplify your code to entirely avoid the read-modify-write cycle and instead just use TIM3->SR = ~TIM_SR_UIF instead.
Many application notes use a read-modify-write to clear interrupts, such as examples by Keil, but this is unnecessary and potentially dangerous, as you have experienced.
In the ST application note DM00236305 (section 1.3.2), only a write operation is used.
Note, however, that when working with the NVIC, the register used for resetting is rc_w1.

Temporarily disable interrupts on ARM

I am starting working with the ARM platform (specifically the TI TMS570 family).
I have some code with critical regions where I don't want an exception to occur. So I want to save the IRQ and FIR enabled flags on entering the regions and restore them on exiting.
How do I do that?

To temporarily mask IRQs and FIQs at the CPU, the nicest option for ARMv7 is to use cps:
// assembly code assuming interrupts unmasked on entry
cpsid if // mask IRQ and FIQ
... // do critical stuff
cpsie if // unmask
Some compilers provide a set of __disable_irq() etc. intrinsics usable from C code, but for others (like GCC) it's going to be a case of dropping to assembly.
If you want critical sections to be nested, reentrant, taken in interrupt handlers or anything else which requires restoring the previous state as opposed to just uncondionally unmasking at the end, then you'll need to copy that state out of the CPSR before masking anything, then restore it on exit. At that point the unmasking probably ends up simpler to handle the old-fashioned way of a direct read-modify-write of the CPSR. Here's one idea off the top of my head:
// int enter_critical_section(void);
enter_critical_section:
mrs r0, cpsr
cpsid if
and r0, r0, #0xc0 // leave just the I and F flags
bx lr
// void leave_critical_section(int flags);
leave_critical_section:
mrs r1, cpsr
bic r1, r1, r0
msr cpsr_c, r1
bx lr

You can use _disable_interrupt_();_enable_interrupt_(); from Halcogen generated code (sys_core.h)

Why do we require two memory barriers in a postbox data communication between two cores?

Here we have a code of postbox code for data communication between two ARM cores (directly referred from the ARM Cortex A Series Programming Guide).
Core A:
STR R0, [Msg] # write some new data into postbox
STR R1, [Flag] # new data is ready to read
Core B:
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] # read new data.
In order to enforce dependency, the document says that we need to insert not one, but two memory barriers, DMB, into the code.
Core A:
STR R0, [Msg] # write some new data into postbox
DMB
STR R1, [Flag] # new data is ready to read
Core B:
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop
DMB
LDR R0, [Msg] # read new data.
I understand the first DMB in the Core A: it prevents compile reordering and also the memory access to [Msg] variable be observed by the system. Below is the definition of the DMB from the same document.
Data Memory Barrier (DMB)
This instruction ensures that all memory
accesses in program order before the barrier are observed in the
system before any explicit memory accesses that appear in program
order after the barrier. It does not affect the ordering of any other
instructions executing on the core, or of instruction fetches.
However, I am not sure why the DMB in the Core B is used. In the document it says:
Core B requires a DMB before the LDR R0, [Msg] to be sure that the
message is not read until the flag is set.
If the DMB in the Core A makes the store to the [Msg] be observed to the system, then we should not need the DMB in the second core. My guess is, the compiler might do a reordering of reading [Flag] and [Msg] in the Core B (though I do not understand why it should do this since the read on [Msg] is dependent on [Flag]).
If this is the case, a compile barrier (asm volatile("" ::: "memory) instead of DMB should be enough. Do I miss something here?

Both barriers are necessary, and do need to be dmbs - this is still about the hardware memory model, and nothing to do with compiler reordering.
Let's look at the writer on core A first:
STR R0, [Msg] # write some new data into postbox
STR R1, [Flag] # new data is ready to read
Since these are two independent stores to different addresses with no dependency between them, there is nothing to force core A to actually issue the stores in program order. The store to Msg could, say, linger in a part-filled write buffer whilst the store to Flag overtakes it and goes straight out to the memory system. Thus any observer other than core A could see the new value of Flag, without yet seeing the new value of Msg.
STR R0, [Msg] # write some new data into postbox
DMB
STR R1, [Flag] # new data is ready to read
Now, with the barrier, the store to Flag is not permitted to be visible before the store to Msg, because that would necessitate one or other store appearing to cross the barrier. Thus any external observer may either see both old values, the new Msg but the old Flag, or both new values. The previous case of seeing the new Flag but the old Msg can no longer occur.
OK, so the first barrier handles things getting written in the correct order, but there's also the matter of how they are read. Over on core B...
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] # read new data.
Note that the branch to Poll_loop does not form a control dependency between the two loads; if you consider program order, the load of Msg is unconditional, and the value of Flag does not affect whether it is executed or not, only whether execution ever progresses to that part of the program at all. Therefore the code could equivalently be written thus:
Poll_loop:
LDR R1, [Flag]
LDR R0, [Msg] # read data, just in case.
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop # no? OK, throw away that data and read everything again.
... # do stuff with R0, because Flag was set so it must be good data, right?
Start to see the problem? Even with the original code, core B is free to speculatively load Msg as soon as it reaches Poll_loop, so even if the stores from core A become visible in program order, things could still play out like this:
core A | core B
-----------+-----------
| load Msg
store Msg |
store Flag |
| load Flag
| conclude that old Msg is valid
Thus you either need a barrier:
...
BEQ Poll_loop
DMB
LDR R0, [Msg] # read new data.
or perhaps a fake address dependency:
...
BEQ Poll_loop
EOR R1, R1, R1
LDR R0, [Msg, R1] # read new data.
To order the two loads against each other.

First, you're kind of mixing up compiler barriers and memory barriers. Compiler barriers prevent the compiler from moving instructions across that barrier in the final assembly. OTOH, memory barriers instruct the hardware to obey a certain ordering. Since you're already presenting assembly code, your question is really about hardware memory barriers and there is no compiler involved here.
The reason why you need a (read) memory barrier in Core B is that the core may reorder the message reading instruction wherever it wants, since
no, there is no data dependency between the reading of the flag and the reading of the message, at least not in the code above: the only information needed for reading Msg is its address and this is known at each point in time. You might want to argue that there is a control dependency. However, control dependencies do not impose any ordering constraints on memory reads.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight