aarch64; Load-Acquire Exclusive vs Load Exclusive - arm

What is the difference between LDAXR & LDXR instructions out of AArch64 instruction set?
From reference manual they looks totally the same (with exception of 'acquire' word):
LDAXR - Load-Acquire Exclusive Register: loads word from memory addressed by base to Wt. Records the physical address as an exclusive access.
LDXR - Load Exclusive Register: loads a word from memory addressed by base to Wt. Records the physical address as an exclusive access.
Thanks

In the simplest form, LDAEX == LDXR +DMB_SY.
This is the description which I find for LDAXR:
C6.2.104 LDAXR
Load-Acquire Exclusive Register derives an address from a base
register value, loads a 32-bit word or 64-bit doubleword from memory,
and writes it to a register. The memory access is atomic. The PE marks
the physical address being accessed as an exclusive access. This
exclusive access mark is checked by Store Exclusive instructions. See
Synchronization and semaphores on page B2-135. The instruction also
has memory ordering semantics as described in Load-Acquire,
Load-AcquirePC, and Store-Release on page B2-108. For information
about memory accesses see Load/Store addressing modes on page C1-157.
From section K11.3 of DDI0487 Da
The ARMv8 architecture adds the acquire and release semantics to
Load-Exclusive and Store-Exclusive instructions, which allows them to
gain ordering acquire and/or release semantics. The Load-Exclusive
instruction can be specified to have acquire semantics, and the
Store-Exclusive instruction can be specified to have release
semantics. These can be arbitrarily combined to allow the atomic
update created by a successful Load-Exclusive and Store-Exclusive pair
to have any of:
No Ordering semantics (using LDREX and STREX).
Acquire only semantics (using LDAEX and STREX).
Release only semantics (using LDREX and STLEX).
Sequentially consistent semantics (using LDAEX and STLEX).
Also (B2.3.5),
The basic principle of a Load-Acquire instruction is to introduce
order between the memory access generated by the Load-Acquire
instruction and the memory accesses appearing in program order after
the Load-Acquire instruction, such that the memory access generated by
the Load-Acquire instruction is Observed-by each PE, to the extent
that that PE is required to observe the access coherently, before any
of the memory accesses appearing in program order after the
Load-Acquire instruction are Observed-by that PE, to the extent that
the PE is required to observe the accesses coherently.

Related

Do I need to use smp_mb() after binding the CPU

Suppose my system is a multicore system, if I bind my program on a cpu core, still I need the smp_mb() to guard the cpu would not reorder
the cpu instructions?
I have this point because I know that the smp_mb() on a single-core systems is not necessary,but I'm no sure this point is correct.
You rarely need a full barrier anyway, usually acquire/release is enough. And usually you want to use C11 atomic_load_explicit(&var, memory_order_acquire), or in Linux kernel code, use one of its functions for an acquire-load, which can be done more efficiently on some ISAs than a plain load and an acquire barrier. (Notably AArch64 or 32-bit ARMv8 with ldar or ldapr)
But yeah, if all threads are sharing the same logical core, run-time memory reordering is impossible, only compile-time. So you just need a compiler memory barrier like asm("" ::: "memory") or C11 atomic_signal_fence(seq_cst), not a CPU run-time barrier like atomic_thread_fence(seq_cst) or the Linux kernel's SMP memory barrier (smp_mb() is x86 mfence or equivalent, or ARM dmb ish, for example).
See Why memory reordering is not a problem on single core/processor machines? for more details about the fact that all instructions on the same core observe memory effects to have happened in program order, regardless of interrupts. e.g. a later load must see the value from an earlier store, otherwise the CPU is not maintaining the illusion of instructions on that core running in program order.
And if you can convince your compiler to emit atomic RMW instructions without the x86 lock prefix, for example, they'll be atomic wrt. context switches (and interrupts in general). Or use gcc -Wa,-momit-lock-prefix=yes to have GAS remove lock prefixes for you, so you can use <stdatomic.h> functions efficiently. At least on x86; for RISC ISAs, there's no way to do a read-modify-write of a memory location in a single instruction.
Or if there is (ARMv8.1), it implies an atomic RMW that's SMP-safe, like x86 lock add [mem], eax. But on a CISC like x86, we have instructions like add [mem], eax or whatever which are just like separate load / ADD / store glued into a single instruction, which either executes fully or not at all before an interrupt. (Note that "executing" a store just means writing into the store buffer, not globally visible cache, but that's sufficient for later code on the same core to see it.)
See also Is x86 CMPXCHG atomic, if so why does it need LOCK? for more about non-locked use-cases.

When is CLREX actually needed on ARM Cortex M7?

I found a couple of places online which state that CLREX "must" be called whenever an interrupt routine is entered, which I don't understand. The docs for CLREX state (added the numbering for easier reference):
(1) Clears the local record of the executing processor that an address has had a request for an exclusive access.
(2) Use the CLREX instruction to return a closely-coupled exclusive access monitor to its open-access state. This removes the requirement for a dummy store to memory.
(3) It is implementation-defined whether CLREX also clears the global record of the executing processor that an address has had a request for an exclusive access.
I don't understand pretty much anything here.
I had the impression that writing something along the lines the example in the docs was enough to guarantee atomicity:
MOV r1, #0x1 ; load the ‘lock taken’ value
try: <---\
LDREX r0, [LockAddr] ; load the lock value |
CMP r0, #0 ; is the lock free? |
STREXEQ r0, r1, [LockAddr] ; try and claim the lock |
CMPEQ r0, #0 ; did this succeed? |
BNE try ; no - try again ------------/
.... ; yes - we have the lock
Why should the "local record" need to be cleared? I thought that LDREX/STREX are enough to guarantee atomic access to an address from several interrupts? I.e. GCC for ARM compiles all C11 atomic functions using LDREX/STREX and I don't see CLREX being called anywhere.
What "requirement for a dummy store" is the second paragraph referring to?
What is the difference between the global record and a local record? Is global record needed for multi-core scenarios?
Taking (and paraphrasing) your three questions separately:
1. Why clear the access record?
When strict nesting of code is enforced, such as when you're working with interrupts, then CLREX is not usually required. However, there are cases where it's important. Imagine you're writing a context switch for a preemptive operating system kernel, which can asynchronously suspend a running task and resume another. Now consider the following pathological situation, involving two tasks of equal priority (A and B) manipulating the same shared resource using LDREX and STREX:
Task A Task B
...
LDREX
-------------------- context switch
LDREX
STREX (succeeds)
...
LDREX
-------------------- context switch
STREX (succeeds, and should not)
...
Therefore the context switch must issue a CLREX to avoid this.
2. What 'requirement for a dummy store' is avoided?
If there wasn't a CLREX instruction then it would be necessary to use a STREX to relinquish the exclusive-access flag, which involves a memory transaction and is therefore slower than it needs to be if all you want to do is clear the flag.
3. Is the 'global record' for multi-core scenarios?
Yes, if you're using a single-core machine, there's only one record because there's only one CPU.
Actually CLREX isn't needed for exceptions/interrupts on the M7, it appears to only be included for compatibility reasons. From the documenation (Version c):
CLREX enables compatibility with other ARM Cortex processors that have
to force the failure of the store exclusive if the exception occurs
between a load exclusive instruction and the matching store exclusive
instruction in a synchronization operation. In Cortex-M processors,
the local exclusive access monitor clears automatically on an
exception boundary, so exception handlers using CLREX are optional.
So, since Cortex-M processors clear the local exclusive access flag on exception/interrupt entry/exit, this negates most (all?) of the use cases for CLREX.
With regard to your third question, as others have mentioned you are correct in thinking that the global record is used in multi-core scenarios. There may still be use cases for CLREX on multi-core processors depending on the implementation defined effects on local/global flags.
I can see why there is confusion around this, as the initial version of the M7 documentation doesn't include these sentences (not to mention the various other versions of more generic documentation on the ARM website). Even now, I cannot even link to the latest revision. The page displays 'Version a' by default and you have to manually change the version via a drop down box (hopefully this will change in future).
Update
In response to comments, an additional documentation link for this. This is the part of the manual that describes the usage of these instructions outside of the specific instruction documentation (and also has been there since the first revision):
The processor removes its exclusive access tag if:
It executes a CLREX instruction.
It executes a STREX instruction, regardless of whether the write succeeds.
An exception occurs. This means the processor can resolve semaphore conflicts between different threads.
In a multiprocessor implementation:
Executing a CLREX instruction removes only the local exclusive access tag for the processor.
Executing a STREX instruction, or an exception, removes the local exclusive access tags for the processor.
Executing a STREX instruction to a Shareable memory region can also remove the global exclusive access tags for the processor in the
system.

Override default memory access behaviour in ARM Cortex-M3

According to ARM, the default behaviour of Cortex-M3 is to prevent execution from certain memory regions.
Information here:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/CIHDHAEF.html
According to the above information page: "The optional MPU can override the default memory access behavior".
That is all good, because we would like to execute code from the implementation specific 0xF0000000 region, which by default has the XN "Execute Never" flag set.
We are able to program the MPU to put additional restrictions on a memory region, so clearly the MPU works. But if we set the MPU to allow execution in the 0xF0000000 region, the CPU still enters exception when we try to execute at 0xF0000000.
Does anyone know if the Cortex-M3 MPU is supposed to be able to lift a default restriction, as the ARM page suggests?
Although perhaps not clearly stated in the ARM documentation, it seems that the default MPU configuration is already the least restrictive possible, in order that a device with an MPU behaves identically to one without by default. So it makes sense that you cannot remove these restrictions.
The Memory access behaviour table shows the 0xE0100000- 0xFFFFFFFF region as a "Device" region rather then a memory region. The behaviour of the processor for device and normal regions is described at Memory regions, types and attributes. The requirement for a region with the device attribute preserve access order would require the processor to handle such memory differently when executing code, making the processor more complex. Execution from such memory would also be less efficient.
Essentially if the intent is to support execution from a memory, then it must be mapped to a memory region rather the a device region.
Note that in the Cortex-R4 documentation the restriction is clearly stated:
Instructions cannot be executed from regions with Device or Strongly-ordered memory type attributes. The processor treats such regions as if they have XN permissions.
I cannot however find a similarly unambiguous statement for M3.

Arm cortex a9 memory access

I want to know the sequence an ARM core (Cortex-A series processor) accesses memory? Right from Virtual Address generated by core to memory and Instruction/Data transferred from the memory to the core. Consider core has generated a virtual address for some data/instruction and there is a miss from TLBs, then how does address reach to main memory(DRAM if I am not wrong) and how does data comes to core through L2 and L1 caches.
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
I am confused regarding cache and MMU communications.
tl;dr - Whatever you want. The ARM is highly flexible and the SOC vendor and/or the system programmer may make the memory sub-systems do a great many different things depending on the end device features and needs.
First, the MMU has fields that explicitly dictate how the cache is to be used. I recommend reading Chapter 9 Caches and Chapter 10 Memory Management Unit of the Cortex-A Series Programmers Guide.
Some terms are,
PoC - point of coherency.
PoU - point of unification.
Strongly ordered.
Device
Normal
Many MMU properties and caching can be affected by different CP15 and configuration registers. For instance, an 'exclusive configuration' for data in the L1 cache is never in the L2 can make it particularly difficult to cleanly write self modifying code and other dynamic updates. So, even for a particular Cortex-A model, the system configuration may change things (write-back/write-through, write-allocate/no write-allocate, bufferable, non-cacheable, etc).
A typical sequence for general DDR core memory is,
Resolve virt -> phys
Micro TLB present? Yes, have `phys`
TLB present? Yes, have `phys`
Table walk. Have `phys` or fault.
Access marked cacheable? Yes do 2.1. No step 4.
In L1 cache? Yes 2b.
If read return data. If write fill data and mark drity (write back).
In L2 cache? Yes 3.1
If read return data. If write fill data and mark drity (write back).
Run physical cycle on AXI bus (may route to sub-bus).
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
For normal cases these are just cache hits. If it is a 'write-through' and 'write' then the value is updated in cache and written to memory. It it is 'write-back' the value is updated in cache and marked dirty.Note1 If it is a read, then the cache memory is used (in both case).
The system maybe set up completely differently for device memory (Ie, memory mapped USB registers, world shareable memory, multi-core/cpu buffers, etc). Often the setup will depend on system cost, performance and power consumption. Ie, a write-through cache is easier to implement (lower power and less cost) but often lower performance.
I am confused regarding cache and MMU communications.
Mainly the MMU will provide information for the caches to resolve an address. The MMU may say to use/not use the cache. It may tell the cache it can 'gang' writes together (write-bufferable), but should not store them indefinitely, etc. So many of the MMU specifiers can selectively alter the behavior of the cache. As the Cortex-A cache parameters are not defined (it is up to each SOC manufacturer), it is often the case that particular MMU bits may have alternate behavior on different systems.
Note1: The 'dirty cache' may have additional 'broadcasts' of exclusion monitor information for strex and ldrex type accesses.

ARM: Is LDRX/STRX needed if interrupts are disabled?

I am working with a multithreaded bare-metal C/Assembler application on a Cortex-A9.
I have some shared variables, i.e. adresses that are used from more than one thread. To perform an atomic exchange of a variables value I use LDRX and STRX. Now my question is if I need LDRX and STRX on every access to one of this variables even if interrupts are disabled.
Assume the following example:
Thread 1 uses LDRX and STRX to exchange the value of address a.
Thread 2 disables interrupts, uses normal LDR and STR to exchange the value of address a, does something else that should not be interrupted and then enables interrupts again.
What happens if Thread 1 gets interrupted right after the LDRX by Thread 2? Does the STRX in Thread 1 still recognize, that there was an access on address a or do I have to use LDRX and STRX in Thread 2, too?
LDREX/STREX are something that have to be implemented by the chip vendor, hopefully to arms specification. You can and should get the arm documentation on the topic, in this case in additional to arm arms and trms you should get the amba-axi documentation.
So if you have
ldrex thread 1
interrupt
ldrex thread 2
strex thread 2
return from interrupt
strex thread 1
Between the thread 2 ldrex and strex there has been no modification of that memory location, so the strex should work. But between the thread 1 strex and the prior ldrex there has been a modification to that location, the thread 2 strex. So in theory that means the thread 1 strex should fail and you have to try your thread 1 ldrex/strex pair again until it works. But that is exactly by design, you keep trying the ldrex/strex pair in a loop until it succeeds.
But this is all implementation defined so you have to look at the specific chip vendor and model and rev and do your own experiments. The bug in linux for example is that ldrex/strex is an infinite loop, apply it to a system/situation where ldrex/strex is not supported you get an OKAY instead of an EXOKAY, and the strex will fail forever you are stuck in that infinite loop forever (ever wonder how I know all of this, had to debug this problem at the logic level).
First off ARM documents that exclusive access support is not required for uniprocessor systems so the ldrex/strex pair CAN fail to work IF you touch vendor specific logic on single core systems. Uniprocessor or not if your ldrex/strex remains within the arm logic (L1 and optional L2 caches) then the ldrex/strex pair are goverened by ARM and not the chip vendor so you fall under one set of rules, if the pair touches system memory outside the arm core, then you fall under the vendors rules.
The big problem is that ARM's documentation is unusually incomplete on the topic. Depending on which manual and where in the manual you read it for example says if some OTHER master has modified that location which in your case it is the same master, so the location has been modified but since it was by you the second strex should succeed. Then the same document says that another exclusive read resets the monitor to a different address, well what if it is another exclusive read of the same address?
Basically yours is a question of what about two exclusive writes to the same address without an exclusive read in between, does/should the second succeed. A very good question...I cant see that there is a definitive answer either within all the arm cores or in the whole world of arm based chips.
The bottom line with ldrex/strex it is not completely ARM core specific but also chip specific (vendor). You need to do experiments to insure you can use that instruction pair on that system (uniprocessor or not). You need to know what the ARM core does (the caches) and what happens when that exclusive access goes out past the core to the vendor logic. Repeat for every core and vendor you care to port this code to.
Apologies for just throwing in an "it's wrong" statement to dwelch, but I did not have time to write a proper answer yesterday. dwelch's answer to your question is correct - but pieces of it are at the very least possible to misinterpret.
The short answer is that, yes, you need to either disable interrupts for both threads or use ldrex/strex for both threads.
But to set one thing straight: support for ldrex/strex is mandatory in all ARM processors of v6 or later (with the exception of v6M microcontrollers). Support for SWP however, is optional for certain ARMv7 processors.
The behaviour of ldrex/strex is dependent on whether your MMU is enabled and what memory type and attributes the accessed region is configured with. Certain possible configurations will require additional support to be added to either the interconnect or RAM controllers in order for ldrex/strex to be able to operate correctly.
The entire concept is based around the idea of local and global exclusive monitors. If operating on memory regions marked as non-shareable (in a uniprocessor configuration), the processor needs only be concerned with the local monitor.
In multi-core configurations, coherent regions are managed using what is architecturally considered to be a global monitor, but still resides within the multi-core processor and does not rely on externally implemented logic.
Now, dwelch is correct in that there are way too many "implementation defined" options surrounding this. The sequence you describe is NOT architecturally guaranteed to work. The architecture does not require that an str transitions the local (or global) monitor from exclusive to open state (although in certain implementations, it might).
Hence, the architecturally safe options are:
Use ldrex/strex in both contexts.
Disable interrupts in both contexts.

Resources