Real-life use cases of barriers (DSB, DMB, ISB) in ARM - arm

I understand that DSB, DMB, and ISB are barriers for prevent reordering of instructions.
I also can find lots of very good explanations for each of them, but it is pretty hard to imagine the case that I have to use them.
Also, from the open source codes, I see those barriers from time to time, but it is quite hard to understand why they are used. Just for an example, in Linux kernel 3.7 tcp_rcv_synsent_state_process function, there is a line as follows:
if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
smp_mb();
if (po->tp_version <= TPACKET_V2)
__packet_set_status(po, h.raw, status);
where smp_mb() is basically DMB.
Could you give me some of your real-life examples?
It would help understand more about barriers.

Sorry, not going to give you a straight-out example like you're asking, because as you are already looking through the Linux source code, you have plenty of those to go around, and they don't appear to help. No shame in that - every sane person is at least initially confused by memory access ordering issues :)
If you are mainly an application developer, then there is every chance you won't need to worry too much about it - whatever concurrency frameworks you use will resolve it for you.
If you are mainly a device driver developer, then examples are fairly straightforward to find - whenever there is a dependency in your code on a previous access having had an effect (cleared an interrupt source, written a DMA descriptor) before some other access is performed (re-enabling interrupts, initiating the DMA transaction).
If you are in the process of developing a concurrency framework (, or debugging one), you probably need to read up on the topic a bit more - but your question suggests a superficial curiosity rather than an immediate need?
If you are developing your own method for passing data between threads, not based on primitives provided by a concurrency framework, that is for all intents and purposes a concurrency framework.
Paul McKenney wrote an excellent paper on the need for memory barriers, and what effects they actually have in the processor: Memory Barriers: a Hardware View for Software Hackers
If that's a bit too hardcore, I wrote a 3-part blog series that's a bit more lightweight, and finishes off with an ARM-specific view. First part is Memory access ordering - an introduction.
But if it is specifically lists of examples you are after, especially for the ARM architecture, you could do a lot worse than Barrier Litmus Tests and Cookbook.
The extra-extra light programmer's view and not entirely architecturally correct version is:
DMB - whenever a memory access requires ordering with regards to another memory access.
DSB - whenever a memory access needs to have completed before program execution progresses.
ISB - whenever instruction fetches need to explicitly take place after a certain point in the program, for example after memory map updates or after writing code to be executed. (In practice, this means "throw away any prefetched instructions at this point".)

Usually you need to use a memory barrier in cases where you have to make SURE that memory access occurs in a specific order. This might be required for a number of reasons, usually it's required when two or more processes/threads or a hardware component access the same memory structure, which has to be kept consistent.
It's used very often in DMA-transfers. A simple DMA control structures might look like this:
struct dma_control {
u32 owner;
void * data;
u32 len;
};
The owner will usually be set to something like OWNER_CPU or OWNER_HARDWARE, to indicate who of the two participants is allowed to work with the structure.
Code which changes this will usually like like this
dma->data = data;
dma->len = length;
smp_mb();
dma->owner = OWNER_HARDWARE;
So, data an len are always set before the ownership gets transfered to the DMA hardware. Otherwise the engine might get stale data, like a pointer or length which was not updated, because the CPU reordered the memory access.
The same goes for processes or threads running on different cores. The could communicate in a similar manner.

One simple example of a barrier requirement is a spinlock. If you implement a spinlock using compare-and-swap(or LDREX/STREX on ARM) and without a barrier, the processor is allowed to speculatively load values from memory and lazily store computed values to memory, and neither of those are required to happen in the order of the loads/stores in the instruction stream.
The DMB in particular prevents memory access reordering around the DMB. Without DMB, the processor could reorder a store to memory protected by the spinlock after the spinlock is released. Or the processor could read memory protected by the spinlock before the spinlock was actually locked, or while it was locked by a different context.
unixsmurf already pointed it out, but I'll also point you toward Barrier Litmus Tests and Cookbook. It has some pretty good examples of where and why you should use barriers.

Related

Are writes on the PCIe bus atomic?

I am a newbie to PCIe, so this might be a dumb question. This seems like fairly basic information to ask about PCIe interfaces, but I am having trouble finding the answer so I am guessing that I am missing some information which makes the answer obvious.
I have a system in which I have an ARM processor (host) communicating to a Xilinx SoC via PCIe (device). The endpoint within the SoC is an ARM processor as well.
The external ARM processor (host) is going to be writing to the register space of the SoC's ARM processor (device) via PCIe. This will command the SoC to do various things. That register space will be read-only with respect to the SoC (device). The external ARM processor (host) will make a write to this register space, and then signal an interrupt to indicate to the SoC that new parameters have been written and it should process them.
My question is: are the writes made by the external ARM (host) guaranteed to be atomic with respect to the reads by the SoC (device)? In conventional shared memory situations, a write to a single byte is guaranteed to be an atomic operation (i.e. you can never be in a situation where the reader had read the first 2 bits of the byte, but before it reads the last 6 bits the writer replace them with a new value, leading to garbage data). Is this the case in PCIe as well? And if so, what is the "unit" of atomic-ness? Are all bytes in a single transaction atomic with respect to the entire transaction, or is each byte atomic only in relation to itself?
Does this question make sense?
Basically I want to know to what extent memory protection is necessary in my situation. If at all possible, I would like to avoid locking memory regions as both processors are running RTOSes and avoiding memory locks would make design simpler.
So on the question of atomicity the PCIe 3.0 specification (only one I have) is mentioned a few times.
First you have SECTION 6.5 Locked Transactions this is likely not what you need but I want to document it anyway. Basically it's the worst case scenario of what you were describing earlier.
Locked Transaction support is required to prevent deadlock in systems that use legacy software
which causes the accesses to I/O devices
But you need to properly check using this anyway as it notes.
If any read associated with a locked sequence is completed unsuccessfully, the Requester must
assume that the atomicity of the lock is no longer assured, and that the path between the
Requester and Completer is no longer locked
With that said Section 6.15 Atomic Operations (AtomicOps) is much more like what you are interested in. There are 3 types of operations you can perform with the AtomicOps instruction.
FetchAdd (Fetch and Add): Request contains a single operand, the “add” value
Swap (Unconditional Swap): Request contains a single operand, the “swap” value
CAS (Compare and Swap): Request contains two operands, a “compare” value and a “swap” value
Reading Section 6.15.1 we see mention that these instructions are largely implemented for cases where multiple producers/consumers exist on a singular bus.
AtomicOps enable advanced synchronization mechanisms that are particularly useful when there are
multiple producers and/or multiple consumers that need to be synchronized in a non-blocking fashion. For example, multiple producers can safely enqueue to a common queue without any explicit locking.
Searching the rest of the specification I find little mention of atomicity outside of the sections pertaining to these AtomicOps. That would imply to me that the spec only insures such behavior when these operations are used however the context around why this was implemented suggests that they only expect such questions when a multi producer/consumer environment exists which yours clearly does not.
The last place I would suggest looking to answer your question is Section 2.4 Transaction Ordering To note I am fairly sure the idea of transactions "passing" others only makes sense with switches in the middle as these switches can make such decisions, once your put bits on the bus in your case there is no going back. So this likely only applies if you place a switch in there.
Your concern is can a write bypass a read. Write being posted, read being non-posted.
A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks.
So in general the write is allowed to bypass the read to avoid deadlocks.
With that concern raised I do not believe it is possible for the write to bypass the read on your system since there is no device on the bus to do this transaction reordering. Since you have RTOSes I highly doubt they are enquing the PCIe transactions and reordering them before sending although I have not looked into that personally.

Can I use LDREX/STREX to implement a spin lock without enabling SCU in a multicore ARM Cortex-A9 SoC?

I know this might be a strange usage. I just want to know if I can use LDREX/STREX with SCU disabled.
I am using a dual-core Cortext-A9 SoC. The two cores are running in an AMP mode: each core has its own OS. Although memory controller is shared resource, each core has its own memory space. One can't access the other's memory space. Because no cache coherency is required, SCU isn't enabled. At the same time, I also have a shared memory region that both cores can access to. The shared memory region is non-cached to avoid cache coherency issue.
I define a spin lock in this shared memory region. This spin lock is used to protect shared resource accessing. Right now, the spin lock is implemented simply like this:
void spin_lock(uint32_t *lock)
{
while(*lock);
*lock = 1;
}
void spin_unlock(uint32_t *lock)
{
*lock = 0;
}
where, lock is a variable in shared memory so both core can access this lock.
The problem of this implementation is that accessing lock is not exclusive. That's why I want to use LDREX/STREX to implement spin lock. Please allow me to restate my question:
Can I use LDREX/STREX without SCU enabled?
Thank you!
So ... the direct answer to your question is that, yes, it is possible - so long as something else out in the memory system implements an exclusive monitor for the shared memory region. If it does not, then your STREXs will always return OK (rather than EXOK), observable as a failure in the result register.
However, why would you not enable the SCU?
Clearly, what you are trying to do requires a coherent view of memory between the two operating systems for at least that region. And with PIPT data caches, you are not going to see any aliasing of cache lines depending on how they are mapped in each image.
Overall, the answer is no. There are two issues here:
1) You cannot use load/store exclusive on uncached memory. The exclusive operations operate only on "normal" idempotent memory.
2) The ARM manual doesn't specify how exclusive monitors work in conjunction with memory coherence, but any sane implementation is essentially going to put the monitor in the cache line acquisition mechanism. If you disabled cache line snooping, you have most likely rendered the monitors non-functional on your chip.
Your only (poorly formed) question,
Can I use LDREX/STREX without SCU enabled?
In an ideal ARM universe, yes, it is possible. Ie, it is possible that somewhere, some day you might be able to do this. I think you mean,
Can I use LDREX/STREX without SCU enabled in my system?
Unfortunately, the ARM ARM is a bit of a political/bureaucratic document. You must take extreme care when reading "strongly advised", "UNPREDICTABLE" "UNKNOWN" and can. All programmers would desire the ldrex/strex to apply to all memory. In fact, if the BUS controller (typically AXI-NIC) implemented a monitor, then there would be no trouble to support the much loved swp instruction. There are various posts on StackOverflow where people want to replace the swp with an ldrex/strex.
After you read and re-read the double speak (it is written for the programmer, but also the silicon implementer) of the ARM ARM, it becomes pretty clear that the monitor logic is probably implemented in the cache. A cache controller must implement dirty line broadcasts. Dirty line broadcasts are very similar to a 'monitor' and your 'reserve granule' is most likely a cache line size (what a co-incidence).
The ARM ARM is written as a generic document for people who may wish to implement a Cortex-A CPU. It is written so that their hands (creativity) are not tied to implement the monitor with-in the cache.
So you need to read the specific documentation on your particular Cortex-A9 SOC. It will probably only support ldrex/strex with cached memory. In fact, it is advisable to issue a pld to ensure the memory is in cache before doing the ldrex and this will mean you need to activate the SCU in your system. I guess you are concerned about some additional cycle(s) that the SCU will add to latency?
I think some of this information has confuse many extremely intelligent people. Beware the difference between possible and is. Every person on StackOverflow probably desires the case where the monitor is implemented in the bus controller (or core memory chip). However, for most real chips, this is not the case.
For certain, if you want to future proof your code/OS to port to newer or different Cortex-A CPUs, you should not make this assumption even if your chipset does support a 'global monitor' outside the cache sub-systems.

Usage of Volatile in case of Memory mapped Devices?

Following link says that "Access to device registers is always uncached"
http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/DevDrvrO2_PG/sgi_html/ch01.html
My Question is do we ever need volatile when access to device registers which is memory mapped?
The confusion here comes from two mechanisms which have similarities in their goals, but quite distinct mechanisms and levels of implementation.
The link refers to memory mapped I/O regions being configured as ineligible for hardware caching in fast intermediate memory that is used to speed operations compared to accessing slower main memory banks. This is traditionally nearly transparent to software (exceptions being things like modifying code on a machine with distinct instruction and data caches).
In contrast, volatile is used to prohibit an optimizing compiler from performing "software" caching of values by strategically holding them in registers, delaying calculating them until needed, or perhaps never calculating them if un-needed. The basic effect is to inform the compiler that the value may be produced or consumed by a mechanism invisible to its analysis - be that either hardware beyond the present processor core, or a distinct thread or context of execution.
This question is a more procesor-specific version of Why is volatile needed in C?
This is one of the two situations where volatile is mandatory (and it would be nice if compilers could know that).
Any memory location which can change either without your code initiating it (I.e. a memory mapped device register) or without your thread initiating it (i.e. it is changed by another thread or by an interrupt handler) absolutely must be declared as volatile to prevent the compiler optimizing away memory-fetch operations.

How often does processor cache flush?

Say I have a casual single-byte variable. I think on pretty much all systems single-byte operations are atomic, but if not please let me know. Now, say one thread updates this variable. How long should I expect/prepare for this update to appear in the other threads? I know I can put the update around mutexes/locks/barriers to make sure it's synchronized everywhere, but I'm curious about this. The wait time probably varies depending on whether the other threads are on separate processors/cores, and maybe depending on processor type.
Am I being logical for wondering this or have I greatly misunderstood something?
Memory is synchronized as soon as you call a synchronization primitive/memory barrier such as pthread_mutex_lock. Aside from that, you should not assume any synchronization unless you're using C11 atomic types.
In many architectures, the processor won't flush the cache until it has to - to make way for some more-needed data.
However, if the threads are sharing memory space, and you only have a single core, they will be able to see the update "immediately" from the cache. If it's actually been written from the CPU to memory. Which it may not be if the compiler's decided to keep it in a register, in which case your threads will all have their own "local" and incorrect copy.
As others have said, it's an interesting question - but the right answer for synchronising is to use proper synchronisation primitives!
On MIPS architecture there is a sync instruction which serves as a load store barrier across cores i.e all loads and stores before issuing sync will happen before any load and stores after sync.Not sure about if there is an equivalent instruction in x86(assuming that is the architecture you are using).

Safety nets in complex multi-threaded code?

As a developer who has just finished writing thousands of lines of complex multi-threaded 'C' code in a project, and which is going to be enhanced, modified etc. by several other developers unfamiliar with this code in the future, I wanted to find out what kind of safety nets do you guys try to put in such code? As an example I could do these:
Define accessor macros for lock protected
structure members, which assert that
the corresponding lock is held. This
makes it clear that these members
are lock-protected to anyone unfamiliar with this code.
Functions which are supposed to be
called with some spinlock held,
assert that the spinlock is being held.
What kind of safety nets have you put into multi-threaded code that you have written?
What kind of problems have you faced when other developers modified such code?
What kind of debugging aids have you put into such code?
Thanks for your comments.
There are a number of things we do in our product (a hypervisor designed to help you find concurrency bugs in applications) that are more generally useful. Note that we do these in our code itself (because its a highly concurrent piece of software) and that some of these are useful whether or not you are writing concurrent code.
Like you, we have the ability to assert(lock_held(...)) and use it.
We also (because we have our own scheduler) can assert(single_threaded()) for those (rare) situations where we count on no other thread being active in the system.
Memory corruption from one thread to another is pretty common (and hard to debug) so we do two things to address this: sprinkled throughout our thread stack are some magic cookies. We periodically (in our get_thread_id()) function invoke a "validate_thread_stack()" function that checks these cookies to make sure the stack is not corrupted.
Our malloc sticks magic cookies before and after a malloc block of memory and checks these on free. If anyone overruns their data these can be used to find the corruption early.
On free() we blast a well known pattern (in our case 0xdddd...) over the memory. This nicely corrupts anyone else who had a dangling pointer left over to that memory region.
We have a guard page (a memory page not mapped into the address space) near the bottom of the thread stack. If the thread overruns its stack, we catch it via page fault and drop into our debugger.
Our locks are witnessed. Checkout the FreeBSD lock witness code. Its like that but homebrew. Basically the witness code is a lightweight way of detecting potential deadlocks by looking at cycles in the lock acquisition graph.
Our locks are also wrapped with accessors that record the file/line number of acquisition and release. For double unlocks or double locks, you get pretty debug information on your screwup.
Our locks are also profiled. Once you get your code working you want it working well. We track the usual things like how many acquisitions, how long it took to acquire it.
In our system, we have an expectation that locks are not contended (we carefully designed the code this way). So if you wait for a spin lock longer than a second or two in our system you get dropped into the debugger because its most likely not a good thing.
Our variables that are meant to be updated atomically are wrapped inside of C struct's. The reason for this is to prevent sloppy code where you mix good use: atomic_increment(&var); and bad use var++. We make it very hard to write the latter code.
"volatile" is forbidden in our code base because its ambiguously implemented by compilers. Its a bad way to try and cobble together synchronization.
And of course code reviews. If you can't explain your concurrency assumptions and locking discipline to a colleague, then there's definitely issues with the code :-)
Make everything absolutely obvious, so that other developers cannot miss the synchronization scope when they view subsections of the code in isolation.
for example: don't hold a lock in code that spans multiple files.
Seems like you've answered your own question: put lots of assertions into the code. They will tell other developers what invariants and preconditions must hold.

Resources