Are writes on the PCIe bus atomic?

Are writes on the PCIe bus atomic? - arm

I am a newbie to PCIe, so this might be a dumb question. This seems like fairly basic information to ask about PCIe interfaces, but I am having trouble finding the answer so I am guessing that I am missing some information which makes the answer obvious.
I have a system in which I have an ARM processor (host) communicating to a Xilinx SoC via PCIe (device). The endpoint within the SoC is an ARM processor as well.
The external ARM processor (host) is going to be writing to the register space of the SoC's ARM processor (device) via PCIe. This will command the SoC to do various things. That register space will be read-only with respect to the SoC (device). The external ARM processor (host) will make a write to this register space, and then signal an interrupt to indicate to the SoC that new parameters have been written and it should process them.
My question is: are the writes made by the external ARM (host) guaranteed to be atomic with respect to the reads by the SoC (device)? In conventional shared memory situations, a write to a single byte is guaranteed to be an atomic operation (i.e. you can never be in a situation where the reader had read the first 2 bits of the byte, but before it reads the last 6 bits the writer replace them with a new value, leading to garbage data). Is this the case in PCIe as well? And if so, what is the "unit" of atomic-ness? Are all bytes in a single transaction atomic with respect to the entire transaction, or is each byte atomic only in relation to itself?
Does this question make sense?
Basically I want to know to what extent memory protection is necessary in my situation. If at all possible, I would like to avoid locking memory regions as both processors are running RTOSes and avoiding memory locks would make design simpler.

So on the question of atomicity the PCIe 3.0 specification (only one I have) is mentioned a few times.
First you have SECTION 6.5 Locked Transactions this is likely not what you need but I want to document it anyway. Basically it's the worst case scenario of what you were describing earlier.
Locked Transaction support is required to prevent deadlock in systems that use legacy software
which causes the accesses to I/O devices
But you need to properly check using this anyway as it notes.
If any read associated with a locked sequence is completed unsuccessfully, the Requester must
assume that the atomicity of the lock is no longer assured, and that the path between the
Requester and Completer is no longer locked
With that said Section 6.15 Atomic Operations (AtomicOps) is much more like what you are interested in. There are 3 types of operations you can perform with the AtomicOps instruction.
FetchAdd (Fetch and Add): Request contains a single operand, the “add” value
Swap (Unconditional Swap): Request contains a single operand, the “swap” value
CAS (Compare and Swap): Request contains two operands, a “compare” value and a “swap” value
Reading Section 6.15.1 we see mention that these instructions are largely implemented for cases where multiple producers/consumers exist on a singular bus.
AtomicOps enable advanced synchronization mechanisms that are particularly useful when there are
multiple producers and/or multiple consumers that need to be synchronized in a non-blocking fashion. For example, multiple producers can safely enqueue to a common queue without any explicit locking.
Searching the rest of the specification I find little mention of atomicity outside of the sections pertaining to these AtomicOps. That would imply to me that the spec only insures such behavior when these operations are used however the context around why this was implemented suggests that they only expect such questions when a multi producer/consumer environment exists which yours clearly does not.
The last place I would suggest looking to answer your question is Section 2.4 Transaction Ordering To note I am fairly sure the idea of transactions "passing" others only makes sense with switches in the middle as these switches can make such decisions, once your put bits on the bus in your case there is no going back. So this likely only applies if you place a switch in there.
Your concern is can a write bypass a read. Write being posted, read being non-posted.
A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks.
So in general the write is allowed to bypass the read to avoid deadlocks.
With that concern raised I do not believe it is possible for the write to bypass the read on your system since there is no device on the bus to do this transaction reordering. Since you have RTOSes I highly doubt they are enquing the PCIe transactions and reordering them before sending although I have not looked into that personally.

Related

MultiCore Shared Memory Atomic Operations

Would it be necessary to use a mutex for atomic operations on shared memory, in a multicore environment, where one CPU is only ever reading and the other CPU is only ever writing? I am guessing that this may depend on architecture, so if an example is needed then ARM (Cortex) and/or ESP32?
I already know that a mutex is not needed for atomic operations in a single-core environment where one thread is only ever reading and the other thread only ever writing (https://www.freertos.org/FreeRTOS_Support_Forum_Archive/May_2019/freertos_Shared_variable_between_a_write_thred_and_a_read_thread_a0408decbaj.html).

One solution that has been around for decades (I already used this 30 years ago) is the concept of mailboxes.
Simplest mailbox is just a structure or buffer with a flag. This flag should be of the minimum size that can be accessed in an atomic operation from both processors sharing the memory. It should also be located at a memory address that both processors see as "aligned" to ensure single-cycle read/write accesses, e.g. 32 bit word boundaries in the case of 32 bit ARM processors. This might be tricky to implement in non- RISC-alike architectures.
The flag usage is very simple. The processor that writes the data waits for the flag to be signalled as "buffer empty", maybe a simple null value, then write the data to the mailbox's buffer and signal "buffer not empty" by setting a magic number into the flag, maybe a non- null value.
The processor receiving the data just has to wait for the flag to be signalled as "buffer not empty" before reading the data, and setting the flag back to "buffer empty".
Whether you have primitives supporting this mechanism without relying in a constant flag polling, or not, is tightly dependent on your hardware and operating system.
I've used this mechanism in heterogeneous architectures (processor + co-processor of different architectures/capabilities running different applications), but homogeneous multicore processors are well supported by many RTOSes today, including freeRTOS, and other mechanisms as queues and semaphores/mutexes are probably more appropriated for the synchronization part. Some current SoC's support hardware semaphores and memory-access interrupts that can improve performance greatly.
EDIT:
There is one freeRTOS feature that can assist you here, message buffers. There is one example using ST's STM32H745 dual-core SoC [here] that comes with a companion article [here] written by freeRTOS's Richard Barry.

ARM: Is LDRX/STRX needed if interrupts are disabled?

I am working with a multithreaded bare-metal C/Assembler application on a Cortex-A9.
I have some shared variables, i.e. adresses that are used from more than one thread. To perform an atomic exchange of a variables value I use LDRX and STRX. Now my question is if I need LDRX and STRX on every access to one of this variables even if interrupts are disabled.
Assume the following example:
Thread 1 uses LDRX and STRX to exchange the value of address a.
Thread 2 disables interrupts, uses normal LDR and STR to exchange the value of address a, does something else that should not be interrupted and then enables interrupts again.
What happens if Thread 1 gets interrupted right after the LDRX by Thread 2? Does the STRX in Thread 1 still recognize, that there was an access on address a or do I have to use LDRX and STRX in Thread 2, too?

LDREX/STREX are something that have to be implemented by the chip vendor, hopefully to arms specification. You can and should get the arm documentation on the topic, in this case in additional to arm arms and trms you should get the amba-axi documentation.
So if you have
ldrex thread 1
interrupt
ldrex thread 2
strex thread 2
return from interrupt
strex thread 1
Between the thread 2 ldrex and strex there has been no modification of that memory location, so the strex should work. But between the thread 1 strex and the prior ldrex there has been a modification to that location, the thread 2 strex. So in theory that means the thread 1 strex should fail and you have to try your thread 1 ldrex/strex pair again until it works. But that is exactly by design, you keep trying the ldrex/strex pair in a loop until it succeeds.
But this is all implementation defined so you have to look at the specific chip vendor and model and rev and do your own experiments. The bug in linux for example is that ldrex/strex is an infinite loop, apply it to a system/situation where ldrex/strex is not supported you get an OKAY instead of an EXOKAY, and the strex will fail forever you are stuck in that infinite loop forever (ever wonder how I know all of this, had to debug this problem at the logic level).
First off ARM documents that exclusive access support is not required for uniprocessor systems so the ldrex/strex pair CAN fail to work IF you touch vendor specific logic on single core systems. Uniprocessor or not if your ldrex/strex remains within the arm logic (L1 and optional L2 caches) then the ldrex/strex pair are goverened by ARM and not the chip vendor so you fall under one set of rules, if the pair touches system memory outside the arm core, then you fall under the vendors rules.
The big problem is that ARM's documentation is unusually incomplete on the topic. Depending on which manual and where in the manual you read it for example says if some OTHER master has modified that location which in your case it is the same master, so the location has been modified but since it was by you the second strex should succeed. Then the same document says that another exclusive read resets the monitor to a different address, well what if it is another exclusive read of the same address?
Basically yours is a question of what about two exclusive writes to the same address without an exclusive read in between, does/should the second succeed. A very good question...I cant see that there is a definitive answer either within all the arm cores or in the whole world of arm based chips.
The bottom line with ldrex/strex it is not completely ARM core specific but also chip specific (vendor). You need to do experiments to insure you can use that instruction pair on that system (uniprocessor or not). You need to know what the ARM core does (the caches) and what happens when that exclusive access goes out past the core to the vendor logic. Repeat for every core and vendor you care to port this code to.

Apologies for just throwing in an "it's wrong" statement to dwelch, but I did not have time to write a proper answer yesterday. dwelch's answer to your question is correct - but pieces of it are at the very least possible to misinterpret.
The short answer is that, yes, you need to either disable interrupts for both threads or use ldrex/strex for both threads.
But to set one thing straight: support for ldrex/strex is mandatory in all ARM processors of v6 or later (with the exception of v6M microcontrollers). Support for SWP however, is optional for certain ARMv7 processors.
The behaviour of ldrex/strex is dependent on whether your MMU is enabled and what memory type and attributes the accessed region is configured with. Certain possible configurations will require additional support to be added to either the interconnect or RAM controllers in order for ldrex/strex to be able to operate correctly.
The entire concept is based around the idea of local and global exclusive monitors. If operating on memory regions marked as non-shareable (in a uniprocessor configuration), the processor needs only be concerned with the local monitor.
In multi-core configurations, coherent regions are managed using what is architecturally considered to be a global monitor, but still resides within the multi-core processor and does not rely on externally implemented logic.
Now, dwelch is correct in that there are way too many "implementation defined" options surrounding this. The sequence you describe is NOT architecturally guaranteed to work. The architecture does not require that an str transitions the local (or global) monitor from exclusive to open state (although in certain implementations, it might).
Hence, the architecturally safe options are:
Use ldrex/strex in both contexts.
Disable interrupts in both contexts.

Using interrupts during reading a file from disk

Assume that a large file is saved on disk and I want to run a computation on every chunk of data contained in the file.
The C/C++ code that I would write to do so would load part of the file, then do the processing, then load the next part, then do the processing of this next part, and so on.
If I am, however, interested to do so in the shortest possible time, I could actually do the following: First, tell DMA-controller to load first part of the file. When this part is loaded tell the DMA-controller to load the second part (in some other part of the memory) and then immediately start processing the first part.
If I get an interrupt from the DMA during processing the first part, I finish the first part and afterwards tell the DMA to overwrite it with the third part of the file; then I process the second part.
If I do not get an interrupt from the DMA during processing the first part, I finish the first part and wait for the interrupt of the DMA.
Depending of how long the processing takes in relation to the disk-read, this should be up to twice as fast. In reality, of course, one would have to measure. But that is not the question I am asking.
The question is: Is it possible to do this a) in C using some non-standard extension or b) in assembly? Or do operating systems not allow such things in general? The question is meant primarily in a single-thread context, although I also would be interested to know how to do it with two threads. Also, I am interested in specific code; this is more of a theoretical question.

You're right that you will not get the benefit of this by default, because a blocking read stops your thread from doing any processing. Hans is right that modern OSes already take care of all the little details of DMA and interrupt completion routines.
You need to use the architecture you've described, of issuing a request in advance of when you will use the data. Issue asynchronous I/O requests (on Windows these are called OVERLAPPED). Then the flow will go exactly as you envisions, but the DMA and interrupts are handled in the drivers.
On Windows, take a look at FILE_FLAG_OVERLAPPED (to CreateFile) and ReadFile (if you like events) or ReadFileEx (if you like callbacks). If you don't have to process the data in any particular order, then add a completion port to the mix, which queues the completion responses.
On Linux, OSX, and many other Unix-like OSes, look at aio_read. Or fadvise. Or use mmap with madvise.
And you can get these benefits without even writing native code. .NET recently added the ReadAsync method to its FileStream, which can be used with continuation-passing style in the form of Task objects, with async/await syntactic sugar in the C# compiler.

Typically, in a multi-mode (user/system) operating system, you do not have access to direct dma or to interrupts. In systems that extend those features from kernel(system) mode down to user mode, the overhead eliminates the benefit of using them.
Ignoring that what you're asking to do requires a very specialized environment to support it, the idea is sound and common: declaring two (or more) buffers to enable DMA to the next while you process the first. When two buffers are used they're sometimes referred to as ping-pong buffers.

Real-life use cases of barriers (DSB, DMB, ISB) in ARM

I understand that DSB, DMB, and ISB are barriers for prevent reordering of instructions.
I also can find lots of very good explanations for each of them, but it is pretty hard to imagine the case that I have to use them.
Also, from the open source codes, I see those barriers from time to time, but it is quite hard to understand why they are used. Just for an example, in Linux kernel 3.7 tcp_rcv_synsent_state_process function, there is a line as follows:
if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
smp_mb();
if (po->tp_version <= TPACKET_V2)
__packet_set_status(po, h.raw, status);
where smp_mb() is basically DMB.
Could you give me some of your real-life examples?
It would help understand more about barriers.

Sorry, not going to give you a straight-out example like you're asking, because as you are already looking through the Linux source code, you have plenty of those to go around, and they don't appear to help. No shame in that - every sane person is at least initially confused by memory access ordering issues :)
If you are mainly an application developer, then there is every chance you won't need to worry too much about it - whatever concurrency frameworks you use will resolve it for you.
If you are mainly a device driver developer, then examples are fairly straightforward to find - whenever there is a dependency in your code on a previous access having had an effect (cleared an interrupt source, written a DMA descriptor) before some other access is performed (re-enabling interrupts, initiating the DMA transaction).
If you are in the process of developing a concurrency framework (, or debugging one), you probably need to read up on the topic a bit more - but your question suggests a superficial curiosity rather than an immediate need?
If you are developing your own method for passing data between threads, not based on primitives provided by a concurrency framework, that is for all intents and purposes a concurrency framework.
Paul McKenney wrote an excellent paper on the need for memory barriers, and what effects they actually have in the processor: Memory Barriers: a Hardware View for Software Hackers
If that's a bit too hardcore, I wrote a 3-part blog series that's a bit more lightweight, and finishes off with an ARM-specific view. First part is Memory access ordering - an introduction.
But if it is specifically lists of examples you are after, especially for the ARM architecture, you could do a lot worse than Barrier Litmus Tests and Cookbook.
The extra-extra light programmer's view and not entirely architecturally correct version is:
DMB - whenever a memory access requires ordering with regards to another memory access.
DSB - whenever a memory access needs to have completed before program execution progresses.
ISB - whenever instruction fetches need to explicitly take place after a certain point in the program, for example after memory map updates or after writing code to be executed. (In practice, this means "throw away any prefetched instructions at this point".)

Usually you need to use a memory barrier in cases where you have to make SURE that memory access occurs in a specific order. This might be required for a number of reasons, usually it's required when two or more processes/threads or a hardware component access the same memory structure, which has to be kept consistent.
It's used very often in DMA-transfers. A simple DMA control structures might look like this:
struct dma_control {
u32 owner;
void * data;
u32 len;
};
The owner will usually be set to something like OWNER_CPU or OWNER_HARDWARE, to indicate who of the two participants is allowed to work with the structure.
Code which changes this will usually like like this
dma->data = data;
dma->len = length;
smp_mb();
dma->owner = OWNER_HARDWARE;
So, data an len are always set before the ownership gets transfered to the DMA hardware. Otherwise the engine might get stale data, like a pointer or length which was not updated, because the CPU reordered the memory access.
The same goes for processes or threads running on different cores. The could communicate in a similar manner.

One simple example of a barrier requirement is a spinlock. If you implement a spinlock using compare-and-swap(or LDREX/STREX on ARM) and without a barrier, the processor is allowed to speculatively load values from memory and lazily store computed values to memory, and neither of those are required to happen in the order of the loads/stores in the instruction stream.
The DMB in particular prevents memory access reordering around the DMB. Without DMB, the processor could reorder a store to memory protected by the spinlock after the spinlock is released. Or the processor could read memory protected by the spinlock before the spinlock was actually locked, or while it was locked by a different context.
unixsmurf already pointed it out, but I'll also point you toward Barrier Litmus Tests and Cookbook. It has some pretty good examples of where and why you should use barriers.

How often does processor cache flush?

Say I have a casual single-byte variable. I think on pretty much all systems single-byte operations are atomic, but if not please let me know. Now, say one thread updates this variable. How long should I expect/prepare for this update to appear in the other threads? I know I can put the update around mutexes/locks/barriers to make sure it's synchronized everywhere, but I'm curious about this. The wait time probably varies depending on whether the other threads are on separate processors/cores, and maybe depending on processor type.
Am I being logical for wondering this or have I greatly misunderstood something?

Memory is synchronized as soon as you call a synchronization primitive/memory barrier such as pthread_mutex_lock. Aside from that, you should not assume any synchronization unless you're using C11 atomic types.

In many architectures, the processor won't flush the cache until it has to - to make way for some more-needed data.
However, if the threads are sharing memory space, and you only have a single core, they will be able to see the update "immediately" from the cache. If it's actually been written from the CPU to memory. Which it may not be if the compiler's decided to keep it in a register, in which case your threads will all have their own "local" and incorrect copy.
As others have said, it's an interesting question - but the right answer for synchronising is to use proper synchronisation primitives!

On MIPS architecture there is a sync instruction which serves as a load store barrier across cores i.e all loads and stores before issuing sync will happen before any load and stores after sync.Not sure about if there is an equivalent instruction in x86(assuming that is the architecture you are using).