I've read similar answers on this site, and elsewhere, but am still confused in a few circumstances.
I'm aware of what the standard actually guarantees us, I understand the intended use of the keyword, and I'm well aware of the difference between the compiler caching and L1/L2/ect. caching; it's more for curiosity's sake that I understand the other cases.
Say I have a variable declared volatile in C. Four scenarios:
Signal handlers, single threaded (As intended): This is the problem the keyword was meant to solve. My process gets a signal callback from the OS, and I modify some volatile variable out of the normal execution of my process. Since it was declared volatile, the normal process won't store this value in a CPU register, and will always do a load from memory. Even if the signal handler writes to the volatile variable, since the signal handler shares the same address space as the normal process, even if the volatile variable was previously cached in hardware (i.e. L1, L2), we guarantee the main process will load the correct, updated variable. Perfect, everyone is happy.
DMA-transfers, single-threaded: Say the volatile variable is mapped to a region of memory for which a DMA-write is taking place. As before, the compiler won't keep the volatile variable in a CPU register, and will always do a load from memory; however, if that variable exists in hardware cache, then the load request will never reach main memory. If the DMA controller updates MM behind our backs, we'll never get the up-to-date value. In a preemptive OS, we are saved by the fact that eventually, we'll probably be context-switched out, and the next time our process resumes, the cache will be cold and we'll actually have to reload from main memory - so we'll get the correct functionality.. eventually (our own process could potentially swap that cache line out too - but again, we might waste valuable cycles before that happens). Is there standardized HW support or OS support that notifies the hardware caches when main memory is updated via the DMA controller? Or do we have to explicitly flush the cache to guarantee we arm't reading a false value? (Is this even possible in the architectures listed?)
Memory-mapped registers, single-threaded: Same as #2, except the volatile variable is mapped to a memory-mapped register (or an explicit IO-port). I would imagine this is a more difficult problem then #3, since at least the DMA controller will signal the CPU when it's done transferring, which gives the OS or HW a chance to do something.
Mutilthreaded: If I have a volatile variable, is there any guarantee of cache-coherency between multiple threads running on separate physical cores? Like sure, again, the compiler is still issuing load instructions from memory, but if the value is cached in one core's cache, is there any guarantee the same value must exist in the other core's caches? (I would imagine it's not an issue at all for hyperthreading threads on different logical cores on the same physical core, since they share physical cache memory). My overwhelming intuition says no, but thought I'd list the case here anyways.
If possible, differentiate between x64 and ARMv6/7/8 architectures, and kernel vs user land solutions.
For 2 and 3, no there's no standardized way this would work.
Normally when doing DMA transfers one would flush the cache in a platform depending manner. Normally there's quite straight forward instructions for doing that (since now-days the caches are integrated in the CPU).
When accessing memory-mapped registers on the other hand, often the behavior is dependent on the order of writes. For example, suppose you have a UART port and write characters to it — you'll need to make sure that there is an actual write to the port each time you write to it from C.
While it might work with flushing the cache between each write, it's not what one normally does. The normal way (for ARM at least) is to set up the MMU so that writes to certain regions of address space happen uncached and in correct sequence.
This approach can also be used for memory used for DMA transfers; one could for example set up dedicated regions for use as DMA buffers and set up the MMU so that reads and writes to that region happen uncached.
On the other hand the language guarantees that all memory (well what you get from declaring variables or allocating memory using new) will behave in certain ways. It should be no difference between if it's multi-threaded or there's signals involved. Note that the C90 and C99 standards don't mention threads (C11 does), but they are supposed to work this way. The implementation has to make sure that the CPU's and cache are used in a way that is consistent with this (as a consequence, the OS might not be able to schedule different threads on different cores if this can't be accomplished). Consequently you should not need to flush caches in order to share data between threads, but you do need to synchronize threads and of course use volatile qualified data. The same is true for signal handlers even if the implementation happens to schedule them on a different core.
Related
Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...
Let's suppose that there are two threads, A and B. There is also a shared array: float X[100].
Thread A writes to the array one element at a time in order, every 10 steps it updates a shared variable index (in a safe way) that indicates the current index, and it also sends a signal to thread B.
As soon as thread B receives the signal, it reads index in a safe way, and then proceed to read the elements of X until position index.
Is it safe to do this? Thread A really updates the array or just a copy in cache?
Every sane way of one thread sending a signal to another provides the assurance that anything written by a thread before sending a signal is guaranteed to be visible to a thread after it receives that signal. So as long as you sent the signal through some means that provided this guarantee, which they pretty much all do, you are safe.
Note that attempting to use a condition variable without a predicate protected by a mutex is not a sane way of one thread sending a signal to another! Among other things, it doesn't guarantee that the thread that you think received the signal actually received the signal. You do need to make sure the thread that does the reads in fact received the very signal sent by the thread that does the writes.
Is it safe to do this?
Provided your data modification is rendered safe and protected by critical sections, locks or whatever, this kind of access is perfectly safe for what concerns hardware access.
Thread A really updates the array or just a copy in cache?
Just a copy in cache. Most caches are presently write-back and just write data back to memory when a line is ejected from the cache if it has been modified. This largely improves memory bandwidth, especially in a multicore context.
BUT all happens as if the memory had been updated.
For shared memory processors, there are generally cache coherency protocols (except in some processors for real time applications). The basic idea of these protocols is that a state is associated with every cache line.
State describes informations concerning the line in the cache of the different processors.
These states indicate, for instance, if the line is only present in the current cache, or is shared by several caches, in sync with memory, invalid... See for instance this description of the popular MESI cache coherence protocol.
So what happens, when a cache line is written and is also present in another processor?
Thanks to the state, the cache knows that one or more other processor also have a copy of the line and it will send an invalidate signal. The line will be invalidated in the other caches and when they want to read or write it, they have to reload its content. Actually, this reload will be served by the cache that has the valid copy to limit memory accesses.
This way, whilst data is only written in the cache, the behavior is similar to a situation where data would have been written to memory.
BUT, despite the fact that functionally the hardware will ensure correctness of the transfer, one must be take into account the cache existence, to avoid performances degradation.
Assume cache A is updating a line and cache B is reading it. Whenever cache A writes, the line in cache B is invalidated. And whenever cache B wants to read it, if the line has been invalidated, it must fetch it from cache A. This can lead to many transfers of the line between the caches and render inefficient the memory system.
So concerning your example, probably 10 is not a good idea, and you should use informations on the caches to improve your exchanges between sender and receiver.
For instance, if you are on a pentium with 64 bytes cache lines, you should declare X as
_Alignas(64) float X[100];
This way the starting address of X will be a multiple of 64 and fit cache lines boundaries. The _Alignas quaiifier exists since C17, and by including stdalign.h, you can also use similarly alignas(64). Before C17, there were several extensions in most compilers in order to have an aligned placement.
And of course, you should indicate process B to read data only when a full 64 bytes line (16 floats) has been written.
This way, when thread B accesses the data, the cache line will not be modified any longer by thread A and only one initial transfer between caches A and B Will take place. This reduction in the number of transfers between the caches may have a significant impact on performances depending on your program.
If you're using a variable to that tracks readiness to read the index, the variable is protected by a mutex and the signalling is done via a pthread condition variable that thread B waits on under the mutex, then yes.
If you're using POSIX signals, then I believe you need a synchronization mechanism on top of that. Writing to an atomic variable with memory_order_release in thread A, and reading it with memory_order_acquire in thread B should guarantee in the most lightweight fashion that writes in A preceding the write to the atomic should be visible in B after it has read the atomic.
For best performance, the array sharing should be also done in such a way that the shared parts of the array do not cross cache-line boundaries (or else you're performance might degrade due to false sharing).
While declaring the volatile keyword, the value of variable make change any moment from outside the scope of the program. What does that meant? Whether it will change outside the scope of main function or outside the scope of globally declared function? What is the perspective in terms of embedded system, if two or more events are performed simultaneously?
volatile was originally intended for stuff like reading from a memory mapped hardware device; each time you read from something like a memory address mapped to a serial port it might have a new value, even if nothing in your program wrote to it. volatile makes it clear that the data there may change at any time, so it should be reread each time, rather than allowing the compiler to optimize it to a single read when it knows your program never changes it. Similar cases can occur even without hardware interference; asynchronous kernel callbacks may write back into user mode memory in a similar way, so reading the value afresh each time is sometimes necessary.
Ab optimizing compiler assumes there is only the context of a single thread of execution. Another context means anything the compiler can't see happening at the same time. So this is hardware actions, interrupt handlers or other threads or processes. Where your code accesses a global (program or file level) variable the optimizer won't assume another context might change or read it unless you tell it by using the volatile qualifier.
Take the case of a hardware register that is memory mapped and you read in a while loop waiting for it to change. Without volatile the compiler only sees your while loop reading the register and if you allow the compiler to optimize the code it will optimize away the multiple reads and never see a change in the register. This is what we normally want the optimizing compiler to do with variables that don't change in a loop.
A similar thing happens to memory mapped hardware registers you write to. If your program never reads from them the compiler could optimize away the write. Again this is what you want an optimizing compiler to do when you are not dealing with a memory location that is used by hardware or another context.
Interrupt handlers and forked threads are treated the same way as hardware. The optimizer doesn't assume they are running at the same time and skip optimizing away a load or store to a shared memory location unless you use volatile.
Following link says that "Access to device registers is always uncached"
http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/DevDrvrO2_PG/sgi_html/ch01.html
My Question is do we ever need volatile when access to device registers which is memory mapped?
The confusion here comes from two mechanisms which have similarities in their goals, but quite distinct mechanisms and levels of implementation.
The link refers to memory mapped I/O regions being configured as ineligible for hardware caching in fast intermediate memory that is used to speed operations compared to accessing slower main memory banks. This is traditionally nearly transparent to software (exceptions being things like modifying code on a machine with distinct instruction and data caches).
In contrast, volatile is used to prohibit an optimizing compiler from performing "software" caching of values by strategically holding them in registers, delaying calculating them until needed, or perhaps never calculating them if un-needed. The basic effect is to inform the compiler that the value may be produced or consumed by a mechanism invisible to its analysis - be that either hardware beyond the present processor core, or a distinct thread or context of execution.
This question is a more procesor-specific version of Why is volatile needed in C?
This is one of the two situations where volatile is mandatory (and it would be nice if compilers could know that).
Any memory location which can change either without your code initiating it (I.e. a memory mapped device register) or without your thread initiating it (i.e. it is changed by another thread or by an interrupt handler) absolutely must be declared as volatile to prevent the compiler optimizing away memory-fetch operations.
I am a little confused as to the real issues between multi-core and multi-cpu environments when it comes to shared memory, with particular reference to mmap in C.
I have an application that utilizes mmap to share multiple segments of memory between 2 processes. Each process has access to:
A Status and Control memory segment
Raw data (up to 8 separate raw data buffers)
The Status and Control segment is used essentially as an IPC. IE, it may convey that buffer 1 is ready to receive data, or buffer 3 is ready for processing or that the Status and Control memory segment is locked whilst being updated by either parent or child etc etc.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Does this assumption of single-process RAM access also apply to multi-cpu systems? That is, a single PC style board with multiple CPU's (and I guess, multiple cores within each CPU).
If not, I will need to seriously rethink my logic to allow for multi-cpu'd single-boarded machines!
Any thoughts would be greatly appreciated!
PS - by single boarded I mean a single, standalone PC style system. This excludes mainframes and the like ... just to clarify :)
RAM is only ever accessed by a single core (or process) at any one time.
Take a step back and think about your assumption means. Theoretically, yes, this statement is true, but I don't think it means what you think it means. There are no practical conclusions you can draw from this other than maybe "the memory will not catch fire if two CPUs write to the same address at the same time". Let me explain.
If one CPU/process writes to a memory location, then a different CPU/process writes to the same location, the memory writes will not happen at the same time, they will happen one at a time. You can't generally reason about which write will happen before the other, you can't reason about if a read from one CPU will happen before the write from the other CPU, one some older CPUs you can't even reason if multi-byte (multi-word, actually) values will be stored/accessed one byte at a time or multiple bytes at a time (which means that reads and writes to multibyte values can get interleaved between CPUs or processes).
The only thing multiple CPUs change here is the order of memory reads and writes. On a single CPU reading memory you can be pretty sure that your reads from memory will see earlier writes to the same memory (iff no other hardware is reading/writing the memory, then all bets are off). On multiple CPUs the order of reads and writes to different memory locations will surprise you (cpu 1 writes to address 1 and then 2, but cpu 2 might just see the new value at address 2 and the old value at address 1).
So unless you have specific documentation from your operating system and/or CPU manufacturer you can't make any assumptions (except that when two writes to the same memory location happen one will happen before the other). This is why you should use libraries like pthreads or stdatomic.h from C11 for proper locking and synchronization or really dig deep down into the most complex parts of the CPU documentation to actually understand what will happen. The locking primitives in pthreads not only provide locking, they are also guarantee that memory is properly synchronized. stdatomic.h is another way to guarantee memory synchronization, but you should carefully read the C11 standard to see what it promises and what it doesn't promise.
One potential issue is that each core has it's own cache (usually just level1, as level2 and level3 caches are usually shared). Each cpu would also have it's own cache. However most systems ensure cache coherency, so this isn't the issue (except for performance impact of constantly invalidating caches due to writes to the same memory shared in a cache line by each core or processor).
The real issue is that there is no guarantee against reordering of reads and writes due to optimizations by the compiler and/or the hardware. You need to use a Memory Barrier to flush out any pending memory operations to synchronize the state of the threads or shared memory of processes. The memory barrier will occur if you use one of the synchronization types such as an event, mutex, semaphore, ... . Not all of the shared memory reads and writes need to be atomic, but you need to use synchronization between threads and/or processes before accessing any shared memory possibly updated by another thread and/or process.
This does not sound right to me. Two processes on two different cores can both load and store data to RAM at the same time. In addition to this caching strategies can result in all kinds of strangeness-es. So please make sure all access to shared memory is properly synchronized using (interprocess) synchronization objects.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Even if this holds true for some particular architecture, such an assumption is entirely wrong in general. You should have proper synchronisation between the processes that modify the shared memory segment, unless atomic intrinsics are used and the algorithm itself is lock-free.
I would advise you to put a pthread_mutex_t in the shared memory segment (shared across all processes). You will have to initialise it with the PTHREAD_PROCESS_SHARED attribute:
pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(mutex, &mutex_attr);