I am reading TLB shootdown code in linux kernel and I saw that shootdown ipi's were sent only to cpu's set in cpu_vm_mask_var in the corresponding mm_struct but I couldn't find where the cpu_vm_mask_var is being updated.
So the questions are:
What does cpu_vm_mask_var field in mm_struct represent?
Where is it being updated?
I think in shootdown case cpu_vm_mask_var should say how many CPU's contains present processes TLB entries but is that what exactly maintained by cpu_vm_mask_var?
The memory descriptor of each process has a bit mask called cpu_vm_mask_var and it is typically used when the process is executing on at least one processor. When a process is scheduled to run on a processor, the corresponding bit of the bit mask is set. Similarly, when the scheduler decides to run something else on the processor, the corresponding but is reset. The field cpu_vm_mask_var is modified in three situations:
When the memory descriptor changes by calling switch_mm. In this case, the bit that corresponds to the current processor is cleared for the previous process and is set for the next process.
When a new processor is added to the system, the clear_tasks_mm_cpumask function gets called, which resets the bit that corresponds to the new processor.
cpu_vm_mask_var is used to support the lazy TLB switching mechanism. If the scheduler decides to run a kernel thread, it will turn on lazy TLB mode by calling enter_lazy_tlb. However, in this case, there is no need to invalidate a TLB entry that refers to a user-mode paging structure entry because kernel threads don't access user mode entries. So performance can be improved by disabling TLB shootdowns requests for the processor on which the kernel thread is running and delay the invalidation until switching back to the process that may use the invalidated entries. When a processor that is running a kernel thread receives for the first time an inter-processor interrupt to invalidate one or more TLB entries, the switch_mm_irqs_off function gets called. This function (in this particular case) will reset the bit that corresponds to the current processor in the bit mask so that it no longer receives any IPIs regarding flushing TLB user-mode entries. When the processor switches to a process that has a different memory descriptor, the write to CR3 will flush all the non-global TLB entries. Otherwise, when the processor switches back to the same process, it knows that one or more has become invalid and so it also flushes all non-global TLB entries. cpu_vm_mask_var is modified in switch_mm_irqs_off. Note that flushing kernel-mode TLB entries don't use this mechanism.
Related
Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...
Let's suppose that there are two threads, A and B. There is also a shared array: float X[100].
Thread A writes to the array one element at a time in order, every 10 steps it updates a shared variable index (in a safe way) that indicates the current index, and it also sends a signal to thread B.
As soon as thread B receives the signal, it reads index in a safe way, and then proceed to read the elements of X until position index.
Is it safe to do this? Thread A really updates the array or just a copy in cache?
Every sane way of one thread sending a signal to another provides the assurance that anything written by a thread before sending a signal is guaranteed to be visible to a thread after it receives that signal. So as long as you sent the signal through some means that provided this guarantee, which they pretty much all do, you are safe.
Note that attempting to use a condition variable without a predicate protected by a mutex is not a sane way of one thread sending a signal to another! Among other things, it doesn't guarantee that the thread that you think received the signal actually received the signal. You do need to make sure the thread that does the reads in fact received the very signal sent by the thread that does the writes.
Is it safe to do this?
Provided your data modification is rendered safe and protected by critical sections, locks or whatever, this kind of access is perfectly safe for what concerns hardware access.
Thread A really updates the array or just a copy in cache?
Just a copy in cache. Most caches are presently write-back and just write data back to memory when a line is ejected from the cache if it has been modified. This largely improves memory bandwidth, especially in a multicore context.
BUT all happens as if the memory had been updated.
For shared memory processors, there are generally cache coherency protocols (except in some processors for real time applications). The basic idea of these protocols is that a state is associated with every cache line.
State describes informations concerning the line in the cache of the different processors.
These states indicate, for instance, if the line is only present in the current cache, or is shared by several caches, in sync with memory, invalid... See for instance this description of the popular MESI cache coherence protocol.
So what happens, when a cache line is written and is also present in another processor?
Thanks to the state, the cache knows that one or more other processor also have a copy of the line and it will send an invalidate signal. The line will be invalidated in the other caches and when they want to read or write it, they have to reload its content. Actually, this reload will be served by the cache that has the valid copy to limit memory accesses.
This way, whilst data is only written in the cache, the behavior is similar to a situation where data would have been written to memory.
BUT, despite the fact that functionally the hardware will ensure correctness of the transfer, one must be take into account the cache existence, to avoid performances degradation.
Assume cache A is updating a line and cache B is reading it. Whenever cache A writes, the line in cache B is invalidated. And whenever cache B wants to read it, if the line has been invalidated, it must fetch it from cache A. This can lead to many transfers of the line between the caches and render inefficient the memory system.
So concerning your example, probably 10 is not a good idea, and you should use informations on the caches to improve your exchanges between sender and receiver.
For instance, if you are on a pentium with 64 bytes cache lines, you should declare X as
_Alignas(64) float X[100];
This way the starting address of X will be a multiple of 64 and fit cache lines boundaries. The _Alignas quaiifier exists since C17, and by including stdalign.h, you can also use similarly alignas(64). Before C17, there were several extensions in most compilers in order to have an aligned placement.
And of course, you should indicate process B to read data only when a full 64 bytes line (16 floats) has been written.
This way, when thread B accesses the data, the cache line will not be modified any longer by thread A and only one initial transfer between caches A and B Will take place. This reduction in the number of transfers between the caches may have a significant impact on performances depending on your program.
If you're using a variable to that tracks readiness to read the index, the variable is protected by a mutex and the signalling is done via a pthread condition variable that thread B waits on under the mutex, then yes.
If you're using POSIX signals, then I believe you need a synchronization mechanism on top of that. Writing to an atomic variable with memory_order_release in thread A, and reading it with memory_order_acquire in thread B should guarantee in the most lightweight fashion that writes in A preceding the write to the atomic should be visible in B after it has read the atomic.
For best performance, the array sharing should be also done in such a way that the shared parts of the array do not cross cache-line boundaries (or else you're performance might degrade due to false sharing).
I've read similar answers on this site, and elsewhere, but am still confused in a few circumstances.
I'm aware of what the standard actually guarantees us, I understand the intended use of the keyword, and I'm well aware of the difference between the compiler caching and L1/L2/ect. caching; it's more for curiosity's sake that I understand the other cases.
Say I have a variable declared volatile in C. Four scenarios:
Signal handlers, single threaded (As intended): This is the problem the keyword was meant to solve. My process gets a signal callback from the OS, and I modify some volatile variable out of the normal execution of my process. Since it was declared volatile, the normal process won't store this value in a CPU register, and will always do a load from memory. Even if the signal handler writes to the volatile variable, since the signal handler shares the same address space as the normal process, even if the volatile variable was previously cached in hardware (i.e. L1, L2), we guarantee the main process will load the correct, updated variable. Perfect, everyone is happy.
DMA-transfers, single-threaded: Say the volatile variable is mapped to a region of memory for which a DMA-write is taking place. As before, the compiler won't keep the volatile variable in a CPU register, and will always do a load from memory; however, if that variable exists in hardware cache, then the load request will never reach main memory. If the DMA controller updates MM behind our backs, we'll never get the up-to-date value. In a preemptive OS, we are saved by the fact that eventually, we'll probably be context-switched out, and the next time our process resumes, the cache will be cold and we'll actually have to reload from main memory - so we'll get the correct functionality.. eventually (our own process could potentially swap that cache line out too - but again, we might waste valuable cycles before that happens). Is there standardized HW support or OS support that notifies the hardware caches when main memory is updated via the DMA controller? Or do we have to explicitly flush the cache to guarantee we arm't reading a false value? (Is this even possible in the architectures listed?)
Memory-mapped registers, single-threaded: Same as #2, except the volatile variable is mapped to a memory-mapped register (or an explicit IO-port). I would imagine this is a more difficult problem then #3, since at least the DMA controller will signal the CPU when it's done transferring, which gives the OS or HW a chance to do something.
Mutilthreaded: If I have a volatile variable, is there any guarantee of cache-coherency between multiple threads running on separate physical cores? Like sure, again, the compiler is still issuing load instructions from memory, but if the value is cached in one core's cache, is there any guarantee the same value must exist in the other core's caches? (I would imagine it's not an issue at all for hyperthreading threads on different logical cores on the same physical core, since they share physical cache memory). My overwhelming intuition says no, but thought I'd list the case here anyways.
If possible, differentiate between x64 and ARMv6/7/8 architectures, and kernel vs user land solutions.
For 2 and 3, no there's no standardized way this would work.
Normally when doing DMA transfers one would flush the cache in a platform depending manner. Normally there's quite straight forward instructions for doing that (since now-days the caches are integrated in the CPU).
When accessing memory-mapped registers on the other hand, often the behavior is dependent on the order of writes. For example, suppose you have a UART port and write characters to it — you'll need to make sure that there is an actual write to the port each time you write to it from C.
While it might work with flushing the cache between each write, it's not what one normally does. The normal way (for ARM at least) is to set up the MMU so that writes to certain regions of address space happen uncached and in correct sequence.
This approach can also be used for memory used for DMA transfers; one could for example set up dedicated regions for use as DMA buffers and set up the MMU so that reads and writes to that region happen uncached.
On the other hand the language guarantees that all memory (well what you get from declaring variables or allocating memory using new) will behave in certain ways. It should be no difference between if it's multi-threaded or there's signals involved. Note that the C90 and C99 standards don't mention threads (C11 does), but they are supposed to work this way. The implementation has to make sure that the CPU's and cache are used in a way that is consistent with this (as a consequence, the OS might not be able to schedule different threads on different cores if this can't be accomplished). Consequently you should not need to flush caches in order to share data between threads, but you do need to synchronize threads and of course use volatile qualified data. The same is true for signal handlers even if the implementation happens to schedule them on a different core.
I have a Linux kernel module which calculates network packet statistics among several CPUs (in kernel address space). Periodically I clear the corresponding memory chunk and strongly need this action to take immediate effect for all CPUs, otherwise it will distort the subsequent statistics values. My target CPU is a Power PC, so its cache coherency is very relaxed. Thus I need to manually flush data caches of all CPUs just after zeroing the memory.
So what should I place just after my clearing procedure:
memset(ptr, 0, size);
// what's going here?
After some reflection I realised that the problem here is not really linked to data cache flushing. Actually I try to avoid a banal race condition (the first cpu clears value while the second increments it). In my case it's too expensive to protect data by mutex, so it's worth using atomic flag to notify the owning CPU to clear the values by itself.
How much is the cost of interrupt in x86_64. For example the interrupt due to a page fault? How much cycles are required for the kernel to service the interrupt and then go back to user-space? I am interested in knowning only the cost due to the interrupt and scheduling the interrupted user-level thread back, so we can neglect what is going on inside the interrupt handler here.
For odrinary interrupts (hardware IRQ or ordinary exception like division by zero) it is probably possible to give an upper bound.
Time to process a page fault is especially tricky to assess even when disk IO is not involved because the CPU has to walk the page tables, which introduces many variables. Page faults occur not only because pages are not present, but also because of access violations (e.g., trying to write to a read-only page). In any case, if the page mapping is not already present in the TLB (missing mappings are never cached), the CPU will first have to walk multiple levels of page tables before even invoking the page fault handler. The time to access page table entries (in case the address is not already cached in the TLB) is again dependent on whether some entries are already in data caches.
So the time from accessing a linear address to PF handler being invoked might be anything from ~200 cycles (best case; TLB entry present, exception due to wrong access type -- just ring switch) to ~2000 cycles (no TLB entry present, no page table entries in data cache). This is just the time between 1) executing a user-mode instruction that faults and 2) executing the first instruction of the page fault handler.
[Side-comment: given that, I wonder whether it's possible to build hard real-time systems that use paging.]
This is a complex question and cannot be answered easily.
You have to save all (used) registers (scalar,sse,fpu-state,avx, etc.) that are being used in the interrupt.
Maybe you have to change the virtual address space context.
When you are done, you have to reset the saved context.
And all the while cache/RAM load effects change the cycle count needed.
(NB: Interrupts should not be paged out, but no idea if linux supports this, or if it is at all possible)