How much is the cost of interrupt in x86_64 - c

How much is the cost of interrupt in x86_64. For example the interrupt due to a page fault? How much cycles are required for the kernel to service the interrupt and then go back to user-space? I am interested in knowning only the cost due to the interrupt and scheduling the interrupted user-level thread back, so we can neglect what is going on inside the interrupt handler here.

For odrinary interrupts (hardware IRQ or ordinary exception like division by zero) it is probably possible to give an upper bound.
Time to process a page fault is especially tricky to assess even when disk IO is not involved because the CPU has to walk the page tables, which introduces many variables. Page faults occur not only because pages are not present, but also because of access violations (e.g., trying to write to a read-only page). In any case, if the page mapping is not already present in the TLB (missing mappings are never cached), the CPU will first have to walk multiple levels of page tables before even invoking the page fault handler. The time to access page table entries (in case the address is not already cached in the TLB) is again dependent on whether some entries are already in data caches.
So the time from accessing a linear address to PF handler being invoked might be anything from ~200 cycles (best case; TLB entry present, exception due to wrong access type -- just ring switch) to ~2000 cycles (no TLB entry present, no page table entries in data cache). This is just the time between 1) executing a user-mode instruction that faults and 2) executing the first instruction of the page fault handler.
[Side-comment: given that, I wonder whether it's possible to build hard real-time systems that use paging.]

This is a complex question and cannot be answered easily.
You have to save all (used) registers (scalar,sse,fpu-state,avx, etc.) that are being used in the interrupt.
Maybe you have to change the virtual address space context.
When you are done, you have to reset the saved context.
And all the while cache/RAM load effects change the cycle count needed.
(NB: Interrupts should not be paged out, but no idea if linux supports this, or if it is at all possible)

Related

Why Do Page Faults and Unrecoverable Errors Need to be Unmaskable?

Looking for a quick clarification on why unrecoverable errors and page faults must be non-maskable interrupts? What happens when they aren't?
Interrupts and exceptions are very different kinds of events.
An interrupt is external to a CPU event that happens and arrives in the processor asynchronously (moment of arrival does not depend on currently executing programs).
An exception is internal to a CPU event that happens as a side effect of instruction execution.
Consider processor as an overcomplex unstoppable automaton with a well-defined and strictly specified behavior. It continuously fetches, decodes, and executes instructions, one by one. When it executes each instruction, it applies the result to the state of the automaton (registers and memory) by its type. It moves without pauses and interrupts. You only can change the direction of this continuous instruction crunching using function calls and jumps.
Such an automaton-like model supported by well-defined and strictly specified instructions behavior makes it extremely predictable and convenient for programming for compilers and software engineers. When you look at the assembler listing, you can precisely say what the processor will do, when it will execute this program. However, under some specific circumstances, the execution of an instruction can fall out of this well-defined model. And in such cases CPU literally does not know what to do next and how to react. For example, the program tries to divide by zero. What reaction do you expect? What value does it need to place into the target register as a result of division? How can it report to the program that something goes wrong? Now imagine another case. The program makes a jump to some virtual address, but it has no physical address mapped. How should CPU proceed with its unstoppable fetch-decode-execute job? From where should it take the next instruction to execute? Which instruction should it execute? Or maybe it should hang in response? There are no ways out from such states.
An exception is a tool for the CPU to go out from such situations gracefully and restore its unstoppable movement. At the same time is a tool to report the encountered error to the operating system and ask it to help with its handling. If you can turn off exceptions, you can steal that tool from the CPU and put all of the above issues back on the table. CPU designers do not have good answers for them and do not what to see them. Due to this, they make exceptions unmaskable.

What is cpumask in mm_struct

I am reading TLB shootdown code in linux kernel and I saw that shootdown ipi's were sent only to cpu's set in cpu_vm_mask_var in the corresponding mm_struct but I couldn't find where the cpu_vm_mask_var is being updated.
So the questions are:
What does cpu_vm_mask_var field in mm_struct represent?
Where is it being updated?
I think in shootdown case cpu_vm_mask_var should say how many CPU's contains present processes TLB entries but is that what exactly maintained by cpu_vm_mask_var?
The memory descriptor of each process has a bit mask called cpu_vm_mask_var and it is typically used when the process is executing on at least one processor. When a process is scheduled to run on a processor, the corresponding bit of the bit mask is set. Similarly, when the scheduler decides to run something else on the processor, the corresponding but is reset. The field cpu_vm_mask_var is modified in three situations:
When the memory descriptor changes by calling switch_mm. In this case, the bit that corresponds to the current processor is cleared for the previous process and is set for the next process.
When a new processor is added to the system, the clear_tasks_mm_cpumask function gets called, which resets the bit that corresponds to the new processor.
cpu_vm_mask_var is used to support the lazy TLB switching mechanism. If the scheduler decides to run a kernel thread, it will turn on lazy TLB mode by calling enter_lazy_tlb. However, in this case, there is no need to invalidate a TLB entry that refers to a user-mode paging structure entry because kernel threads don't access user mode entries. So performance can be improved by disabling TLB shootdowns requests for the processor on which the kernel thread is running and delay the invalidation until switching back to the process that may use the invalidated entries. When a processor that is running a kernel thread receives for the first time an inter-processor interrupt to invalidate one or more TLB entries, the switch_mm_irqs_off function gets called. This function (in this particular case) will reset the bit that corresponds to the current processor in the bit mask so that it no longer receives any IPIs regarding flushing TLB user-mode entries. When the processor switches to a process that has a different memory descriptor, the write to CR3 will flush all the non-global TLB entries. Otherwise, when the processor switches back to the same process, it knows that one or more has become invalid and so it also flushes all non-global TLB entries. cpu_vm_mask_var is modified in switch_mm_irqs_off. Note that flushing kernel-mode TLB entries don't use this mechanism.

How is load->store reordering possible with in-order commit?

ARM allows the reordering loads with subsequent stores, so that the following pseudocode:
// CPU 0 | // CPU 1
temp0 = x; | temp1 = y;
y = 1; | x = 1;
can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."
I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:
Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.
The load can commit before its value is known. I don't have a guess as to how this would be implemented.
Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?
Something else entirely?
There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.
Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).
I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.
You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.
(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)
CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)
So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)
A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.
So the sequence on an in-order pipeline is:
lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.
With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.
any amount of other instructions that don't read r0. That would stall an in-order pipeline.
sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.
Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)
Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.
For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)
That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)
We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.

How to debug an aarch64 translation fault?

I am writing a simple kernel in armv8 (aarch64).
MMU config:
48 VA bits (T1SZ=64-48=16)
4K page size
All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1)
(MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)
I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):
ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000
Inspecting other registers, I have:
TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003
So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):
ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.
So, translation fails at level 0, where it should not.
My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?
Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.
I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).
I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).
I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.
Two possibilities can explain this behavior (I don't fully understand how caches work yet):
First possibility: the MMU does not have the required address in its internal walk cache.
In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).
So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.

Cost of a page fault trap

I have an application which periodically (after each 1 or 2 seconds) takes checkpoints by forking itself. So checkpoint is a fork of the original process which just stays idle until it is asked to start when some error in the original process occurs.
Now my question is how costly is the copy-on-write mechanism of fork. How much is the cost of a page fault trap that will occur whenever the original process writes to a memory page (first time after taking a checkpoint that is), as copy-on-write mechanism will make sure that it gives the original process a different physical page than the checkpoint.
In my opinion, the page fault trap overhead could be quite high as an interrupt occurs, we land from user-space land to the kernel space land and then back from kernel to user-space. How many CPU cycles can I lose from such a a page fault trap. Assume that the RAM is big enough and we don't ever need to swap to the hard disk.
Well I know that its difficult to imagine a checkpointing scheme more efficient than this and therefore you could say why I'm worrying about page trap fault overhead, but I'm asking just to have an idea how much cost will be there for this scheme.
You can do the rough math for an educated guess yourself. Assuming no disk access (~10 billion cycles), you have to account for
160 cycles for the trap and returning (approximately, on x86_64)
validity checks, quota, accounting, and whatnot (unknown, probably a few hundred to a thousand cycles)
aligned memcpy of 4096 bytes, something around 500-800 cycles
TLB invalidation (adds 10-100 cycles on first access)
either eviction of other cached data or one guaranteed cache miss (80-400 cycles) depending on the implementation of the memcpy. It matters a lot on your access pattern whether one or the other is better.
So all in all, we're talking of something around 2000 cycles, with some of the effects (e.g. TLB and cache effects) being spread out and not immediately visible. Omondi and Sedukhin reported 1700 cycles on P-III back in 2003, which is consistent with this estimate.
Note that if the page has never been written to before, things are slightly different according to a comment by L. Torvalds back in 2000. A copy-on-write miss on a zero page pulls another zero page from the pool and doesn't copy zeroes. That's pretty much a guaranteed cache miss too, though.

Resources