From Understanding the Linux Kernel 3rd edition , chapter 10.4.4, which discussing about page fault exception while the Kernel accessing user space memory:
the Page Fault handler do_page_fault( ) executes the following
statements:
if ((fixup = search_exception_tables(regs->eip))) {
regs->eip = fixup->fixup;
return 1;
}
The regs->eip field contains the value of the eip register saved on
the Kernel Mode stack when the exception occurred. If the value in the
register (the instruction pointer) is in an exception table,
do_page_fault( ) replaces the saved value with the address found in
the entry returned by search_exception_tables( ). Then the Page Fault
handler terminates and the interrupted program resumes with execution
of the fixup code .
Well understood, beside a crucial fact- memory operations can be cached, which means that at the moment of the page fault exception, the instruction pointer contained address of another instruction which had no relation to the exception, as the instruction that caused the exception executed earlier by the CPU but was cached up until this moment (reference to that in here at last paragraphs).
How can Linux Kernel execute the above code with no side effects? how can it be sure that at the time of the page fault exception, the instruction pointer register contained the address of the memory-operation instruction (that can be cached and performed later on) that accessed illegal address?
EDIT #1:
I believe the same guide has a hint in chapter 2.4.7-
the cache unit is inserted between the paging unit and the main memory
Maybe it implies that address translation and checking in always done prior (or during) to caching (at least in x86 architecture, in which the guide is based on) which can explain my issue by the fact that address check in done in the MMU circuitry at the moment of instruction execution? unfortunately I could not find any definitive 'Yes' in the guide nor in my searches online.
EDIT #2:
I found another source in SuperUser that strengthen my speculation in EDIT #1:
Permissions still need to be checked before the access can be
committed
So, it seems (without any confirmation from formal source) that upon memory access, the address goes into MMU circuitry for translation before or during the caching, which means that address check is done at instruction execution time, and that this is the moment when page fault can be raised. So the cache latency of main memory access is irrelevant to the page fault timing, and when page fault occurs, instruction pointer register indeed contains the faulty instruction which tries to access invalid address. I'll keep searching for a source that confirms that, or alternatively a source that contradict that and that offers a proper solution.
Related
Honestly, I am really confused with this particular virtual memory related concept.
Q1) When a page fault occurs, does the processor first finishes the execution of the current instruction and then moves the IP register contents (address of next instruction) to the stack? Or, it aborts current instruction being executed and moves the contents of instruction pointer register to stack?
Q2) If the second case is true, then how does it resume the instruction which was aborted because when if it resumes, the stack contains the instruction pointer value which is nothing but the address of the next instruction. So it will never resume the instruction where the page fault occurred.
What I think
I think the second case sounds wrong. The confusion occurred while i was reading Operating System Principles by Silbershatz and Galvin. In that they have written
when a page fault occurs, we will have to bring in the desired page, correct page table and restart the instruction.
But the instruction pointer always points to the address of the next instruction so it means, according to what this book is trying to convey, we are decrementing the value of IP just to restart the execution of the instruction where the page fault occurred?
In the Intel System Programming guide, chapter 6.5, it says
Faults — A fault is an exception that can generally be corrected and that, once corrected, allows the program
to be restarted with no loss of continuity. When a fault is reported, the processor restores the machine state to
the state prior to the beginning of execution of the faulting instruction. The return address (saved contents of
the CS and EIP registers) for the fault handler points to the faulting instruction, rather than to the instruction
following the faulting instruction.
A page fault is classified as a fault (no surprises there), so when a page fault happened you're in the state "before it ever happened" - well not really, because you're in the fault handler (so EIP and ESP are definitely different, also CR2 contains the address), but when you return it'll be the state before the ever happened, only with changes made by the handler (so, put there page there, or kill the process)
I am writing a simple kernel in armv8 (aarch64).
MMU config:
48 VA bits (T1SZ=64-48=16)
4K page size
All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1)
(MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)
I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):
ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000
Inspecting other registers, I have:
TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003
So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):
ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.
So, translation fails at level 0, where it should not.
My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?
Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.
I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).
I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).
I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.
Two possibilities can explain this behavior (I don't fully understand how caches work yet):
First possibility: the MMU does not have the required address in its internal walk cache.
In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).
So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.
How can Instruction Pointer register recover from a bad read or bad jump?
Kernel makes the call to an init code that will call the main() program. If the main() program makes a stack overflow or whatever and RIP/EIP/IP fills with junk, how can the OS recover the CPU register?
CPU has only one instruction pointer right? So recovering from a overflow seems impossible to my point of view.
Yes, if the IP gets trashed and that causes a fault, only the bad value is known. It's unclear what you mean by "recovering from overflow". Of course the fault handler of the OS has a well defined address and the cpu goes there so IP will be well defined from then on. The OS may decide to terminate the process or if the program has installed a signal/exception handler the OS will make sure that is called. This handler can then load IP with an appropriate value.
When you trash the IP in the usermode, eventually a hardware fault occurs, be it a page fault, illegal opcode or something like that. Then the processor switches to supervisor/kernel mode and starts running a fault handler by setting the instruction pointer to a well-defined value.
The kernel code will then inspect the address at which the exception happened and/or the type of the exception. Upon finding that it was because of any of these usually the kernel will then terminate the malfunctioning user-mode process.
If the IP gets loaded with an address from which it cannot execute, it triggers an EXCEPTION. A CPU usually recognizes a number of different types of exceptions and they are identified by a different number.
When the exception occurs, it causes the CPU to switch to kernel mode. That in turn causes the CPU to load the IP with the address of a handler defined to handle the specific type of exception and to load a kernel mode stack.
There are two types of exceptions: faults and traps. After a fault, the original instruction in the IP can be restarted. A trap is a fatal error. What happens at this point depends upon the type of exception.
If its a page fault, the handler will try to load the page into memory.
For most other exceptions, the handler will try to find a user mode handler for the specific type of exception. See the signal function in eunuchs.
As developing on ARM based project, we get data abort randomly, that is when we play with it we get a data abort interrupt. But the data abort is not always on the same point when we check with the register map with r14 or r13, even though check the function callback. Is there anyway that I can get the information about the root cause on data abort precisely? I have try the ref2 but could not get the same point when I trap the data about interrupt.
Related
ARM Data Abort error exception debugging
ARM: HOW TO ANALYZE A DATA ABORT EXCEPTION
Checking the link register (r14) as described in your Keil link above will show you the instruction that triggered the data abort. From there you'll have to figure out why it triggered a data abort and how that could have happened, which is the difficult part.
In my experience what most likely happened is that you accessed an invalid pointer. It can be invalid for many reasons. Here are a few candidates:
You used the pointer before it was initialized
You used the pointer after it, or the containing memory, had been freed (and was subsequently modified when another function allocated it)
The pointer was corrupted by a stack overflow
The pointer was corrupted by other, unrelated, misbehaving code that is trampling on memory
The pointer was allocated on the stack as a local variable and then used after the allocating function had exited
The pointer has incorrect alignment for its type (for example, trying to access 0x4001 as a uint32_t)
As you can see, lots of things can be the root cause of an ARM data abort. Finding the root cause is part of what makes ARM software/firmware development so much fun! Good luck figuring out your puzzle.
How much is the cost of interrupt in x86_64. For example the interrupt due to a page fault? How much cycles are required for the kernel to service the interrupt and then go back to user-space? I am interested in knowning only the cost due to the interrupt and scheduling the interrupted user-level thread back, so we can neglect what is going on inside the interrupt handler here.
For odrinary interrupts (hardware IRQ or ordinary exception like division by zero) it is probably possible to give an upper bound.
Time to process a page fault is especially tricky to assess even when disk IO is not involved because the CPU has to walk the page tables, which introduces many variables. Page faults occur not only because pages are not present, but also because of access violations (e.g., trying to write to a read-only page). In any case, if the page mapping is not already present in the TLB (missing mappings are never cached), the CPU will first have to walk multiple levels of page tables before even invoking the page fault handler. The time to access page table entries (in case the address is not already cached in the TLB) is again dependent on whether some entries are already in data caches.
So the time from accessing a linear address to PF handler being invoked might be anything from ~200 cycles (best case; TLB entry present, exception due to wrong access type -- just ring switch) to ~2000 cycles (no TLB entry present, no page table entries in data cache). This is just the time between 1) executing a user-mode instruction that faults and 2) executing the first instruction of the page fault handler.
[Side-comment: given that, I wonder whether it's possible to build hard real-time systems that use paging.]
This is a complex question and cannot be answered easily.
You have to save all (used) registers (scalar,sse,fpu-state,avx, etc.) that are being used in the interrupt.
Maybe you have to change the virtual address space context.
When you are done, you have to reset the saved context.
And all the while cache/RAM load effects change the cycle count needed.
(NB: Interrupts should not be paged out, but no idea if linux supports this, or if it is at all possible)