Why Do Page Faults and Unrecoverable Errors Need to be Unmaskable? - c

Looking for a quick clarification on why unrecoverable errors and page faults must be non-maskable interrupts? What happens when they aren't?

Interrupts and exceptions are very different kinds of events.
An interrupt is external to a CPU event that happens and arrives in the processor asynchronously (moment of arrival does not depend on currently executing programs).
An exception is internal to a CPU event that happens as a side effect of instruction execution.
Consider processor as an overcomplex unstoppable automaton with a well-defined and strictly specified behavior. It continuously fetches, decodes, and executes instructions, one by one. When it executes each instruction, it applies the result to the state of the automaton (registers and memory) by its type. It moves without pauses and interrupts. You only can change the direction of this continuous instruction crunching using function calls and jumps.
Such an automaton-like model supported by well-defined and strictly specified instructions behavior makes it extremely predictable and convenient for programming for compilers and software engineers. When you look at the assembler listing, you can precisely say what the processor will do, when it will execute this program. However, under some specific circumstances, the execution of an instruction can fall out of this well-defined model. And in such cases CPU literally does not know what to do next and how to react. For example, the program tries to divide by zero. What reaction do you expect? What value does it need to place into the target register as a result of division? How can it report to the program that something goes wrong? Now imagine another case. The program makes a jump to some virtual address, but it has no physical address mapped. How should CPU proceed with its unstoppable fetch-decode-execute job? From where should it take the next instruction to execute? Which instruction should it execute? Or maybe it should hang in response? There are no ways out from such states.
An exception is a tool for the CPU to go out from such situations gracefully and restore its unstoppable movement. At the same time is a tool to report the encountered error to the operating system and ask it to help with its handling. If you can turn off exceptions, you can steal that tool from the CPU and put all of the above issues back on the table. CPU designers do not have good answers for them and do not what to see them. Due to this, they make exceptions unmaskable.

Related

how do you intercept the address of an instruction that is writing to a segment of memory?

Imagine we have a usual instruction such as this one
mov [eax], ebx
and eax contains some address that we would like to write to.
The idea is to write a c program that tells you which address contains the instruction, if we already know the address that it's going to be writing to.
The real question:
write a c program using the free sony pspsdk that would accomplish the same thing.
The psp uses MIPS III / IV and the instruction would look something like
sw a0 $00(t0)
##which literally spells out store register a0 at offset t0 + 0 bytes. where t0 would
## contain something like 0x08800000
disclaimer: it is still useful to know how to do this on windows, so if somebody only knows how to do this on windows or even osx, That would still be appreciated as it could provide relevant information on similar programming practices to accomplish this particular task.
Intercepting an instruction that writes to a particular address is not a normal activity in programs.
It is a feature provided by some debuggers. There are at least three ways debuggers may be able to do this:
A debugger can examine the program code and find where a particular instruction writes to a particular address. This is actually a hugely complicated activity that requires interpreting the instructions. Often, a debugger cannot do it completely; as doing so in general is equivalent to completely interpreting and executing the program the same way the computer processor does, and it is very slow to do in software. Instead, the debugger may plan part of program execution and put in a breakpoint at a spot where it is unable to easily continue, such as at a branch instruction that depends on a value the debugger is not prepared to compute. A breakpoint is a special instruction that interrupts program execution and, in this case, results in the operating system transferring control to the debugger. At that time, the debugger removes the breakpoint, requests that the instruction be single-stepped (that the processor execute the single instruction and then interrupt program execution immediately), examines the result, and continues.
A debugger can mark the page of memory containing the desired address as no-access. Then, whenever the program accesses that memory, the hardware will interrupt program execution, and the operating system will transfer control to the debugger. The debugger examines the instruction that caused the interruption. If the instruction is accessing the target address, the debugger acts on that. If it is not, the debugger changes the memory protection to allow access, requests that the instruction be single-stepped, changes the memory protection to disallow access, and resumes the program to wait for the next interruption. (Instead of single-stepping the instruction, the debugger might just emulate it, since that might avoid changing the memory protection twice, which can be expensive.)
Some computer processor models have features to support this sort of debugging feature. The debugger can request that a portion of memory be monitored, so that the hardware interrupts program execution when a particular address is accessed, instead of when any part of a whole memory page is accessed.
I cannot speak to the Sony platform you are using. You would have to check its documentation or ask others regarding the availability of such features. Since this is a feature most often used by debuggers, investigating the documentation regarding debugging could be a way to find out whether the system supports such a feature.

Performance of System()

For the function in c, system(), would it affect the hardware counters if you are trying to see how that command you ran performed
For example lets say im using the Performance API(PAPI) and the program is a precompiled matrix multiplication application
PAPI_start_counters();
system("./matmul");
PAPI_read_counters();
//Print out values
PAPI_stop_counters();
I am obviously missing a bit but what I am trying to find out is it is possible, through the use of said counters to get the performance of a program im running.
from my tests I would get wild numbers like the ones below. they are obviously wrong, just want to find out why
Total Cycles =========== 140733358872510
Instructions Completed =========== 4203968
Floating Point Instructions =========== 0
Floating Point Operations =========== 4196867
Loads =========== 140733358872804
Stores =========== 4204037
Branches Taken =========== 15774436
system() is a very slow function in general. On Linux, it spawns /bin/sh (forking and executing a full shell process), which parses your command, and spawns the second program. Loading these two programs requires loading the code to memory, initializing all their libraries, executing startup code, etc. Only then will the program code actually start executing.
Because of the unpredictability of disk access and Linux process scheduling, timing system() calls has a very high inherent variability. Therefore, you won't get accurate results even if you use a high-performance counter.
The better solution would be to compile the target program as a library instead. Load it before initializing your counters, then just execute the main function from the library. That way, all the code executes in your process, and you have negligible startup time. Your performance numbers will be much more precise this way.
Do you have access to the code of matmul? If so, it's much more precise to instrument and measure only the code you're interested in. That means you wrap only those instructions (or C statements) in counters that you want to measure.
For more information see:
Related discussion here
IntelĀ® Performance Counter Monitor here
Performance measurements with x86 RDTSC instruction here
As stated above, measuring using PAPI to wrap system() invocations carries way too much process overhead to give you any idea of how fast your math code is actually running.
The numbers you are getting are odd, but not necessarily wrong. The huge disparity between the instructions completed and the cycles probably indicate that the executable "matmul" is doing a lot of waiting for external processes (e.g. disk I/O) to complete. I do not know the specifics of the msg FP Instructions and FP ops, but if they are displaying those values differently PAPI has a reason.
What is interesting is that the loads and cycles are obviously connected as well as instructions/fp ops and stores.
I would have to know about the internals of "matmul" in order to give you a better description.

How much is the cost of interrupt in x86_64

How much is the cost of interrupt in x86_64. For example the interrupt due to a page fault? How much cycles are required for the kernel to service the interrupt and then go back to user-space? I am interested in knowning only the cost due to the interrupt and scheduling the interrupted user-level thread back, so we can neglect what is going on inside the interrupt handler here.
For odrinary interrupts (hardware IRQ or ordinary exception like division by zero) it is probably possible to give an upper bound.
Time to process a page fault is especially tricky to assess even when disk IO is not involved because the CPU has to walk the page tables, which introduces many variables. Page faults occur not only because pages are not present, but also because of access violations (e.g., trying to write to a read-only page). In any case, if the page mapping is not already present in the TLB (missing mappings are never cached), the CPU will first have to walk multiple levels of page tables before even invoking the page fault handler. The time to access page table entries (in case the address is not already cached in the TLB) is again dependent on whether some entries are already in data caches.
So the time from accessing a linear address to PF handler being invoked might be anything from ~200 cycles (best case; TLB entry present, exception due to wrong access type -- just ring switch) to ~2000 cycles (no TLB entry present, no page table entries in data cache). This is just the time between 1) executing a user-mode instruction that faults and 2) executing the first instruction of the page fault handler.
[Side-comment: given that, I wonder whether it's possible to build hard real-time systems that use paging.]
This is a complex question and cannot be answered easily.
You have to save all (used) registers (scalar,sse,fpu-state,avx, etc.) that are being used in the interrupt.
Maybe you have to change the virtual address space context.
When you are done, you have to reset the saved context.
And all the while cache/RAM load effects change the cycle count needed.
(NB: Interrupts should not be paged out, but no idea if linux supports this, or if it is at all possible)

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

How does sched_setaffinity() work?

I am trying to understand how the linux syscall sched_setaffinity() works. This is a follow-on from my question here.
I have this guide, which explains how to use the syscall and has a pretty neat (working!) example.
So I downloaded the Linux 2.6.27.19 kernel sources.
I did a 'grep' for lines containing that syscall, and I got 91 results. Not promising.
Ultimately, I'm trying to understand how the kernel is able to set the instruction pointer for a specific core (or processor.)
I am familiar with how single-core-single-thread programs work. One might issue a 'jmp foo' instruction, and this basically sets the IP to the memory address of the 'foo' label. But when one has multiple cores, one has to say "fetch the next instruction at memory address foo, and set the instruction pointer for core number 2 to begin execution there."
Where, in the assembly code, are we specifying which core performs that operation?
Back to the kernel code: what is important here? The file 'kernel/sched.c' has a function called sched_setaffinity(), but returns type "long" - which is inconsistent with its manual page. So what is important here? Which of these modules shows the assembly instructions issued? What module is reading the 'task_struct', looking at the 'cpus_allowed' member, and then translating that into an instruction? (I've also thumbed through the glibc source - but I think it just makes a call to the kernel code to accomplish this task.)
sched_setaffinity() simply tells the scheduler which CPUs is that process/thread allowed to run on, then calls for a re-schedule.
The scheduler actually runs on each one of the CPUs, so it gets a chance to decide what task to execute next on that particular CPU.
If you're interested in how you can actually call some code on other CPUs, I suggest you take a look at smp_call_function_single(). In case we want to call something on another CPU, this calls generic_exec_single(). The latter simply adds the function to the target CPU's call queue and forces a reschedule through some IPI stuff (if the queue was empty).
Bottom line is: there no actual SMP variant of the _jmp_ instruction. Instead, code running on other CPUs cooperates in order to accomplish the task.
I think the thing you are not understanding is that the kernel is running on all the CPU cores. At every timer interrupt (~1000 per second), the scheduler runs on each CPU and chooses a process to run. There is no one CPU that somehow tells the others to start running a process. sched_setaffinity() works by just setting flags on the process. The scheduler reads these flags and will not run that process on its CPU if it is set not to.
Where, in the assembly code, are we specifying which core performs that operation?
There is no assembly involved here. Every task (thread) is assigned to a single CPU (or core in your terms) at a time. To stop running on a given CPU and resume on another, the task has to "migrate" (also this). When a task migrates from one CPU to another, the scheduler picks the CPU which is more idle among the CPUs allowed by sched_setaffinity().
There is no magic assembly instructions issued. The kernel has a more low-level view of the hardware, each CPU is a separate object, very different than how it looks like for user-space processes (in user-space, CPUs are almost invisible).
Check this out: B Operating System Programming Guidelines

Resources