I have been trying to understand how context switching works in Linux Kernel. It appears to me that there is a situation (explained later) which results in no invocation of IRET instruction after the interrupt (I am sure that there is something that I am missing!). I am assuming that invocation of IRET after the interrupt is extremely necessary, since you can't get the same interrupt until you invoke IRET. I am only worried about uni-processor kernel running on x86 arch.
The situation that I think might result in the described behavior is as follows:
Process A running in kernel mode calls schedule() voluntarily (for example while trying to acquire an already locked mutex).
schedule() decides to perform a context switch to process B and hence calls context_switch()
context_switch() switches virtual memory from A to B by calling switch_mm()
context_switch() runs macro switch_to() to switch stacks and actually change the running process from A to B. Note that process A is now stuck inside switch_to() and the stack of process A looks like (stack growing downwards):
...
[mutex_lock()]
[schedule()]
[context_switch()] (Stack Top)
Process B starts running. At some later time, it receives a timer interrupt and the timer interrupt handler decides that process B needs a reschedule.
On return from timer interrupt (but before invoking IRET) preempt_schedule_irq() is invoked.
preempt_schedule_irq() calls schedule().
schedule() decides to context switch to process A and calls context_switch().
context_switch() calls switch_mm() to switch the virtual memory.
context_switch() calls switch_to() to switch stacks. At this point, stack of process B looks like following:
...
[IRET return frame]
[ret_from_interrupt()]
[preempt_schedule_irq()]
[schedule()]
[context_switch()] (Stack top)
Now process A is running with its stack resumed. Since, context_switch() function in A was not invoked due to a timer interrupt, process A does not call IRET and it continues execution of mutex_lock(). This scenario may lead to blocking of timer interrupt forever.
What am I missing here?
Economical with the truth time, non-linux-specifc explanation/example:
Thread A does not have to call IRET - the kernel code calls IRET to return execution to thread A, after all, that's one way it may have lost it in the first place - a hardware interrupt from some peripheral device.
Typically, when thread A lost execution earlier on due to some other hardware interrupt or sycall, thread A's stack pointer is saved in the kernel TCB pointing to an IRET return frame on the stack of A before switching to the kernel stack for all the internal scheduler etc gubbins. If an exact IRET frame does not exist because of the particular syscall mechanism used, one is assembled. When the kernel needs to resume A, the kernel reloads the hardware SP with thread A's stored SP and IRET's to user space. Job done - A resumes running with interrupts etc, enabled.
The kernel has then lost control. When it's entered again by the next hardware interrupt/driver or syscall, it can set it's internal SP to the top of its own private stack since it keeps no state data on it between invocations.
That's just one way in which it can be made to work:) Obviously, the exact mechanism/s are ABI/architecture dependent.
I don't know about Linux, but in many operating systems, the context switch is usually performed by a dispatcher, not an interrupt handler. If an interrupt doesn't result in a pending context switch, it just returns. If an interrupt triggered context switch is needed, the current state is saved and the interrupt exits via the dispatcher (the dispatcher does the IRET). This gets more complicated if nested interrupts are allowed, since the initial interrupt is the one that goes to the dispatcher, regardless of of which nested interrupt handler(s) triggered a context switch condition. An interrupt needs to check the saved state to see if it's a nested interrupt, and if not, it can disable interrupts to prevent nested interrupts occurring when it does the check for and optionally exits via the dispatcher to perform a context switch. If the interrupt is a nested interrupt, it only has to set a context switch flag if needed, and rely on the initial interrupt to do the check and context switch.
Usually, there's no need for an interrupt to save a threads state in a kernel TCB unless a context switch is going to occur.
The dispatcher also handles the cases where context switches are triggered by non-interrupt conditions, such as mutex, semaphore, ... .
Related
im wondering whether I understand the concept of a RTOS, and more specifically the scheduling process, correctly.
So, I think I understand the process of a timer interrupt (i omitted the interrupt enable/disable commands for better readability here)
1. program runs...
2. A timer tick occurs that triggers a Timer Interrupt
3. The Timer ISR is called
The timer ISR looks like this:
3.1. Kernel saves context (registers etc.)
3.2. Kernel checks if there is a higher priority task
3.3. If so, the Kernel performs the context switch
3.4. Return from Interrupt
4. Program runs with another task executing
But how does the process looks like, when an Interrupt occurs from lets say a I/O Pin?
1. program runs
2. an interrupt is triggered because data is available
3. a general ISR is called?
3.1. Kernel saves context
3.2. Kernel have to call the User defined ISR, because the Kernel doesn't know what to do now
3.1.1 User ISR runs and does whatever it should do (maybe change priority of a task, that should run now, because the data is now available)
3.1.2 return from User ISR
3.3. Kernel checks if there is a higher priority task available
3.4. If so the Kernel performs a context switch
3.5. Return from Interrupt
4. program runs with the different task
In this case the kernel must implement a general ISR, so that all interrupts are mapped to this ISR. For example (as far as i know) the ATmega168p microcontroller has 26 interrupt vectors. So there should be a processor specific file, that maps all the Interrupts to a general ISR. The Kernel-ISR determines what caused the interrupt and calls the specific User-ISR (that handles the actual interrupt).
Did I misunderstood something?
Thank you for your help
There is a clear distinction between the OS tick interrupt and the OS scheduler - you have however conflated the two. When the OS tick ISR occurs, the tick count is incremented, if that increment causes a timer or delay expiry, that is a scheduling event, and scheduling events causes the scheduler to run on exit from the interrupt context.
Different RTOS may have subtle differences, but in general in any ISR, if a scheduling event occurred, the scheduler runs immediately before exiting the interrupt context, setting up the threading context for whatever thread is due to run by the scheduling policy (normally highest priority ready thread).
Scheduling events include:
OS timer expiry
Task delay expiry
Timeslice expiry (for round-robin scheduling).
Semaphore give
Message queue post
Task event flag set
These last three can occur in any ISR (so long as they are "try semantics" non-blocking/zero timeout), the first three as a result of the tick ISR. So the scheduler will run on exit from the interrupt context when any interrupt has caused at least one scheduling event (there may have been nested or multiple simultaneous interrupts).
Scheduling events may occur in the task context also including on any potentially blocking action such as:
Semaphore give
Semaphore take
Message queue receive
Message queue post
Task event flag set
Task event flag wait
Task delay start
Timer wait
Explicit "yield"
The scheduler runs also when a thread triggers a scheduling event, so context switches do not only occur as the result of an interrupt.
To summarise and with respect to your question specifically; the tick or any other interrupt does not directly cause the scheduler to run. An interrupt, any interrupt can perform an action that makes the scheduler due to run. Unlike the thread context where such an action causes the scheduler to run immediately, in the interrupt context, the scheduler is deferred until all pending interrupts have been serviced and runs on exit from the interrupt context.
For details of a specific RTOS implementation of context switching see ยงยง3.05, 3.06 and 3.10 of MicroC/OS-II: The Real Time Kernel (the kernel and the book were specifically developed to teach such principles, so it is a useful resource and the principles apply to other RTOS kernels). In particular Listings 3.18 to 3.20 and Figure 3.10 and the associated explanation.
I saw this piece of code on disk read in Linux 0.11 kernel:
static inline void lock_buffer(struct buffer_head * bh)
{
cli();
while (bh->b_lock)
sleep_on(&bh->b_wait);
bh->b_lock=1;
sti();
}
IIUC, cli() will block the interrupt (not blocking all as explained here: https://c9x.me/x86/html/file_module_x86_id_31.html, but still, block some interrupts which means it changes the default behavior).
And sleep_on will call schedule, which will pass the control flow to another process.
However, what makes me confused is that here we will switch to another process with some of the interrupts blocked, which seems error-prone because the other process should expect the default behavior. So is this a correctly written piece of code (if so, why?) or it is just a wrongly written one which will cause unexpected behaviors?
I presume that the interrupt handler of the disk drive will be the one to wakeup(&bh->b_wait), which could lead to a missed wakeup if interrupts were not disabled in the process waiting for this block.
Remember that condition variables (sleep_on, wakeup) have no memory: sleep_on will suspend until wakeup is called; it doesn't matter if wakeup is called just before sleep_on.
From the point in time of testing bh->b_lock, the caller is racing with the interrupt handler; thus cli (or, more typical unix splbio()) blocks the interrupt handler, preventing the race.
Since the kernel saves the interrupt state (mask, priority, ...) with the process state, when sleep_on cause a reschedule, it is most likely that interrupts will be re-enabled; or at least eventually will be. The disk interrupt will eventually run, waking-up this process.
When this process is rescheduled, its saved interrupt state (disabled) will be restored, so that the test & assignment of b_lock will also prevent interference from the disk interrupt handler.
Thought about this again. I think this is the intended behavior. It means that before the disk read finishes (unlock_buffer being called), all the following executions will be in uninterruptible mode (interrupt blocked). When the buffer is unlocked and the head of queue is woken up,
while (bh->b_lock)
sleep_on(&bh->b_wait);
bh->b_lock=1;
sti();
will be executed and because we are in uninterruptible mode, it will execute to sti() without switching to other process. So other processes waiting on the same signal will sleep again (bh->b_lock is 1) when scheduled and only 1 process continues its execution.
I am developing a simple kernel for my upcoming OS. I have developed everything till the scheduler. I am wondering how the scheduler comes into its cycle.
For example,
The TIMER interrupt fires.
The handler calls the scheduler.
The scheduler jumps the next process in the queue.
The interrupt must return (IRETD)
But if the scheduler has to jump to the next process then when does the interrupt return. And if it does, wouldn't it go back to last process.
I want this clarification - How does the timer interrupt return to from scheduler and how does the scheduler communicate with timer interrupt (if with function call, then when does it return) ?
Assume - Monolithic Kernel
When a interrupt occurs, the processor switches its context. It does so by updating a flag in the EFLAGS register and pushing some information on the stack (can be seen in intel manuals). If the interrupt occurs in user-mode, then a stack-switch also occurs according to the TSS of the current task.
The scheduler process is done as -
Came from user-process with interrupt state pushed on stack
Pick next process
IRETD on interrupt state of new process
I have some confusion when looking at how interrupt handler(ISR) is run. In Wiki http://en.wikipedia.org/wiki/Context_switch, it describes interrupt handling with 2 steps:
1) context switching
When an interrupt occurs, the hardware automatically switches a part of the
context (at least enough to allow the handler to return to the interrupted code).
The handler may save additional context, depending on details of the particular
hardware and software designs.
2) running the handler
The kernel does not spawn or schedule a special process to handle interrupts,
but instead the handler executes in the (often partial) context established at
the beginning of interrupt handling. Once interrupt servicing is complete, the
context in effect before the interrupt occurred is restored so that the
interrupted process can resume execution in its proper state.
Let's say the interrupt handler is the upper half, is for a kernel space device driver (i assume user space device driver interrupt follow same logic).
when interrupt occurs:
1) current kernel process is suspended. But what is the context situation here? Based on Wiki's description, kernel does not spawn a new process to run ISR, and the context established at the beginning of interrupt handling, sounds so much like another function call within the interrupted process. so is interrupt handler using the interrupted process's stack(context) to run? Or kernel would allocate some other memory space/resource to run it?
2) since here ISR is not a 'process' type that can be put to sleep by scheduler. It has to be finished no matter what? Not even limited by any time-slice bound? What if ISR hang, how does the system deal with it?
Sorry if the question is fundamental. I have not delved into the subject long enough.
Thanks,
so is interrupt handler using the interrupted process's stack(context) to run? Or kernel would allocate some other memory space/resource to run it?
It depends on the CPU and on the kernel. Some CPUs execute ISRs using the current stack. Others automatically switch to a special ISR stack or to a kernel stack. The kernel may switch the stack as well, if needed.
since here ISR is not a 'process' type that can be put to sleep by scheduler. It has to be finished no matter what?
Yep, or you're risking to hang your computer. You see, interrupts interrupt processes and threads. In fact, most CPUs have no concept of a thread or a process and to them it doesn't matter what gets interrupted/preempted (it can even be another ISR!), it's just not going to execute again until the ISR finishes.
Not even limited by any time-slice bound? What if ISR hang, how does the system deal with it?
It hangs, especially if it's a single-CPU system. It may report an error and then hang/reboot. In fact, in Windows (since Vista?) hung or too slowly executing deferred procedures (DPCs), which aren't ISRs but are somewhat like them (they execute between ISRs and threads in terms of priority/preemption) can cause a "bugcheck". The OS monitors execution of DPCs and it can do that concurrently on multiple CPUs.
Anyway, it's not a normal situation and typically there's no way out of it other than a system reset. Look up watchdog timers. They help to discover such bad hangs and perform a reset. Many electronic devices have them.
Think about interrupt handler as a function running in its own thread with high priority. When interrupt is set by device, any other activity with lowest priority is suspended, and ISR is executed. This is like thread context switch.
When ISR hangs (for example, in endless loop), the whole computer hangs - assuming that we are talking about ISR in PC driver. Any activity with lower that ISR priority is not allowed, so computer looks dead. However, it still reacts on the hardware remote debugger commands, if one is attached.
http://lxr.linux.no/linux+v2.6.35/include/linux/preempt.h#L21
I am just trying get the linux source. I saw this preempt count and how does linux ensure the preempt count is atomic ? The code just increments the value.
Also I have an another question. why does interrupt handles need to maintain mutual exclusion. Because only one can execute at a time right ?
Also when interrupts are disabled what does OS do ? Ignore interrups or maintain a queue ?
It increments preempt_count() - notice the () - which is a macro is defined as:
#define preempt_count() (current_thread_info()->preempt_count)
So it is incrementing a per-thread variable, which doesn't require any locking and is safe.
It's best to ask your multiple questions as separate questions, but briefly:
Interrupt handlers can in general be interrupted by other interrupt handlers;
Interrupt handlers can run on one CPU core while other kernel code is running on another core;
Interrupts are usually disabled using a hardware mechanism. These tend to remember pending interrupts, but only up to a maximum of one per interrupt vector.
The operation on the preempt_count variable is not atomic. The code region between an inc and a dec of a preempt_count of a thread is guaranteed not to be switched out by the scheduler. Context switching from the current thread in this code region can only happen in further embedded exceptions or interrupts. After the first inc operation completes, the further handlers will see the variable is non-zero thus not to cause a context switch. Before the inc finishes the thread can be switched out but that's ok as the code has not reached the guarded region.
Some details: The definition of an atomic variable should be something like "Atomic variables are the ones on whom the read modify write operation is done as one instruction with out any interruption". The "Read-Modify-Write" operation on a preempt_count can be interrupted by another exception handler or interrupt handler but only in strictly embedded manner, that's by the kernel design. Since those embedded operations are in pairs, thus the value of a preempt_count will not be corrupted eventually. Though a R-M-W operation can be interrupted and the current thread can be switched out (only if none of the multiple embedded inc has completed), but that is ok as the code has not reached the guarded region. Once the thread is switched back it will continue finish the R-M-W operation and from that point on the current thread will not be switched out till all the paired dec(s) all finish.
Every modern processor has some variant of the atomic test-and-set instruction.