How does linux synchronize preempt count - c

http://lxr.linux.no/linux+v2.6.35/include/linux/preempt.h#L21
I am just trying get the linux source. I saw this preempt count and how does linux ensure the preempt count is atomic ? The code just increments the value.
Also I have an another question. why does interrupt handles need to maintain mutual exclusion. Because only one can execute at a time right ?
Also when interrupts are disabled what does OS do ? Ignore interrups or maintain a queue ?

It increments preempt_count() - notice the () - which is a macro is defined as:
#define preempt_count() (current_thread_info()->preempt_count)
So it is incrementing a per-thread variable, which doesn't require any locking and is safe.
It's best to ask your multiple questions as separate questions, but briefly:
Interrupt handlers can in general be interrupted by other interrupt handlers;
Interrupt handlers can run on one CPU core while other kernel code is running on another core;
Interrupts are usually disabled using a hardware mechanism. These tend to remember pending interrupts, but only up to a maximum of one per interrupt vector.

The operation on the preempt_count variable is not atomic. The code region between an inc and a dec of a preempt_count of a thread is guaranteed not to be switched out by the scheduler. Context switching from the current thread in this code region can only happen in further embedded exceptions or interrupts. After the first inc operation completes, the further handlers will see the variable is non-zero thus not to cause a context switch. Before the inc finishes the thread can be switched out but that's ok as the code has not reached the guarded region.
Some details: The definition of an atomic variable should be something like "Atomic variables are the ones on whom the read modify write operation is done as one instruction with out any interruption". The "Read-Modify-Write" operation on a preempt_count can be interrupted by another exception handler or interrupt handler but only in strictly embedded manner, that's by the kernel design. Since those embedded operations are in pairs, thus the value of a preempt_count will not be corrupted eventually. Though a R-M-W operation can be interrupted and the current thread can be switched out (only if none of the multiple embedded inc has completed), but that is ok as the code has not reached the guarded region. Once the thread is switched back it will continue finish the R-M-W operation and from that point on the current thread will not be switched out till all the paired dec(s) all finish.

Every modern processor has some variant of the atomic test-and-set instruction.

Related

How a context switch works in a RTOS, need clarity

im wondering whether I understand the concept of a RTOS, and more specifically the scheduling process, correctly.
So, I think I understand the process of a timer interrupt (i omitted the interrupt enable/disable commands for better readability here)
1. program runs...
2. A timer tick occurs that triggers a Timer Interrupt
3. The Timer ISR is called
The timer ISR looks like this:
3.1. Kernel saves context (registers etc.)
3.2. Kernel checks if there is a higher priority task
3.3. If so, the Kernel performs the context switch
3.4. Return from Interrupt
4. Program runs with another task executing
But how does the process looks like, when an Interrupt occurs from lets say a I/O Pin?
1. program runs
2. an interrupt is triggered because data is available
3. a general ISR is called?
3.1. Kernel saves context
3.2. Kernel have to call the User defined ISR, because the Kernel doesn't know what to do now
3.1.1 User ISR runs and does whatever it should do (maybe change priority of a task, that should run now, because the data is now available)
3.1.2 return from User ISR
3.3. Kernel checks if there is a higher priority task available
3.4. If so the Kernel performs a context switch
3.5. Return from Interrupt
4. program runs with the different task
In this case the kernel must implement a general ISR, so that all interrupts are mapped to this ISR. For example (as far as i know) the ATmega168p microcontroller has 26 interrupt vectors. So there should be a processor specific file, that maps all the Interrupts to a general ISR. The Kernel-ISR determines what caused the interrupt and calls the specific User-ISR (that handles the actual interrupt).
Did I misunderstood something?
Thank you for your help
There is a clear distinction between the OS tick interrupt and the OS scheduler - you have however conflated the two. When the OS tick ISR occurs, the tick count is incremented, if that increment causes a timer or delay expiry, that is a scheduling event, and scheduling events causes the scheduler to run on exit from the interrupt context.
Different RTOS may have subtle differences, but in general in any ISR, if a scheduling event occurred, the scheduler runs immediately before exiting the interrupt context, setting up the threading context for whatever thread is due to run by the scheduling policy (normally highest priority ready thread).
Scheduling events include:
OS timer expiry
Task delay expiry
Timeslice expiry (for round-robin scheduling).
Semaphore give
Message queue post
Task event flag set
These last three can occur in any ISR (so long as they are "try semantics" non-blocking/zero timeout), the first three as a result of the tick ISR. So the scheduler will run on exit from the interrupt context when any interrupt has caused at least one scheduling event (there may have been nested or multiple simultaneous interrupts).
Scheduling events may occur in the task context also including on any potentially blocking action such as:
Semaphore give
Semaphore take
Message queue receive
Message queue post
Task event flag set
Task event flag wait
Task delay start
Timer wait
Explicit "yield"
The scheduler runs also when a thread triggers a scheduling event, so context switches do not only occur as the result of an interrupt.
To summarise and with respect to your question specifically; the tick or any other interrupt does not directly cause the scheduler to run. An interrupt, any interrupt can perform an action that makes the scheduler due to run. Unlike the thread context where such an action causes the scheduler to run immediately, in the interrupt context, the scheduler is deferred until all pending interrupts have been serviced and runs on exit from the interrupt context.
For details of a specific RTOS implementation of context switching see ยงยง3.05, 3.06 and 3.10 of MicroC/OS-II: The Real Time Kernel (the kernel and the book were specifically developed to teach such principles, so it is a useful resource and the principles apply to other RTOS kernels). In particular Listings 3.18 to 3.20 and Figure 3.10 and the associated explanation.

Who calls IRET after context switch in Linux Kernel?

I have been trying to understand how context switching works in Linux Kernel. It appears to me that there is a situation (explained later) which results in no invocation of IRET instruction after the interrupt (I am sure that there is something that I am missing!). I am assuming that invocation of IRET after the interrupt is extremely necessary, since you can't get the same interrupt until you invoke IRET. I am only worried about uni-processor kernel running on x86 arch.
The situation that I think might result in the described behavior is as follows:
Process A running in kernel mode calls schedule() voluntarily (for example while trying to acquire an already locked mutex).
schedule() decides to perform a context switch to process B and hence calls context_switch()
context_switch() switches virtual memory from A to B by calling switch_mm()
context_switch() runs macro switch_to() to switch stacks and actually change the running process from A to B. Note that process A is now stuck inside switch_to() and the stack of process A looks like (stack growing downwards):
...
[mutex_lock()]
[schedule()]
[context_switch()] (Stack Top)
Process B starts running. At some later time, it receives a timer interrupt and the timer interrupt handler decides that process B needs a reschedule.
On return from timer interrupt (but before invoking IRET) preempt_schedule_irq() is invoked.
preempt_schedule_irq() calls schedule().
schedule() decides to context switch to process A and calls context_switch().
context_switch() calls switch_mm() to switch the virtual memory.
context_switch() calls switch_to() to switch stacks. At this point, stack of process B looks like following:
...
[IRET return frame]
[ret_from_interrupt()]
[preempt_schedule_irq()]
[schedule()]
[context_switch()] (Stack top)
Now process A is running with its stack resumed. Since, context_switch() function in A was not invoked due to a timer interrupt, process A does not call IRET and it continues execution of mutex_lock(). This scenario may lead to blocking of timer interrupt forever.
What am I missing here?
Economical with the truth time, non-linux-specifc explanation/example:
Thread A does not have to call IRET - the kernel code calls IRET to return execution to thread A, after all, that's one way it may have lost it in the first place - a hardware interrupt from some peripheral device.
Typically, when thread A lost execution earlier on due to some other hardware interrupt or sycall, thread A's stack pointer is saved in the kernel TCB pointing to an IRET return frame on the stack of A before switching to the kernel stack for all the internal scheduler etc gubbins. If an exact IRET frame does not exist because of the particular syscall mechanism used, one is assembled. When the kernel needs to resume A, the kernel reloads the hardware SP with thread A's stored SP and IRET's to user space. Job done - A resumes running with interrupts etc, enabled.
The kernel has then lost control. When it's entered again by the next hardware interrupt/driver or syscall, it can set it's internal SP to the top of its own private stack since it keeps no state data on it between invocations.
That's just one way in which it can be made to work:) Obviously, the exact mechanism/s are ABI/architecture dependent.
I don't know about Linux, but in many operating systems, the context switch is usually performed by a dispatcher, not an interrupt handler. If an interrupt doesn't result in a pending context switch, it just returns. If an interrupt triggered context switch is needed, the current state is saved and the interrupt exits via the dispatcher (the dispatcher does the IRET). This gets more complicated if nested interrupts are allowed, since the initial interrupt is the one that goes to the dispatcher, regardless of of which nested interrupt handler(s) triggered a context switch condition. An interrupt needs to check the saved state to see if it's a nested interrupt, and if not, it can disable interrupts to prevent nested interrupts occurring when it does the check for and optionally exits via the dispatcher to perform a context switch. If the interrupt is a nested interrupt, it only has to set a context switch flag if needed, and rely on the initial interrupt to do the check and context switch.
Usually, there's no need for an interrupt to save a threads state in a kernel TCB unless a context switch is going to occur.
The dispatcher also handles the cases where context switches are triggered by non-interrupt conditions, such as mutex, semaphore, ... .

Is interrupt handler running like this, and for how long?

I have some confusion when looking at how interrupt handler(ISR) is run. In Wiki http://en.wikipedia.org/wiki/Context_switch, it describes interrupt handling with 2 steps:
1) context switching
When an interrupt occurs, the hardware automatically switches a part of the
context (at least enough to allow the handler to return to the interrupted code).
The handler may save additional context, depending on details of the particular
hardware and software designs.
2) running the handler
The kernel does not spawn or schedule a special process to handle interrupts,
but instead the handler executes in the (often partial) context established at
the beginning of interrupt handling. Once interrupt servicing is complete, the
context in effect before the interrupt occurred is restored so that the
interrupted process can resume execution in its proper state.
Let's say the interrupt handler is the upper half, is for a kernel space device driver (i assume user space device driver interrupt follow same logic).
when interrupt occurs:
1) current kernel process is suspended. But what is the context situation here? Based on Wiki's description, kernel does not spawn a new process to run ISR, and the context established at the beginning of interrupt handling, sounds so much like another function call within the interrupted process. so is interrupt handler using the interrupted process's stack(context) to run? Or kernel would allocate some other memory space/resource to run it?
2) since here ISR is not a 'process' type that can be put to sleep by scheduler. It has to be finished no matter what? Not even limited by any time-slice bound? What if ISR hang, how does the system deal with it?
Sorry if the question is fundamental. I have not delved into the subject long enough.
Thanks,
so is interrupt handler using the interrupted process's stack(context) to run? Or kernel would allocate some other memory space/resource to run it?
It depends on the CPU and on the kernel. Some CPUs execute ISRs using the current stack. Others automatically switch to a special ISR stack or to a kernel stack. The kernel may switch the stack as well, if needed.
since here ISR is not a 'process' type that can be put to sleep by scheduler. It has to be finished no matter what?
Yep, or you're risking to hang your computer. You see, interrupts interrupt processes and threads. In fact, most CPUs have no concept of a thread or a process and to them it doesn't matter what gets interrupted/preempted (it can even be another ISR!), it's just not going to execute again until the ISR finishes.
Not even limited by any time-slice bound? What if ISR hang, how does the system deal with it?
It hangs, especially if it's a single-CPU system. It may report an error and then hang/reboot. In fact, in Windows (since Vista?) hung or too slowly executing deferred procedures (DPCs), which aren't ISRs but are somewhat like them (they execute between ISRs and threads in terms of priority/preemption) can cause a "bugcheck". The OS monitors execution of DPCs and it can do that concurrently on multiple CPUs.
Anyway, it's not a normal situation and typically there's no way out of it other than a system reset. Look up watchdog timers. They help to discover such bad hangs and perform a reset. Many electronic devices have them.
Think about interrupt handler as a function running in its own thread with high priority. When interrupt is set by device, any other activity with lowest priority is suspended, and ISR is executed. This is like thread context switch.
When ISR hangs (for example, in endless loop), the whole computer hangs - assuming that we are talking about ISR in PC driver. Any activity with lower that ISR priority is not allowed, so computer looks dead. However, it still reacts on the hardware remote debugger commands, if one is attached.

Why is "sleeping" not allowed while holding a spinlock? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Why can't you sleep while holding spinlock?
As far as I know, spinlocks should be used in short duration, and are only choices in code such as interrupt handler where sleeping (preemption) is not allowed.
However, I do not know why there is such a "rule" that there SHOULD BE no sleeping at all while holding a spinlock. I know that it is not a recommended practice (since it is detrimental in performance), but I see no reason why sleeps SHOULD NOT be allowed in spinlocks.
You cannot hold a spin lock while you acquire a semaphore, because you might have to sleep while waiting for the semaphore, and you cannot sleep while holding a spin lock (from "Linux Kernel Development" by Robert Love).
The only reason I can see is for portability reasons, because in uniprocessors, spinlocks are implemented as disabling interrupts, and by disabling interrupts, sleeping is of course not allowed (but sleeping will not break code in SMP systems).
But I am wondering if my reasoning is correct or if there are any other reasons.
There are several reasons why, at least in Linux, sleeping in spinlocks is not allowed:
If thread A sleeps in a spinlock, and thread B then tries to acquire the same spinlock, a uniprocessor system will deadlock. Thread B will never go to sleep (because spinlocks don't have the waitlist necessary to awaken B when A is done), and thread A will never get a chance to wake up.
Spinlocks are used over semaphores precisely because they're more efficient - provided you do not contend for long. Allowing sleeping means that you will have long contention periods, erasing all the benefit of using a spinlock. Your system would be faster just using a semaphore in this case.
Spinlocks are often used to synchronize with interrupt handlers, by additionally disabling interrupts. This use case is not possible if you sleep (once you enter the interrupt handler, you cannot switch back to the thread to let it wake up and finish its spinlock critical section).
Use the right tool for the right job - if you need to sleep, semaphores and mutexes are your friends.
Actually, you can sleep with interrupts disabled or some other sort of exclusion active. If you don't, the condition for which you are sleeping could change state due to an interrupt and then you would never wake up. The sleep code would normally never be entered without an elevated priority or some other critical section that encloses the execution path between the decision to sleep and the context switch.
But for spinlocks, sleep is a disaster, as the lock stays set. Other threads will spin when they hit it, and they won't stop spinning until you wake up from the sleep. That could be an eternity compared to the handful of spins expected in the worst case at a spinlock, because spinlocks exist just to synchronize access to memory locations, they aren't supposed to interact with the context-switching mechanism. (For that matter, every other thread might eventually hit the spinlock and then you would have wedged every thread of every core of the entire system.)
You cannot when you use a spin lock as it is meant to be used. Spin locks are used where really necessary to protect critical regions and shared data structures. If you acquire one while also holding a semaphore, you lock access to whichever critical region (say) your lock is attached to (it is typically a member of a specific larger data structure), while allowing this process to possibly be put to sleep. If, say, an IRQ is raised while this process sleeps, and the IRQ handler needs access to the critical region still locked away, it's blocked, which can never happen with IRQs. Obviously, you could make up examples where your spin lock isn't used the way it should be (a hypothetical spin lock attached to a nop loop, say); but that's simply not a real spin lock found in Linux kernels.

Gracefully (i.e eventually cooperatively) suspend thread execution

I have to develop an application that tries to emulate the executing flow of an embedded target. This target has 2 levels of priority : the highest one being preemptive on the lowest one. The low priority level is managed with a round-robin scheduler which gives 1ms of execution to each thread in turn.
My goal is to write a library that provide the thread_create, thread_start, and all the system calls that are available on my target and use POSIX functions to reproduce the behavior natively on a standard PC.
Thus, when an high priority thread executes, low priority threads should be suspended whatever they are doing at that very moment. It is to the responsibility of the low priority thread's implementation to ensure that it won't be perturbed.
I now it is usually unsafe to suspend a thread, which explains why I didn't find any "suspend(pid)" function.
I basically imagine two solutions to the problem :
-find a way to suspend the low priority threads when a high priority thread starts (and resume them when there is no more high priority activity)
-periodically call a very small "suspend_if_necessary" function everywhere in my low-priority code, and whenever an high priority must start, wait for all low-priority process to call that function and be suspended, execute as single high priority thread, then resume them all.
Even if it is not-so-clean, I quite like the second solution, but still have one problem : how to call the function everywhere without changing all my code?
I wonder if there is an easy way to doing that, somewhat like debugging code does : add a hook call at every line executed that checks for a flag and run some specific code when that flag changes?
I'd be very happy if there is an easy solution to that problem, since I really need to be representative with the behavior of the target execution flow...
Thanks in advance,
Goulou.
Unfortunately, it's not really possible to implement what you want with true threads - even if the high prio thread is restarted, it can take arbitrarily long before the high prio thread is scheduled back in and goes to suspend all the low priority threads. Moreover, there is no reliable way to determine whether the high priority thread is blocked or not using only POSIX threads; you could try tracking things manually, but this runs the risk of both false positives (the thread's blocked on something, but the low prio threads think it's running and suspend itself) and false negatives (you miss a resumed annotation, or there's lag between when the thread's actually resumed and when it marks itself as running).
If you want to implement a thread priority system with pure POSIX, one option is to not use threads, but rather use setcontext for cooperative multitasking. This would allow you to swap between threads at a user level. However you must explicitly yield the CPU in this case. It also doesn't help with blocking syscalls, which would then block all threads in your app; but since you're writing an emulator this might not be an issue.
You may also be able to swap threads using setcontext within a signal handler; I've not tested this case myself, but it could be worth a try scheduling using setcontext in a SIGALRM handler.
To suspend a thread, you sleep it. If you want to be able to wake it on command, sleep it using sigwait, which puts the thread to sleep until it gets a signal. You can send a specific thread a signal with pthread_kill (crazy name, but it actually just sends signals to a thread). This is a very fast way to sleep and wake up threads. 40x Faster than condition variables and very easy.

Resources