Can I set up the priority of a workqueue?

Can I set up the priority of a workqueue? - c

Can I set up the priority of a workqueue?
I am modifying the SPI kernel module "spidev" so it can communicate faster with my hardware.
The external hardware is a CAN controller with a very small buffer, so I must read any incoming data quickly to avoid loosing data.
I have configured a GPIO interrupt to inform me of the new data, but I cannot read the SPI hardware in the interrupt handler.
My interrupt handler basically sets up a workqueue that will read the SPI data.
It works fine when there is only one active process in the kernel.
As soon as I open any other process (even the process viewer top) at the same time, I start loosing data in bunches, i.e., I might receive 1000 packects of data with no problems and then loose 15 packets in a row and so on.
I suspect that the cause of my problem is that when the other process (top, in this case) has control over the cpu the interrupt handler runs, but the work in the workqueue doesn't until the scheduler is called again.
I tried to increase the priority of my process with no success.
I wonder if there is a way to tell the kernel to execute the work in workqueue immediatelly after the interrupt handling function.
Suggestions are welcome.

As an alternative you could consider using a tasklet, which will tell the kernel execute more immediate, but be aware you are unable to sleep in tasklets
A good ibm article on deffering work in the kernel
http://www.ibm.com/developerworks/linux/library/l-tasklets/
http://www.makelinux.net/ldd3/chp-7-sect-5
A tasklet is run at the next timer tick as long as the CPU is busy running a process, but it is run immediately when the CPU is otherwise idle. The kernel provides a set of ksoftirqd kernel threads, one per CPU, just to run "soft interrupt" handlers, such as the tasklet_action function. Thus, the final three runs of the tasklet take place in the context of the ksoftirqd kernel thread associated to CPU 0. The jitasklethi implementation uses a high-priority tasklet, explained in an upcoming list of functions.

Related

How could we sleep when we are executing a syscall that execute in interrupt mode

When I am executing a system call to do write or something else, the ISR corresponded to the exception is executing in interrupt mode (on cortex-m3 the IPSR register is having a non-zero value, 0xb). And what I have learned is that when we execute a code in an interrupt mode we can not sleep, we can not use functions that might block ...
My question is that: is there any kind of a mechanism with which the ISR could still executing in interrupt mode and in the same time it could use functions that might block, or is there any kind of trick is implemented.

Caveat: This is more of a comment than an answer but is too big to fit in a comment or series of comments.
TL;DR: Needing to sleep or execute a blocking operation from an ISR is a fundamental misdesign. This seems like an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Doing sleep [even in application code] is generally a code smell. Why do you feel you need to sleep [vs. some event/completion driven mechanism]?
Please edit your question and add clarification [i.e. don't just add comments]
When I am executing a system call to do write or something else
What is your application doing? A write to what device? What "something else"?
What is the architecture/board, kernel/distro? (e.g. Raspberry Pi running Raspian? nvidia Jetson? Beaglebone? Xilinx FPGA with petalinux?)
What is the H/W device and where did the device driver come from? Did you write the device driver yourself or is it a standard one that comes with the kernel/distro? If you wrote it, please post it in your question.
Is the device configured properly? (e.g.) Are the DTB entries correct?
Is the device a block device, such as a disk controller? Or, is it a character device, such as a UART? Does the device transfer data via DMA? Or, does it transfer data by reading/writing to/from an IO port?
What do you mean by "exception"? Generally, exception is an abnormal condition (e.g. segfault, bus error, etc.). Please describe the exact context/scenario for which this occurs.
Generally, an ISR does little things. (e.g.) Grab [and save] status from the device. Clear/rearm the interrupt in the interrupt controller. Start the next queued transfer request. Wake up the sleeping base level task (usually the task that executed the syscall [waiting on a completion event in kernel mode]).
More elaborate actions are generally deferred and handled in the interrupt's "bottom half" handler and/or tasklet. Or, the base level is woken up and it handles the remaining processing.
What kernel subsystems are involved? Are you using platform drivers? Are you interfacing from within the DMA device driver framework? Are message buses involved (e.g. I2C, SPI, etc.)?
Interrupt and device handling in the linux kernel is somewhat different than what one might do in a "bare metal" system or RTOS (e.g. FreeRTOS). So, if you're coming from those environments, you'll need to think about restructuring your driver code [and/or application code].
What are your requirements for device throughput and latency?
You may wish to consult a good book on linux device driver design. And, you may wish to consult the kernel's Documentation subdirectory.
If you're able to provide more information, I may be able to help you further.
UPDATE:
A system call is not really in the same class as a hardware interrupt as far as the kernel is concerned, even if the CPU hardware uses the same sort of exception vector mechanisms for handling both hardware and software interrupts. The kernel treats the system call as a transition from user mode to kernel mode. – Ian Abbott
This is a succinct/great explanation. The "mode" or "context" has little to do with how we got/get there from a H/W mechanism.
The CPU doesn't really "understand" interrupt mode [as defined by the kernel]. It understands "supervisor" vs "user" privilege level [sometimes called "mode"].
When executing at user privilege level, an interrupt/exception will notice the transition from "user" level to "supvervisor" level. It may have a special register that specifies the address of the [initial] supervisor stack pointer. Atomically, it swaps in the value, pushing the user SP onto the new kernel stack.
If the interrupt is interrupting a CPU that is already at supervisor level, the existing [supervisor] SP will be used unchanged.
Note that x86 has privilege "ring" levels. User mode is ring 3 and the highest [most privileged] level is ring 0. For arm, some arches can have a "hypervisor" privilege level [which is higher privilege than "supervisor" privilege].
The setup of the mode/context is handled in arch/arm/kernel/entry-*.S code.
An svc is a synchronous interrupt [generated by a special CPU instruction]. The resulting context is the context of the currently executing thread. It is analogous to "call function in kernel mode". The resulting context is "kernel thread mode". At that point, it's not terribly useful to think of it as an "interrupt" anymore.
In fact, on some arches, the syscall instruction/mechanism doesn't use the interrupt vector table. It may have a fixed address or use a "call gate" mechanism (e.g. x86).
Each thread has its own stack which is different than the initial/interrupt stack.
Thus, once the entry code has established the context/mode, it is not executing in "interrupt mode". So, the full range of kernel functions is available to it.
An interrupt from a H/W device is asynchronous [may occur at any time the CPU's internal interrupt enable flag is set]. It may interrupt a userspace application [executing in application mode] OR kernel thread mode OR an existing ISR executing in interrupt mode [from another interrupt]. The resulting ISR is executing in "interrupt mode" or "ISR mode".
Because the ISR can interrupt a kernel thread, it may not do certain things. For example, if the CPU were in [kernel] thread mode, and it was in the middle of a kmalloc call [GFP_KERNEL], the ISR would see partial state and any action that tried to adjust the kernel's heap would result in corruption of the heap.
This is a design choice by linux for speed.
Kernel ISRs may be classified as "fast interrupts". The ISR executes with the CPU interrupt enable [IE] flag cleared. No normal H/W interrupt may interrupt the ISR.
So, if another H/W device asserts its interrupt request line [in the external interrupt controller], that request will be "pending". That is, the request line has been asserted but the CPU has not acknowledged it [and the CPU has not jumped via the interrupt table].
The request will remain pending until the CPU allows further interrupts by asserting IE. Or, the CPU may clear the pending interrupt without taking action by clearing the pending interrupt in the interrupt controller.
For a "slow" interrupt ISR, the interrupt entry code will clear the interrupt in the external interrupt controller. It will then rearm interrupts by setting IE and call the ISR. This ISR can be interrupted by other [higher priority] interrupts. The result is a "stacked" interrupt.
I have been searching all over the places, I come to the conclusion is that the interrupts have a higher priority than some of exceptions in the Linux kernel.
An exception [synchronous interrupt] can be interrupted if the IE flag is enabled. An svc is treated differently but after the entry code is executed, the IE flag is set, so the actual syscall code [executing in kernel thread mode] can be interrupted by a H/W interrupt.
Or, in limited circumstances, the kernel code can generate an exception (e.g. a page fault caused by a kernel action [which is usually deemed fatal]).
but I am still looking on how exactly the context switching happen when executing an exception and letting the processor to execute in a thread mode while the SVCall exception is pending (was preempted and have not returned yet)... I think when I understand that, it would be more clear to me.
I think you have to be very careful with the terminology. In particular, when combining terms from disparate sources. Although user mode, kernel thread mode, or interrupt mode can be considered a context [in the dictionary sense of the word], context switching usually means that the current thread is suspended, the scheduler selects a new thread to run and resumes it. That is separate from the user-to-kernel transition.
And if there is any recommended resources about that for ARM-Cortex-M3/4, it would be nice
Here is something: https://interrupt.memfault.com/blog/arm-cortex-m-exceptions-and-nvic But, be very careful in applying the terminology therein. What it considers "pending" only exists in the kernel during the entry code. What is more relevant is what the kernel does to set up mode/context and the terms are not equivalent.
So, from the kernel's standpoint, it's probably better to not consider an svc as "pending".

what's wrong in using interrupt handlers as event listeners

My system is simple enough that it runs without an OS, I simply use interrupt handlers like I would use event listener in a desktop program. In everything I read online, people try to spend as little time as they can in interrupt handlers, and give the control back to the tasks. But I don't have an OS or real task system, and I can't really find design information on OS-less targets.
I have basically one interrupt handler that reads a chunk of data from the USB and write the data to memory, and one interrupt handler that reads the data, sends the data on GPIO and schedule itself on an hardware timer again.
What's wrong with using the interrupts the way I do, and using the NVIC (I use a cortex-M3) to manage the work hierarchy ?

First of all, in the context of this question, let's refer to the OS as a scheduler.
Now, unlike threads, interrupt service routines are "above" the scheduling scheme.
In other words, the scheduler has no "control" over them.
An ISR enters execution as a result of a HW interrupt, which sets the PC to a different address in the code-section (more precisely, to the interrupt-vector, where you "do a few things" before calling the ISR).
Hence, essentially, the priority of any ISR is higher than the priority of the thread with the highest priority.
So one obvious reason to spend as little time as possible in an ISR, is the "side effect" that ISRs have on the scheduling scheme that you design for your system.
Since your system is purely interrupt-driven (i.e., no scheduler and no threads), this is not an issue.
However, if nested ISRs are not allowed, then interrupts must be disabled from the moment an interrupt occurs and until the corresponding ISR has completed. In that case, if any interrupt occurs while an ISR is in execution, then your program will effectively ignore it.
So the longer you spend inside an ISR, the higher the chances are that you'll "miss out" on an interrupt.

In many desktop programs, events are send to queue and there is some "event loop" that handle this queue. This event loop handles event by event so it is not possible to interrupt one event by other one. It also is good practise in event driven programming to have all event handlers as short as possible because they are not interruptable.
In bare metal programming, interrupts are similar to events but they are not send to queue.
execution of interrupt handlers is not sequential, they can be interrupted by interrupt with higher priority (numerically lower number in Cortex-M3)
there is no queue of same interrupts - e.g. you can't detect multiple GPIO interrupts while you are in that interrupt - this is the reason you should have all routines as short as possible.
It is possible to implement queues by yourself, feed these queues by interrupts and consume these queues in your super loop (consume while disabling all interrupts). By this approach, you can get sequential processing of interrupts. If you keep your handlers short, this is mostly not needed and you can do the work in handlers directly.
It is also good practise in OS based systems that they are using queues, semaphores and "interrupt handler tasks" to handle interrupts.

With bare metal it is perfectly fine to design for application bound or interrupt/event bound so long as you do your analysis. So if you know what events/interrupts are coming at what rate and you can insure that you will handle all of them in the desired/designed amount of time, you can certainly take your time in the event/interrupt handler rather than be quick and send a flag to the foreground task.
The common approach of course is to get in and out fast, saving just enough info to handle the thing in the foreground task. The foreground task has to spin its wheels of course looking for event flags, prioritizing, etc.
You could of course make it more complicated and when the interrupt/event comes, save state, and return to the forground handler in the forground mode rather than interrupt mode.
Now that is all general but specific to the cortex-m3 I dont think there are really modes like big brother ARMs. So long as you take a real-time approach and make sure your handlers are deterministic, and you do your system engineering and insure that no situation happens where the events/interrupts stack up such that the response is not deterministic, not too late or too long or loses stuff it is okay

What you have to ask yourself is whether all events can be services in time in all circumstances:
For example;
If your interrupt system were run-to-completion, will the servicing of one interrupt cause unacceptable delay in the servicing of another?
On the other hand, if the interrupt system is priority-based and preemptive, will the servicing of a high priority interrupt unacceptably delay a lower one?
In the latter case, you could use Rate Monotonic Analysis to assign priorities to assure the greatest responsiveness (the shortest execution-time handlers get the highest priority). In the first case your system may lack a degree of determinism, and performance will be variable under both event load, and code changes.
One approach is to divide the handler into real-time critical and non-critical sections, the time-critical code can be done in the handler, then a flag set to prompt the non-critical action to be performed in the "background" non-interrupt context in a "big-loop" system that simply polls event flags or shared data for work to complete. Often all that might be necessary in the interrupt handler is to copy some data to timestamp some event - making data available for background processing without holding up processing of new events.
For more sophisticated scheduling, there are a number of simple, low-cost or free RTOS schedulers that provide multi-tasking, synchronisation, IPC and timing services with very small footprints and can run on very low-end hardware. If you have a hardware timer and 10K of code space (sometimes less), you can deploy an RTOS.

I am taking your described problem first
As I interpret it your goal is to create a device which by receiving commands from the USB, outputs some GPIO, such as LEDs, relays etc. For this simple task, your approach seems to be fine (if the USB layer can work with it adequately).
A prioritizing problem exists though, in this case it may be that if you overload the USB side (with data from the other end of the cable), and the interrupt handling it is higher priority than that triggered by the timer, handling the GPIO, the GPIO side may miss ticks (like others explained, interrupts can't queue).
In your case this is about what could be considered.
Some general guidance
For the "spend as little time in the interrupt handler as possible" the rationale is just what others told: an OS may realize a queue, etc., however hardware interrupts offer no such concepts. If the event causing the interrupt happens, the CPU enters your handler. Then until you handle it's source (such as reading a receive holding register in the case of a UART), you lose any further occurrences of that event. After this point, until exiting the handler, you may receive whether the event happened, but not how many times (if the event happened again while the CPU was still processing the handler, the associated interrupt line goes active again, so after you return from the handler, the CPU immediately re-enters it provided nothing higher priority is waiting).
Above I described the general concept observable on 8 bit processors and the AVR 32bit (I have experience with these).
When designing such low-level systems (no OS, one "background" task, and some interrupts) it is fundamental to understand what goes on on each priority level (if you utilize such). In general, you would make the most real-time critical tasks the highest priority, taking the most care of serving those fast, while being more relaxed with the lower priority levels.
From an other aspect usually at design phase it can be planned how the system should react to missed interrupts, since where there are interrupts, missing one will eventually happen anyway. Critical data going across communication lines should have adequate checksums, an especially critical timer should be sourced from a count register, not from event counting, and the likes.
An other nasty part of interrupts is their asynchronous nature. If you fail to design the related locks properly, they will eventually corrupt something giving nightmares to that poor soul who will have to debug it. The "spend as little time in the interrupt handler as possible" statement also encourages you to keep the interrupt code reasonably short which means less code to consider for this problem as well. If you also worked with multitasking assisted by an RTOS you should know this part (there are some differences though: a higher priority interrupt handler's code does not need protection against a lower priority handler's).
If you can properly design your architecture regarding the necessary asynchronous tasks, getting around without an OS (from the no multitasking aspect) may even prove to be a nicer solution. It needs way more thinking to design it properly, however later there are much less locking related problems. I got through some mid-sized safety critical projects designed over a single background "task" with very few and little interrupts, and the experience and maintenance demands regarding those (especially the tracing of bugs) were quite satisfactory compared to some others in the company built over multitasking concepts.

How to finish lower priority interrupt?

I have one high priority interrupt which sends USB data, and one lower priority task which already fetches the next data to be send.
Sometimes the high priority interrupt requires some data that is still being fetched, and in that case I need to instruct the MCU to finish the lower priority task before continuing execution in the high priority interrupt.
I can't figure out how to make this work. Is it possible to raise the priority of the background task higher using NVIC_SetPriority, and immediately call NVIC_SetPendingIRQ from the USB task, and after that lower it again? Or what would be the simplest way to make this work?

How much time do you have to answer the data request, and how long does it take to prefetch for the next one? If the prefetch time is short, I would reverse your priority on your interrupts - this keeps the buffer filled for the data request interrupt.
Otherwise there isn't a clean way to do what you want bare-metal - this is what Operating Systems are for. If you are in an OS, the data request interrupt routine can request a signal from the prefetch interrupt routine and return from interrupt and wait for the data request interrupt routine to send a signal that it has completed a block.
Bare-metal, you could try having the prefetch routine call the data request interrupt after each buffer is ready. The DRIR then does a series of checks when it wakes up
Was I woken by a Data Request?
yes: Do I have data to send?
yes: Send data, clear interrupt request, return from interrupt
no: Increment "blocks needed" counter by 1, clear interrupt request, return from interrupt
no: must have been woken by a Prefetch complete, is "blocks needed" zero?
yes: buffer has data, but not needed yet, return
no: Send 1 block of data, decrement "blocks needed" until it hits zero or buffer is empty, return
There isn't any guarantee you will get the data out in time, but at least this way there's a chance for the lower priority interrupt to finish.
BTW, I don't think the NVIC can force a currently-executing interrupt to stop for a different higher-priority one. The priorities really matter when the interrupts occur at the same time (or occur when interrupts are already masked, ie when servicing another interrupt).
Many operating systems provide a two-step interrupt process where the direct interrupt routine is minimal as possible to clear the interrupt, and it notifies a separate interrupt thread to handle the longer, detailed parts of the request. See http://en.wikipedia.org/wiki/Interrupt_handler
Since the direct interrupt routine is small&fast, it allows for priorities to be assigned to the respective interrupt threads to control execution order.

I don't know if you have a good reason to run RX and TX in two different contexts, but usually this could be very simply implemented in only one context. However, if you really want to follow your original design, you need to introduce some kind of a mechanism to synchronize operations of these two contexts. Normally, if you're running the RTOS, you would use event flag, binary semaphore or some other similar service which is provided by the given RTOS.
You would not want (and could not) send any data before you receive some, right? That's why you would need to wait for a notification (in TX context) which is to be set from RX context after the data was received.
Having done this kind of tunneling without employing this technique would go out of sync very soon.

Is spinlock required for every interrupt handler?

In Chapter 5 of ULK the author states as follows:
"...each interrupt handler is serialized with respect to itself-that is, it cannot execute more than one concurrently. Thus, accessing the data struct does not require synchronization primitives"
I don't quite understand why interrupt handlers is "serialized" on modern CPUs with multiple cores. I'm thinking it could be possible that a same ISR can be run on different cores simultaneously, right? If that's the case, if you don't use spinlock to protect your data it can come to a race condition.
So my question is, on a modern system with multi-cpus, for every interrupt handler you are going to write that will read & write some data, is spinlock always needed?

While executing interrupt handlers, the kernel explicitly disables that particular interrupt line at the interrupt controller, so one interrupt handler cannot be executed more than once concurrently. (The handlers of other interrupts can run concurrently, though.)

Clarification: as per CL. remark below - the kernel makes sure not to fire the interrupt handler for the same interrupt but if you have multiple registrations of the same interrupt handler for multiple interrupts than the below answer is, I believe, correct.
You are right that the same interrupt handler can run concurrently on multiple cores and that shared data needs to be protected. However, a spinlock is not the only and certainly not always the recommended way to achieve this.
A multitude of other synchronization methods, from per-CPU data, accessing shared data only using atomic operations and even Read-Copy-Update variants may be used to protect the shared data.

No spinlock is not always needed in interrupt handler.
Please note one thing first -
When an interrupt of a particular device occur on interrupt controller, that interrupt is disabled at interrupt controller and hence on all the cores for that particular device. So, the interrupt of same device cannot come on all the CPU simultaneously.
So in normal case there wont be any spin lock required as the code would not be re-entrant.
Though there are 2 cases below in which spinlock is needed in interrupt handler.
Please note, when an interrupt comes from a device and IRQ line, that cores disables all other interrupt on that core and also for that device interrupt on other core also. Interrupt from other devices can comes on other core.
So there can be a case in which, same interrupt handler is registered for different devices.
for eg:-
request_irq(A,func,..);
reqest_irq(B,func,..);
for device A interrupt handler func is called.
for device B same interrupt handler func is called.
So, a spinlock should be used in this case to prevent raise condition.
When same resource is being used in interrupt handler and also some other code which runs in process context.
For eg:- there is resource A
So there can be a case in which one cores runs in interrupt mode, the interrupt handler and is modifying the resource A and other core runs in process context and is also modifying the same resource in some other place.
So to present raise condition for that resource we should use spin lock.

section 4.6 of Understanding the Linux Kernel, 3rd Edition by Marco Cesati, Daniel P. Bovet told you the answer.
Actual interrupt handler is process by handle_IRQ_event. irq_desc[irq].lock prevent concurrently access to handle_IRQ_event by any other CPU.

If the critical data is shared b/w the interrupt handler and your process (may be a kernel thread) then you need to protect your data and hence spinlock is required.A common Kernel api for spinlock is : spin_lock().
There are also variants of these api e.g. spin_lock_irqsave() which can help avoiding the deadlock problems which one can face while acquiring/holding the spin locks.Please go through the below link to find details of the subject:
http://www.linuxjournal.com/article/5833

What's the algorithm behind sleep()?

Now there's something I always wondered: how is sleep() implemented ?
If it is all about using an API from the OS, then how is the API made ?
Does it all boil down to using special machine-code on the CPU ? Does that CPU need a special co-processor or other gizmo without which you can't have sleep() ?
The best known incarnation of sleep() is in C (to be more accurate, in the libraries that come with C compilers, such as GNU's libc), although almost every language today has its equivalent, but the implementation of sleep in some languages (think Bash) is not what we're looking at in this question...
EDIT: After reading some of the answers, I see that the process is placed in a wait queue. From there, I can guess two alternatives, either
a timer is set so that the kernel wakes the process at the due time, or
whenever the kernel is allowed a time slice, it polls the clock to check whether it's time to wake a process.
The answers only mention alternative 1. Therefore, I ask: how does this timer behave ? If it's a simple interrupt to make the kernel wake the process, how can the kernel ask the timer to "wake me up in 140 milliseconds so I can put the process in running state" ?

The "update" to question shows some misunderstanding of how modern OSs work.
The kernel is not "allowed" a time slice. The kernel is the thing that gives out time slices to user processes. The "timer" is not set to wake the sleeping process up - it is set to stop the currently running process.
In essence, the kernel attempts to fairly distribute the CPU time by stopping processes that are on CPU too long. For a simplified picture, let's say that no process is allowed to use the CPU more than 2 milliseconds. So, the kernel would set timer to 2 milliseconds, and let the process run. When the timer fires an interrupt, the kernel gets control. It saves the running process' current state (registers, instruction pointer and so on), and the control is not returned to it. Instead, another process is picked from the list of processes waiting to be given CPU, and the process that was interrupted goes to the back of the queue.
The sleeping process is simply not in the queue of things waiting for CPU. Instead, it's stored in the sleeping queue. Whenever kernel gets timer interrupt, the sleep queue is checked, and the processes whose time have come get transferred to "waiting for CPU" queue.
This is, of course, a gross simplification. It takes very sophisticated algorithms to ensure security, fairness, balance, prioritize, prevent starvation, do it all fast and with minimum amount of memory used for kernel data.

There's a kernel data structure called the sleep queue. It's a priority queue. Whenever a process is added to the sleep queue, the expiration time of the most-soon-to-be-awakened process is calculated, and a timer is set. At that time, the expired job is taken off the queue and the process resumes execution.
(amusing trivia: in older unix implementations, there was a queue for processes for which fork() had been called, but for which the child process had not been created. It was of course called the fork queue.)
HTH!

Perhaps the major job of an operating system is to hide the complexity of a real piece of hardware from the application writer. Hence, any description of how the OS works runs the risk of getting really complicated, really fast. Accordingly, I am not going to deal with all the "what ifs" and yeah buts" that a real operating system needs to deal with. I'm just going to describe, at a high conceptual level, what a process is, what the scheduler does, how the timer queue works. Hopefully this is helpful.
What's a process:
Think of a process--let's just talk about processes, and get to threads later--as "the thing the operating system schedules". A process has an ID--think an integer--and you can think of that integer as an index into a table containing all the context of that process.
Context is the hardware information--registers, memory management unit contents, other hardware state--that, when loaded into the machine, will allow the process to "go". There are other components of context--lists of open files, state of signal handlers, and, most importantly here, things the process is waiting for.
Processes spend a lot of time sleeping (a.k.a. waiting)
A process spends much of its time waiting. For example, a process that reads or writes to disk will spend a lot of time waiting for the data to arrive or be acknowledged to be out on disk. OS folks use the terms "waiting" and "sleeping" (and "blocked") somewhat interchangeably--all meaning that the process is awaiting something to happen before it can continue on its merry way. It is just confusing that the OS API sleep() happens to use underlying OS mechanisms for sleeping processes.
Processes can be waiting for other things: network packets to arrive, window selection events, or a timer to expire, for example.
Processes and Scheduling
Processes that are waiting are said to be non-runnable. They don't go onto the run queue of the operating system. But when the event occurs which the process is waiting for, it causes the operating system to move the process from the non-runnable to the runnable state. At the same time, the operating system puts the process on the run queue, which is really not a queue--it's more of a pile of all the processes which, should the operating system decide to do so, could run.
Scheduling:
the operating system decides, at regular intervals, which processes should run. The algorithm by which the operating system decides to do so is called, somewhat unsurprisingly, the scheduling algorithm. Scheduling algorithms range from dead-simple ("everybody gets to run for 10 ms, and then the next guy on the queue gets to run") to far more complicated (taking into account process priority, frequency of execution, run-time deadlines, inter-process dependencies, chained locks and all sorts of other complicated subject matter).
The Timer Queue
A computer has a timer inside it. There are many ways this can be implemented, but the classic manner is called a periodic timer. A periodic timer ticks at a regular interval--in most operating systems today, I believe this rate is 100 times per second--100 Hz--every 10 milliseconds. I'll use that value in what follows as a concrete rate, but know that most operating systems worth their salt can be configured with different ticks--and many don't use this mechanism and can provide much better timer precision. But I digress.
Each tick results in an interrupt to the operating system.
When the OS handles this timer interrupt, it increments its idea of system time by another 10 ms. Then, it looks at the timer queue and decides what events on that queue need to be dealt with.
The timer queue really is a queue of "things which need to be dealt with", which we will call events. This queue is ordered by time of expiration, soonest events first.
An "event" can be something like, "wake up process X", or "go kick disk I/O over there, because it may have gotten stuck", or "send out a keepalive packet on that fibrechannel link over there". Whatever the operating system needs to have done.
When you have a queue ordered in this way, it's easy to manage the dequeuing. The OS simply looks at the head of the queue, and decrements the "time to expiration" of the event by 10 ms every tick. When the expiration time goes to zero, the OS dequeues that event, and does whatever is called for.
In the case of a sleeping process, it simply makes the process runnable again.
Simple, huh?

there's at least two different levels to answer this question. (and a lot of other things that get confused with it, i won't touch them)
an application level, this is what the C library does. It's a simple OS call, it simply tells the OS not to give CPU time to this process until the time has passed. The OS has a queue of suspended applications, and some info about what are they waiting for (usually either time, or some data to appear somewhere).
kernel level. when the OS doesn't have anything to do right now, it executes a 'hlt' instruction. this instruction doesn't do anything, but it never finishes by itself. Of course, a hardware interrupt is serviced normally. Put simply, the main loop of an OS looks like this (from very very far away):
allow_interrupts ();
while (true) {
hlt;
check_todo_queues ();
}
the interrupt handlers simpy add things to the todo queues. The real time clock is programmed to generate interrupts either periodically (at a fixed rate), or to some fixed time in the future when the next process wants to be awaken.

A multitasking operating system has a component called a scheduler, this component is responsible for giving CPU time to threads, calling sleep tells the OS not to give CPU time to this thread for some time.
see http://en.wikipedia.org/wiki/Process_states for complete details.

I don't know anything about Linux, but I can tell you what happens on Windows.
Sleep() causes the process' time-slice to end immediately to return control to the OS. The OS then sets up a timer kernel object that gets signaled after the time elapses. The OS will then not give that process any more time until the kernel object gets signaled. Even then, if other processes have higher or equal priority, it may still wait a little while before letting the process continue.
Special CPU machine code is used by the OS to do process switching. Those functions cannot be accessed by user-mode code, so they are accessed strictly by API calls into the OS.

Essentially, yes, there is a "special gizmo" - and it's important for a lot more than just sleep().
Classically, on x86 this was an Intel 8253 or 8254 "Programmable Interval Timer". In the early PCs, this was a seperate chip on the motherboard that could be programmed by the CPU to assert an interrupt (via the "Programmable Interrupt Controller", another discrete chip) after a preset time interval. The functionality still exists, although it is now a tiny part of a much larger chunk of motherboard circuitry.
The OS today still programs the PIT to wake it up regularly (in recent versions of Linux, once every millisecond by default), and this is how the Kernel is able to implement pre-emptive multitasking.

glibc 2.21 Linux
Forwards to the nanosleep system call.
glibc is the default implementation for the C stdlib on most Linux desktop distros.
How to find it: the first reflex is:
git ls-files | grep sleep
This contains:
sysdeps/unix/sysv/linux/sleep.c
and we know that:
sysdeps/unix/sysv/linux/
contains the Linux specifics.
On the top of that file we see:
/* We are going to use the `nanosleep' syscall of the kernel. But the
kernel does not implement the stupid SysV SIGCHLD vs. SIG_IGN
behaviour for this syscall. Therefore we have to emulate it here. */
unsigned int
__sleep (unsigned int seconds)
So if you trust comments, we are done basically.
At the bottom:
weak_alias (__sleep, sleep)
which basically says __sleep == sleep. The function uses nanosleep through:
result = __nanosleep (&ts, &ts);
After greppingg:
git grep nanosleep | grep -v abilist
we get a small list of interesting occurrences, and I think __nanosleep is defined in:
sysdeps/unix/sysv/linux/syscalls.list
on the line:
nanosleep - nanosleep Ci:pp __nanosleep nanosleep
which is some super DRY magic format parsed by:
sysdeps/unix/make-syscalls.sh
Then from the build directory:
grep -r __nanosleep
Leads us to: /sysd-syscalls which is what make-syscalls.sh generates and contains:
#### CALL=nanosleep NUMBER=35 ARGS=i:pp SOURCE=-
ifeq (,$(filter nanosleep,$(unix-syscalls)))
unix-syscalls += nanosleep
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,nanosleep)$o)): \
$(..)sysdeps/unix/make-syscalls.sh
$(make-target-directory)
(echo '#define SYSCALL_NAME nanosleep'; \
echo '#define SYSCALL_NARGS 2'; \
echo '#define SYSCALL_SYMBOL __nanosleep'; \
echo '#define SYSCALL_CANCELLABLE 1'; \
echo '#include <syscall-template.S>'; \
echo 'weak_alias (__nanosleep, nanosleep)'; \
echo 'libc_hidden_weak (nanosleep)'; \
) | $(compile-syscall) $(foreach p,$(patsubst %nanosleep,%,$(basename $(#F))),$($(p)CPPFLAGS))
endif
It looks like part of a Makefile. git grep sysd-syscalls shows that it is included at:
sysdeps/unix/Makefile:23:-include $(common-objpfx)sysd-syscalls
compile-syscall looks like the key part, so we find:
# This is the end of the pipeline for compiling the syscall stubs.
# The stdin is assembler with cpp using sysdep.h macros.
compile-syscall = $(COMPILE.S) -o $# -x assembler-with-cpp - \
$(compile-mkdep-flags)
Note that -x assembler-with-cpp is a gcc option.
This #defines parameters like:
#define SYSCALL_NAME nanosleep
and then use them at:
#include <syscall-template.S>
OK, this is as far as I will go on the macro expansion game for now.
I think then this generates the posix/nanosleep.o file which must be linked together with everything.
Linux 4.2 x86_64 nanosleep syscall
Uses the scheduler: it's not a busy sleep.
Search ctags:
sys_nanosleep
Leads us to kernel/time/hrtimer.c:
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
hrtimer stands for High Resolution Timer. From there the main line looks like:
hrtimer_nanosleep
do_nanosleep
set_current_state(TASK_INTERRUPTIBLE); which is interruptible sleep
freezable_schedule(); which calls schedule() and allows other processes to run
hrtimer_start_expires
hrtimer_start_range_ns
TODO: reach the arch/x86 timing level
TODO: are the above steps done directly in the syscal call interrupt handler, or in a regular kernel thread?
A few articles about it:
https://geeki.wordpress.com/2010/10/30/ways-of-sleeping-in-linux-kernel/
http://www.linuxjournal.com/article/8144

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight