What's the algorithm behind sleep()?

What's the algorithm behind sleep()? - c

Now there's something I always wondered: how is sleep() implemented ?
If it is all about using an API from the OS, then how is the API made ?
Does it all boil down to using special machine-code on the CPU ? Does that CPU need a special co-processor or other gizmo without which you can't have sleep() ?
The best known incarnation of sleep() is in C (to be more accurate, in the libraries that come with C compilers, such as GNU's libc), although almost every language today has its equivalent, but the implementation of sleep in some languages (think Bash) is not what we're looking at in this question...
EDIT: After reading some of the answers, I see that the process is placed in a wait queue. From there, I can guess two alternatives, either
a timer is set so that the kernel wakes the process at the due time, or
whenever the kernel is allowed a time slice, it polls the clock to check whether it's time to wake a process.
The answers only mention alternative 1. Therefore, I ask: how does this timer behave ? If it's a simple interrupt to make the kernel wake the process, how can the kernel ask the timer to "wake me up in 140 milliseconds so I can put the process in running state" ?

The "update" to question shows some misunderstanding of how modern OSs work.
The kernel is not "allowed" a time slice. The kernel is the thing that gives out time slices to user processes. The "timer" is not set to wake the sleeping process up - it is set to stop the currently running process.
In essence, the kernel attempts to fairly distribute the CPU time by stopping processes that are on CPU too long. For a simplified picture, let's say that no process is allowed to use the CPU more than 2 milliseconds. So, the kernel would set timer to 2 milliseconds, and let the process run. When the timer fires an interrupt, the kernel gets control. It saves the running process' current state (registers, instruction pointer and so on), and the control is not returned to it. Instead, another process is picked from the list of processes waiting to be given CPU, and the process that was interrupted goes to the back of the queue.
The sleeping process is simply not in the queue of things waiting for CPU. Instead, it's stored in the sleeping queue. Whenever kernel gets timer interrupt, the sleep queue is checked, and the processes whose time have come get transferred to "waiting for CPU" queue.
This is, of course, a gross simplification. It takes very sophisticated algorithms to ensure security, fairness, balance, prioritize, prevent starvation, do it all fast and with minimum amount of memory used for kernel data.

There's a kernel data structure called the sleep queue. It's a priority queue. Whenever a process is added to the sleep queue, the expiration time of the most-soon-to-be-awakened process is calculated, and a timer is set. At that time, the expired job is taken off the queue and the process resumes execution.
(amusing trivia: in older unix implementations, there was a queue for processes for which fork() had been called, but for which the child process had not been created. It was of course called the fork queue.)
HTH!

Perhaps the major job of an operating system is to hide the complexity of a real piece of hardware from the application writer. Hence, any description of how the OS works runs the risk of getting really complicated, really fast. Accordingly, I am not going to deal with all the "what ifs" and yeah buts" that a real operating system needs to deal with. I'm just going to describe, at a high conceptual level, what a process is, what the scheduler does, how the timer queue works. Hopefully this is helpful.
What's a process:
Think of a process--let's just talk about processes, and get to threads later--as "the thing the operating system schedules". A process has an ID--think an integer--and you can think of that integer as an index into a table containing all the context of that process.
Context is the hardware information--registers, memory management unit contents, other hardware state--that, when loaded into the machine, will allow the process to "go". There are other components of context--lists of open files, state of signal handlers, and, most importantly here, things the process is waiting for.
Processes spend a lot of time sleeping (a.k.a. waiting)
A process spends much of its time waiting. For example, a process that reads or writes to disk will spend a lot of time waiting for the data to arrive or be acknowledged to be out on disk. OS folks use the terms "waiting" and "sleeping" (and "blocked") somewhat interchangeably--all meaning that the process is awaiting something to happen before it can continue on its merry way. It is just confusing that the OS API sleep() happens to use underlying OS mechanisms for sleeping processes.
Processes can be waiting for other things: network packets to arrive, window selection events, or a timer to expire, for example.
Processes and Scheduling
Processes that are waiting are said to be non-runnable. They don't go onto the run queue of the operating system. But when the event occurs which the process is waiting for, it causes the operating system to move the process from the non-runnable to the runnable state. At the same time, the operating system puts the process on the run queue, which is really not a queue--it's more of a pile of all the processes which, should the operating system decide to do so, could run.
Scheduling:
the operating system decides, at regular intervals, which processes should run. The algorithm by which the operating system decides to do so is called, somewhat unsurprisingly, the scheduling algorithm. Scheduling algorithms range from dead-simple ("everybody gets to run for 10 ms, and then the next guy on the queue gets to run") to far more complicated (taking into account process priority, frequency of execution, run-time deadlines, inter-process dependencies, chained locks and all sorts of other complicated subject matter).
The Timer Queue
A computer has a timer inside it. There are many ways this can be implemented, but the classic manner is called a periodic timer. A periodic timer ticks at a regular interval--in most operating systems today, I believe this rate is 100 times per second--100 Hz--every 10 milliseconds. I'll use that value in what follows as a concrete rate, but know that most operating systems worth their salt can be configured with different ticks--and many don't use this mechanism and can provide much better timer precision. But I digress.
Each tick results in an interrupt to the operating system.
When the OS handles this timer interrupt, it increments its idea of system time by another 10 ms. Then, it looks at the timer queue and decides what events on that queue need to be dealt with.
The timer queue really is a queue of "things which need to be dealt with", which we will call events. This queue is ordered by time of expiration, soonest events first.
An "event" can be something like, "wake up process X", or "go kick disk I/O over there, because it may have gotten stuck", or "send out a keepalive packet on that fibrechannel link over there". Whatever the operating system needs to have done.
When you have a queue ordered in this way, it's easy to manage the dequeuing. The OS simply looks at the head of the queue, and decrements the "time to expiration" of the event by 10 ms every tick. When the expiration time goes to zero, the OS dequeues that event, and does whatever is called for.
In the case of a sleeping process, it simply makes the process runnable again.
Simple, huh?

there's at least two different levels to answer this question. (and a lot of other things that get confused with it, i won't touch them)
an application level, this is what the C library does. It's a simple OS call, it simply tells the OS not to give CPU time to this process until the time has passed. The OS has a queue of suspended applications, and some info about what are they waiting for (usually either time, or some data to appear somewhere).
kernel level. when the OS doesn't have anything to do right now, it executes a 'hlt' instruction. this instruction doesn't do anything, but it never finishes by itself. Of course, a hardware interrupt is serviced normally. Put simply, the main loop of an OS looks like this (from very very far away):
allow_interrupts ();
while (true) {
hlt;
check_todo_queues ();
}
the interrupt handlers simpy add things to the todo queues. The real time clock is programmed to generate interrupts either periodically (at a fixed rate), or to some fixed time in the future when the next process wants to be awaken.

A multitasking operating system has a component called a scheduler, this component is responsible for giving CPU time to threads, calling sleep tells the OS not to give CPU time to this thread for some time.
see http://en.wikipedia.org/wiki/Process_states for complete details.

I don't know anything about Linux, but I can tell you what happens on Windows.
Sleep() causes the process' time-slice to end immediately to return control to the OS. The OS then sets up a timer kernel object that gets signaled after the time elapses. The OS will then not give that process any more time until the kernel object gets signaled. Even then, if other processes have higher or equal priority, it may still wait a little while before letting the process continue.
Special CPU machine code is used by the OS to do process switching. Those functions cannot be accessed by user-mode code, so they are accessed strictly by API calls into the OS.

Essentially, yes, there is a "special gizmo" - and it's important for a lot more than just sleep().
Classically, on x86 this was an Intel 8253 or 8254 "Programmable Interval Timer". In the early PCs, this was a seperate chip on the motherboard that could be programmed by the CPU to assert an interrupt (via the "Programmable Interrupt Controller", another discrete chip) after a preset time interval. The functionality still exists, although it is now a tiny part of a much larger chunk of motherboard circuitry.
The OS today still programs the PIT to wake it up regularly (in recent versions of Linux, once every millisecond by default), and this is how the Kernel is able to implement pre-emptive multitasking.

glibc 2.21 Linux
Forwards to the nanosleep system call.
glibc is the default implementation for the C stdlib on most Linux desktop distros.
How to find it: the first reflex is:
git ls-files | grep sleep
This contains:
sysdeps/unix/sysv/linux/sleep.c
and we know that:
sysdeps/unix/sysv/linux/
contains the Linux specifics.
On the top of that file we see:
/* We are going to use the `nanosleep' syscall of the kernel. But the
kernel does not implement the stupid SysV SIGCHLD vs. SIG_IGN
behaviour for this syscall. Therefore we have to emulate it here. */
unsigned int
__sleep (unsigned int seconds)
So if you trust comments, we are done basically.
At the bottom:
weak_alias (__sleep, sleep)
which basically says __sleep == sleep. The function uses nanosleep through:
result = __nanosleep (&ts, &ts);
After greppingg:
git grep nanosleep | grep -v abilist
we get a small list of interesting occurrences, and I think __nanosleep is defined in:
sysdeps/unix/sysv/linux/syscalls.list
on the line:
nanosleep - nanosleep Ci:pp __nanosleep nanosleep
which is some super DRY magic format parsed by:
sysdeps/unix/make-syscalls.sh
Then from the build directory:
grep -r __nanosleep
Leads us to: /sysd-syscalls which is what make-syscalls.sh generates and contains:
#### CALL=nanosleep NUMBER=35 ARGS=i:pp SOURCE=-
ifeq (,$(filter nanosleep,$(unix-syscalls)))
unix-syscalls += nanosleep
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,nanosleep)$o)): \
$(..)sysdeps/unix/make-syscalls.sh
$(make-target-directory)
(echo '#define SYSCALL_NAME nanosleep'; \
echo '#define SYSCALL_NARGS 2'; \
echo '#define SYSCALL_SYMBOL __nanosleep'; \
echo '#define SYSCALL_CANCELLABLE 1'; \
echo '#include <syscall-template.S>'; \
echo 'weak_alias (__nanosleep, nanosleep)'; \
echo 'libc_hidden_weak (nanosleep)'; \
) | $(compile-syscall) $(foreach p,$(patsubst %nanosleep,%,$(basename $(#F))),$($(p)CPPFLAGS))
endif
It looks like part of a Makefile. git grep sysd-syscalls shows that it is included at:
sysdeps/unix/Makefile:23:-include $(common-objpfx)sysd-syscalls
compile-syscall looks like the key part, so we find:
# This is the end of the pipeline for compiling the syscall stubs.
# The stdin is assembler with cpp using sysdep.h macros.
compile-syscall = $(COMPILE.S) -o $# -x assembler-with-cpp - \
$(compile-mkdep-flags)
Note that -x assembler-with-cpp is a gcc option.
This #defines parameters like:
#define SYSCALL_NAME nanosleep
and then use them at:
#include <syscall-template.S>
OK, this is as far as I will go on the macro expansion game for now.
I think then this generates the posix/nanosleep.o file which must be linked together with everything.
Linux 4.2 x86_64 nanosleep syscall
Uses the scheduler: it's not a busy sleep.
Search ctags:
sys_nanosleep
Leads us to kernel/time/hrtimer.c:
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
hrtimer stands for High Resolution Timer. From there the main line looks like:
hrtimer_nanosleep
do_nanosleep
set_current_state(TASK_INTERRUPTIBLE); which is interruptible sleep
freezable_schedule(); which calls schedule() and allows other processes to run
hrtimer_start_expires
hrtimer_start_range_ns
TODO: reach the arch/x86 timing level
TODO: are the above steps done directly in the syscal call interrupt handler, or in a regular kernel thread?
A few articles about it:
https://geeki.wordpress.com/2010/10/30/ways-of-sleeping-in-linux-kernel/
http://www.linuxjournal.com/article/8144

Related

why there is many schedule() call in different place?

I am tracing Linux 0.11
https://mirrors.edge.kernel.org/pub/linux/kernel/Historic/old-versions/
I see there are many schedule() call in different place, not just the one inside do_timer().
Few questions here:
do_timer() (#sched.c) will be called every time the timer timeout? This timer is based on an x86 interrupt call?
Since there are many schedule() calls outside of do_timer(), can I say that is kind of preempting? or what's the purpose?

Any operation that blocks calls schedule() to yield control.

Some tasks' state has changed, it needs to be updated in schedule().
Some tasks' are working and still a lot of work, schedule() for balance.

Since there are many schedule() calls outside of do_timer(), can I say that is kind of preempting? or what's the purpose?
For a real OS; most task switches occur because a task blocks waiting for something (user input, network packet, disk IO, ..) or a task unblocks because something it was waiting for happened (and the unblocked task has higher priority and preempts the currently running lower priority task).
The whole "task switch caused by timer IRQ" thing is mostly just a fallback to guard against malicious CPU hogs (denial of service attacks); and for normal software under normal conditions you could disable it (delete the schedule() from the timer IRQ handler) and nobody would notice or care. Note: Some people will say it's also for "non-malicious" CPU bound tasks, but CPU bound tasks are relatively rare, and (ignoring the fact that the Linux scheduler has never been good for task priorities) for CPU bound tasks it's better to rely on an effective system of task priorities (e.g. give the CPU bound tasks a low priority so that almost everything will preempt them).
Also note that various courses on OS theory start with "so simple it never actually happens in practice" concepts, which is almost always a pure round-robin scheduler with tasks that never block (often with "Hey, we can accurately predict the future and know exactly how long each task will run for" nonsense), which is mostly fine as a first step (in a "learn to walk before you run" way) but sucks big salty dog balls if it's not followed by more realistic and more complex concepts (better scheduling algorithms, task priorities, multiple simultaneous scheduling algorithms/"scheduler policies", multi-CPU, interactive/latency sensitive tasks, ..) because it leaves the student/victim with little more than misinformation (e.g. the ever re-occurring "all tasks switches are caused by timer IRQ" misconception).
do_timer() (#sched.c) will be called every time the timer timeout? This timer is based on an x86 interrupt call?
I'm guessing that the timer was the raw PIT chip's IRQ (given that Linux version 0.11 was "absolute beginner developer with no intention of making it portable" historical memorabilia from before thousands of volunteers fixed half of the worst parts).
Also don't forget that the scheduler uses time for two different things - the "current task has used too much CPU time" thing that almost never matters, and figuring out when tasks that are blocked/sleeping (e.g. because they called sleep()) should unblock/wake up. The do_timer() might be for either of these things and might be for both (I don't know without looking at it).

What could be delaying my select() call?

I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run
clock_gettime() using CLOCK_MONOTONIC
select() with a timeout of 8 ms
clock_gettime() as before
Output the time difference of the two clock_gettime() calls
To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.
The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:
Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).
So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?
Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?
Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.

Without more information it is difficult to point to something specific, so I am just guessing here:
Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).
118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.
sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.
What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.

How to create an uninterruptible sleep in C?

I'm looking to create a state of uninterruptible sleep for a program I'm writing. Any tips or ideas about how to create this state would be helpful.
So far I've looked into the wait_event() function defined in wait.h, but was having little luck implementing it. When trying to initialize my wait queue the compiler complained
warning: parameter names (without types) in function declaration
static DECLARE_WAIT_QUEUE_HEAD(wq);
Has anyone had any experience with the wait_event() function or creating an uninterruptible sleep?

The functions that you're looking at in include/linux/wait.h are internal to the Linux kernel. They are not available to userspace.
Generally speaking, uninterruptible sleep states are considered undesirable. Under normal circumstances, they cannot be triggered by user applications except by accident (e.g, by attempting to read from a storage device that is not responding correctly, or by causing the system to swap).

You can make sleep 'signal-aware`.
sleep can be interrupted by signal. In which case the pause would be stopped and sleep would return with amount of time still left. The application can choose to handle the signal notified and if needed resume sleep for the time left.

Actually, you should use synchronization objects provided by the operating system you're working on or simply check the return value of sleep function. If it returns to a value bigger than zero, it means your procedure was interrupted. According to this return value, call sleep function again by passing the delta (T-returnVal) as argument (probably in a loop, in case of possible interrupts that might occur again in that time interval)
On the other hand, if you really want a real-uninterruptible custom sleep function, I may suggest something like the following:
void uninterruptible_sleep(long time, long factor)
{
long i, j;
__asm__("cli"); // close interrupts
for(i=0; i<time; ++i)
for(j=0; j<factor; ++j)
; // custom timer loop
__asm__("sti"); // open interrupts
}
cli and sti are x86 assembly instructions which allow us to set IF (interrupt flag) of the cpu. In this way, it is possible to clear (cli) or set (sti) all the interrupts. However, if you're working on a multi-processor system, there needs to be taken another synchronization precautions too, due to the fact that these instructions will only be valid for single microprocessor. Moreover, this type of function as I suggested above, will be very system (cpu) dependant. Because, the inner loop requires a clock-cycle count to measure an exact time interval (execution number of instructions per second) depending on the cpu frequency. Thus, if you really want to get rid of every possible interrupt, you may use a function as I suggested above. But be careful, if your program gets a deadlock situation while it's in cli state, you will need to restart your system.
(The inline assembly syntax I have written is for gcc compiler)

Can I set up the priority of a workqueue?

Can I set up the priority of a workqueue?
I am modifying the SPI kernel module "spidev" so it can communicate faster with my hardware.
The external hardware is a CAN controller with a very small buffer, so I must read any incoming data quickly to avoid loosing data.
I have configured a GPIO interrupt to inform me of the new data, but I cannot read the SPI hardware in the interrupt handler.
My interrupt handler basically sets up a workqueue that will read the SPI data.
It works fine when there is only one active process in the kernel.
As soon as I open any other process (even the process viewer top) at the same time, I start loosing data in bunches, i.e., I might receive 1000 packects of data with no problems and then loose 15 packets in a row and so on.
I suspect that the cause of my problem is that when the other process (top, in this case) has control over the cpu the interrupt handler runs, but the work in the workqueue doesn't until the scheduler is called again.
I tried to increase the priority of my process with no success.
I wonder if there is a way to tell the kernel to execute the work in workqueue immediatelly after the interrupt handling function.
Suggestions are welcome.

As an alternative you could consider using a tasklet, which will tell the kernel execute more immediate, but be aware you are unable to sleep in tasklets
A good ibm article on deffering work in the kernel
http://www.ibm.com/developerworks/linux/library/l-tasklets/
http://www.makelinux.net/ldd3/chp-7-sect-5
A tasklet is run at the next timer tick as long as the CPU is busy running a process, but it is run immediately when the CPU is otherwise idle. The kernel provides a set of ksoftirqd kernel threads, one per CPU, just to run "soft interrupt" handlers, such as the tasklet_action function. Thus, the final three runs of the tasklet take place in the context of the ksoftirqd kernel thread associated to CPU 0. The jitasklethi implementation uses a high-priority tasklet, explained in an upcoming list of functions.

Why does a thread, on ubuntu 2.6.38-generic or 3.0.0-lowlatency kernel, block another thread from taking mmap_sem?

If I write a thread and run it on the Round Robyn Real-Time scheduler, in Ubuntu 11.04 using either the shipped 2.6.38 generic kernel or the available 3.0.0-9-lowlatency kernel from this ppa: https://launchpad.net/~abogani/+archive/ppa, it seems to lock out the command:
apt-key get
It appears to lock that command out when gpg, under the hood, tries to use mlock(), which as I understand it requires the mmap_sem. However, my test thread is literally "doing nothing", in that it is just an empty for loop. I am not also proactively using the mmap_sem, for example.
On a SMP machine (4 cores, 8 logical cores), a single thread on the RR scheduler at a priority of 50 or more seems to always lock out apt-key. A lower priority returns roughly 50% or less of the time, sometimes taking minutes to return.
What is the connection between my empty-for-loop thread at this real-time priority and the mmap_sem?

From this site: http://www.icir.org/gregor/tools/pthread-scheduling.html
The thread will contend with all the other threads and processes for the CPU. So if your thread is higher priority and is in real time scheduling, it will take the CPU and never give it back.
One way to test if this is the problem is to block on any system call, so your thread will sleep and let the other process run. Using a select with a timeout should be a good test.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight