Erlang: blocking C NIF call behavior - c

I have observed a blocking behavior of C NIFs when they were being called concurrently by many Erlang processes. Can it be made non-blocking? Is there a mutex at work here which I'm not able to comprehend?
P.S. A basic "Hello world" NIF can be tested by making it sleep for a hundred microseconds in case of a particular PID calling it. It can be observed that the other PIDs calling the NIF wait for that sleep to execute before their execution.
Non blocking behavior would be beneficial in cases where concurrency might not pose an issue(e.g. array push, counter increment).
I am sharing the links to 4 gists which comprise of a spawner, conc_nif_caller and niftest module respectively. I have tried to tinker with the value of Val and I have indeed observed a non-blocking behavior. This is confirmed by assigning a large integer parameter to the spawn_multiple_nif_callers function.
Links
spawner.erl,conc_nif_caller.erl,niftest.erl and finally niftest.c.
The line below is printed by the Erlang REPL on my Mac.
Erlang/OTP 17 [erts-6.0] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

NIF's themselves don't have any mutex. You could implement one in C, and there is one when you load NIF's object, but this should be done only once with loading module.
One thing that's might be happening (and I would bet that's what is going on), is you C code messes up Erlang scheduler(s).
A native function that do lengthy work before returning will degrade responsiveness of the VM, and may cause miscellaneous strange behaviors. Such strange behaviors include, but are not limited to, extreme memory usage, and bad load balancing between schedulers. Strange behaviors that might occur due to lengthy work may also vary between OTP releases.
and description of what lengty work means and how you could solve it.
In very few words (with quite few simplifications):
For core one scheduler is created. Each has a list of processes which he can run. If ones scheduler list is empty, he will try to still work from another one. This can fail, if there is nothing (or not enough) to still.
Erlang schedulers spends some amount of work in one process, than moves to another, spend there some amount of work, and move to another. And so on, and so one. This is very similar to scheduling in system processes.
One thing that very important here is calculating amount of work. As default each function call has assigned some number of reductions. Addition could have two, calling function in your module will have one, sending a message also a one, some build-in could have more (like list_to_binary). If we collect 2 000 reductions we move to another process.
So what is the cost of your C function? It's only one reduction.
Code like
loop() ->
call_nif_function(),
loop().
could be taking all whole hour, but scheduler will be stuck in this one process, because he still haven't count to 2 000 reductions. Or to put it in other words, he could be stuck inside NIF without possibility to move forward (at least any time soon).
There are few ways around this but general rule is stat NIF's should not take long time. So if you have long running C code, maybe you should use drivers instead. They should be much easier to implement and manage, that tinkering with NIF's.

I think the responses about long-running NIFs are off the mark, since your question says you're running some simple "hello world" code and are sleeping for just 100 us. It's true that ideally a NIF call shouldn't take more than a millisecond, but your NIFs likely won't cause scheduler issues unless they run consistently for tens of milliseconds at a time or more.
I have a simple NIF called rev/1 that takes a string argument, reverses it, and returns the reversed string. I stuck a usleep call in the middle of it, then spawned 100 concurrent Erlang processes to invoke it. The two thread stacktraces shown below, based on Erlang/OTP 17.3.2, show two Erlang scheduler threads both inside the rev/1 NIF simultaneously, one at a breakpoint I set on the NIF C function itself, the other blocked on the usleep inside the NIF:
Thread 18 (process 26016):
#0 rev (env=0x1050d0a50, argc=1, argv=0x102ecc340) at nt2.c:9
#1 0x000000010020f13d in process_main () at beam/beam_emu.c:3525
#2 0x00000001000d5b2f in sched_thread_func (vesdp=0x102829040) at beam/erl_process.c:7719
#3 a0x0000000100301e94 in thr_wrapper (vtwd=0x7fff5fbff068) at pthread/ethread.c:106
#4 0x00007fff8a106899 in _pthread_body ()
#5 0x00007fff8a10672a in _pthread_start ()
#6 0x00007fff8a10afc9 in thread_start ()
Thread 17 (process 26016):
#0 0x00007fff8a0fda3a in __semwait_signal ()
#1 0x00007fff8d205dc0 in nanosleep ()
#2 0x00007fff8d205cb2 in usleep ()
#3 0x000000010062ee65 in rev (env=0x104fcba50, argc=1, argv=0x102ec8280) at nt2.c:21
#4 0x000000010020f13d in process_main () at beam/beam_emu.c:3525
#5 0x00000001000d5b2f in sched_thread_func (vesdp=0x10281ed80) at beam/erl_process.c:7719
#6 0x0000000100301e94 in thr_wrapper (vtwd=0x7fff5fbff068) at pthread/ethread.c:106
#7 0x00007fff8a106899 in _pthread_body ()
#8 0x00007fff8a10672a in _pthread_start ()
#9 0x00007fff8a10afc9 in thread_start ()
If there were any mutexes within the Erlang emulator preventing concurrent NIF access, the stacktraces would not show both threads inside the C NIF.
It would be nice if you were to post your code so those willing to help resolve this issue could see what you're doing and perhaps help you find any bottlenecks. It would also be helpful if you were to tell us what version(s) of Erlang/OTP you're using.

NIF calls block the scheduler to which the process that called them is bound. So, for your example, if those other processes are on the same scheduler, they cannot call into the NIF until the first process finishes.
You cannot make an NIF call non-blocking in this regard. You can, however, spawn your own threads and offload the brunt of your work to them.
Such threads can send messages to local Erlang processes (processes on the same machine), and as such you can still get the response you desire by waiting for your spawned thread to send back a message.
A bad example:
static ERL_NIF_TERM my_function(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
MyStruct* args = new MyStruct(); // I like C++; so sue me
args->caller = enif_self();
ErlNifTid thread_id;
// Please remember, you must at some point rejoin the thread,
// so keep track of the thread_id
enif_thread_create("my_function_thread", &thread_id, my_worker_function, (void*)args, NULL);
return enif_make_atom(env, "ok");
}
void* my_worker_function(void* args) {
sleep(100);
ErlNifEnv* msg_env = enif_alloc_env();
ERL_NIF_TERM msg = enif_make_atom(msg_env, "ok");
enif_send(NULL, args->caller, msg_env, msg);
delete args;
return NULL;
}
And in your erlang source:
test_nif() ->
my_nif:my_function(),
receive
ok -> ok
end.
Something to that effect, anyway.

Related

Monitoring Thread performance of server

I have developed a C server using gcc and pthreads that receives UDP packets and depending on the configuration either drops or forwards them to specific targets. In some cases these packets are untouched and just redirected, in some cases headers in the packet are modified, in other cases there is another module of the server that modifies every byte of the packet.
To configure this server, there is a GUI written in Java that connects to the C Server using TCP (to exchange configuration commands). There can be multiple connected GUIs at the same time.
In order to measure utilization of the server I have written kind of a module that starts two separate threads (#2 & #3). The main thread (#1) that does the whole forwarding work essentially works like the following:
struct monitoring_struct data; //contains 2 * uint64_t for start and end time among other fields
for(;;){
recvfrom();
data.start = current_time();
modifyPacket();
sendPacket(); //sometimes to multiple destinations
data.end = current_time();
writeDataToPipe();
}
The current_time function:
//give a timestamp in microsecond precision
uint64_t current_time(void){
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
uint64_t ts = (uint64_t) ((((double) spec.tv_sec) * 1.0e6) +
(((double) spec.tv_nsec) / 1.0e3));
return ts;
}
As indicated in the main thread, the data struct is written into a pipe, where thread #2 waits to read from. Everytime there is data to be read from the pipe, thread #2 uses a given aggregation function that stores the data in another place in memory. Thread #3 is a loop, that always sleeps for ~1 sec and then sends out the aggregated values (median, avg, min, max, lower quartil, upper quartil, ...) and then resets the aggregated data. Thread #2 and #3 are synchronized by mutexes.
The GUI listens to this data (if the monitoring window is open) which is sent out via UDP to listeners (there can be more) and the GUI then converts the numbers into diagrams, graphs and "pressure" indicators.
I came up with this as this is in my mind the solution that interferes least of all with thread #1 (assuming that it is run on a multicore system, which it always is, and exclusively besides OS and maybe SSH).
As performance is critical for my server (version "1.0" with simpler configuration was able to manage the maximum amount of streams that were possible using gigabit ethernet) I would like to ask if have my solution may be not as good as I think it is to ensure the least performance hit on thread #1 and if you think there would better designs for that? At least I am unable to think of another solution that is not using locks on the data itself (avoiding the pipe, but potentially locking thread #1) or a shared list implementation using rwlock, with possible reader starvation.
There are scenarios where packets are larger, but we currently use the mode for performance measuring where 1 Streams sends exactly 1000 packets per second. We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
In the future I would like to add other milestone timestamps to thread #1, e.g. inside modifyPacket() (there are multiple steps) and before sendPacket().
I have tried tinkering with the current_time() function, mostly trying to remove it to save time by just storing the value of clock_gettime(), but in my simple test program the current_time() function always beat the clock_gettime.
Thanks in advance for any input.
if you think there would better designs for that?
The short answer is to use Data Plane Development Kit (DPDK) with its design patterns and libraries. It might be quite a learning curve, but in terms of performance it is the best solution at the moment. It is free and open source (BSD license).
A bit more detailed answer:
the data struct is written into a pipe
Since thread #1 and #2 are the threads of the same process, it would be much faster to pass data using shared memory, not pipes. Just like you used between threads #2 and #3.
thread #2 uses a given aggregation function that stores the data in another place in memory
Those two threads seems unnecessary. Thread #2 can read data passed by thread #1, aggregate and send it out?
I am unable to think of another solution that is not using locks on the data itself
Have a look at the lockless queues which are called "rings" in DPDK. The idea is to have a common circular buffer between threads and use lockless algorithms to enqueue/dequeue to/from the buffer.
We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
Measure the performance and find the bottlenecks (seems your are still not 100% sure what is the bottleneck in the code).
Just for the reference, Intel publishes the performance reports for DPDK. Those reference numbers for L3 forwarding (i.e. routing) are up to 30 million packet per second.
Sure, you might have less powerful processor and NIC, but few millions packets per second are reachable quite easily using the right techniques.

How to prevent linux soft lockup/unresponsiveness in C without sleep

How would be the correct way to prevent a soft lockup/unresponsiveness in a long running while loop in a C program?
(dmesg is reporting a soft lockup)
Pseudo code is like this:
while( worktodo ) {
worktodo = doWork();
}
My code is of course way more complex, and also includes a printf statement which gets executed once a second to report progress, but the problem is, the program ceases to respond to ctrl+c at this point.
Things I've tried which do work (but I want an alternative):
doing printf every loop iteration (don't know why, but the program becomes responsive again that way (???)) - wastes a lot of performance due to unneeded printf calls (each doWork() call does not take very long)
using sleep/usleep/... - also seems like a waste of (processing-)time to me, as the whole program will already be running several hours at full speed
What I'm thinking about is some kind of process_waiting_events() function or the like, and normal signals seem to be working fine as I can use kill on a different shell to stop the program.
Additional background info: I'm using GWAN and my code is running inside the main.c "maintenance script", which seems to be running in the main thread as far as I can tell.
Thank you very much.
P.S.: Yes I did check all other threads I found regarding soft lockups, but they all seem to ask about why soft lockups occur, while I know the why and want to have a way of preventing them.
P.P.S.: Optimizing the program (making it run shorter) is not really a solution, as I'm processing a 29GB bz2 file which extracts to about 400GB xml, at the speed of about 10-40MB per second on a single thread, so even at max speed I would be bound by I/O and still have it running for several hours.
While the posed answer using threads might possibly be an option, it would in reality just shift the problem to a different thread. My solution after all was using
sleep(0)
Also tested sched_yield / pthread_yield, both of which didn't really help. Unfortunately I've been unable to find a good resource which documents sleep(0) in linux, but for windows the documentation states that using a value of 0 lets the thread yield it's remaining part of the current cpu slice.
It turns out that sleep(0) is most probably relying on what is called timer slack in linux - an article about this can be found here: http://lwn.net/Articles/463357/
Another possibility is using nanosleep(&(struct timespec){0}, NULL) which seems to not necessarily rely on timer slack - linux man pages for nanosleep state that if the requested interval is below clock granularity, it will be rounded up to clock granularity, which on linux depends on CLOCK_MONOTONIC according to the man pages. Thus, a value of 0 nanoseconds is perfectly valid and should always work, as clock granularity can never be 0.
Hope this helps someone else as well ;)
Your scenario is not really a soft lock up, it is a process is busy doing something.
How about this pseudo code:
void workerThread()
{
while(workToDo)
{
if(threadSignalled)
break;
workToDo = DoWork()
}
}
void sighandler()
{
signal worker thread to finish
waitForWorkerThreadFinished;
}
void main()
{
InstallSignalHandler;
CreateSemaphore
StartThread;
waitForWorkerThreadFinished;
}
Clearly a timing issue. Using a signalling mechanism should remove the problem.
The use of printf solves the problem because printf accesses the console which is an expensive and time consuming process which in your case gives enough time for the worker to complete its work.

Purpose of wake_up_sync/wake_up_interruptible_sync in the Linux kernel

I'm following an example in the Linux Device Drivers 3rd Edition book:
if (temp = = 0)
wake_up_interruptible_sync(&scull_w_wait); /* awake other uid's */
return 0;
The author states:
Here is an example of where calling wake_up_interruptible_sync makes sense. When we do
the wakeup, we are just about to return to user space, which is a natural scheduling
point for the system. Rather than potentially reschedule when we do the wakeup, it is
better to just call the "sync" version and finish our job.
I don't understand why using wake_up_interruptible_sync is better in this situation. The author implies that this call will prevent a reschedule -- which it does prevent within the call -- but after wake_up_interruptible_sync returns, couldn't another thread just take control of the CPU anyway before the return 0 line?
So what is the difference between calling wake_up_interruptible_sync over the typical wake_up_interruptible if a thread can take control of the CPU regardless after each call?
The reason for using _sync is that we know that the scheduler will run within a short time, so we do not need to run it a second time.
However, this is just an optimization; if the scheduler did run again, nothing bad would happen.
A timer interrupt can indeed happen at any time, but it would be needed only if the scheduler did not already run recently for some other reason.

Use printk in kernel

I am trying to implement my own new schedule(). I want to debug my code.
Can I use printk function in sched.c?
I used printk but it doesn't work. What did I miss?
Do you know how often schedule() is called? It's probably called faster than your computer can flush the print buffer to the log. I would suggest using another method of debugging. For instance running your kernel in QEMU and using remote GDB by loading the kernel.syms file as a symbol table and setting a breakpoint. Other virtualization software offers similar features. Or do it the manual way and walk through your code. Using printk in interrupt handlers is typically a bad idea (unless you're about to panic or stall).
If the error you are seeing doesn't happen often think of using BUG() or BUG_ON(cond) instead. These do conditional error messages and shouldn't happen as often as a non-conditional printk
Editing the schedule() function itself is typically a bad idea (unless you want to support multiple run queue's etc...). It's much better and easier to instead modify a scheduler class. Look at the code of the CFS scheduler to do this. If you want to accomplish something else I can give better advice.
It's not safe to call printk while holding the runqueue lock. A special function printk_sched was introduced in order to have a mechanism to use printk when holding the runqueue lock (https://lkml.org/lkml/2012/3/13/13). Unfortunatly it can just print one message within a tick (and there cannot be more than one tick when holding the run queue lock because interrupts are disabled). This is because an internal buffer is used to save the message.
You can either use lttng2 for logging to user space or patch the implementation of printk_sched to use a statically allocated pool of buffers that can be used within a tick.
Try trace_printk().
printk() has too much of an overhead and schedule() gets called again before previous printk() calls finish. This creates a live lock.
Here is a good article about it: https://lwn.net/Articles/365835/
It depends, basically it should be work fine.
try to use dmesg in shell to trace your printk if it is not there you apparently didn't invoked it.
2396 if (p->mm && printk_ratelimit()) {
2397 printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
2398 task_pid_nr(p), p->comm, cpu);
2399 }
2400
2401 return dest_cpu;
2402 }
there is a sections in sched.c that printk doesn't work e.g.
1660 static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
1661 {
1662 if (unlikely(!irqs_disabled())) {
1663 /* printk() doesn't work good under rq->lock */
1664 raw_spin_unlock(&this_rq->lock);
1665 BUG_ON(1);
1666 }
1667
1668 return _double_lock_balance(this_rq, busiest);
1669 }
EDIT
you may try to printk once in 1000 times instead of each time.

What's the algorithm behind sleep()?

Now there's something I always wondered: how is sleep() implemented ?
If it is all about using an API from the OS, then how is the API made ?
Does it all boil down to using special machine-code on the CPU ? Does that CPU need a special co-processor or other gizmo without which you can't have sleep() ?
The best known incarnation of sleep() is in C (to be more accurate, in the libraries that come with C compilers, such as GNU's libc), although almost every language today has its equivalent, but the implementation of sleep in some languages (think Bash) is not what we're looking at in this question...
EDIT: After reading some of the answers, I see that the process is placed in a wait queue. From there, I can guess two alternatives, either
a timer is set so that the kernel wakes the process at the due time, or
whenever the kernel is allowed a time slice, it polls the clock to check whether it's time to wake a process.
The answers only mention alternative 1. Therefore, I ask: how does this timer behave ? If it's a simple interrupt to make the kernel wake the process, how can the kernel ask the timer to "wake me up in 140 milliseconds so I can put the process in running state" ?
The "update" to question shows some misunderstanding of how modern OSs work.
The kernel is not "allowed" a time slice. The kernel is the thing that gives out time slices to user processes. The "timer" is not set to wake the sleeping process up - it is set to stop the currently running process.
In essence, the kernel attempts to fairly distribute the CPU time by stopping processes that are on CPU too long. For a simplified picture, let's say that no process is allowed to use the CPU more than 2 milliseconds. So, the kernel would set timer to 2 milliseconds, and let the process run. When the timer fires an interrupt, the kernel gets control. It saves the running process' current state (registers, instruction pointer and so on), and the control is not returned to it. Instead, another process is picked from the list of processes waiting to be given CPU, and the process that was interrupted goes to the back of the queue.
The sleeping process is simply not in the queue of things waiting for CPU. Instead, it's stored in the sleeping queue. Whenever kernel gets timer interrupt, the sleep queue is checked, and the processes whose time have come get transferred to "waiting for CPU" queue.
This is, of course, a gross simplification. It takes very sophisticated algorithms to ensure security, fairness, balance, prioritize, prevent starvation, do it all fast and with minimum amount of memory used for kernel data.
There's a kernel data structure called the sleep queue. It's a priority queue. Whenever a process is added to the sleep queue, the expiration time of the most-soon-to-be-awakened process is calculated, and a timer is set. At that time, the expired job is taken off the queue and the process resumes execution.
(amusing trivia: in older unix implementations, there was a queue for processes for which fork() had been called, but for which the child process had not been created. It was of course called the fork queue.)
HTH!
Perhaps the major job of an operating system is to hide the complexity of a real piece of hardware from the application writer. Hence, any description of how the OS works runs the risk of getting really complicated, really fast. Accordingly, I am not going to deal with all the "what ifs" and yeah buts" that a real operating system needs to deal with. I'm just going to describe, at a high conceptual level, what a process is, what the scheduler does, how the timer queue works. Hopefully this is helpful.
What's a process:
Think of a process--let's just talk about processes, and get to threads later--as "the thing the operating system schedules". A process has an ID--think an integer--and you can think of that integer as an index into a table containing all the context of that process.
Context is the hardware information--registers, memory management unit contents, other hardware state--that, when loaded into the machine, will allow the process to "go". There are other components of context--lists of open files, state of signal handlers, and, most importantly here, things the process is waiting for.
Processes spend a lot of time sleeping (a.k.a. waiting)
A process spends much of its time waiting. For example, a process that reads or writes to disk will spend a lot of time waiting for the data to arrive or be acknowledged to be out on disk. OS folks use the terms "waiting" and "sleeping" (and "blocked") somewhat interchangeably--all meaning that the process is awaiting something to happen before it can continue on its merry way. It is just confusing that the OS API sleep() happens to use underlying OS mechanisms for sleeping processes.
Processes can be waiting for other things: network packets to arrive, window selection events, or a timer to expire, for example.
Processes and Scheduling
Processes that are waiting are said to be non-runnable. They don't go onto the run queue of the operating system. But when the event occurs which the process is waiting for, it causes the operating system to move the process from the non-runnable to the runnable state. At the same time, the operating system puts the process on the run queue, which is really not a queue--it's more of a pile of all the processes which, should the operating system decide to do so, could run.
Scheduling:
the operating system decides, at regular intervals, which processes should run. The algorithm by which the operating system decides to do so is called, somewhat unsurprisingly, the scheduling algorithm. Scheduling algorithms range from dead-simple ("everybody gets to run for 10 ms, and then the next guy on the queue gets to run") to far more complicated (taking into account process priority, frequency of execution, run-time deadlines, inter-process dependencies, chained locks and all sorts of other complicated subject matter).
The Timer Queue
A computer has a timer inside it. There are many ways this can be implemented, but the classic manner is called a periodic timer. A periodic timer ticks at a regular interval--in most operating systems today, I believe this rate is 100 times per second--100 Hz--every 10 milliseconds. I'll use that value in what follows as a concrete rate, but know that most operating systems worth their salt can be configured with different ticks--and many don't use this mechanism and can provide much better timer precision. But I digress.
Each tick results in an interrupt to the operating system.
When the OS handles this timer interrupt, it increments its idea of system time by another 10 ms. Then, it looks at the timer queue and decides what events on that queue need to be dealt with.
The timer queue really is a queue of "things which need to be dealt with", which we will call events. This queue is ordered by time of expiration, soonest events first.
An "event" can be something like, "wake up process X", or "go kick disk I/O over there, because it may have gotten stuck", or "send out a keepalive packet on that fibrechannel link over there". Whatever the operating system needs to have done.
When you have a queue ordered in this way, it's easy to manage the dequeuing. The OS simply looks at the head of the queue, and decrements the "time to expiration" of the event by 10 ms every tick. When the expiration time goes to zero, the OS dequeues that event, and does whatever is called for.
In the case of a sleeping process, it simply makes the process runnable again.
Simple, huh?
there's at least two different levels to answer this question. (and a lot of other things that get confused with it, i won't touch them)
an application level, this is what the C library does. It's a simple OS call, it simply tells the OS not to give CPU time to this process until the time has passed. The OS has a queue of suspended applications, and some info about what are they waiting for (usually either time, or some data to appear somewhere).
kernel level. when the OS doesn't have anything to do right now, it executes a 'hlt' instruction. this instruction doesn't do anything, but it never finishes by itself. Of course, a hardware interrupt is serviced normally. Put simply, the main loop of an OS looks like this (from very very far away):
allow_interrupts ();
while (true) {
hlt;
check_todo_queues ();
}
the interrupt handlers simpy add things to the todo queues. The real time clock is programmed to generate interrupts either periodically (at a fixed rate), or to some fixed time in the future when the next process wants to be awaken.
A multitasking operating system has a component called a scheduler, this component is responsible for giving CPU time to threads, calling sleep tells the OS not to give CPU time to this thread for some time.
see http://en.wikipedia.org/wiki/Process_states for complete details.
I don't know anything about Linux, but I can tell you what happens on Windows.
Sleep() causes the process' time-slice to end immediately to return control to the OS. The OS then sets up a timer kernel object that gets signaled after the time elapses. The OS will then not give that process any more time until the kernel object gets signaled. Even then, if other processes have higher or equal priority, it may still wait a little while before letting the process continue.
Special CPU machine code is used by the OS to do process switching. Those functions cannot be accessed by user-mode code, so they are accessed strictly by API calls into the OS.
Essentially, yes, there is a "special gizmo" - and it's important for a lot more than just sleep().
Classically, on x86 this was an Intel 8253 or 8254 "Programmable Interval Timer". In the early PCs, this was a seperate chip on the motherboard that could be programmed by the CPU to assert an interrupt (via the "Programmable Interrupt Controller", another discrete chip) after a preset time interval. The functionality still exists, although it is now a tiny part of a much larger chunk of motherboard circuitry.
The OS today still programs the PIT to wake it up regularly (in recent versions of Linux, once every millisecond by default), and this is how the Kernel is able to implement pre-emptive multitasking.
glibc 2.21 Linux
Forwards to the nanosleep system call.
glibc is the default implementation for the C stdlib on most Linux desktop distros.
How to find it: the first reflex is:
git ls-files | grep sleep
This contains:
sysdeps/unix/sysv/linux/sleep.c
and we know that:
sysdeps/unix/sysv/linux/
contains the Linux specifics.
On the top of that file we see:
/* We are going to use the `nanosleep' syscall of the kernel. But the
kernel does not implement the stupid SysV SIGCHLD vs. SIG_IGN
behaviour for this syscall. Therefore we have to emulate it here. */
unsigned int
__sleep (unsigned int seconds)
So if you trust comments, we are done basically.
At the bottom:
weak_alias (__sleep, sleep)
which basically says __sleep == sleep. The function uses nanosleep through:
result = __nanosleep (&ts, &ts);
After greppingg:
git grep nanosleep | grep -v abilist
we get a small list of interesting occurrences, and I think __nanosleep is defined in:
sysdeps/unix/sysv/linux/syscalls.list
on the line:
nanosleep - nanosleep Ci:pp __nanosleep nanosleep
which is some super DRY magic format parsed by:
sysdeps/unix/make-syscalls.sh
Then from the build directory:
grep -r __nanosleep
Leads us to: /sysd-syscalls which is what make-syscalls.sh generates and contains:
#### CALL=nanosleep NUMBER=35 ARGS=i:pp SOURCE=-
ifeq (,$(filter nanosleep,$(unix-syscalls)))
unix-syscalls += nanosleep
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,nanosleep)$o)): \
$(..)sysdeps/unix/make-syscalls.sh
$(make-target-directory)
(echo '#define SYSCALL_NAME nanosleep'; \
echo '#define SYSCALL_NARGS 2'; \
echo '#define SYSCALL_SYMBOL __nanosleep'; \
echo '#define SYSCALL_CANCELLABLE 1'; \
echo '#include <syscall-template.S>'; \
echo 'weak_alias (__nanosleep, nanosleep)'; \
echo 'libc_hidden_weak (nanosleep)'; \
) | $(compile-syscall) $(foreach p,$(patsubst %nanosleep,%,$(basename $(#F))),$($(p)CPPFLAGS))
endif
It looks like part of a Makefile. git grep sysd-syscalls shows that it is included at:
sysdeps/unix/Makefile:23:-include $(common-objpfx)sysd-syscalls
compile-syscall looks like the key part, so we find:
# This is the end of the pipeline for compiling the syscall stubs.
# The stdin is assembler with cpp using sysdep.h macros.
compile-syscall = $(COMPILE.S) -o $# -x assembler-with-cpp - \
$(compile-mkdep-flags)
Note that -x assembler-with-cpp is a gcc option.
This #defines parameters like:
#define SYSCALL_NAME nanosleep
and then use them at:
#include <syscall-template.S>
OK, this is as far as I will go on the macro expansion game for now.
I think then this generates the posix/nanosleep.o file which must be linked together with everything.
Linux 4.2 x86_64 nanosleep syscall
Uses the scheduler: it's not a busy sleep.
Search ctags:
sys_nanosleep
Leads us to kernel/time/hrtimer.c:
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
hrtimer stands for High Resolution Timer. From there the main line looks like:
hrtimer_nanosleep
do_nanosleep
set_current_state(TASK_INTERRUPTIBLE); which is interruptible sleep
freezable_schedule(); which calls schedule() and allows other processes to run
hrtimer_start_expires
hrtimer_start_range_ns
TODO: reach the arch/x86 timing level
TODO: are the above steps done directly in the syscal call interrupt handler, or in a regular kernel thread?
A few articles about it:
https://geeki.wordpress.com/2010/10/30/ways-of-sleeping-in-linux-kernel/
http://www.linuxjournal.com/article/8144

Resources