I have developed a C server using gcc and pthreads that receives UDP packets and depending on the configuration either drops or forwards them to specific targets. In some cases these packets are untouched and just redirected, in some cases headers in the packet are modified, in other cases there is another module of the server that modifies every byte of the packet.
To configure this server, there is a GUI written in Java that connects to the C Server using TCP (to exchange configuration commands). There can be multiple connected GUIs at the same time.
In order to measure utilization of the server I have written kind of a module that starts two separate threads (#2 & #3). The main thread (#1) that does the whole forwarding work essentially works like the following:
struct monitoring_struct data; //contains 2 * uint64_t for start and end time among other fields
for(;;){
recvfrom();
data.start = current_time();
modifyPacket();
sendPacket(); //sometimes to multiple destinations
data.end = current_time();
writeDataToPipe();
}
The current_time function:
//give a timestamp in microsecond precision
uint64_t current_time(void){
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
uint64_t ts = (uint64_t) ((((double) spec.tv_sec) * 1.0e6) +
(((double) spec.tv_nsec) / 1.0e3));
return ts;
}
As indicated in the main thread, the data struct is written into a pipe, where thread #2 waits to read from. Everytime there is data to be read from the pipe, thread #2 uses a given aggregation function that stores the data in another place in memory. Thread #3 is a loop, that always sleeps for ~1 sec and then sends out the aggregated values (median, avg, min, max, lower quartil, upper quartil, ...) and then resets the aggregated data. Thread #2 and #3 are synchronized by mutexes.
The GUI listens to this data (if the monitoring window is open) which is sent out via UDP to listeners (there can be more) and the GUI then converts the numbers into diagrams, graphs and "pressure" indicators.
I came up with this as this is in my mind the solution that interferes least of all with thread #1 (assuming that it is run on a multicore system, which it always is, and exclusively besides OS and maybe SSH).
As performance is critical for my server (version "1.0" with simpler configuration was able to manage the maximum amount of streams that were possible using gigabit ethernet) I would like to ask if have my solution may be not as good as I think it is to ensure the least performance hit on thread #1 and if you think there would better designs for that? At least I am unable to think of another solution that is not using locks on the data itself (avoiding the pipe, but potentially locking thread #1) or a shared list implementation using rwlock, with possible reader starvation.
There are scenarios where packets are larger, but we currently use the mode for performance measuring where 1 Streams sends exactly 1000 packets per second. We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
In the future I would like to add other milestone timestamps to thread #1, e.g. inside modifyPacket() (there are multiple steps) and before sendPacket().
I have tried tinkering with the current_time() function, mostly trying to remove it to save time by just storing the value of clock_gettime(), but in my simple test program the current_time() function always beat the clock_gettime.
Thanks in advance for any input.
if you think there would better designs for that?
The short answer is to use Data Plane Development Kit (DPDK) with its design patterns and libraries. It might be quite a learning curve, but in terms of performance it is the best solution at the moment. It is free and open source (BSD license).
A bit more detailed answer:
the data struct is written into a pipe
Since thread #1 and #2 are the threads of the same process, it would be much faster to pass data using shared memory, not pipes. Just like you used between threads #2 and #3.
thread #2 uses a given aggregation function that stores the data in another place in memory
Those two threads seems unnecessary. Thread #2 can read data passed by thread #1, aggregate and send it out?
I am unable to think of another solution that is not using locks on the data itself
Have a look at the lockless queues which are called "rings" in DPDK. The idea is to have a common circular buffer between threads and use lockless algorithms to enqueue/dequeue to/from the buffer.
We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
Measure the performance and find the bottlenecks (seems your are still not 100% sure what is the bottleneck in the code).
Just for the reference, Intel publishes the performance reports for DPDK. Those reference numbers for L3 forwarding (i.e. routing) are up to 30 million packet per second.
Sure, you might have less powerful processor and NIC, but few millions packets per second are reachable quite easily using the right techniques.
How would be the correct way to prevent a soft lockup/unresponsiveness in a long running while loop in a C program?
(dmesg is reporting a soft lockup)
Pseudo code is like this:
while( worktodo ) {
worktodo = doWork();
}
My code is of course way more complex, and also includes a printf statement which gets executed once a second to report progress, but the problem is, the program ceases to respond to ctrl+c at this point.
Things I've tried which do work (but I want an alternative):
doing printf every loop iteration (don't know why, but the program becomes responsive again that way (???)) - wastes a lot of performance due to unneeded printf calls (each doWork() call does not take very long)
using sleep/usleep/... - also seems like a waste of (processing-)time to me, as the whole program will already be running several hours at full speed
What I'm thinking about is some kind of process_waiting_events() function or the like, and normal signals seem to be working fine as I can use kill on a different shell to stop the program.
Additional background info: I'm using GWAN and my code is running inside the main.c "maintenance script", which seems to be running in the main thread as far as I can tell.
Thank you very much.
P.S.: Yes I did check all other threads I found regarding soft lockups, but they all seem to ask about why soft lockups occur, while I know the why and want to have a way of preventing them.
P.P.S.: Optimizing the program (making it run shorter) is not really a solution, as I'm processing a 29GB bz2 file which extracts to about 400GB xml, at the speed of about 10-40MB per second on a single thread, so even at max speed I would be bound by I/O and still have it running for several hours.
While the posed answer using threads might possibly be an option, it would in reality just shift the problem to a different thread. My solution after all was using
sleep(0)
Also tested sched_yield / pthread_yield, both of which didn't really help. Unfortunately I've been unable to find a good resource which documents sleep(0) in linux, but for windows the documentation states that using a value of 0 lets the thread yield it's remaining part of the current cpu slice.
It turns out that sleep(0) is most probably relying on what is called timer slack in linux - an article about this can be found here: http://lwn.net/Articles/463357/
Another possibility is using nanosleep(&(struct timespec){0}, NULL) which seems to not necessarily rely on timer slack - linux man pages for nanosleep state that if the requested interval is below clock granularity, it will be rounded up to clock granularity, which on linux depends on CLOCK_MONOTONIC according to the man pages. Thus, a value of 0 nanoseconds is perfectly valid and should always work, as clock granularity can never be 0.
Hope this helps someone else as well ;)
Your scenario is not really a soft lock up, it is a process is busy doing something.
How about this pseudo code:
void workerThread()
{
while(workToDo)
{
if(threadSignalled)
break;
workToDo = DoWork()
}
}
void sighandler()
{
signal worker thread to finish
waitForWorkerThreadFinished;
}
void main()
{
InstallSignalHandler;
CreateSemaphore
StartThread;
waitForWorkerThreadFinished;
}
Clearly a timing issue. Using a signalling mechanism should remove the problem.
The use of printf solves the problem because printf accesses the console which is an expensive and time consuming process which in your case gives enough time for the worker to complete its work.
I'm following an example in the Linux Device Drivers 3rd Edition book:
if (temp = = 0)
wake_up_interruptible_sync(&scull_w_wait); /* awake other uid's */
return 0;
The author states:
Here is an example of where calling wake_up_interruptible_sync makes sense. When we do
the wakeup, we are just about to return to user space, which is a natural scheduling
point for the system. Rather than potentially reschedule when we do the wakeup, it is
better to just call the "sync" version and finish our job.
I don't understand why using wake_up_interruptible_sync is better in this situation. The author implies that this call will prevent a reschedule -- which it does prevent within the call -- but after wake_up_interruptible_sync returns, couldn't another thread just take control of the CPU anyway before the return 0 line?
So what is the difference between calling wake_up_interruptible_sync over the typical wake_up_interruptible if a thread can take control of the CPU regardless after each call?
The reason for using _sync is that we know that the scheduler will run within a short time, so we do not need to run it a second time.
However, this is just an optimization; if the scheduler did run again, nothing bad would happen.
A timer interrupt can indeed happen at any time, but it would be needed only if the scheduler did not already run recently for some other reason.
I am trying to implement my own new schedule(). I want to debug my code.
Can I use printk function in sched.c?
I used printk but it doesn't work. What did I miss?
Do you know how often schedule() is called? It's probably called faster than your computer can flush the print buffer to the log. I would suggest using another method of debugging. For instance running your kernel in QEMU and using remote GDB by loading the kernel.syms file as a symbol table and setting a breakpoint. Other virtualization software offers similar features. Or do it the manual way and walk through your code. Using printk in interrupt handlers is typically a bad idea (unless you're about to panic or stall).
If the error you are seeing doesn't happen often think of using BUG() or BUG_ON(cond) instead. These do conditional error messages and shouldn't happen as often as a non-conditional printk
Editing the schedule() function itself is typically a bad idea (unless you want to support multiple run queue's etc...). It's much better and easier to instead modify a scheduler class. Look at the code of the CFS scheduler to do this. If you want to accomplish something else I can give better advice.
It's not safe to call printk while holding the runqueue lock. A special function printk_sched was introduced in order to have a mechanism to use printk when holding the runqueue lock (https://lkml.org/lkml/2012/3/13/13). Unfortunatly it can just print one message within a tick (and there cannot be more than one tick when holding the run queue lock because interrupts are disabled). This is because an internal buffer is used to save the message.
You can either use lttng2 for logging to user space or patch the implementation of printk_sched to use a statically allocated pool of buffers that can be used within a tick.
Try trace_printk().
printk() has too much of an overhead and schedule() gets called again before previous printk() calls finish. This creates a live lock.
Here is a good article about it: https://lwn.net/Articles/365835/
It depends, basically it should be work fine.
try to use dmesg in shell to trace your printk if it is not there you apparently didn't invoked it.
2396 if (p->mm && printk_ratelimit()) {
2397 printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
2398 task_pid_nr(p), p->comm, cpu);
2399 }
2400
2401 return dest_cpu;
2402 }
there is a sections in sched.c that printk doesn't work e.g.
1660 static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
1661 {
1662 if (unlikely(!irqs_disabled())) {
1663 /* printk() doesn't work good under rq->lock */
1664 raw_spin_unlock(&this_rq->lock);
1665 BUG_ON(1);
1666 }
1667
1668 return _double_lock_balance(this_rq, busiest);
1669 }
EDIT
you may try to printk once in 1000 times instead of each time.
Now there's something I always wondered: how is sleep() implemented ?
If it is all about using an API from the OS, then how is the API made ?
Does it all boil down to using special machine-code on the CPU ? Does that CPU need a special co-processor or other gizmo without which you can't have sleep() ?
The best known incarnation of sleep() is in C (to be more accurate, in the libraries that come with C compilers, such as GNU's libc), although almost every language today has its equivalent, but the implementation of sleep in some languages (think Bash) is not what we're looking at in this question...
EDIT: After reading some of the answers, I see that the process is placed in a wait queue. From there, I can guess two alternatives, either
a timer is set so that the kernel wakes the process at the due time, or
whenever the kernel is allowed a time slice, it polls the clock to check whether it's time to wake a process.
The answers only mention alternative 1. Therefore, I ask: how does this timer behave ? If it's a simple interrupt to make the kernel wake the process, how can the kernel ask the timer to "wake me up in 140 milliseconds so I can put the process in running state" ?
The "update" to question shows some misunderstanding of how modern OSs work.
The kernel is not "allowed" a time slice. The kernel is the thing that gives out time slices to user processes. The "timer" is not set to wake the sleeping process up - it is set to stop the currently running process.
In essence, the kernel attempts to fairly distribute the CPU time by stopping processes that are on CPU too long. For a simplified picture, let's say that no process is allowed to use the CPU more than 2 milliseconds. So, the kernel would set timer to 2 milliseconds, and let the process run. When the timer fires an interrupt, the kernel gets control. It saves the running process' current state (registers, instruction pointer and so on), and the control is not returned to it. Instead, another process is picked from the list of processes waiting to be given CPU, and the process that was interrupted goes to the back of the queue.
The sleeping process is simply not in the queue of things waiting for CPU. Instead, it's stored in the sleeping queue. Whenever kernel gets timer interrupt, the sleep queue is checked, and the processes whose time have come get transferred to "waiting for CPU" queue.
This is, of course, a gross simplification. It takes very sophisticated algorithms to ensure security, fairness, balance, prioritize, prevent starvation, do it all fast and with minimum amount of memory used for kernel data.
There's a kernel data structure called the sleep queue. It's a priority queue. Whenever a process is added to the sleep queue, the expiration time of the most-soon-to-be-awakened process is calculated, and a timer is set. At that time, the expired job is taken off the queue and the process resumes execution.
(amusing trivia: in older unix implementations, there was a queue for processes for which fork() had been called, but for which the child process had not been created. It was of course called the fork queue.)
HTH!
Perhaps the major job of an operating system is to hide the complexity of a real piece of hardware from the application writer. Hence, any description of how the OS works runs the risk of getting really complicated, really fast. Accordingly, I am not going to deal with all the "what ifs" and yeah buts" that a real operating system needs to deal with. I'm just going to describe, at a high conceptual level, what a process is, what the scheduler does, how the timer queue works. Hopefully this is helpful.
What's a process:
Think of a process--let's just talk about processes, and get to threads later--as "the thing the operating system schedules". A process has an ID--think an integer--and you can think of that integer as an index into a table containing all the context of that process.
Context is the hardware information--registers, memory management unit contents, other hardware state--that, when loaded into the machine, will allow the process to "go". There are other components of context--lists of open files, state of signal handlers, and, most importantly here, things the process is waiting for.
Processes spend a lot of time sleeping (a.k.a. waiting)
A process spends much of its time waiting. For example, a process that reads or writes to disk will spend a lot of time waiting for the data to arrive or be acknowledged to be out on disk. OS folks use the terms "waiting" and "sleeping" (and "blocked") somewhat interchangeably--all meaning that the process is awaiting something to happen before it can continue on its merry way. It is just confusing that the OS API sleep() happens to use underlying OS mechanisms for sleeping processes.
Processes can be waiting for other things: network packets to arrive, window selection events, or a timer to expire, for example.
Processes and Scheduling
Processes that are waiting are said to be non-runnable. They don't go onto the run queue of the operating system. But when the event occurs which the process is waiting for, it causes the operating system to move the process from the non-runnable to the runnable state. At the same time, the operating system puts the process on the run queue, which is really not a queue--it's more of a pile of all the processes which, should the operating system decide to do so, could run.
Scheduling:
the operating system decides, at regular intervals, which processes should run. The algorithm by which the operating system decides to do so is called, somewhat unsurprisingly, the scheduling algorithm. Scheduling algorithms range from dead-simple ("everybody gets to run for 10 ms, and then the next guy on the queue gets to run") to far more complicated (taking into account process priority, frequency of execution, run-time deadlines, inter-process dependencies, chained locks and all sorts of other complicated subject matter).
The Timer Queue
A computer has a timer inside it. There are many ways this can be implemented, but the classic manner is called a periodic timer. A periodic timer ticks at a regular interval--in most operating systems today, I believe this rate is 100 times per second--100 Hz--every 10 milliseconds. I'll use that value in what follows as a concrete rate, but know that most operating systems worth their salt can be configured with different ticks--and many don't use this mechanism and can provide much better timer precision. But I digress.
Each tick results in an interrupt to the operating system.
When the OS handles this timer interrupt, it increments its idea of system time by another 10 ms. Then, it looks at the timer queue and decides what events on that queue need to be dealt with.
The timer queue really is a queue of "things which need to be dealt with", which we will call events. This queue is ordered by time of expiration, soonest events first.
An "event" can be something like, "wake up process X", or "go kick disk I/O over there, because it may have gotten stuck", or "send out a keepalive packet on that fibrechannel link over there". Whatever the operating system needs to have done.
When you have a queue ordered in this way, it's easy to manage the dequeuing. The OS simply looks at the head of the queue, and decrements the "time to expiration" of the event by 10 ms every tick. When the expiration time goes to zero, the OS dequeues that event, and does whatever is called for.
In the case of a sleeping process, it simply makes the process runnable again.
Simple, huh?
there's at least two different levels to answer this question. (and a lot of other things that get confused with it, i won't touch them)
an application level, this is what the C library does. It's a simple OS call, it simply tells the OS not to give CPU time to this process until the time has passed. The OS has a queue of suspended applications, and some info about what are they waiting for (usually either time, or some data to appear somewhere).
kernel level. when the OS doesn't have anything to do right now, it executes a 'hlt' instruction. this instruction doesn't do anything, but it never finishes by itself. Of course, a hardware interrupt is serviced normally. Put simply, the main loop of an OS looks like this (from very very far away):
allow_interrupts ();
while (true) {
hlt;
check_todo_queues ();
}
the interrupt handlers simpy add things to the todo queues. The real time clock is programmed to generate interrupts either periodically (at a fixed rate), or to some fixed time in the future when the next process wants to be awaken.
A multitasking operating system has a component called a scheduler, this component is responsible for giving CPU time to threads, calling sleep tells the OS not to give CPU time to this thread for some time.
see http://en.wikipedia.org/wiki/Process_states for complete details.
I don't know anything about Linux, but I can tell you what happens on Windows.
Sleep() causes the process' time-slice to end immediately to return control to the OS. The OS then sets up a timer kernel object that gets signaled after the time elapses. The OS will then not give that process any more time until the kernel object gets signaled. Even then, if other processes have higher or equal priority, it may still wait a little while before letting the process continue.
Special CPU machine code is used by the OS to do process switching. Those functions cannot be accessed by user-mode code, so they are accessed strictly by API calls into the OS.
Essentially, yes, there is a "special gizmo" - and it's important for a lot more than just sleep().
Classically, on x86 this was an Intel 8253 or 8254 "Programmable Interval Timer". In the early PCs, this was a seperate chip on the motherboard that could be programmed by the CPU to assert an interrupt (via the "Programmable Interrupt Controller", another discrete chip) after a preset time interval. The functionality still exists, although it is now a tiny part of a much larger chunk of motherboard circuitry.
The OS today still programs the PIT to wake it up regularly (in recent versions of Linux, once every millisecond by default), and this is how the Kernel is able to implement pre-emptive multitasking.
glibc 2.21 Linux
Forwards to the nanosleep system call.
glibc is the default implementation for the C stdlib on most Linux desktop distros.
How to find it: the first reflex is:
git ls-files | grep sleep
This contains:
sysdeps/unix/sysv/linux/sleep.c
and we know that:
sysdeps/unix/sysv/linux/
contains the Linux specifics.
On the top of that file we see:
/* We are going to use the `nanosleep' syscall of the kernel. But the
kernel does not implement the stupid SysV SIGCHLD vs. SIG_IGN
behaviour for this syscall. Therefore we have to emulate it here. */
unsigned int
__sleep (unsigned int seconds)
So if you trust comments, we are done basically.
At the bottom:
weak_alias (__sleep, sleep)
which basically says __sleep == sleep. The function uses nanosleep through:
result = __nanosleep (&ts, &ts);
After greppingg:
git grep nanosleep | grep -v abilist
we get a small list of interesting occurrences, and I think __nanosleep is defined in:
sysdeps/unix/sysv/linux/syscalls.list
on the line:
nanosleep - nanosleep Ci:pp __nanosleep nanosleep
which is some super DRY magic format parsed by:
sysdeps/unix/make-syscalls.sh
Then from the build directory:
grep -r __nanosleep
Leads us to: /sysd-syscalls which is what make-syscalls.sh generates and contains:
#### CALL=nanosleep NUMBER=35 ARGS=i:pp SOURCE=-
ifeq (,$(filter nanosleep,$(unix-syscalls)))
unix-syscalls += nanosleep
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,nanosleep)$o)): \
$(..)sysdeps/unix/make-syscalls.sh
$(make-target-directory)
(echo '#define SYSCALL_NAME nanosleep'; \
echo '#define SYSCALL_NARGS 2'; \
echo '#define SYSCALL_SYMBOL __nanosleep'; \
echo '#define SYSCALL_CANCELLABLE 1'; \
echo '#include <syscall-template.S>'; \
echo 'weak_alias (__nanosleep, nanosleep)'; \
echo 'libc_hidden_weak (nanosleep)'; \
) | $(compile-syscall) $(foreach p,$(patsubst %nanosleep,%,$(basename $(#F))),$($(p)CPPFLAGS))
endif
It looks like part of a Makefile. git grep sysd-syscalls shows that it is included at:
sysdeps/unix/Makefile:23:-include $(common-objpfx)sysd-syscalls
compile-syscall looks like the key part, so we find:
# This is the end of the pipeline for compiling the syscall stubs.
# The stdin is assembler with cpp using sysdep.h macros.
compile-syscall = $(COMPILE.S) -o $# -x assembler-with-cpp - \
$(compile-mkdep-flags)
Note that -x assembler-with-cpp is a gcc option.
This #defines parameters like:
#define SYSCALL_NAME nanosleep
and then use them at:
#include <syscall-template.S>
OK, this is as far as I will go on the macro expansion game for now.
I think then this generates the posix/nanosleep.o file which must be linked together with everything.
Linux 4.2 x86_64 nanosleep syscall
Uses the scheduler: it's not a busy sleep.
Search ctags:
sys_nanosleep
Leads us to kernel/time/hrtimer.c:
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
hrtimer stands for High Resolution Timer. From there the main line looks like:
hrtimer_nanosleep
do_nanosleep
set_current_state(TASK_INTERRUPTIBLE); which is interruptible sleep
freezable_schedule(); which calls schedule() and allows other processes to run
hrtimer_start_expires
hrtimer_start_range_ns
TODO: reach the arch/x86 timing level
TODO: are the above steps done directly in the syscal call interrupt handler, or in a regular kernel thread?
A few articles about it:
https://geeki.wordpress.com/2010/10/30/ways-of-sleeping-in-linux-kernel/
http://www.linuxjournal.com/article/8144