I have a very basic doubt.
when a process is waiting on a semaphore , it goes into sleep state.
So no way it can poll the semaphore value.
Does kernel poll the semaphore value and if available sends a signal to all process waiting for it ? If so, wont it be too much overhead for the kernel.
Or does the signal() call internally notifies all the process waiting for the semaphore.
Please let me know on this.
The operating system schedules the process once more when the operating system is told by another process that it has done with the semaphore.
Semaphores are just one of the ways of interacting with the OS scheduler.
The kernel doesn't poll the semaphore; it doesn't need to. Every time a process calls sem_post() (or equivalent), that involves interaction with the kernel. What the kernel does during the sem_post() is look up whatever processes have previously called sem_wait() on the same semaphore. If one or more processes have called sem_wait(), it picks the process with the highest priority and schedules it. This shows up as that sem_wait() finally returning and that process carries on executing.
How This is Implemented Under the Hood
Fundamentally the kernel needs to implement something called an "atomic test and set". That is an operation where by the value of some variable can be tested and, if a certain condition is met (such as the value == 0) the variable value is altered (e.g. value = 1). If this succeeds, the kernel will do one thing, (like schedule a process), if this does not (because the condition value==0 was false) the kernel will do something difference (like put a process on the do-not-schedule list). The 'atomic' part is that this decision is made without anything else being able to look at and change the same variable at the same time.
There's several ways of doing this. One is to suspend all processes (or at least all activity within the kernel) so that nothing else is testing the value of the variable at the same time. That's not very fast.
For example, the Linux kernel once had something called the Big Kernel Lock. I don't know if this was used to process semaphore interactions, but that's the kind of thing that OSes used to have for atomic test & sets.
These days CPUs have atomic test & set op codes, which is a lot faster. The good ole' Motorola 68000 had one of these a long time ago; it took CPUs like the PowerPC and the x86 many, many years to get the same kind of instruction.
If you root around inside linux you'll find mention of futexes. a futex is a fast mutex - it relies on a CPU's test/set instruction to implement a fast mutex semaphore.
Post a Semaphore in Hardware
A variation is a mailbox semaphore. This is a special variation on a semaphore that is extremely useful in some system types where hardware needs to wake up a process at the end of a DMA transfer. A mailbox is a special location in memory which when written to will cause an interrupt to be raised. This can be turned into a semaphore by the kernel because when that interrupt is raised, it goes through the same motions as it would had something called sem_post().
This is incredibly handy; a device can DMA a large amount of data to some pre-arranged buffer, and top that off with a small DMA transfer to the mail box. The kernel handles the interrupt, and if a process has previously called sem_wait() on the mailbox semaphore the kernel schedules it. The process, which also knows about this pre-arranged buffer, can then process the data.
On a real time DSP systems this is very useful, because it's very fast and very low latency; it allows a process to receive data from some device with very little delay. The alternative, to have a full up device driver stack that uses read() / write() to transfer data from the device to the process is incredibly slow by comparison.
Speed
The speed of semaphore interactions depends entirely on the OS.
For OSes like Windows and Linux, the context switch time is fairly slow (in the order of several microseconds, if not tens of microseconds). Basically this means that when a process calls something like sem_post(), the kernel is doing a lot of different things whilst it has the opportunity before finally returning control to the process(es). What it's doing during this time could be, well, almost anything!
If a program has made use of a lot threads, and they're all rapidly interacting between themselves using semaphores, quite a lot of time is lost to the sem_post() and sem_wait(). This places an emphasis on doing a decent amount of work once a process has returned from sem_wait() before calling the next sem_post().
However on OSes like VxWorks, the context switch time is lightning fast. That is there's very little code in the kernel that gets run when sem_post() is called. The result is that a semaphore interaction is a lot more efficient. Moreover, and OS like VxWorks is written in such a way so as to guarantee that the time take to do all this sem_post() / sem_wait() work is constant.
This influences the architecture of one's software on these systems. On VxWorks, where a context switch is cheap, there's very little penalty in having a large number of threads all doing quite small tasks. On Windows / Linux there's more of an emphasis on the opposite.
This is why OSes like VxWorks are excellent for hard real time applications, and Windows / Linux are not.
The Linux PREEMPT_RT patch set in part aims to improve the latency of the linux kernel during operations like this. For example, it pushes a lot of device interrupt handlers (device drivers) up into kernel threads; these are scheduled almost just like any other thread. The idea is to reduce the amount of work that is being done by the kernel (and have more done by kernel threads), so that the work it still has to do itself (such as handling sem_post() / sem_wait()) takes less time and is more consistent about how long this takes. It still not a hard guarantee of latency, but it's a pretty good improvement. This is what we call a soft-realtime kernel. The impact though is that overall throughput of the machine can be lower.
Signals
Signals are nasty, horrible things that really get in the way of using things like sem_post() and sem_wait(). I avoid them like the plague.
If you are on a Linux platform and you do have to use signals, take a serious long look at signalfd (man page). This is a far better way of dealing with signals because you can choose to accept them at a convenient time (simply by called read()), instead of having to handle them as soon as they occur. Certainly if you're using epoll() or select() anywhere at all in a program then signalfd is the way to go.
Related
pthread_yield is documented as "causes the calling thread to relinquish the CPU", but on a modern OS/scheduler, the relinquishing of the CPU happens automatically at the appropriate times (i.e. whenever the thread calls a blocking operation, and/or when the thread's quantum has expired). Is pthread_yield() therefore vestigial/useless except in the special case of running under a co-operative-only task scheduler? Or are there some use-cases where calling it would still be correct/useful even under a modern pre-emptive scheduler?
pthread_yield() gives you a chance to do a short sleep -- not a timed sleep. You relinquish the remainder of time slice to some other thread or process, but you don't put the thread in a wait queue.
Also a while ago I read about how schedulers prioritizing interactive processes. These are the processes that user interacts with directly and you feel their sluggishness most (you have less of a feeling of your system being slow if your UI is responsive). One of the properties of interactive processes is that they have little to do and mostly don't use entire time slice. So if a process keeps yielding before its time slice is up you assume it is interactive and you boost its priority. There were exploits that used this trick to effectively use 99% of CPU while showing the offending process as being at 0%.
In concurrent code in my workplace, there are several occurrences of nanosleep() or usleep() with a non-zero constant to free up the CPU without relying on futex(), or a sleeping synchronization primitive to put the thread to sleep (for instance, when waiting for an element from a concurrent queue). The code claims to prevent pathological cases where threads consume CPU without doing any actual work when other threads are available to get scheduled on that CPU. This sounds reasonable by itself assuming the cooperation between the sleep functions and the kernel thread scheduler is correct.
Is there a concept in linux where a minimum duration passed to nanosleep(), usleep(), et al. is known to put the calling thread to sleep and run another thread in it's place on the same core when cores are oversubscribed? And if the duration is smaller than that, then the thread does not actually yield the CPU but continue spinning? This forms the basis of the constant passed to the sleep() functions in order to make it behave like a coarse-yield.
I realize that a sched_yield() is probably better suited for what the code is doing; but I just wanted to educate myself on the behavior of the linux sleep() functions before benchmarking a replacement or improvement on the existing code.
Thanks!
The man page makes it clear that it no longer busy-waits.
In order to support applications requiring much more precise pauses
(e.g., in order to control some time-critical hardware), nanosleep()
would handle pauses of up to 2 milliseconds by busy waiting with
microsecond precision when called from a thread scheduled under a
real-time policy like SCHED_FIFO or SCHED_RR. This special extension
was removed in kernel 2.5.39, and is thus not available in Linux
2.6.0 and later kernels.
#stark has answered your question as written, but to elaborate, don't do that. If you're waiting for an event to happen, perform an operation that waits for the event, like pthread_cond_wait, sem_wait, poll, read, etc. rather than sleeping and retrying. This will avoid wasting lots of cpu time, and it also discourages erroneous programming models full of data races (because normally the same primitive that waits also ensures exclusive access/synchronization).
I have a little paging problem on my realtime system, and wanted to know how exactly linux should behave in my particular case.
Among various other things, my application spawns 2 threads using pthread_create(), which operate on a set of shared buffers.
The first thread, let's call it A, reads data from a device, performs some calculations on it, and writes the results into one of the buffers.
Once that buffer is full, thread B will read all the results and send them to a PC via ethernet, while thread A writes into the next buffer.
I have noticed that each time thread A starts writing into a previously unused buffer, i miss some interrupts and lose data (there is an id in the header of each packet, and if that increments by more than one, i have missed interrupts).
So if i use n buffers, i get exactly n bursts of missed interrupts at the start of my data acquisition (therefore the problem is definitely caused by paging).
To fix this, i used mlock() and memset() on all of the buffers to make sure they are actually paged in.
This fixed my problem, but i was wondering where in my code would be the correct place do this. In my main application, or in one/both of the threads? (currently i do it in both threads)
According to the libc documentation (section 3.4.2 "Locked Memory Details"), memory locks are not inherited by child processes created using fork().
So what about pthreads? Do they behave the same way, or would they inherit those locks from my main process?
Some background information about my system, even though i don't think it matters in this particular case:
It is an embedded system powered by a SoC with a dual-core Cortex-A9 running Linux 4.1.22 with PREEMPT_RT.
The interrupt frequency is 4kHz
The thread priorities (as shown in htop) are -99 for the interrupt, -98 for thread A (both of which are higher than the standard priority of -51 for all other interrupts) and -2 for thread B
EDIT:
I have done some additional tests, calling my page locking function from different threads (and in main).
If i lock the pages in main(), and then try to lock them again in one of the threads, i would expect to see a large amount of page faults for main() but no page faults for the thread itself (because the pages should already be locked). However, htop tells a different story: i see a large amount of page faults (MINFLT column) for each and every thread that locks those pages.
To me, that would suggest that pthreads actually do have the same limitation as child processes spawned using fork(). And if this is the case, locking them in both threads (but not in main) would be the correct procedure.
Threads share the same memory management context. If a page is resident for one thread, it's resident for all threads in the same process.
The implication of this is that memory locking is per-process, not per-thread.
You are probably still seeing minor faults on the first write because a fault is used to mark the page dirty. You can avoid this by also writing to each page after locking.
When a program is doing I/O, my understanding is that the thread will briefly sleep and then resume (e.g. when writing to a file). My question is that when we do printing using printf(), does a C program thread sleep in any way ?
Since you've specifically asked for printf(), I'm going to assume that you mean in the most generic way where it will fill a reasonably sized buffer and invoke the system call write(2) to stdout and that the stdout happens to point to your terminal.
In most operating systems, when you invoke certain system calls the calling thread/process is removed from CPU runnable list and placed in a separate waiting list. This is true for all I/O calls like read/write/etc. Being temporarily removed from processing due to I/O is not the same as being put to sleep via a timer.
For example, in Linux there's uninterruptible sleep state of a thread/process specifically meant for I/O waiting, while interruptible sleep state for those thread/process that are waiting on timers and events. Though, from a dumb user's perspective they both seem to be same, their implementation behind the scenes are significantly different.
To answer your question, a call to printf() isn't exactly sleeping but waiting for the buffer to be flushed to device rather than actually being in sleep. Even then there are a few more quirks which you can read about it in signal(7) and even more about various process/thread states from Marek's blog.
Hope this helps.
Much of the point of stdio.h is that it buffers I/O: a call to printf will often simply put text into a memory buffer (owned by the library by default) and perform zero system calls, thus offering no opportunity to yield the CPU. Even when something like write(2) is called, the thread may continue running: the kernel can copy the data into kernel memory (from which it will be transferred to the disk later, e.g. by DMA) and return immediately.
Of course, even on a single-core system, most operating systems frequently interrupt the running thread in order to share it. So another thread can still run at any time, even if no blocking calls are made.
I'm implementing a boss/worker design pattern using pthreads on Linux. I want to have a boss thread that constantly checks for work, and if there is work, then wakes up a sleeping worker to do the work. My question is: what type of IPC synchronization/mechanism should I use to achieve the least latency between my boss thread handing off to my worker, and my worker waking up?
The easy solution is to use Pthread conditional variables and call pthread_cond_signal in the boss thread, and pthread_cond_wait in each of the worker threads, but I'm wondering
is there something faster that I can use to implement the blocking and signaling? For example, how would using pipes between the boss and worker threads fare?
how can I measure the performance of one type of IPC versus another? For example, I see benchmarks for pipe()'s and fork()'s, but nothing for using pipe()'s as an interthread communication.
Let me know if I can clarify anything in my questions!
EDIT
As an example of how I would use pipe()'s to implement blocking between my worker and boss threads, the worker thread would read() a pipe, and since it's empty would then block on that read call until the boss calls write() on it.
The glibc implementation of pthreads uses the low-level "futex" locks to implement pthread_cond_wait() / pthread_cond_signal(). Futexes were designed to be a fast synchronisation primitive, so these are likely to outperform pipes or similar methods (at the very least, using pipes requires copying a byte to and from kernel space that isn't needed for futexes).
If pthread_cond_wait() / pthread_cond_signal() map well onto your problem (and it sounds like they do), then the only way to outperform them is likely to be to implement something on futexes yourself (for example, you could eliminate the handling of thread cancellation if you do not use that).
It is probably worthwhile benchmarking your application - unless your work units are very small indeed, then the condition variable wakeup latency is unlikely to dominate.
What you should do first is being sure you need something faster. Since pthread signaling is implemented using futex, where futex stands for fast user space mutex, I don't think you can out perform them.
If you have waiting threads, by definition you will have to wake them up, and this round trip through the kernel will be the source of your unwanted latency.
But what you should do is really think about your problem :
if you constantly have work to do, then your worker thread is always busy. Work will be done when previous work is finished, and you don't care about the latency.
If what matters is the latency between the boss detecting an event and the worker starting to work, then why do you use a boss -> worker pattern ?
My advice would be to look for a faster thing when you really need it, at this time you will probably have a much mre detailed question to ask. Maybe I am wrong, but it looks like you are trying to optimize preemptively, which as you perhaps know is the root of all evil. Of course, bad design can lead to massive rework, but here you are dealing with a very small detail of your real design decision which is using a boss / worker pattern.
Implement your design with pthread_signal, or perhaps semp_post() / sem_wait(), and then look where your latency really is, and if it is really a problem.
I would guess signal and wait would be the best. Most OS recognize threads and can have them just idle until the interrupt comes. With pipes the worker would have to keep waking up and checking the pipe for output. The best testing I've found for efficiency has usually been using the unix command to get the running time from start to finish(assuming the program isn't meant to keep running in the background), set up a script to do it a few times and compare.