Can I have realtime scheduling within my process (but without affecting others)? - c

According to my question here I would like to use SCHED_RR with pthread_setschedparam for my threads in a Linux application. However, this has effects even on kernel modules which I currently cannot solve.
I have found http://www.icir.org/gregor/tools/pthread-scheduling.html which says that I could create my threads with PTHREAD_SCOPE_PROCESS attribute, but I haven't found further information on this.
Will this work with (Angstrom) Linux, kernel version2.6.32? (How) will this affect the way my process competes with other processes? Would it be the way to have my processes compete with real time scheduling but other processes would not be affected?
(As I am using boost threads I cannot simply try this...)

Threads created with PTHREAD_SCOPE_PROCESS will share the same kernel thread (
http://lists.freebsd.org/pipermail/freebsd-threads/2006-August/003674.html )
However, SCHED_RR must be run under a root-privileged process.
Round-Robin; threads whose contention scope is system
(PTHREAD_SCOPE_SYSTEM) are in real-time (RT) scheduling class if the
calling process has an effective user id of 0. These threads, if not
preempted by a higher priority thread, and if they do not yield or
block, will execute for a time period determined by the system.
SCHED_RR for threads that have a contention scope of process
(PTHREAD_SCOPE_PROCESS) or whose calling process does not have an
effective user id of 0 is based on the TS scheduling class.
However, basing on your linked problem I think you are facing a deeper issue. Have you tried setting your kernel to be more "preemptive"? Preemption should allow the kernel to forcibly schedule out of running your process allowing for more responsive running of some kernel parts. This shouldn't affect IRQs though, maybe something disabled your IRQs?
Another thing I am thinking about is maybe that you are not fetching your SPI data fast enough and the buffor for your data in the kernel becomes full and hence the data loss. Try increasing those buffers also.

Related

Are locked pages inherited by pthreads?

I have a little paging problem on my realtime system, and wanted to know how exactly linux should behave in my particular case.
Among various other things, my application spawns 2 threads using pthread_create(), which operate on a set of shared buffers.
The first thread, let's call it A, reads data from a device, performs some calculations on it, and writes the results into one of the buffers.
Once that buffer is full, thread B will read all the results and send them to a PC via ethernet, while thread A writes into the next buffer.
I have noticed that each time thread A starts writing into a previously unused buffer, i miss some interrupts and lose data (there is an id in the header of each packet, and if that increments by more than one, i have missed interrupts).
So if i use n buffers, i get exactly n bursts of missed interrupts at the start of my data acquisition (therefore the problem is definitely caused by paging).
To fix this, i used mlock() and memset() on all of the buffers to make sure they are actually paged in.
This fixed my problem, but i was wondering where in my code would be the correct place do this. In my main application, or in one/both of the threads? (currently i do it in both threads)
According to the libc documentation (section 3.4.2 "Locked Memory Details"), memory locks are not inherited by child processes created using fork().
So what about pthreads? Do they behave the same way, or would they inherit those locks from my main process?
Some background information about my system, even though i don't think it matters in this particular case:
It is an embedded system powered by a SoC with a dual-core Cortex-A9 running Linux 4.1.22 with PREEMPT_RT.
The interrupt frequency is 4kHz
The thread priorities (as shown in htop) are -99 for the interrupt, -98 for thread A (both of which are higher than the standard priority of -51 for all other interrupts) and -2 for thread B
EDIT:
I have done some additional tests, calling my page locking function from different threads (and in main).
If i lock the pages in main(), and then try to lock them again in one of the threads, i would expect to see a large amount of page faults for main() but no page faults for the thread itself (because the pages should already be locked). However, htop tells a different story: i see a large amount of page faults (MINFLT column) for each and every thread that locks those pages.
To me, that would suggest that pthreads actually do have the same limitation as child processes spawned using fork(). And if this is the case, locking them in both threads (but not in main) would be the correct procedure.
Threads share the same memory management context. If a page is resident for one thread, it's resident for all threads in the same process.
The implication of this is that memory locking is per-process, not per-thread.
You are probably still seeing minor faults on the first write because a fault is used to mark the page dirty. You can avoid this by also writing to each page after locking.

How does a process know that semaphore is available

I have a very basic doubt.
when a process is waiting on a semaphore , it goes into sleep state.
So no way it can poll the semaphore value.
Does kernel poll the semaphore value and if available sends a signal to all process waiting for it ? If so, wont it be too much overhead for the kernel.
Or does the signal() call internally notifies all the process waiting for the semaphore.
Please let me know on this.
The operating system schedules the process once more when the operating system is told by another process that it has done with the semaphore.
Semaphores are just one of the ways of interacting with the OS scheduler.
The kernel doesn't poll the semaphore; it doesn't need to. Every time a process calls sem_post() (or equivalent), that involves interaction with the kernel. What the kernel does during the sem_post() is look up whatever processes have previously called sem_wait() on the same semaphore. If one or more processes have called sem_wait(), it picks the process with the highest priority and schedules it. This shows up as that sem_wait() finally returning and that process carries on executing.
How This is Implemented Under the Hood
Fundamentally the kernel needs to implement something called an "atomic test and set". That is an operation where by the value of some variable can be tested and, if a certain condition is met (such as the value == 0) the variable value is altered (e.g. value = 1). If this succeeds, the kernel will do one thing, (like schedule a process), if this does not (because the condition value==0 was false) the kernel will do something difference (like put a process on the do-not-schedule list). The 'atomic' part is that this decision is made without anything else being able to look at and change the same variable at the same time.
There's several ways of doing this. One is to suspend all processes (or at least all activity within the kernel) so that nothing else is testing the value of the variable at the same time. That's not very fast.
For example, the Linux kernel once had something called the Big Kernel Lock. I don't know if this was used to process semaphore interactions, but that's the kind of thing that OSes used to have for atomic test & sets.
These days CPUs have atomic test & set op codes, which is a lot faster. The good ole' Motorola 68000 had one of these a long time ago; it took CPUs like the PowerPC and the x86 many, many years to get the same kind of instruction.
If you root around inside linux you'll find mention of futexes. a futex is a fast mutex - it relies on a CPU's test/set instruction to implement a fast mutex semaphore.
Post a Semaphore in Hardware
A variation is a mailbox semaphore. This is a special variation on a semaphore that is extremely useful in some system types where hardware needs to wake up a process at the end of a DMA transfer. A mailbox is a special location in memory which when written to will cause an interrupt to be raised. This can be turned into a semaphore by the kernel because when that interrupt is raised, it goes through the same motions as it would had something called sem_post().
This is incredibly handy; a device can DMA a large amount of data to some pre-arranged buffer, and top that off with a small DMA transfer to the mail box. The kernel handles the interrupt, and if a process has previously called sem_wait() on the mailbox semaphore the kernel schedules it. The process, which also knows about this pre-arranged buffer, can then process the data.
On a real time DSP systems this is very useful, because it's very fast and very low latency; it allows a process to receive data from some device with very little delay. The alternative, to have a full up device driver stack that uses read() / write() to transfer data from the device to the process is incredibly slow by comparison.
Speed
The speed of semaphore interactions depends entirely on the OS.
For OSes like Windows and Linux, the context switch time is fairly slow (in the order of several microseconds, if not tens of microseconds). Basically this means that when a process calls something like sem_post(), the kernel is doing a lot of different things whilst it has the opportunity before finally returning control to the process(es). What it's doing during this time could be, well, almost anything!
If a program has made use of a lot threads, and they're all rapidly interacting between themselves using semaphores, quite a lot of time is lost to the sem_post() and sem_wait(). This places an emphasis on doing a decent amount of work once a process has returned from sem_wait() before calling the next sem_post().
However on OSes like VxWorks, the context switch time is lightning fast. That is there's very little code in the kernel that gets run when sem_post() is called. The result is that a semaphore interaction is a lot more efficient. Moreover, and OS like VxWorks is written in such a way so as to guarantee that the time take to do all this sem_post() / sem_wait() work is constant.
This influences the architecture of one's software on these systems. On VxWorks, where a context switch is cheap, there's very little penalty in having a large number of threads all doing quite small tasks. On Windows / Linux there's more of an emphasis on the opposite.
This is why OSes like VxWorks are excellent for hard real time applications, and Windows / Linux are not.
The Linux PREEMPT_RT patch set in part aims to improve the latency of the linux kernel during operations like this. For example, it pushes a lot of device interrupt handlers (device drivers) up into kernel threads; these are scheduled almost just like any other thread. The idea is to reduce the amount of work that is being done by the kernel (and have more done by kernel threads), so that the work it still has to do itself (such as handling sem_post() / sem_wait()) takes less time and is more consistent about how long this takes. It still not a hard guarantee of latency, but it's a pretty good improvement. This is what we call a soft-realtime kernel. The impact though is that overall throughput of the machine can be lower.
Signals
Signals are nasty, horrible things that really get in the way of using things like sem_post() and sem_wait(). I avoid them like the plague.
If you are on a Linux platform and you do have to use signals, take a serious long look at signalfd (man page). This is a far better way of dealing with signals because you can choose to accept them at a convenient time (simply by called read()), instead of having to handle them as soon as they occur. Certainly if you're using epoll() or select() anywhere at all in a program then signalfd is the way to go.

Soft Real Time Linux Scheduling

I have a project with some soft real-time requirements. I have two processes (programs that I've written) that do some data acquisition. In either case, I need to continuously read in data that's coming in and process it.
The first program is heavily threaded, and the second one uses a library which should be threaded, but I have no clue what's going on under the hood. Each program is executed by the user and (by default) I see each with a priority of 20 and a nice value of 0. Each program uses roughly 30% of the CPU.
As it stands, both processes have to contended with a few background processes, and I want to give my two programs the best shot at the CPU as possible. My main issue is that I have a device that I talk to that has a 64 byte hardware buffer, and if I don't read from it in time, I get an overflow. I have noted this condition occurring once every 2-3 hours of run time.
Based on my research (http://oreilly.com/catalog/linuxkernel/chapter/ch10.html) there appear to be three ways of playing around with the priority:
Set the nice value to a lower number, and therefore give each process more priority. I can do this without any modification to my code (or use the system call) using the nice command.
Use sched_setscheduler() for the entire process to a particular scheduling policy.
Use pthread_setschedparam() to individually set each pthread.
I have run into the following roadblocks:
Say I go with choice 3, how do I prevent lower priority threads from being starved? Is there also a way to ensure that shared locks cause lower priority threads to be promoted to a higher priority? Say I have a thread that's real-time, SCHED_RR and it shared a lock with a default, SCHED_OTHER thread. When the SCHED_OTHER thread gets the lock, I want it to execute # higher priority to free the lock. How do I ensure this?
If a thread of SCHED_RR creates another thread, is the new thread automatically SCHED_RR, or do I need to specify this? What if I have a process that I have set to SCHED_RR, do all its threads automatically follow this policy? What if a process of SCHED_RR spawns a child process, is it too automatically SCHED_RR?
Does any of this matter given that the code only uses up 60% of the CPU? Or are there still issues with the CPU being shared with background processes that I should be concerned with and could be caused my buffer overflows?
Sorry for the long winded question, but I felt it needed some background info. Thanks in advance for the help.
(1) pthread_mutex_setprioceiling
(2) A newly created thread inherits the schedule and priority of its creating thread unless it's thread attributes (e.g. pthread_attr_setschedparam / pthread_attr_setschedpolicy) are directed to do otherwise when you call pthread_create.
(3) Since you don't know what causes it now it is in fairness hard for anyone say with assurance.

Linux Scheduling: OS vs "virtual"

How does one implement a multithreaded single process model in linux fedora under c where a single scheduler is used on a "main" core reading i/o availability (ex. tcp/ip, udp) then having a single-thread-per-core (started at init), the "execution thread", parse the data then update a small amount of info update to shared memory space (it is my understanding pthreads share data under a single process).
I beleive my options are:
Pthreads or the linux OS scheduler
I have a naive model in mind consisting of starting a certain number of these execution threads a single scheduler thread.
What is the best solution one could think when I know that I can use this sort of model.
Completing Benoit's answer, in order to communicate between your master and your worker threads, you could use conditional variable. The workers do something like:
while (true)
{
pthread_mutex_lock(workQueueMutex);
while (workQueue.empty())
pthread_cond_wait(workQueueCond, workQueueMutex);
/* if we get were then (a) we have work (b) we hold workQueueMutex */
work = pop(workQueue);
pthread_mutex_unlock(workQueueMutex);
/* do work */
}
and the master:
/* I/O received */
pthread_mutex_lock(workQueueMutex);
push(workQueue, work);
pthread_cond_signal(workQueueCond);
pthread_mutex_unlock(workQueueMutex);
This would wake up one idle work to immediately process the request. If no worker is available, the work will be dequeued and processed later.
Modifying the Linux scheduler is quite a tough work. I would just forget about it. Pthread is usually prefered. If I understand well, you want to have one core dedicated to the control plan, and a pool of other cores dedicated to the data plan processing? Then create a pool of threads from your master thread and setup core affinity for these slave threads with pthread_setaffinity_np(...).
Indeed threads of a process share the same address-space, and global variables are accessible by any threads of that process.
It looks to me that you have a version of the producer-consumer problem with a single consumer aggregating the results of n producers. This is a pretty standard problem, so I definitely think that pthread is more than enough for you. You don't need to go and mess around with the scheduler.
As one of the answer's states, a thread safe queue like the one described here works nicely for this sort of issue. Your original idea of spawning a bunch of threads is a good idea. You seem to be worried that the ability of the threads to share global state will cause you problems. I don't think that this is an issue if you keep shared state to a minimum and use sane locking discipline. Sharing state is fine as long as you do so responsibly.
Finally, unless you really know what you're doing, I would advise against manually messing with thread affinity. Just spawn the threads and let the scheduler handle when and on what core a thread runs. The thing to optimize is the number of threads you use. One for each core may not actually be the fastest approach if other threads are running.
Generally speaking, this is more or less exactly what the posix select and linux specific epoll functions are for.

Pthreads don't seem to be using more than one processor

I have an application that spawns multiple child processes, which then go on to spawn multiple threads. I can control the number of processes and threads that are spawned. The threads do a specific read/write operation to a NAS, and I record how long this takes.
What's odd is that the time it takes to perform the read/write operation is longer with multiple threads. I read /proc/stat before starting the application and when finished, and got this (after some math):
cpu0: 1.0050% usrtime, 2.5126% systime, 95.4774% idle, 0.5025% softirq
cpu1: 0.0000% usrtime, 0.0000% systime, 100.0000% idle, 0.0000% softirq
I also checked sched_getaffinity, and both CPUs are enabled for the child processes. Is there something that I must do, besides spawning multiple threads, to make use of the multiple cores?
You're hardly using your CPU at all. Going out to Network Attached Storage, your bottleneck is most likely your network connection. How much data are you pushing and how much bandwidth can your pipeline (and your NAS) tolerate?

Resources