How does one implement a multithreaded single process model in linux fedora under c where a single scheduler is used on a "main" core reading i/o availability (ex. tcp/ip, udp) then having a single-thread-per-core (started at init), the "execution thread", parse the data then update a small amount of info update to shared memory space (it is my understanding pthreads share data under a single process).
I beleive my options are:
Pthreads or the linux OS scheduler
I have a naive model in mind consisting of starting a certain number of these execution threads a single scheduler thread.
What is the best solution one could think when I know that I can use this sort of model.
Completing Benoit's answer, in order to communicate between your master and your worker threads, you could use conditional variable. The workers do something like:
while (true)
{
pthread_mutex_lock(workQueueMutex);
while (workQueue.empty())
pthread_cond_wait(workQueueCond, workQueueMutex);
/* if we get were then (a) we have work (b) we hold workQueueMutex */
work = pop(workQueue);
pthread_mutex_unlock(workQueueMutex);
/* do work */
}
and the master:
/* I/O received */
pthread_mutex_lock(workQueueMutex);
push(workQueue, work);
pthread_cond_signal(workQueueCond);
pthread_mutex_unlock(workQueueMutex);
This would wake up one idle work to immediately process the request. If no worker is available, the work will be dequeued and processed later.
Modifying the Linux scheduler is quite a tough work. I would just forget about it. Pthread is usually prefered. If I understand well, you want to have one core dedicated to the control plan, and a pool of other cores dedicated to the data plan processing? Then create a pool of threads from your master thread and setup core affinity for these slave threads with pthread_setaffinity_np(...).
Indeed threads of a process share the same address-space, and global variables are accessible by any threads of that process.
It looks to me that you have a version of the producer-consumer problem with a single consumer aggregating the results of n producers. This is a pretty standard problem, so I definitely think that pthread is more than enough for you. You don't need to go and mess around with the scheduler.
As one of the answer's states, a thread safe queue like the one described here works nicely for this sort of issue. Your original idea of spawning a bunch of threads is a good idea. You seem to be worried that the ability of the threads to share global state will cause you problems. I don't think that this is an issue if you keep shared state to a minimum and use sane locking discipline. Sharing state is fine as long as you do so responsibly.
Finally, unless you really know what you're doing, I would advise against manually messing with thread affinity. Just spawn the threads and let the scheduler handle when and on what core a thread runs. The thing to optimize is the number of threads you use. One for each core may not actually be the fastest approach if other threads are running.
Generally speaking, this is more or less exactly what the posix select and linux specific epoll functions are for.
Related
I have a project with some soft real-time requirements. I have two processes (programs that I've written) that do some data acquisition. In either case, I need to continuously read in data that's coming in and process it.
The first program is heavily threaded, and the second one uses a library which should be threaded, but I have no clue what's going on under the hood. Each program is executed by the user and (by default) I see each with a priority of 20 and a nice value of 0. Each program uses roughly 30% of the CPU.
As it stands, both processes have to contended with a few background processes, and I want to give my two programs the best shot at the CPU as possible. My main issue is that I have a device that I talk to that has a 64 byte hardware buffer, and if I don't read from it in time, I get an overflow. I have noted this condition occurring once every 2-3 hours of run time.
Based on my research (http://oreilly.com/catalog/linuxkernel/chapter/ch10.html) there appear to be three ways of playing around with the priority:
Set the nice value to a lower number, and therefore give each process more priority. I can do this without any modification to my code (or use the system call) using the nice command.
Use sched_setscheduler() for the entire process to a particular scheduling policy.
Use pthread_setschedparam() to individually set each pthread.
I have run into the following roadblocks:
Say I go with choice 3, how do I prevent lower priority threads from being starved? Is there also a way to ensure that shared locks cause lower priority threads to be promoted to a higher priority? Say I have a thread that's real-time, SCHED_RR and it shared a lock with a default, SCHED_OTHER thread. When the SCHED_OTHER thread gets the lock, I want it to execute # higher priority to free the lock. How do I ensure this?
If a thread of SCHED_RR creates another thread, is the new thread automatically SCHED_RR, or do I need to specify this? What if I have a process that I have set to SCHED_RR, do all its threads automatically follow this policy? What if a process of SCHED_RR spawns a child process, is it too automatically SCHED_RR?
Does any of this matter given that the code only uses up 60% of the CPU? Or are there still issues with the CPU being shared with background processes that I should be concerned with and could be caused my buffer overflows?
Sorry for the long winded question, but I felt it needed some background info. Thanks in advance for the help.
(1) pthread_mutex_setprioceiling
(2) A newly created thread inherits the schedule and priority of its creating thread unless it's thread attributes (e.g. pthread_attr_setschedparam / pthread_attr_setschedpolicy) are directed to do otherwise when you call pthread_create.
(3) Since you don't know what causes it now it is in fairness hard for anyone say with assurance.
Suppose I create threads with pthreads, is it possible to send them new things to work on after they have been initialized, so I don't waste resources in creating new threads? For instance, I create 3 threads, thread 2 signals completion and I send it another "task" without killing it and starting a new one. Thanks.
The usual, simple form is an ordinary (work) queue. In principle, you maintain a queue structure, perhaps as a linked list, protected by a mutex. Typically, condition variables are used by the main/producer threads to notify worker threads that new work is available, so they don't have to poll.
Some previous SO questions that may also be useful are:
How To Use Condition Variable
One producer, Two consumers and usage of pthread_cond_signal & pthread_mutex_lock
pthread conditional variable
Yes, and that is what servers like Apache do to increase their performance. The design pattern is called the Thread pool pattern and there are various implementations (this one for example) using pthreads.
Of course, you might want to keep your implementation as simple as possible, depending on what your goals are.
Of course. For example, you can use producer-consumer pattern. Here is an example in C#, but it can be easily implemented in pthreads as well.
The search-keyword to your question is "thread pooling" or "thread pool". Using this terms you will find plenty information on this site and also in Google.
I'm implementing a boss/worker design pattern using pthreads on Linux. I want to have a boss thread that constantly checks for work, and if there is work, then wakes up a sleeping worker to do the work. My question is: what type of IPC synchronization/mechanism should I use to achieve the least latency between my boss thread handing off to my worker, and my worker waking up?
The easy solution is to use Pthread conditional variables and call pthread_cond_signal in the boss thread, and pthread_cond_wait in each of the worker threads, but I'm wondering
is there something faster that I can use to implement the blocking and signaling? For example, how would using pipes between the boss and worker threads fare?
how can I measure the performance of one type of IPC versus another? For example, I see benchmarks for pipe()'s and fork()'s, but nothing for using pipe()'s as an interthread communication.
Let me know if I can clarify anything in my questions!
EDIT
As an example of how I would use pipe()'s to implement blocking between my worker and boss threads, the worker thread would read() a pipe, and since it's empty would then block on that read call until the boss calls write() on it.
The glibc implementation of pthreads uses the low-level "futex" locks to implement pthread_cond_wait() / pthread_cond_signal(). Futexes were designed to be a fast synchronisation primitive, so these are likely to outperform pipes or similar methods (at the very least, using pipes requires copying a byte to and from kernel space that isn't needed for futexes).
If pthread_cond_wait() / pthread_cond_signal() map well onto your problem (and it sounds like they do), then the only way to outperform them is likely to be to implement something on futexes yourself (for example, you could eliminate the handling of thread cancellation if you do not use that).
It is probably worthwhile benchmarking your application - unless your work units are very small indeed, then the condition variable wakeup latency is unlikely to dominate.
What you should do first is being sure you need something faster. Since pthread signaling is implemented using futex, where futex stands for fast user space mutex, I don't think you can out perform them.
If you have waiting threads, by definition you will have to wake them up, and this round trip through the kernel will be the source of your unwanted latency.
But what you should do is really think about your problem :
if you constantly have work to do, then your worker thread is always busy. Work will be done when previous work is finished, and you don't care about the latency.
If what matters is the latency between the boss detecting an event and the worker starting to work, then why do you use a boss -> worker pattern ?
My advice would be to look for a faster thing when you really need it, at this time you will probably have a much mre detailed question to ask. Maybe I am wrong, but it looks like you are trying to optimize preemptively, which as you perhaps know is the root of all evil. Of course, bad design can lead to massive rework, but here you are dealing with a very small detail of your real design decision which is using a boss / worker pattern.
Implement your design with pthread_signal, or perhaps semp_post() / sem_wait(), and then look where your latency really is, and if it is really a problem.
I would guess signal and wait would be the best. Most OS recognize threads and can have them just idle until the interrupt comes. With pipes the worker would have to keep waking up and checking the pipe for output. The best testing I've found for efficiency has usually been using the unix command to get the running time from start to finish(assuming the program isn't meant to keep running in the background), set up a script to do it a few times and compare.
According to my question here I would like to use SCHED_RR with pthread_setschedparam for my threads in a Linux application. However, this has effects even on kernel modules which I currently cannot solve.
I have found http://www.icir.org/gregor/tools/pthread-scheduling.html which says that I could create my threads with PTHREAD_SCOPE_PROCESS attribute, but I haven't found further information on this.
Will this work with (Angstrom) Linux, kernel version2.6.32? (How) will this affect the way my process competes with other processes? Would it be the way to have my processes compete with real time scheduling but other processes would not be affected?
(As I am using boost threads I cannot simply try this...)
Threads created with PTHREAD_SCOPE_PROCESS will share the same kernel thread (
http://lists.freebsd.org/pipermail/freebsd-threads/2006-August/003674.html )
However, SCHED_RR must be run under a root-privileged process.
Round-Robin; threads whose contention scope is system
(PTHREAD_SCOPE_SYSTEM) are in real-time (RT) scheduling class if the
calling process has an effective user id of 0. These threads, if not
preempted by a higher priority thread, and if they do not yield or
block, will execute for a time period determined by the system.
SCHED_RR for threads that have a contention scope of process
(PTHREAD_SCOPE_PROCESS) or whose calling process does not have an
effective user id of 0 is based on the TS scheduling class.
However, basing on your linked problem I think you are facing a deeper issue. Have you tried setting your kernel to be more "preemptive"? Preemption should allow the kernel to forcibly schedule out of running your process allowing for more responsive running of some kernel parts. This shouldn't affect IRQs though, maybe something disabled your IRQs?
Another thing I am thinking about is maybe that you are not fetching your SPI data fast enough and the buffor for your data in the kernel becomes full and hence the data loss. Try increasing those buffers also.
A thread is "lightweight" because most of the overhead has already been accomplished through the creation of its process.
I found this in one of the tutorials.
Can somebody elaborate what it exactly means?
The claim that threads are "lightweight" is - depending on the platform - not necessarily reliable.
An operating system thread has to support the execution of native code, e.g. written in C. So it has to provide a decent-sized stack, usually measured in megabytes. So if you started 1000 threads (perhaps in an attempt to support 1000 simultaneous connections to your server) you would have a memory requirement of 1 GB in your process before you even start to do any real work.
This is a real problem in highly scalable servers, so they don't use threads as if they were lightweight at all. They treat them as heavyweight resources. They might instead create a limited number of threads in a pool, and let them take work items from a queue.
As this means that the threads are long-lived and small in number, it might be better to use processes instead. That way you get address space isolation and there isn't really an issue with running out of resources.
In summary: be wary of "marketing" claims made on behalf of threads. Parallel processing is great (increasingly it's going to be essential), but threads are only one way of achieving it.
Process creation is "expensive", because it has to set up a complete new virtual memory space for the process with it's own address space. "expensive" means takes a lot of CPU time.
Threads don't need to do this, just change a few pointers around, so it's much "cheaper" than creating a process. The reason threads don't need this is because they run in the address space, and virtual memory of the parent process.
Every process must have at least one thread. So if you think about it, creating a process means creating the process AND creating a thread. Obviously, creating only a thread will take less time and work by the computer.
In addition, threads are "lightweight" because threads can interact without the need of inter-process communication. Switching between threads is "cheaper" than switching between processes (again, just moving some pointers around). And inter-process communication requires more expensive communication than threads.
Threads within a process share the same virtual memory space but each has a separate stack, and possibly "thread-local storage" if implemented. They are lightweight because a context switch is simply a case of switching the stack pointer and program counter and restoring other registers, wheras a process context switch involves switching the MMU context as well.
Moreover, communication between threads within a process is lightweight because they share an address space.
process:
process id
environment
folder
registers
stack
heap
file descriptor
shared libraries
instruments of interprocess communications (pipes, semaphores, queues, shared memory, etc.)
specific OS sources
thread:
stack
registers
attributes (for sheduler, like priority, policy, etc.)
specific thread data
specific OS sources
A process contains one or more threads in it and a thread can do anything a process can do. Also threads within a process share the same address space because of which cost of communication between threads is low as it is using the same code section, data section and OS resources, so these all features of thread makes it a "lightweight process".
Just because the threads share the common memory space. The memory allocated to the main thread will be shared by all other child threads.
Whereas in case of Process, the child process are in need to allocate the separate memory space.