I want to write a high performance synchronized generator in C. I want to be able to feed events to it and have multiple threads be able to poll/read asynchronously, such that threads never receive duplicates.
I don't really know that much about how synchronization is typically done. Can someone give me a high level explanation of one or more techniques that I might be able to use?
Thanks!
You need a thread implementation; C does not have any built-in support for multiprocessing concepts. Threads are thus often implemented as libraries. Such a library will typically provide you with ways to synchronize the execution of multiple threads, ways to protect data, and so on.
The main concept in thread safety is the Mutex (though there is different kind of locks).
It is used to protect your memory from multiple accesses and race conditions.
A good example of its use would be when using a Linked List. You can't allow two different threads to modify it in the same time. In your example, you could possibly use a linked-list to create a queue, and each thread would consume some data from it.
Obviously there are other synchronization mechanisms, but this one is (by far ?) the most important.
You could have a look at this page (and referenced pages at the bottom) for more implementation details.
Thread-safe will be the problem when there are shared variables between threads. If you don't have any shared variables, it's not a problem. Every event can be readonly and disptaching to listeners randomly.
Thread safety is achieved by using whatever synchronisation primitives the multithreading implementation provides.
Your start point would probably be a linked list of events, a lock that protects it, and every thread takes the lock, consumes one event by adjusting the pointer to the first event and then releases the lock; appending events also locks the entire list. When the list is empty, the workers exit.
From there, various optimisations are possible:
Caching the pointer to the last event, so appending an event to the list becomes cheaper.
Adding a notification mechanism so worker threads can sleep while the list is empty. Typically, this is achieved with something called a condition variable.
Using multiple lists, so if the first list is locked, the worker can retrieve an event from another list without having to wait for the thread that has currently locked the list.
Related
I have a an object (kind of a queue) which is accessed across the threads. The queue object can be mutex locked before used by either thread.
A simpler way to manage this is by bringing the lock inside the queue object itself - hence every API will lock the queue and release when the work is done. This way, threads don't have to manage additional mutex variables along with each queue.
Now my question is, sometimes there is only one thread which is accessing queue (say it is a local variable). But since now inherently the queue would first lock its internal data structure and unlock before leaving, will this be a costly affair?
How costly is the redundant mutex_lock and mutex_unlock operation - when there is no specific need of thread synchronization?
PS:
My question is slightly related to this one: How efficient is locking an unlocked mutex? What is the cost of a mutex?
But i am looking for a specific answer in my design and understanding of why.
I AM USING C, and pthread libraries.
One way to handle this is to have your queue initialization take a parameter that indicates whether a lock should be acquired or not during queue operations. If a queue is being used by a single thread, it gets initialized such that it won't acquire/release locks (or uses a lock object where the acquire/release operations are nops).
See this answer for an example of how boost::pool does something along these lines (although in C++ and as a compile time configuration): https://stackoverflow.com/a/10188784/12711
A similar concept can be applied to C code at runtime, too.
First of all: Neither the C library nor pthreads implements mutex locking - they call into the kernel to use OS primitives for that. This implies, that the performance of muteces will vary wildly with the base OS.
If you can reduce your portability spectrum to hardware supporting atomic compare-exchange or atomic increase-and-read (such as any x86 from this millennium) you can use atomic increase-and-read to create a threadsafe queue that does not need locking.
For the .Net platform I have such a beast at http://sourceforge.net/projects/dotnetlockless - it should be quite easy to port it to C.
How does one implement a multithreaded single process model in linux fedora under c where a single scheduler is used on a "main" core reading i/o availability (ex. tcp/ip, udp) then having a single-thread-per-core (started at init), the "execution thread", parse the data then update a small amount of info update to shared memory space (it is my understanding pthreads share data under a single process).
I beleive my options are:
Pthreads or the linux OS scheduler
I have a naive model in mind consisting of starting a certain number of these execution threads a single scheduler thread.
What is the best solution one could think when I know that I can use this sort of model.
Completing Benoit's answer, in order to communicate between your master and your worker threads, you could use conditional variable. The workers do something like:
while (true)
{
pthread_mutex_lock(workQueueMutex);
while (workQueue.empty())
pthread_cond_wait(workQueueCond, workQueueMutex);
/* if we get were then (a) we have work (b) we hold workQueueMutex */
work = pop(workQueue);
pthread_mutex_unlock(workQueueMutex);
/* do work */
}
and the master:
/* I/O received */
pthread_mutex_lock(workQueueMutex);
push(workQueue, work);
pthread_cond_signal(workQueueCond);
pthread_mutex_unlock(workQueueMutex);
This would wake up one idle work to immediately process the request. If no worker is available, the work will be dequeued and processed later.
Modifying the Linux scheduler is quite a tough work. I would just forget about it. Pthread is usually prefered. If I understand well, you want to have one core dedicated to the control plan, and a pool of other cores dedicated to the data plan processing? Then create a pool of threads from your master thread and setup core affinity for these slave threads with pthread_setaffinity_np(...).
Indeed threads of a process share the same address-space, and global variables are accessible by any threads of that process.
It looks to me that you have a version of the producer-consumer problem with a single consumer aggregating the results of n producers. This is a pretty standard problem, so I definitely think that pthread is more than enough for you. You don't need to go and mess around with the scheduler.
As one of the answer's states, a thread safe queue like the one described here works nicely for this sort of issue. Your original idea of spawning a bunch of threads is a good idea. You seem to be worried that the ability of the threads to share global state will cause you problems. I don't think that this is an issue if you keep shared state to a minimum and use sane locking discipline. Sharing state is fine as long as you do so responsibly.
Finally, unless you really know what you're doing, I would advise against manually messing with thread affinity. Just spawn the threads and let the scheduler handle when and on what core a thread runs. The thing to optimize is the number of threads you use. One for each core may not actually be the fastest approach if other threads are running.
Generally speaking, this is more or less exactly what the posix select and linux specific epoll functions are for.
Suppose I create threads with pthreads, is it possible to send them new things to work on after they have been initialized, so I don't waste resources in creating new threads? For instance, I create 3 threads, thread 2 signals completion and I send it another "task" without killing it and starting a new one. Thanks.
The usual, simple form is an ordinary (work) queue. In principle, you maintain a queue structure, perhaps as a linked list, protected by a mutex. Typically, condition variables are used by the main/producer threads to notify worker threads that new work is available, so they don't have to poll.
Some previous SO questions that may also be useful are:
How To Use Condition Variable
One producer, Two consumers and usage of pthread_cond_signal & pthread_mutex_lock
pthread conditional variable
Yes, and that is what servers like Apache do to increase their performance. The design pattern is called the Thread pool pattern and there are various implementations (this one for example) using pthreads.
Of course, you might want to keep your implementation as simple as possible, depending on what your goals are.
Of course. For example, you can use producer-consumer pattern. Here is an example in C#, but it can be easily implemented in pthreads as well.
The search-keyword to your question is "thread pooling" or "thread pool". Using this terms you will find plenty information on this site and also in Google.
I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?
This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).
Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.
In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.
You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.
I’m buried in multithreading / parallelism documents, trying to figure out how to implement a threading implementation in a programming language I’ve been designing.
I’m trying to map a mental model to the pthreads.h library, but I’m having trouble with one thing: I need my interpreter instances to continue to exist after they complete interpretation of a routine (the language’s closure/function data type), because I want to later assign other routines to them for interpretation, thus saving me the thread and interpreter setup/teardown time.
This would be fine, except that pthread_join(3) requires that I call pthread_exit(3) to ‘unblock’ the original thread. How can I block the original thread (when it needs the result of executing the routine), and then unblock it when interpretation of the child routine is complete?
Use a pthread_cond_t; wait on it on one thread and signal or broadcast it in the other.
Sounds like you actually want an implementation of the Thread Pool Pattern. It makes for a fairly simple conceptual model, without repeated thread creation & tear down costs. Some OS's directly support it, on others it should be reasonably simple to implement using a queue and a semaphore.