What is the concept of implementing Thread-pool (in C with help from pthreads)?
how can a thread be assigned to execute from the thread pool ?
A thread-pool is a collection of a fixed number of threads which are created on application startup. The threads then sit waiting for requests to come to them, typically via a queue controlled by a semaphore. When a request is made, and there is at least one thread waiting, the thread is woken up, services the request, and goes back to waiting on the semaphore. If no threads are available, requests queue up until one is.
Thread-pools are a generally more efficient way of managing resources than simply starting a new thread for every request. However, some architectures allow new threads to be created and added to the pool as the application runs, depending on request loading.
To clarify something in previous answers:
The reason that instantiating more and more threads leads to inefficiency is context switching time. The OS periodically switches one thread for another on the processor. This involves saving one thread's state and loading another thread's state from memory, so it takes non-negligible time, N ms, per context switch.
For example, if you have 10 threads, the context switching takex 10*N ms. If you have 1000 threads, it's 1000*N ms. As the number of concurrent threads increases, eventually the context switching begins to overwhelm any efficiencies derived from multithreading. Your application has a sweet spot in terms of the best number of threads. Once you determine this sweet number by experimentation, you can set your thread pool max size to that number of threads, thereby obtaining maximum efficiency from multithreading.
Adding to anon's answer I'd like to mention that there are Fixed thread pools which have fixed numbers of thread running in them; Cached thread pools which can dynamically grow and then shrink when no work is available; Dynamic thread pools can also be bound by maximum number of threads and/or maximum length of the work queue. I don't think there is actually a set terminology for this kind of stuff and one rarely encounters non-fixed TPs written in C but at least one should know that fixed TP is not the only kind out there.
Related
I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)
Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.
Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.
I'm using Windows 7 and using VC++ 2010 and this is a 32 bit application
I am trying to get my renderer to work multithreaded but as it turns out I made it slower than without using multiple threads.
I want it to have the main thread adding rendering commands to a list, and a worker thread that does the rendering of these commands.
This all does happen, and it draws to the screen fine, but I get less fps when doing so...
I used the benchmark tool in Fraps to get this data:
Time is the time it was benchmarked for, in this case 30 seconds.
Min, max, avg are all FPS values.
With Multithreading:
Frames, Time (ms), Min, Max, Avg
28100, 30000, 861,1025, 936.667
Without multithreading:
Frames, Time (ms), Min, Max, Avg
21483, 30000, 565, 755, 716.100
Here is some pseudocode (with the relevant event function calls):
Main Thread:
Add render comands to queue
ResetEvent (renderCompletedEvent);
SetEvent (renderCommandsEvent);
WaitForSingleObject (renderCompletedEvent, INFINITE);
Render Thread:
WaitForSingleObject (renderCommandsEvent, INFINITE);
Process commands
SetEvent (renderCompletedEvent);
ResetEvent (renderCommandsEvent);
Why would you expect this to be faster?
Only one thread is ever doing anything, you create the commands in one thread and the signal the other and wait for it to finish which will take just as long as just doing it in the first thread, only with more overhead.
TO take advantage of multithreading you need to ensure that both threads are doing something at the same time.
I am no opengl expert, but in general it is important to realize that threads are actually not used to speed things up, they are to guarantee that some subsystem is responsive at the cost of overall speed. That is one might keep a gui thread and a networking thread to ensure that the gui and networking are responsive. That is actually done at a performance cost to the main thread. The CPU is going to give 1/3 of its time to the main thread, 1/3 of its time to the networking thread and 1/3 of its time to the gui thread, even if there are no gui events to handle and nothing going in or out of the network. Thus whatever the main thread is doing gets only 1/3 of the CPU time that it would in a non-multithreaded situation. The upside is that if a lot of data starts arriving over the network, there is always CPU time reserved to handle it (which can be bad if there isn't as the networking buffer can be filled and then additional data starts being dropped or overwritten). The possible exception is that if multiple threads are running on different cores. However, even then be careful, cores can share the same caches, so if two cores are invalidating each other caches, performance could drop dramatically, not improve. If the cores share some resource to move data to and from the GPU or has some other shared limiting resource, this again could possibly cause performance losses, not gains.
In short, threading on a single CPU system is always about responsiveness of a subsystem, not performance. There are possible perfomance gains when different threads run on multiple cores (which windows doesn't seem to usually do by default, but it can be forced). However there are potential issues with doing this when those cores share some resource that could potentially hurt, not help, performance, e.g. shared cache space or some shared GPU related resource in your context.
I have more of a conceptual question.
Assume one program running two threads.
Both threads are running loops all the time.
One thread is responsible for streaming data and the other thread is responsible for receiving the file that the first thread has to stream.
So the file transfer thread is loops to receive the data which it writes to a file and the streaming thread reads that data from that file as it needs it and streams it.
The problem I see here is how to avoid starvation when the file transfer is taking too much CPU cycles for it's own and thus making the streaming thread lag?
How would I be able to share the CPU effectively between those two threads knowing that the streamer streams data far slower than the file transfer receives it.
I thank you for your advice.
Quite often this kind of problems are solved by using somekind of flow control:
Block the sender when the receiver is busy.
This cause also problems: If your program must be able to fast forward (seek forward),
then this is not good idea.
In your case, you could block the file transfer thread when there is more than 2MB unstreamed data in the file. And resume it when there is less than 1MB unstreamed data.
See if pthread_setschedparam() helps you balance out the threads' usage of the CPU
From man page of pthread_setschedparam, you can change the thread priorities.
pthread_setschedparam(pthread_t thread, int policy,
const struct sched_param *param);
struct sched_param {
int sched_priority; /* Scheduling priority */
};
As can be seen, only one scheduling parameter is supported. For details of
the permitted ranges for scheduling priorities in each scheduling policy, see
sched_setscheduler(2).
Also,
the file transfer is taking too much CPU cycles for it's own
If you read this SO post, it seems to suggest that changing thread priorities may not help. Because the reason the file transfer thread is consuming more CPU cycles is that it needs it. But in your case, you are OK if the file transfering is slowed down as the streamer thread cannot compete anyways! Hence I suggested you to change priorities and deprive file transfer thread of some cycles even if needs it
Say a thread in one core is spinning on a variable which will be updated by a thread running on another core. My question is what is the overhead at cache level. Will the waiting thread cache the variable and therefore does not cause any traffic on the bus until the writing thread writes to that variable?
How can this overhead be reduced. Does x86 pause instruction help?
I believe all modern x86 CPUs use the MESI protocol. So the spinning "reader" thread will likely have a cached copy of the data in either "exclusive" or "shared" mode, generating no memory bus traffic while you spin.
It is only when the other core writes to the location that it will have to perform cross-core communication.
[update]
A "spinlock" like this is only a good idea if you will not be spinning for very long. If it may be a while before the variable gets updated, use a mutex + condition variable instead, which will put your thread to sleep so that it adds no overhead while it waits.
(Incidentally, I suspect a lot of people -- including me -- are wondering "what are you actually trying to do?")
If you spin lock for short intervals you are usually fine. However there is a timer interrupt on Linux (and I assume similar on other OSes) so if you spin lock for 10 ms or close to it you will see a cache disturbance.
I have heard its possible to modify the Linux kernel to prevent all interrupts on specific cores and this disturbance goes away, but I don't know what is involved in doing this.
In the case of two threads the overhead may be ignored, anyway it could be a good idea make a simple benchmark. For instance, if you implement spinlocks, how much time the thread spends into the spin.
This effect on the cache it's called cache line bouncing.
I tested this extensively in this post. The overhead in general is incurred by the bus-locking component of the spinlock, usually the instruction "xchg reg,mem" or some variant of it. Since that particular overhead cannot be avoided you have the options of economizing on the frequency with which you invoke the spinlock and performing the absolute minimum amount of work necessary - once the lock is in place - before releasing it.
I'm using Pthreads to create a new tread for each partition after the list is split into the right and left halves (less than and greater than the pivot). I do this recursively until I reach the maximum number of allowed threads.
When I use printfs to follow what goes on in the program, I clearly see that each thread is doing its delegated work in parallel. However using a single process is always the fastest. As soon as I try to use more threads, the time it takes to finish almost doubles, and keeps increasing with number of threads.
I am allowed to use up to 16 processors on the server I am running it on.
The algorithm goes like this:
Split array into right and left by comparing the elements to the pivot.
Start a new thread for the right and left, and wait until the threads join back.
If there are more available threads, they can create more recursively.
Each thread waits for its children to join.
Everything makes sense to me, and sorting works perfectly well, but more threads makes it slow down immensely.
I tried setting a minimum number of elements per partition for a thread to be started (e.g. 50000).
I tried an approach where when a thread is done, it allows another thread to be started, which leads to hundreds of threads starting and finishing throughout. I think the overhead was way too much. So I got rid of that, and if a thread was done executing, no new thread was created. I got a little more speedup but still a lot slower than a single process.
The code I used is below.
http://pastebin.com/UaGsjcq2
Does anybody have any clue as to what I could be doing wrong?
Starting a thread has a fair amount of overhead. You'd probably be better off creating a threadpool with some fixed number of threads, along with a thread-safe queue to queue up jobs for the threads to do. The threads wait for an item in the queue, process that item, then wait for another item. If you want to do things really correctly, this should be a priority queue, with the ordering based on the size of the partition (so you always sort the smallest partitions first, to help keep the queue size from getting excessive).
This at least reduces the overhead of starting the threads quite a bit -- but that still doesn't guarantee you'll get better performance than a single-threaded version. In particular, a quick-sort involves little enough work on the CPU itself that it's probably almost completely bound by the bandwidth to memory. Processing more than one partition at a time may hurt cache locality to the point that you lose speed in any case.
First guess would be that creating, destroying, and especially the syncing your threads is going to eat up and possible gain you might receive depending on just how many elements you are sorting. I'd actually guess that it would take quite a long while to make up the overhead and that it probably won't ever be made up.
Because of the way you have your sort, you have 1 thread waiting for another waiting for another... you aren't really getting all that much parallelism to begin with. You'd be better off using a more linear sort, perhaps something like a Radix, that splits the threads up with more further data. That's still having one thread wait for others a lot, but at least the threads get to do more work in the mean time. But still, I don't think threads are going to help too much even with this.
I just have a quick look at your code. And i got a remark.
Why are you using lock.
If I understand what you are doing is something like:
quickSort(array)
{
left, right = partition(array);
newThread(quickSort(left));
newThread(quickSort(right));
}
You shouldn't need lock.
Normally each call to quick sort do not access the other part of the array.
So no sharing is involve.
Unless each thread is running on a separate processor or core they will not truly run concurrently and the context switch time will be significant. The number of threads should be restricted to the number of available execution units, and even then you have to trust the OS will distribute them to separate processors/cores, which it may not do if they are also being used for other processes.
Also you should use a static thread pool rather than creating and destroying threads dynamically. Creating/destroying a thread includes allocating/releasing a stack from the heap, which is non-deterministic and potentially time-consuming.
Finally are the 16 processors on the server real or VMs? And are they exclusively allocated to your process?