I have an int array[100] and I want 5 threads to calculate the sum of all array elements.
Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
for(i=offset; i<offset+range; i++){
// not used pthread_mutex_lock(&mutex);
sum += array[i];
// not used pthread_mutex_unlock(&mutex);
}
Can this lead to unpredictable behavior or does the OS actually handle this?
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
Yes, you need synchronization, because all thread are modifying the sum at the same time. Here's example:
You have array of 4 elements [a1, a2, a3, a4] and 2 threads t1 and t2 and sum. To begin let's say t1 get value a1 and adds it to sum. But it's not an atomic operation, so he copy current value of sum (it's 0) to his local space, let's call it t1_s, adds to it a1 and then write sum = t1_s. But at the same time t2 do the same, he get sum value (which is 0, because t1 have not completed it operation) to t2_s, adds a3 and write to sum. So we got in the sum value of a3 insted of a1 + a3. This is called data race.
There are multiple solutions to this is:
You can use mutex as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
Without array calculate local sum_local for each thread and in the end add all these sums to shared variable sum using mutex. I guess it will be faster (however it need to be checked).
However as #gavinb mentioned all of it make sense only for larger amount of data.
I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
Yes - the reading of the array variable is independent, however updating the sum variable is not, so you would need a mutex to serialise access to sum, according to your description above.
However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum. If you calculate intermediate sums for each subset (as #Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.
Can this lead to unpredictable behavior or does the OS actually handle this?
Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum.
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
No, definitely not. It might run faster, but it will almost certainly not give you the correct result!
In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.
Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.
Related
Let's say I have 'n' number of threads. All of these threads are accessing the same matrix, and they are doing some operation.
When a thread does its job (the job is going to an adjacent location in the 2-D array), I have either lock the entire matrix, do its job, then unlock it and let other threads to do their job too. Or, I can lock their adjacent locations, in this case, 8 location including diagonals, or I can block the target cell that thread wants to move.
I have implemented the locking entire matrix with putting pthread_t_lock() and do the job, then unlock it. In this case, I have used only one mutex. It works but I don't think that I benefit the whole multi-threading support in this kind of method.
In the second method, I don't know how to implement 8 adjacent location locking or locking the target location that thread wants to go. Should I use more than one mutex, like an array of mutexes for my whole grid? i.e. if my array is 10*10, I need to use 100 mutexes and lock 8 of them and release 8 of them for each thread when a thread wants to do its job. Or should I use another method? Also, I'm not sure that locking 8 mutexes will be atomic. Maybe I can use another mutex for locking these 8 mutexes, and release this lock when 8 mutexes are locked. But again, I'm not sure that will cause a deadlock.
The programming language is C.
Thanks in advance.
If you want a lock for each entry in a 2D array, you have 2 choices:
have a second 2D array containing the locks, so that myLocks[x][y] is the lock for the entry myArray[x][y].
create a structure containing a lock and a value and create a 2D array of those structures, so that myArray[x][y].lock is the lock for the value at myArray[x][y].value.
To avoid deadlocks you'll need to acquire locks in a specific order and release the locks in the reverse order. The most logical order (at least for people that use the English language) would be "left to right, top to bottom" but any order is fine.
The problem is that it's very likely that you'll spend so much time acquiring and releasing locks (and so little time doing actual work in comparison) that it'd probably be faster to just use a single thread (and avoid the cost of acquiring and releasing locks).
You'd want to find a better compromise between the benefits of more threads and the cost of acquiring/releasing locks; like only having a lock for each row of the array (so you need to use 3 locks instead of 8 or 9), or only having a having a lock for each pair of rows (so you need to use 2 locks instead of 8 or 9).
Note that the design of the locks can (and should) depend on the order you do operations, and the order you do operations can (and should) depend on the design of the locks. For example; if you do have a lock for each row of the array, then it might make sense for each thread to do an entire row of the array (e.g. so that a thread would acquire 3 locks, then do the entire row, then release 3 locks).
Also note that it may be possible to do this without any locks at all. For example, if the array is 1000 * 1000 entries and you have 10 threads, then you can split the array into ten 1000*100 sub-arrays (one sub-array per thread) and let each thread do the top half of its sub-array; then make all threads wait until all other threads have finished the top halves of their sub-array before continuing; the let each thread do the bottom half of its sub-array.
When considering performance as the only factor, for extremely fast addition in a multithreaded context, is it better to use the GCC builtin sync / atomic operations to add to a single variable, or is it more performant to add to a single counter per thread?
For example, if I have 8 threads, where a total count of processed items must be incremented (at an extremely high rate), would it be better to have a single variable and increment it from each thread using the atomic operations, or would it be better to have 8 separate variables, one for each thread, and then aggregate the data from the 8 variables at some interval?
It would most likely be much faster for each thread to do its work separately and then aggregate it at the end. ADD instructions are some of the simplest in the instruction set and run very quickly (~1 clock cycle). The overhead to lock a mutex or similar would be larger than the actual computation. Perhaps more importantly, if it's not shared the counter can reside in a register instead of in main memory which is also significantly faster.
In general, it's both faster and easier to avoid sharing state unless you have to.
Suppose I have an array of 1,000,000 elements, and a number of worker threads each manipulating data in this array. The worker threads might be updating already populated elements with new data, but each operation is limited to a single array element, and is independent of the values of any other element.
Using a single mutex to protect the entire array would clearly result in high contention. On the other extreme, I could create an array of mutexes that is the same length as the original array, and for each element array[i] I would lock mutex[i] while operating on it. Assuming an even distribution of data, this would mostly eliminate lock contention, at the cost of a lot of memory.
I think a more reasonable solution would be to have an array of n mutexes (where 1 < n < 1000000). Then for each element array[i] I would lock mutex[i % n] while operating on it. If n is sufficiently large, I can still minimize contention.
So my question is, is there a performance penalty to using a large (e.g. >= 1000000) number of mutexes in this manner, beyond increased memory usage? If so, how many mutexes can you reasonably use before you start to see degradation?
I'm sure the answer to this is somewhat platform specific; I'm using pthreads on Linux. I'm also working on setting up my own benchmarks, but the scale of data that I'm working on makes that time consuming, so some initial guidance would be appreciated.
That was the initial question. For those asking for more detailed information regarding the problem, I have 4 multiple GB binary data files describing somewhere in the neighborhood of half a billion events that are being analyzed. The array in question is actually the array of pointers backing a very large chained hash table. We read the four data files into the hash table, possibly aggregating them together if they share certain characteristics. The existing implementation has 4 threads, each reading one file and inserting records from that file into the hash table. The hash table has 997 locks and 997*9973 = ~10,000,000 pointers. When inserting an element with hash h, I first lock mutex[h % 997] before inserting or modifying the element in bucket[h % 9943081]. This works all right, and as far as I can tell, we haven't had too many issues with contention, but there is a performance bottleneck in that we're only using 4 cores of a 16 core machine. (And even fewer as we go along since the files generally aren't all the same size.) Once all of the data has been read into memory, then we analyze it, which uses new threads and a new locking strategy tuned to the different workload.
I'm attempting to improve the performance of the data load stage by switching to a thread pool. In the new model, I still have one thread for each file which simply reads the file in ~1MB chunks and passes each chunk to a worker thread in the pool to parse and insert. The performance gain so far has been minimal, and the profiling that I did seemed to indicate that the time spent locking and unlocking the array was the likely culprit. The locking is built into the hash table implementation we are using, but it does allow specifying the number of locks to use independently of the size of the table. I'm hoping to speed things up without changing the hash table implementation itself.
(A very partial & possibly indirect answer to your question.)
Have once scored a huge performance hit trying this (on a CentOS) raising the number of locks from a prime of ~1K to a prime of ~1M. While I never fully understood its reason, I eventually figured out (or just convinced myself) that it's the wrong question.
Suppose you have an array of length M, with n workers. Furthermore, you use a hash function to protect the M elements with m < M locks (e.g., by some random grouping). Then, using the Square Approximation to the Birthday Paradox, the chance of a collision between two workers - p - is given by:
p ~ n2 / (2m)
It follows that the number of mutexes you need, m, does not depend on M at all - it is a function of p and n only.
Under Linux there is no cost other than the memory associated with more mutexes.
However, remember that the memory used by your mutexes must be included in your working set - and if your working set size exceeds the relevant cache size, you'll see a significant performance drop. This means that you don't want an excessively sized mutex array.
As Ami Tavory points out, the contention depends on the number of mutexes and number of threads, not the number of data elements protected - so there's no reason to link the number of mutexes to the number of data elements (with the obvious proviso that it never makes sense to have more mutexes than elements).
In the general scenario, I would advise
Simply locking the whole array (simple, very often "good enough" if your application is mostly doing "other stuff" besides accessing the array)
... or ...
Implementing a read/write lock on the entire array (assuming reads equal or exceed writes)
Apparently your scenario doesn't match either case.
Q: Have you considered implementing some kind of a "write queue"?
Worst case, you'd only need one mutex. Best case, you might even be able to use a lock-less mechanism to manage your queue. Look here for some ideas that might be applicable: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx
I am computing a project with openMP. In this project i need to do a computation, in particular:
gap = (DEFAULT_HEIGHT / nthreads);
where DEFAULT_HEIGHT is a constant and nthreads is the number of threads inside my parallel region. My problem is that i can't compute the variable gap outside the parallel region because i need to be inside to know nthreads . But on the other hand i don't want to compute gap for every thread. Moreover i can't set a code like this:
if(tid==0){
gap = (DEFAULT_HEIGHT / nthreads);
}
because i don't know the order of esecution of every thread, so it could be that the thread 0 start for last and all my other computation that need gap will be wrong ( because it will be not setted ). So, there is a way to make this computation only once without this problem?
Thanks
Ensure that gap is a shared variable and enclose it in an OpenMP single directive, something like
#pragma omp single
{
gap = (DEFAULT_HEIGHT / nthreads);
}
Only one thread will execute the code enclosed in the single directive, the other threads will wait at the end of the enclosed block of code.
An alternative would be to make gap private and let all threads compute their own value. This might be faster, the single option requires some synchronisation which always takes time. If you are concerned, try both and compare results. ( I think this is what ComicSansMS is suggesting.)
Here's the tradeoff: If only one thread does the computation you have to synchronize access to that value. In particular also threads that only want to read the value do have to synchronize, as they usually have no other means of determining whether the write is already finished. If initialization of the variable is so expensive that it can compensate for this, you should go for it. But it's probably not.
Keep in mind that you can do a lot of computation on the CPU in the time it takes to fetch data from memory. Synchronizing this access properly will eat away additional cycles and can lead to undesired stalling effects. Even worse, the impact of such effects usually increases drastically with the number of threads sharing the resource.
It's not uncommon to accept some redundancy in parallel computation as the synchronization overhead easily nullifies any benefits from saved computation time for redundant data.
http://pastebin.com/YMS4ehRj
^ This is my implementation of parallel merge sort. Basically what I do is, For every split, the first half is handled by a thread whereas the second half is sequential (i.e.) say we have an array of 9 elements, [0..4] is handled by Thread 1, [0..1] is handled Thread 2, [5..6] is handled by thread 3 (Look at the source code for clarification).
Everything else stays the same, like Merging. But the problem is, this runs much slower than merge sort, even slower than normal bubble sort! And I mean for an array of 25000 int's. I'm not sure where the bottleneck is: Is it the mutex locking? Is it the merging?
Any ideas on how to make this faster?
You are creating a large number of threads, each of which then only does very little work. To sort 25000 ints you create about 12500 threads that spawn other threads and merge their results, and about 12500 threads that only sort two ints each.
The overhead from creating all those threads far outweighs the gains you get from parallel processing.
To avoid this, make sure that each thread has a reasonable amount of work to do. For example, if one thread finds that it only has to sort <10000 numbers it can simply sort them itself with a normal merge sort, instead of spawning new threads.
Given you have a finite number of cores on your system, why would you want to create more threads than cores?
Also, it isn't clear why you need to have a mutex at all. As far as I can tell from a quick scan, the program doesn't need to share the threads[lthreadcnt] outside the local function. Just use a local variable and you should be golden.
Your parallelism is too fine-grained, there are too many threads which are doing just small work. You can define a threshold so that arrays which have smaller sizes than the threshold are sequentially sorted. Be careful about the number of spawned threads, a good indication is that the number of threads are usually not much bigger than the number of cores.
Because much of your computation is in merge function, another suggestion is using Divide-and-Conquer Merge instead of simple merge. The advantage is two-fold: the running time is smaller and it is easy to spawn threads for running parallel merging. You can get the idea of how to implement parallel merge here: http://drdobbs.com/high-performance-computing/229204454. They also have an article about Parallel Merge Sort which might be helpful for you: http://drdobbs.com/high-performance-computing/229400239