I am trying to parallelize the radix sort using POSIX threads using C language. The specialty is the radix sort needs to be implemented for floating-point numbers. Currently, the code is running sequentially but I have no idea how to parallelize the code. Can anyone help me with this? Any help is appreciated.
Radix sorts are pretty hard to parallelize efficiently on CPUs. There is two parts in a radix sort: the creation of the histogram and the bucket filling.
To create an histogram in parallel you can fill local histograms in each thread and then perform a (tree-based) reduction of the histograms to build a global one. This strategy scale well as long as the histogram are small relative to the data chunks computed by each thread. An alternative way to parallelize this step is to use atomic adds to fill directly a shared histogram. This last method scale barely when thread write accesses conflicts (which often happens on small histograms and many threads). Note that in both solutions, the input array is evenly distributed between threads.
Regarding the bucket filling part, one solution is to make use of atomic adds to fill the buckets: 1 atomic counter per bucket is needed so that each thread can push back items safely. This solution only scale when threads do not often access to the same bucket (bucket conflicts). This solution is not great as the scalability of the algorithm is strongly dependent of the content of the input array (sequential in the worst case). There are solutions to reduces conflicts between threads (better scalability) at the expense of more work (slower with few threads). One is to fill the buckets from both sides: threads with an even ID fill the buckets in ascending order while threads with an odd ID fill them in descending order. Note that it is important to take into account false sharing to maximize performance.
A simple way to parallelize radix sort for all but the first pass is to use a most significant digit (MSD) pass to split up the array into bins, each of which can then be sorted concurrently. This approach relies on having somewhat uniform distribution of values, at least in terms of the most significant digit, so that the bins are reasonably equal in size.
For example, using a digit size of 8 bits (base 256), use a MSD pass to split up the array into 256 bins. Assuming there are t threads, then sort t bins at a time, using least significant digit first radix sort.
For larger arrays, it may help to use a larger initial digit size to split up the array into a larger number of bins, with the goal of getting t bins to fit in cache.
Link to a non-parallelized radix sort that uses MSD for first pass, then LSD for next 3 passes. The loop at the end of RadixSort() to sort the 256 bins could be parallelized:
Radix Sort Optimization
For the first pass, you could use the parallel method in Jerome Richard's answer, but depending on the data pattern, it may not help much, due to cache and memory conflicts.
How can I implement version 7 of the code given in the following link:
http://www.cuvilib.com/Reduction.pdf
for an input array whose size is an arbitrary number, in other words, not a power of 2?
Version 7 already handles an arbitrary number of elements.
Perhaps instead of referring to the cuvilib link, you should look at the link to the relevant NVIDIA CUDA reduction sample. It includes essentially the pdf file you are using, but also sample codes that implement reductions 1 through 7 (labelled reduce0 through reduce6)
If you study the description of the reduction 7 in the document, you'll see that the initial reduction steps are handled via a while loop, that is causing the grid to loop through memory. As it loops through memory, each thread is accumulating multiple reduction elements.
This initial while loop is not limited to a particular size of problem (e.g. power of 2).
Due to the initial handling of the reduction via this while loop, later steps can be done as a super-efficient power of 2 at the threadblock level, as has been previously discussed in that document. But the initial input set size is not limited to a power of 2.
Please study the code given in the CUDA sample (reduce6).
I have a parallelized algorithm that can output a random number from 1 to 1000.
My objective is to compute, for N executions of the algorithm, how many times each number is chosen.
So for instance, I am doing N/100 executions of the algorithm, on 100 threads, and the final result is an array of 1000 ints, which are the occurrences of each number.
Is there a way to parallelize this intelligently? For instance, if I only use one global array I will have to lock it every time I want to write in it, which will make my algorithm run almost as if there was no parallelization. On the other hand, I can't just make one array of 1000 numbers per threads, just to have them be 1% filled and merge them at the end.
This appears to be histogramming. If you want to do it quickly, use a library such as CUB or Thrust.
For cases where there are a small number of bins, one approach is to have each thread operate on its own set of bins, for a segment of the input. Then do a parallel reduction on each bin. If you are clever about the storage organization of your bins, the parallel reduction amounts to summation of matrix columns:
Bins:
1 2 3 4 ... 1000
T 1
h 2
r 3
e .
a .
d 100
In the above example, each thread takes a segment of the input, and operates on one row of the partial sums matrix.
When all threads are finished with their segments, then sum the columns of the matrix, which can be done very efficiently and quickly with a simple for-loop kernel.
There are a couple of things you can do.
If you want to be as portable as possible, you could have one lock for each index.
If this is being run on a Windows system, I would suggest InterlockedIncrement
There is a quite large file (>10G) on the disk, each line inside the fie is composed of a line-number and a person's name, like this:
1 Jane
2 Perk
3 Sime
4 Perk
.. ..
I have to read this large file, and find the frequency of each name, finally output the results in descending order of each name's frequency, like this:
Perk 2
Jane 1
Sime 1
As the interviewer requested, the above job should be done as efficiently as possible, and multithreading is allowed. And my solution is something like this:
Because the file is too large, I partition the file into several small files, each small file is about 100M, via lseek I can locate the begin and the end of each small file (beg, end);
For these small files, there is a shared hash-map using person's name as key and how many times it shows so far as value;
For each small file, there is a single thread go through it, every time the thread encounters a person's name, it will increment its corresponding value in the shared hash-map;
When all threads finish, I think it's time to sort the hash-map according to the value field.
But because there might be too many names in that file, so the sorting would be slow. I didn't come up with a good idea about how to output the names in descending order.
Hope anyone can help me with the above problem, give me a better solution on how to do the job via multithreading and the sorting stuff.
Using a map-reduce approach could be a good idea for your problem. That approach would consist of two steps:
Map: read chunks of data from the file and create a thread to process that data
Reduce: the main thread waits for all other threads to finish and then it combines the results from each individual thread.
The advantage of this solution is that you would not need locking between the threads, since each one of them would operate on a different chunk of data. Using a shared data structure, as you are proposing, could be a solution too, but you may have some overhead due to contention for locking.
You need to do the sorting part at the reduce step, when the data from all the threads is available. But you might want to do some work during the map step, so that it is easier (quicker) to finish the complete sort at the reduce step.
If you prefer to avoid the sequential sorting at the end, you could use some custom data structure. I would use a map (something like a red-black tree or a hash table) for quickly finding a name. Moreover, I would use a heap in order to keep the order of frequencies among names. Of course, you would need to have parallel versions of those data structures. Depending on how coarse the parallelization is, you may have locking contention problems or not.
If I asked that as an interview question using the word "efficiently" I would expect an answer something like "cut -f 2 -d ' ' < file | sort | uniq -c" because efficiency is most often about not wasting time solving an already solved problem. Actually, this is a good idea, I'll add something like this to our interview questions.
Your bottleneck will be the disk so all kinds of multithreading is overdesigning the solution (which would also count against "efficiency"). Splitting your reads like this will either make things slower if there are rotating disks or at least make the buffer cache more confused and less likely to kick in a drop-behind algorithm. Bad idea, don't do it.
I don't think multithreading is a good idea. The "slow" part of the program is reading from disk, and multithreading the read from disk won't make it faster. It will only make it much more complex (for each chunk you have to find the first "full" line, for example, and you have to coordinate the various threads, and you have to lock the shared hash map each time you access it). You could work with "local" hash map and then merge them at the end (when all the threads finish (at the end of the 10gb) the partial hash maps are merged). Now you don't need to sync the access to the shared map.
I think that sorting the resulting hash map will be the easiest part, if the full hash map can be kept in memory :-) You simply copy it in a malloc(ed) block of memory and qsort it by its counter.
Your (2) and (4) steps in the solution make it essentially sequential (the second introduces locking to keep the hash-map consistent, and the last one, where you're attempting to sort all the data).
One-step sorting of the hash-map at the end is a little strange, you should use an incremental sorting technique, like heapsort (locking of the data structure required) or mergesort (sort parts of the "histogram" file, but avoid merging everything "in one main thread at the end" - try to create the sorting network and mix the contents of the output file at each step of the sorting).
Multi-threaded reads might be an issue, but with modern SSD drives and aggressive read caching multi-threading is not the main slowdown factor. It's all about synchronizing the results sorting process.
Here's a sample of mergesort's parallelization: http://dzmitryhuba.blogspot.com/2010/10/parallel-merge-sort.html
Once again, as I have said, some sorting network might help to allow efficient parallel sort, but not the straightforward "wait-for-all-subthreads-and-sort-their-results". Maybe, bitonic sort in case you have a lot of processors.
The interviewer's original question states "...and multithreading is allowed". The phrasing of this question might be a little ambiguous, however the spirit of the question is obvious: the interviewer is asking the candidate to write a program to solve the problem, and to analyse/justify the use (or not) of multithreading within the proposed solution. It is a simple question to test the candidate's ability to think around a large-scale problem and explain algorithmic choices they make, making sure the candidate hasn't just regurgitated something from an internet website without understanding it.
Given this, this particular interview question can be efficiently solved in O(n log n) (asymptotically speaking) whether multithreading is used or not, and multi-threading can additionally be used to logarithmically accelerate the actual execution time.
Solution Overview
If you were asked the OP's question by a top-flight company, the following approach would show that you really understood the problem and the issues involved. Here we propose a two stage approach:
The file is first partitioned and read into memory.
A special version of Merge Sort is used on the partitions that simultaneously tallies the frequency of each name as the file is being sorted.
As an example, let us consider a file with 32 names, each one letter long, and each with an initial frequency count of one. The above strategy can be visualised as follows:
1. File: ARBIKJLOSNUITDBSCPBNJDTLGMGHQMRH 32 Names
2. A|R|B|I|K|J|L|O|S|N|U|I|T|D|B|S|C|P|B|N|J|D|T|L|G|M|G|H|Q|M|R|H 32 Partitions
1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1 with counts
3. AR BI JK LO NS IU DT BS CP BN DJ LT GM GH MQ HR Merge #1
11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 and tally
4. ABRI JKLO INSU BDST BCNP DJLT GHM HMQR Merge #2
1111 1111 1111 1111 1111 1111 211 1111 and tally
5. ABIJKLOR BDINSTU BCDJLNPT GHMQR Merge #3
11111111 1111211 11111111 22211 and tally
6. ABDIJKLNORSTU BCDGHJLMNPQRT Merge #4
1212111111211 1112211211111 and tally
7. ABCDGHIJKLMNOPQRSTU Merge #5
1322111312132113121 and tally
So, if we read the final list in memory from start to finish, it yields the sorted list:
A|B|C|D|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
1|3|2|2|1|1|1|3|1|2|1|3|2|1|1|3|1|2|1 = 32 Name instances (== original file).
Why the Solution is Efficient
Whether a hash table is used (as the original poster suggested), and whether multi-threading is used or not, any solution to this question cannot be solved more efficiently than O(n log n) because a sort must be performed. Given this restriction, there are two strategies that can be employed:
Read data from disk, use hash table to manage name/frequency totals, then sort the hash table contents (original poster's suggested method)
Read data from disk, initialise each name with its frequency total from the file, then merge sort the names simultaneously summing all the totals for each name (this solution).
Solution (1) requires the hash table to be sorted after all data has been read in. Solution (2) performs its frequency tallying as it is sorting, thus the overhead of the hash table has been removed. Without considering multithreading at all, we can already see that even with the most efficient hash table implementation for Solution (1), Solution (2) is already more efficient as it doesn't have the overhead of the hash table at all.
Constraints on Multithreading
In both Solution (1) and Solution (2), assuming the most efficient hash table implementation ever devised is being used for Solution (1), both algorithms perform the same asymptotically in O(n log n); it's simply that the ordering of their operations is slightly different. However, while multithreading Solution (1) actually slows its execution down, multithreading Solution (2) will gain substantial improvements in speed. How is this possible?
If we multithread Solution (1), either in the reading from disk or in the sort afterwards, we hit a problem of contention on the hash table as all threads try to access the hash table simultaneously. Especially for writing to the table, this contention could cripple the execution time of Solution (1) so much so that running it without multithreading would actually give a faster execution time.
For multithreading to give execution time speed ups, it is necessary to make sure that each block of work that each thread performs is independent of every other thread. This will allow all threads to run at maximum speed with no contention on shared resources and to complete the job much faster. Solution (2) does exactly this removing the hash table altogether and employing Merge Sort, a Divide and Conquer algorithm that allows a problem to be broken into sub-problems that are independent of each other.
Multithreading and Partitioning to Further Improve Execution Times
In order to multithread the merge sort, the file can be divided into partitions and a new thread created to merge each consecutive pair of partitions. As names in the file are variable length, the file must be scanned serially from start to finish in order to be able to do the partitioning; random access on the file cannot be used. However, as any solution must scan the file contents at least once anyway, allowing only serial access to the file still yields an optimal solution.
What kind of speed-up in execution times can be expected from multithreading Solution (2)? The analysis of this algorithm is quite tricky given its simplicity, and as been the subject of various white papers. However, splitting the file into n partitions will allow the program to execute (n / log(n)) times quicker than on a single CPU with no partitioning of the file. Simply put, if a single processor takes 1 hour to process a 640GB file, then splitting the file into 64 10GB chunks and executing on a machine with 32 CPUs will allow the program to complete in around 6 minutes, a 10 fold increase (ignoring disk overheads).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've implemented both a sequential version and a parallel version of quicksort.
I've used to verify the speedup the worst case of quicksort for my implementation: the source array is already sorted and in my case the pivot is always the first element of the array.
So, the partition generates two sets one containing the elements lesser than the pivot and another with the elements higher than pivot having namely n - 1 elements where n is the number of elements of the array being passed as the argument of quicksort function. The recursion depth has size N -1 where N is the number of elements of the original array passed as argument for the quicksort function.
Obs: The sets are actually represented by two variables containing the initial and the final position of the array part that correspondends either the elements are smaller than the pivot and the elements are higher than the pivot. The whole division are happening in place, what means no new array is created on process. The difference of the sequential for the parallel is in the parallel version more than one array is created where the elements are divided equally between them (sorted as the sequential case). For the junction of elements in the parallel case the algorithm merge was used.
The speedup obtained was higher than the theoric, it means with two threads the speeedup achieved was more than 2x compared to the sequential version (3x to be more precise) and with 4 threads the speedup was 10x.
The computer where I ran the threads is a 4 cores machine (Phenom II X4) running Ubuntu Linux 10.04, 64 bits if I am not wrong. The compiler is gcc 4.4 and no flags were passed for the compiler with exception of the inclusion of library pthread for the parallel implementation;
So, does someone know the reason for the superlinear speedup achieved? Can someone give some any pointer, please?
It would really be best to use some performance analyzer to dig into
this in more detail, but my first guess is that this kind of super
linear speed up is caused by the fact that you get more cache space
if you add threads. In that way, more data will be read from cache.
Since the cost of a memory transfer is really high, this can easily
improve performance.
Are you using Amdahl's law to evaluate your maximum speedup ?
Hope this helps.
If you see a 3x speedup with two threads versus one, and a 10x speedup with four threads versus one, something fishy is going on.
Amdahl's law states that speedup is 1/(1-P+P/S), where P is the portion of the algorithm which is parallel and S is the speedup factor of the parallel portion. Assuming that S=4 for four cores (the best possible result), we find that P=2.5, which is impossible (it has to be between 0 and 1, inclusive).
Put another way, if you could get a 10x improvement with 4 cores, then you could simply use one core to simulate four cores and still get a 2.5x improvement (or thereabouts).
Put yet another way, four cores over one second perform fewer operations than one core over ten seconds. So the parallel version is actually performing fewer operations, and if that is the case there is no reason why the serial version couldn't also perform fewer operations.
Possible conclusions:
Something could be wrong with your sequential version. Perhaps it is optimized poorly.
Something could be wrong with your parallel versions. They could be incorrect.
The measurement could be performed incorrectly. This is quite common.
Impossible conclusions:
This algorithm scales superlinearly.