MPI QuickSort Program - c

I am a newbie, trying to edit a program. I have a MPI Program that divide array into subsets, the master sends the subsets to the slaves, they doo a quicksort and then return the sorted numbers to the master so he can write them in a file.
What I am trying to do is make the quick sort happen even quicker. My idea is to make the master divide the array and sends subsets to the slaves but keeping one for himself. Then dividing them again to new subsets (for example if we have numbers from 1 to 100 in the array the new subsets should be from 1 to 25, 26 to 50, 51 to 75 and 76 to 100) and then keep the first subset (1 to 25) for himself, send the second (26 to 50) to the first slave, the third one (51 to 76) to the second slave and etc. The slaves should do the same. Then it should perform a quicksort and the slave should return the sorted numbers to the master. I am hoping that this way the sort should be faster. The problem is that as I said I am a newbie and I need help with ideas, advices and even code so I can achieve my goal.

For this answer I am going to stick with the assumption that this should be done with Quicksort, and that the data is read on a single process. Just keep in mind that there are many sophisticated parallel sorting techniques.
Your idea of separating the numbers by subsets is problematic, because it makes assumptions about the shape of data. For non-uniformly distributed data sets it won't even help to know the minimum and maximum. It is better to simply send out equal amount of numbers to each process, let them sort and afterwards merge the data.
For the merge you start with ntasks sorted sub-lists and want to end up with a single one. A naive merge would repeatedly look for the minimal element in each sub-list, remove that and append it to the final list. This needs ntasks * N comparisons, N swaps and N * 2 memory. You can optimize the comparisons to log2(ntasks) * N by doing an actual merge sort, but that also needs log2(ntasks) * N swaps. You can further refine that by keeping the sub-lists (or pointers to their first element) in a priority queue, which should give you log2(ntasks) * N comparisons and N swaps.
About the usage of MPI:
Do not use MPI_Isend & MPI_Wait right after each other. In this case use MPI_Send instead. Use the immediate variants only if you can actually do something useful between the MPI_Isend and MPI_Wait.
Use collective operations whenever possible. To distribute data from the root to all slaves, use MPI_Scatter or MPI_Scatterv. The first requires all ranks to receive the same number of elements, which can also be achieved by padding. To collect data from the slaves in the master, use MPI_Gather or MPI_Gatherv.1 Collectives are more easy to get right, because they describe the high level operation. Their implementation is usually highly optimized.
To receive an unknown-size message, you can also send the message directly and use MPI_Probe at the receiver side to determine the size. You are even allowed to MPI_Recv with a buffer that is larger than the sent buffer, if you know an upper bound.
1 You could also consider the merge step as a reduction and parallelize the necessary computation for that.

In principle your solution looks very good. I don't understand completely if for the larger files you are intending to process them in chunks or as a whole. From my experience I suggest that you assign as large as possible blocks to the slaves. This way the rather expensive message passing operations are executed only very seldom.
What I cannot understand in your question is what the overall goal of your program is. Is it your intention to sort the complete input files in parallel? If this is the case you will need some sort of merge sort to be applied to the results you receive from the individual processes.

Related

Performance implications of a large number of mutexes

Suppose I have an array of 1,000,000 elements, and a number of worker threads each manipulating data in this array. The worker threads might be updating already populated elements with new data, but each operation is limited to a single array element, and is independent of the values of any other element.
Using a single mutex to protect the entire array would clearly result in high contention. On the other extreme, I could create an array of mutexes that is the same length as the original array, and for each element array[i] I would lock mutex[i] while operating on it. Assuming an even distribution of data, this would mostly eliminate lock contention, at the cost of a lot of memory.
I think a more reasonable solution would be to have an array of n mutexes (where 1 < n < 1000000). Then for each element array[i] I would lock mutex[i % n] while operating on it. If n is sufficiently large, I can still minimize contention.
So my question is, is there a performance penalty to using a large (e.g. >= 1000000) number of mutexes in this manner, beyond increased memory usage? If so, how many mutexes can you reasonably use before you start to see degradation?
I'm sure the answer to this is somewhat platform specific; I'm using pthreads on Linux. I'm also working on setting up my own benchmarks, but the scale of data that I'm working on makes that time consuming, so some initial guidance would be appreciated.
That was the initial question. For those asking for more detailed information regarding the problem, I have 4 multiple GB binary data files describing somewhere in the neighborhood of half a billion events that are being analyzed. The array in question is actually the array of pointers backing a very large chained hash table. We read the four data files into the hash table, possibly aggregating them together if they share certain characteristics. The existing implementation has 4 threads, each reading one file and inserting records from that file into the hash table. The hash table has 997 locks and 997*9973 = ~10,000,000 pointers. When inserting an element with hash h, I first lock mutex[h % 997] before inserting or modifying the element in bucket[h % 9943081]. This works all right, and as far as I can tell, we haven't had too many issues with contention, but there is a performance bottleneck in that we're only using 4 cores of a 16 core machine. (And even fewer as we go along since the files generally aren't all the same size.) Once all of the data has been read into memory, then we analyze it, which uses new threads and a new locking strategy tuned to the different workload.
I'm attempting to improve the performance of the data load stage by switching to a thread pool. In the new model, I still have one thread for each file which simply reads the file in ~1MB chunks and passes each chunk to a worker thread in the pool to parse and insert. The performance gain so far has been minimal, and the profiling that I did seemed to indicate that the time spent locking and unlocking the array was the likely culprit. The locking is built into the hash table implementation we are using, but it does allow specifying the number of locks to use independently of the size of the table. I'm hoping to speed things up without changing the hash table implementation itself.
(A very partial & possibly indirect answer to your question.)
Have once scored a huge performance hit trying this (on a CentOS) raising the number of locks from a prime of ~1K to a prime of ~1M. While I never fully understood its reason, I eventually figured out (or just convinced myself) that it's the wrong question.
Suppose you have an array of length M, with n workers. Furthermore, you use a hash function to protect the M elements with m < M locks (e.g., by some random grouping). Then, using the Square Approximation to the Birthday Paradox, the chance of a collision between two workers - p - is given by:
p ~ n2 / (2m)
It follows that the number of mutexes you need, m, does not depend on M at all - it is a function of p and n only.
Under Linux there is no cost other than the memory associated with more mutexes.
However, remember that the memory used by your mutexes must be included in your working set - and if your working set size exceeds the relevant cache size, you'll see a significant performance drop. This means that you don't want an excessively sized mutex array.
As Ami Tavory points out, the contention depends on the number of mutexes and number of threads, not the number of data elements protected - so there's no reason to link the number of mutexes to the number of data elements (with the obvious proviso that it never makes sense to have more mutexes than elements).
In the general scenario, I would advise
Simply locking the whole array (simple, very often "good enough" if your application is mostly doing "other stuff" besides accessing the array)
... or ...
Implementing a read/write lock on the entire array (assuming reads equal or exceed writes)
Apparently your scenario doesn't match either case.
Q: Have you considered implementing some kind of a "write queue"?
Worst case, you'd only need one mutex. Best case, you might even be able to use a lock-less mechanism to manage your queue. Look here for some ideas that might be applicable: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Parallel Demonstration Program

An assignment that I've just now completed requires me to create a set of scripts that can configure random Ubuntu machines as nodes in an MPI computing cluster. This has all been done and the nodes can communicate with one another properly. However, I would now like to demonstrate the efficiency of said MPI cluster by throwing a parallel program at it. I'm just looking for a straight brute force calculation that can divide up work among the number of processes (=nodes) available: if one node takes 10 seconds to run the program, 4 nodes should only take about 2.5.
With that in mind I looked for a prime calculation programs written in C. For any purists, the program is not actually part of my assignment as the course I'm taking is purely systems management. I just need anything that will show that my cluster is working. I have some programming experience but little in C and none with MPI. I've found quite a few sample programs but none of those seem to actually run in parallel. They do distribute all the steps among my nodes so if one node has a faster processor the overall time will go down, but adding additional nodes does nothing to speed up the calculation.
Am I doing something wrong? Are the programs that I've found simply not parallel? Do I need to learn C programming for MPI to write my own program? Are there any other parallel MPI programs that I can use to demonstrate my cluster at work?
EDIT
Thanks to the answers below I've managed to get several MPI scripts working, among which the sum of the first N natural numbers (which isn't very useful as it runs into data type limits), the counting and generating of prime numbers and the Monte Carlo calculation of Pi. Interestingly only the prime number programs realise a (sometimes dramatic) performance gain with multiple nodes/processes.
The issue that caused most of my initial problems with getting scripts working was rather obscure and apparently due to issues with hosts files on the nodes. Running mpiexec with the -disable-hostname-propagation parameter solved this problem, which may manifest itself in a variety of ways: MPI(R) barrier errors, TCP connect errors and other generic connection failures. I believe it may be necessary for all nodes in the cluster to know one another by hostname, which is not really an issue in classic Beowulf clusters that have DHCP/DNS running on the server node.
The usual proof of concept application in parallel programming is simple raytracing.
That being said, I don't think that raytracing is a good example to show off the power of OpenMPI. I'd put the emphasis on scatter/gather or even better scatter/reduce, because that's where MPI gets the true power :)
the most basic example for that would be calculating the sum over the first N integers. You'll need to have a master thread, that fits value ranges to sum over into an array, and scatter these ranges over the number of workers.
Then you'll need to do a reduction and check your result against the explicit formula, to get a free validation test.
If you're looking for a weaker spot of MPI, a parallel grep might work, where IO is the bottleneck.
EDIT
You'll have to keep in mind that MPI is based on a shared nothing architecture where the nodes communicate using messages, and that the number of nodes is fixed. these two factors set a very tight frame for the programs that run on it. To make a long story short, this kind of parallelism is great for data-parallel applications, but sucks for task-parallel applications, because you can usually distribute data better than tasks if the number of nodes changes.
Also, MPI has no concept of implicit work-stealing. if a node is finished working, it just sits around waiting for the other nodes to finish. that means, you'll have to figure out weakest-link handling yourself.
MPI is very customizable when it comes to performance details, there are numerous different variants of MPI_SEND, for example. That leaves much room for performance tweaking, which is important for high performance computing, for which MPI was designed, but is mostly confusing "ordinary" programmers, leading to programs that actually get slower when run parallel. maybe your examples just suck :)
And on the scaleup / speedup problem, well...
I suggest that you read into Amdahl's Law, and you'll see that it's impossible to get linear speedup by just adding more nodes :)
I hope that helped. If you still have questions, feel free to drop a comment :)
EDIT2
maybe the best scaling problem that integrates perfectly with MPI is the empiric estimation of Pi.
Imaging a quarter circle with the radius 1, inside a square with sides of length 1, then you can estimate Pi by firing random points into the square and calculate if they're inside of the quarter circle.
note: this is equal to generating tuples (x,y) with x,y in [0, 1] and measuring how many of these have x² + y² <= 1.
Pi is then roughly equal to
4 * Points in Circle / total Points
In MPI you'd just have to gather the ratios generated from all threads, which is very little overhead and thus gives a perfect proof of concept problem for your cluster.
Like with any other computing paradigm, there are certain well established patterns in use with distributed memory programming. One such pattern is the "bag of jobs" or "controller/worker" (previously known as "master/slave", but now the name is considered politically incorrect). It is best suited for your case because:
under the right conditions it scales with the number of workers;
it is easy to implement;
it has built-in load balancing.
The basic premises are very simple. The "controller" process has a big table/queue of jobs and practically executes one big loop (possibly an infinite one). It listens for messages from "worker" processes and responds back. In the simplest case workers send only two types of messages: job requests or computed results. Consequently, the controller process sends two types of messages: job descriptions or termination requests.
And the canonical non-trivial example of this pattern is colouring the Mandelbrot set. Computing each pixel of the final image is done completely independent from the other pixels, so it scales very well even on clusters with high-latency slow network connects (e.g. GigE). In the extreme case each worker can compute a single pixel, but that would result in very high communication overhead, so it is better to split the image in small rectangles. One can find many ready-made MPI codes that colour the Mandelbrot set. For example this code uses row decomposition, i.e. a single job item is to fill one row of the final image. If the number of MPI processes is big, one would have to have fairly large image dimensions, otherwise the load won't balance well enough.
MPI also has mechanisms that allow spawning additional processes or attaching externally started jobs in client/server fashion. Implementing them is not rocket science, but still requires some understanding of advanced MPI concepts like intercommunicators, so I would skip that for now.

Efficiently update an identical array on all tasks with MPI

I'd like to improve the efficiency of a code which includes updates to every value of an array which is identical on all processors run with MPI. The basic structure I have now is to memcpy chunks of the data into a local array on each processor, operate on those, and Allgatherv (have to use "v" because the size of local blocks isn't strictly identical).
In C this would look something like:
/* counts gives the parallelization, counts[RANK] is the local memory size */
/* offsets gives the index in the global array to the local processors */
memcpy (&local_memory[0], &total_vector[0], counts[RANK] * sizeof (double));
for (i = 0; i < counts[RANK]; i++)
local_memory[i] = new_value;
MPI_Allgatherv (&local_memory[0], counts[RANK], MPI_DOUBLE, &total_vector[0], counts, offsets, MPI_DOUBLE, MPI_COMM_WORLD);
As it turns out, this isn't very efficient. In fact, it's really freaking slow, so bad that for most system sizes I'm interested in the parallelization doesn't lead to any increase in speed.
I suppose an alternative to this would be to update just the local chunks of the global vector on each processor and then broadcast the correct chunk of memory from the correct task to all other tasks. While this avoids the explicit memory handling, the communication cost of the broadcast has to be pretty high. It's effectively all-to-all.
EDIT: I just went and tried this solution, where you have to loop over the number of tasks and execute that number of broadcast statements. This method is even worse.
Anyone have a better solution?
The algorithm you describe is "all to all." Each rank updates part of a larger array, and all ranks must sync that array from time to time.
If the updates happen at controlled points in the program flow, a Gather/Scatter pattern might be beneficial. All ranks send their update to "rank 0", and rank 0 sends the updated array to everyone else. Depending on the array size, number of ranks, interconnect between each rank, etc....this pattern may offer less overhead than the Allgatherv.

Parallel Merge Sort with threads /much/ slower than Seq. Merge Sort. Help

http://pastebin.com/YMS4ehRj
^ This is my implementation of parallel merge sort. Basically what I do is, For every split, the first half is handled by a thread whereas the second half is sequential (i.e.) say we have an array of 9 elements, [0..4] is handled by Thread 1, [0..1] is handled Thread 2, [5..6] is handled by thread 3 (Look at the source code for clarification).
Everything else stays the same, like Merging. But the problem is, this runs much slower than merge sort, even slower than normal bubble sort! And I mean for an array of 25000 int's. I'm not sure where the bottleneck is: Is it the mutex locking? Is it the merging?
Any ideas on how to make this faster?
You are creating a large number of threads, each of which then only does very little work. To sort 25000 ints you create about 12500 threads that spawn other threads and merge their results, and about 12500 threads that only sort two ints each.
The overhead from creating all those threads far outweighs the gains you get from parallel processing.
To avoid this, make sure that each thread has a reasonable amount of work to do. For example, if one thread finds that it only has to sort <10000 numbers it can simply sort them itself with a normal merge sort, instead of spawning new threads.
Given you have a finite number of cores on your system, why would you want to create more threads than cores?
Also, it isn't clear why you need to have a mutex at all. As far as I can tell from a quick scan, the program doesn't need to share the threads[lthreadcnt] outside the local function. Just use a local variable and you should be golden.
Your parallelism is too fine-grained, there are too many threads which are doing just small work. You can define a threshold so that arrays which have smaller sizes than the threshold are sequentially sorted. Be careful about the number of spawned threads, a good indication is that the number of threads are usually not much bigger than the number of cores.
Because much of your computation is in merge function, another suggestion is using Divide-and-Conquer Merge instead of simple merge. The advantage is two-fold: the running time is smaller and it is easy to spawn threads for running parallel merging. You can get the idea of how to implement parallel merge here: http://drdobbs.com/high-performance-computing/229204454. They also have an article about Parallel Merge Sort which might be helpful for you: http://drdobbs.com/high-performance-computing/229400239

Resources