MPI_Allreduce Source - c

I am writing code that involved a for loop that makes calculations at each index.
The smallest of these calculations is stored in a variable and I use MPI_Allreduce at the end of the program to determine the global minimum across all processes.
However, I need a way of knowing which process has the smallest value, i.e. can MPI_Allreduce tell me which process sends the result? the process with the smallest value? there is some additional data I need to get from that process.
Thanks in advance for any help!

You can use the MPI_MINLOC operator in the reduce operation to receive the rank of the process with the minimal value (more specifically, the lowest ranked process that has a minimal value).
See http://www.netlib.org/utk/papers/mpi-book/node114.html#SECTION005103000000000000000

Related

MPI QuickSort Program

I am a newbie, trying to edit a program. I have a MPI Program that divide array into subsets, the master sends the subsets to the slaves, they doo a quicksort and then return the sorted numbers to the master so he can write them in a file.
What I am trying to do is make the quick sort happen even quicker. My idea is to make the master divide the array and sends subsets to the slaves but keeping one for himself. Then dividing them again to new subsets (for example if we have numbers from 1 to 100 in the array the new subsets should be from 1 to 25, 26 to 50, 51 to 75 and 76 to 100) and then keep the first subset (1 to 25) for himself, send the second (26 to 50) to the first slave, the third one (51 to 76) to the second slave and etc. The slaves should do the same. Then it should perform a quicksort and the slave should return the sorted numbers to the master. I am hoping that this way the sort should be faster. The problem is that as I said I am a newbie and I need help with ideas, advices and even code so I can achieve my goal.
For this answer I am going to stick with the assumption that this should be done with Quicksort, and that the data is read on a single process. Just keep in mind that there are many sophisticated parallel sorting techniques.
Your idea of separating the numbers by subsets is problematic, because it makes assumptions about the shape of data. For non-uniformly distributed data sets it won't even help to know the minimum and maximum. It is better to simply send out equal amount of numbers to each process, let them sort and afterwards merge the data.
For the merge you start with ntasks sorted sub-lists and want to end up with a single one. A naive merge would repeatedly look for the minimal element in each sub-list, remove that and append it to the final list. This needs ntasks * N comparisons, N swaps and N * 2 memory. You can optimize the comparisons to log2(ntasks) * N by doing an actual merge sort, but that also needs log2(ntasks) * N swaps. You can further refine that by keeping the sub-lists (or pointers to their first element) in a priority queue, which should give you log2(ntasks) * N comparisons and N swaps.
About the usage of MPI:
Do not use MPI_Isend & MPI_Wait right after each other. In this case use MPI_Send instead. Use the immediate variants only if you can actually do something useful between the MPI_Isend and MPI_Wait.
Use collective operations whenever possible. To distribute data from the root to all slaves, use MPI_Scatter or MPI_Scatterv. The first requires all ranks to receive the same number of elements, which can also be achieved by padding. To collect data from the slaves in the master, use MPI_Gather or MPI_Gatherv.1 Collectives are more easy to get right, because they describe the high level operation. Their implementation is usually highly optimized.
To receive an unknown-size message, you can also send the message directly and use MPI_Probe at the receiver side to determine the size. You are even allowed to MPI_Recv with a buffer that is larger than the sent buffer, if you know an upper bound.
1 You could also consider the merge step as a reduction and parallelize the necessary computation for that.
In principle your solution looks very good. I don't understand completely if for the larger files you are intending to process them in chunks or as a whole. From my experience I suggest that you assign as large as possible blocks to the slaves. This way the rather expensive message passing operations are executed only very seldom.
What I cannot understand in your question is what the overall goal of your program is. Is it your intention to sort the complete input files in parallel? If this is the case you will need some sort of merge sort to be applied to the results you receive from the individual processes.

Usage of atomic integer in a shared data

I was studying OS and synchronizing and I got an idea about dealing with this shared data without synchronizing but I am not sure if it will work.Here is the code
Now,the race condition is obviously the increment and decrement in a shared data.But what if the integer variable was atomic?I think I read something about this when I just a beginner in CS so question might not be perfect.As far as I remember it was blocking something to prevent the increment and decrement at the same time.Now,I am a bit confused about this because if the atomic variables really worked there would not be any need to find synchronization methods for simple codes like this one.
Note:Code is removed since it just changes the focus of people and answer provides enough info
As it stands, the code is indeed not safe to call concurrently, so there must be some kind of syncronization that prevents this.
Now, concerning the idea to make num_processes atomic, that could work. It wouldn't be a simple substitution though, in particular comparing to the max and incrementing must be done atomically and not in two steps, otherwise you still have a race condition. In particular, the following steps must be prevented:
Thread A checks if the limit is reached, which it isn't.
Thread B checks if the limit is reached, which it isn't.
Thread B increments the PID counter.
Thread A increments the PID counter.
Each step in and of itself is atomic, but obviously that didn't help preventing a PID overflow. Instead, the code must check if the counter is not at the limit and then increment it atomically. This is also a common task (compare and increment), so you should easily find existing code examples.
However, I'm pretty sure this isn't all code that is involved and some other code (e.g. in get_processID() or the code that releases a PID) could still require a lock around the whole.
For your code, synchronization is not necessary at all because here num_processes is incremented and decremented by only one process i.e. Parent process.And also num_processes is not a shared variable here. To create shared variable you have to first learn about shmget() and shmat() function in UNIX.
And race condition arises if two or more processes want to access a shared memory.An operation will be atomic if that operation is going to executed entirely (i.e. no switching) or not at all. For example
Consider increment operator on a shared data. This operator is not atomic. Because if go to the lower level instruction for increment operator then this operation is performed in several steps as:
1. First load the value of variable in some register.
2. Add one with that loaded value and now result will be in some temporary register.
3. Store this result in the memory location / register that is pointed by that variable on which increment is performed.
Now As you can see this operation is done in three step. So if there is any switching to another process before completion of these three steps then it leads to undesired results. For more you can read about race condition from this link http://tutorials.jenkov.com/java-concurrency/race-conditions-and-critical-sections.html. As from above you can see that add, store, load instructions are atomic because it will be performed entirely or not at all considering there is no power failure any system failure. So to perform increment operation atomic we need to do some synchronization either using semaphores or monitors. These all are software synchronization technique. I think now you will be clear on this topic..

Necessity of pthread mutex

I have an int array[100] and I want 5 threads to calculate the sum of all array elements.
Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
for(i=offset; i<offset+range; i++){
// not used pthread_mutex_lock(&mutex);
sum += array[i];
// not used pthread_mutex_unlock(&mutex);
}
Can this lead to unpredictable behavior or does the OS actually handle this?
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
Yes, you need synchronization, because all thread are modifying the sum at the same time. Here's example:
You have array of 4 elements [a1, a2, a3, a4] and 2 threads t1 and t2 and sum. To begin let's say t1 get value a1 and adds it to sum. But it's not an atomic operation, so he copy current value of sum (it's 0) to his local space, let's call it t1_s, adds to it a1 and then write sum = t1_s. But at the same time t2 do the same, he get sum value (which is 0, because t1 have not completed it operation) to t2_s, adds a3 and write to sum. So we got in the sum value of a3 insted of a1 + a3. This is called data race.
There are multiple solutions to this is:
You can use mutex as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
Without array calculate local sum_local for each thread and in the end add all these sums to shared variable sum using mutex. I guess it will be faster (however it need to be checked).
However as #gavinb mentioned all of it make sense only for larger amount of data.
I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
Yes - the reading of the array variable is independent, however updating the sum variable is not, so you would need a mutex to serialise access to sum, according to your description above.
However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum. If you calculate intermediate sums for each subset (as #Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.
Can this lead to unpredictable behavior or does the OS actually handle this?
Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum.
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
No, definitely not. It might run faster, but it will almost certainly not give you the correct result!
In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.
Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.

MPI sending messages with MPI_Send and MPI_Reduce

so I am learning about parallel programming and am writing a program to calculate a global sum of a list of numbers. The list is split up into several sublists (depending on how many cores I have), and the lists are individually summed in parallel. After each core has its own sum, I use MPI_Reduce to send the values back to other cores, until they eventually make it back to root. Rather than just sending their values back to root directly (O(n)), we send them back, to other cores in parallel (O(log(n)), like this image illustrates: http://imgur.com/rL2O3Tr
So, everything is working fine until like 54. I think I may be misunderstanding MPI_Reduce. I was under the impression MPI_Reduce simply took a value in one thread, and a value in another thread (destination thread), and executed an operation on the value, and then stored it in the same spot in the second thread. This is what I want at least. I want to take my_sum from the sending thread, and add it to the my_sum in the receiving thread. Can you use MPI_Reduce on the same addresses in different threads? They both have the same name.
Furthermore, I want to generate a binary tree representation like this: http://imgur.com/cz6iFxl
Where S02 means that the sum was sent to thread 2, and R03 means that the sum was received by thread 3. For this I am creating an array of structs for each step in the sums (log(n) steps). Each step occurs on lines 59 - 95, each iteration of the while loop is one step. Lines 64-74 are where the thread is sending it's sum to the destination thread, and recording the information in the array of structs.
I think I may be using MPI_Send the wrong way. I am using it like this:
MPI_Send(srInfo, 1, MPI_INT, root, 0, MPI_COMM_WORLD);
Where srInfo is an array of structs, so just a pointer to the first struct (right?). Will this not work because the memory is not shared?
Sorry I am very new to parallel programming, and just need help understanding this, thanks.
You might be misunderstanding what MPI_REDUCE is supposed to do at a higher level. Is there a reason that you really need to divide up your reduction manually? Usually, the MPI collectives are going to be better at optimizing for large scale communicators that you will be able to do on your own. I'd suggest just using the MPI_REDUCE function to do the reduction for all ranks.
So your code will do something like this:
Divide up the work among all of your ranks somehow (could be reading from a file, being sent from some "root" process to all of the others, etc.).
Each rank sums up its own values.
Each rank enters into an MPI_REDUCE with its own value. This would look something like:
MPI_Reduce(&myval, &sum, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
That should automatically do all of the summation for you in what is usually some sort of tree fashion.

Calculate the fibonacci numbers in multiprocess way?

I am writing multi process fibonacci number calculator , I have a file that keeps track of the fibonacci numbers , first process open the file and write first fibonacci numbers (0 and 1 ), then do fork and its child process read the last two numbers add them up and write the next into file and close the file and fork again this process continue like that with forking and child adding up numbers and writing calculated number into file, using fork inside the for not a good solution neither recursive call,is there any suggesstion for problem ??
Here is the link of the problem we are
talking about multi-process part of
the problem which is part 2
http://cse.yeditepe.edu.tr/~sbaydere/fall2010/cse331/files/assignments/F10A1.pdf
Assuming you're calculating them in a "simple" way (i.e. without using a cunning formula), I don't think it's a good candidate for parallel processing at all.
It's easy to come up with an O(n) solution, but each result depends on the previous one, so it's inherently tricky to parallelize. I can't see any benefit in your current parallel approach, as after each process has finished its own job and forked a child to get the next number, it's basically done... so you might as well do the work of the forked child in the existing process.
Fibonacci number calculation is a really strange idea to go multiprocess. Indeed, to calculate a number, you do need to know the previous two. Multiple processes cannot calculate other numbers but the next one, and only the next one. Multiple processes will all calculate the next Fibonacci number. Anyway, you'll double check.
You might want to look at this article:
http://cgi.cse.unsw.edu.au/~dons/blog/2007/11/29
There are more ideas here:
http://www.haskell.org/haskellwiki/The_Fibonacci_sequence
probably this isn't solving your problem, but here is a trivial way to calculate the fibonacci numbers of a given range
int fibo(int n) { return (n<=2)?n:(fibo(n-1))+(fibo(n-2)); }

Resources