MPI_Recv() freezing program, not receiving value from MPI_Send() in C - c

I'm trying to write an implementation of a hyper quicksort in MPI, but I'm having an issue where a process gets stuck on MPI_Recv().
While testing with 2 processes, it seems that inside the else of the if (rank % comm_sz == 0), process 1 is never receiving the pivot from process 0. Process 0 successfully sends its pivot and recurses through the method correctly. If put in some print debug statements and received the output:
(arr, 0, 2, 0, 9)
Rank 0 sending pivot 7 to 1
(arr, 1, 2, 0, 9)
Rank 1 pre-recv from 0
After which, the post-recv message from rank 1 never prints. Rank 0 prints its post-send message and continues through its section of the array. Is there something wrong with my implementation of MPI_Send() or MPI_Recv() that may be causing this?
Here is my code for the quicksort:
(For reference, comm_sz in the parameters for the method refers to the number of processes looking at that section of the array.)
void hyper_quick(int *array, int rank, int comm_sz, int s, int e) {
printf("(arr, %d, %d, %d, %d)\n", rank, comm_sz, s, e);
// Keeps recursing until there is only one element
if (s < e) {
int pivot;
if (comm_sz > 1) {
// One process gets a random pivot within its range and sends that to every process looking at that range
if (rank % comm_sz == 0) {
pivot = rand() % (e - s) + s;
for (int i = rank + 1; i < comm_sz; i++) {
int partner = rank + i;
printf("Rank %d sending pivot %d to %d\n", rank, pivot, partner);
MPI_Send(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD);
printf("Rank %d successfully sent %d to %d\n", rank, pivot, partner);
}
}
else {
int partner = rank - (rank % comm_sz);
printf("Rank %d pre-recv from %d\n", rank, partner);
MPI_Recv(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Rank %d received pivot %d from %d\n", rank, pivot, partner);
}
}
else {
pivot = rand() % (e - s) + s;
}
int tmp = array[pivot];
array[pivot] = array[e];
array[e] = tmp;
// Here is where the actual quick sort happens
int i = s;
int j = e - 1;
while (i < j) {
while (array[e] >= array[i] && i < j) {
i++;
}
while (array[e] < array[j] && i < j) {
j--;
}
if (i < j) {
tmp = array[i];
array[i] = array[j];
array[j] = tmp;
}
}
if (array[e] < array[i]) {
tmp = array[i];
array[i] = array[e];
array[e] = tmp;
pivot = i;
}
else {
pivot = e;
}
// Split remaining elements between remaining processes
if (comm_sz > 1) {
// Elements greater than pivot
if (rank % comm_sz >= comm_sz/2) {
hyper_quick(array, rank, comm_sz/2, pivot + 1, e);
}
// Elements lesser than pivot
else {
hyper_quick(array, rank, comm_sz/2, s, pivot - 1);
}
}
// Recurse remaining elements in current process
else {
hyper_quick(array, rank, 1, s, pivot - 1);
hyper_quick(array, rank, 1, pivot + 1, e);
}
}

Rank 0 sending pivot 7 to 1
MPI_Send(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD);
^^^^
So the sender tag is zero.
Rank 1 pre-recv from 0
MPI_Recv(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
^^^^
And the receiver tag is one.
If the receiver asks for only messages with a specific tag, it will not receive a message with a different tag.
Sometimes there are cases when A might have to send many different types of messages to B. Instead of B having to go through extra measures to differentiate all these messages, MPI allows senders and receivers to also specify message IDs with the message (known as tags). When process B only requests a message with a certain tag number, messages with different tags will be buffered by the network until B is ready for them. [MPI Tutorial -- Send and Receive]

Related

Implementing MPI_Allreduce in C - why does my code hang indefinitely?

I am trying to implement my own version of MPI_Allreduce in C. The logic is that odd ranks send their data to even ranks to their rank-1, ie rank 1 sends to rank 0. Rank 0 then receives the data from rank 1 and operates on it depending on the MPI_Op operator passed as an argument. I am unsure if my way of operating on the data is correct, for example if MPI_SUM is passed then for each count passed my code does recvbuf[count] += sendbuf[count]. Once the even rank has received data from the odd rank+1, it then does rank /= 2 and the contents of the while loop repeat, with rank 0, being unchanged, rank 2 becoming rank 1 and now containing the original ranks 2+3 data, and so on. Num is also halfed each interation of the while loop until it becomes 1, which is when only rank 0 should remain.
My code thus far is below:
int tree_reduction(const int rank_in, const int np, const int *sendbuf, int *recvbuf,
int count, MPI_Op op, MPI_Comm comm){
// Create variables for rank and size
int rank = rank_in;
int num = np;
int tag = 0;
int depth = 1; // Depth of the tree
// While size is greater than 1 there is 2 or more ranks to operate on
while(num > 1){
if(rank < num){
if( (rank % 2) != 0 ){ // If rank is odd
MPI_Ssend(sendbuf, count, MPI_INT, (rank-1)*depth, tag, comm);
rank *= num; // any ranks above 0 will be filtered out
break;
}
else{ // If rank is even
MPI_Recv(recvbuf, count, MPI_INT, (rank+1)*depth, tag, comm,
MPI_STATUS_IGNORE);
/* START OF OPERATORS */
if(op == MPI_SUM){
for(int c=0; c<count; c++){
recvbuf[c] += sendbuf[c];
}
}
if(op == MPI_PROD){
for(int c=0; c<count; c++){
recvbuf[count] *= sendbuf[count];
}
}
if(op == MPI_MIN){
for(int c=0; c<count; c++){
if(sendbuf[count] < recvbuf[count]){
recvbuf[count] = sendbuf[count];
}
}
}
if(op == MPI_MAX){
for(int c=0; c<count; c++){
if(sendbuf[count] > recvbuf[count]){
recvbuf[count] = sendbuf[count];
}
}
}
/* END OF OPERATORS */
}
depth *= 2;
}
num = num/2;
}
// NEED TO BROADCAST BACK TO ALL RANKS
MPI_Bcast(recvbuf, count, MPI_INT, 0, comm);
return 0;
When I run this with more than 2 processes, it hangs indefinitely without printing anything to the terminal, why is this?

How to convert MPI_Reduce into MPI_Send and MPI_Recv?

I am working on a parallel processing program that uses MPI_Send() and MPI_Recv() instead of using MPI_Reduce(). I understand that MPI_Send() will need to send a value from each processor to the root processor aka 0 and MPI_Recv() will need to receive all of the values from each processor.
I keep getting the error where the value in Send will not be sent to the Receiving side thus making the final value 0. The MPI_Reduce() function is still in the code but commented out to see what needs to be replaced. Can anyone help?
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[])
{
int n, i;
double PI25DT = 3.141592653589793238462643;
double pi, h, sum, x;
int numprocs, myid;
double startTime, endTime;
/* Initialize MPI and get number of processes and my number or rank*/
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
/* Processor zero sets the number of intervals and starts its clock*/
if (myid==0) {
n=600000000;
startTime=MPI_Wtime();
for (int i = 0; i < numprocs; i++) {
if (i != myid) {
MPI_Send(&n, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
}
}
}
else {
MPI_Recv(&n, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
/* Calculate the width of intervals */
h = 1.0 / (double) n;
/* Initialize sum */
sum = 0.0;
/* Step over each inteval I own */
for (i = myid+1; i <= n; i += numprocs) {
/* Calculate midpoint of interval */
x = h * ((double)i - 0.5);
/* Add rectangle's area = height*width = f(x)*h */
sum += (4.0/(1.0+x*x))*h;
}
/* Get sum total on processor zero */
//MPI_Reduce(&sum,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
double value = 0;
if (myid != 0) {
MPI_Send(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
else {
for (int i = 1; i < numprocs; i++) {
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
pi += value;
}
}
/* Print approximate value of pi and runtime*/
if (myid==0) {
printf("pi is approximately %.16f, Error is %e\n",
pi, fabs(pi - PI25DT));
endTime=MPI_Wtime();
printf("runtime is=%.16f",endTime-startTime);
}
MPI_Finalize();
return 0;
}
You are using MPI_INT to send a value of type double:
if (myid != 0) {
MPI_Send(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
// ^^^^^^^
}
int is 4 bytes long; double is 8 bytes long. Although the receive operation succeeds, it cannot construct a value of type MPI_DOUBLE given only 4 bytes from the message, so it doesn't write anything into value and it remains 0.0. Indeed, if you replace:
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
with
MPI_Status status;
int count;
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_DOUBLE, &count);
if (count == MPI_UNDEFINED) {
printf("Short message received\n");
MPI_Abort(MPI_COMM_WORLD, 0);
}
your program will abort, indicating that the body of the conditional statement was executed due to MPI_Get_count() returning MPI_UNDEFINED in count, which signals that the length of the received message was not an integer multiple of the size of MPI_DOUBLE.
Also, pi must be explicitly initialised to sum before the receive loop, otherwise you will get the wrong value of pi due to either of the following errors:
pi is left uninitialised and has arbitrary initial value, and
the contribution of rank 0 is not added to the final result.

MPI_send and MPI_receive function getting stuck

I have a code of page_rank where I have to compute page rank in parallel. My code hangs where I wrote the following MPI_send and MPI_receive function. What could be the problem?
int **sendto_list = (int **)malloc(comm_size*sizeof(int*));
for(i=0; i < comm_size; i++) {
sendto_list[i] = (int *)malloc(18*sizeof(int));
sendto_list[i][0] = 16;
sendto_list[i][1] = 0;
}
int temp_data = 1;
for(i=0; i < comm_size; i++) {
if(request_list[i][1] > 0) {
for(k=0; k < request_list[i][1]; ) {
for(j=0; j < 200; j++) {
if( k >= request_list[i][1] )
break;
sendrecv_buffer_int[j] = request_list[i][k+2];
k++;
}
// Request appropriate process for pagerank.
if(i!= my_rank)
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_REQ, MPI_COMM_WORLD);
}
}
if( i != my_rank )
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_DONE, MPI_COMM_WORLD);
}
int expected_requests = 0, done = 0,temp,s;
s=0;
while( (done == 0) && (comm_size > 1) ) {
if(expected_requests == (comm_size - 1))
break;
int count;
// Receive pagerank requests or messages with TAG_PR_DONE(can be from any process).
MPI_Recv(&temp, 1, MPI_INT, MPI_ANY_SOURCE ,MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_INT, &count);
switch(status.MPI_TAG) {
case TAG_PR_REQ:{
for(i = 0 ; i < count; i++)
insert_into_adj_list(&sendto_list[status.MPI_SOURCE], sendrecv_buffer_int[i], num_nodes);
break;
}
case TAG_PR_DONE:{
expected_requests++;
break;
}
default:
break;
}
}
A cursory glance over your code looks as if your issue is because your MPI_Send() calls are blocking and thus nothing is receiving and freeing them up.
If (request_list[i][1] > 0) and (i!= my_rank) evaluate to true you try to perform 2 MPI_Send() operations to process rank i but you only have 1 matching MPI_Recv() operation in each process, i.e. process rank i.
You may want to try changing
if(request_list[i][1] > 0) {
...
}
if( i != my_rank )
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_DONE, MPI_COMM_WORLD);
to
if(request_list[i][1] > 0) {
...
} else if( i != my_rank ) {
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_DONE, MPI_COMM_WORLD);
}
Note the addition of else turning if into else if. This ensures only 1 MPI_Send() operation per process. It does not look like the two MPI_Send() operations should be executed if the above conditions are true.
Alternatively if you need you could look into MPI_Isend() and MPI_Irecv(). Although I don't think they will solve your problem entirely in this case. I still think you need the else if clause.
I'd also like to point out in C there is no need to cast the return of malloc(). This topic has been covered extensively on StackOverflow so I won't dwell on it too long.
You should also check the result of malloc() is a valid pointer. It returns NULL if an error occured.

How does MPI Odd-Even sort work?

Just looking over some notes prior to an interview and am struggling to understand how Odd-Even sort works in parallel architectures.
int MPI_OddEven_Sort(int n, double *a, int root, MPI_Comm comm)
{
int rank, size, i, sorted_result;
double *local_a;
// get rank and size of comm
MPI_Comm_rank(comm, &rank); //&rank = address of rank
MPI_Comm_size(comm, &size);
local_a = (double *) calloc(n / size, sizeof(double));
// scatter the array a to local_a
MPI_Scatter(a, n / size, MPI_DOUBLE, local_a, n / size, MPI_DOUBLE,
root, comm);
// sort local_a
merge_sort(n / size, local_a);
//odd-even part
for (i = 0; i < size; i++) {
if ((i + rank) % 2 == 0) { // means i and rank have same nature
if (rank < size - 1) {
MPI_Compare(n / size, local_a, rank, rank + 1, comm);
}
} else if (rank > 0) {
MPI_Compare(n / size, local_a, rank - 1, rank, comm);
}
MPI_Barrier(comm);
// test if array is sorted
MPI_Is_Sorted(n / size, local_a, root, comm, &sorted_result);
// is sorted gives integer 0 or 1, if 0 => array is sorted
if (sorted_result == 0) {
break;
} // check for iterations
}
// gather local_a to a
MPI_Gather(local_a, n / size, MPI_DOUBLE, a, n / size, MPI_DOUBLE,
root, comm)
return MPI_SUCCESS;
}
is some code I wrote for this function (not today nor yesterday!). Can someone please break down how it is working ?
I'm scattering my array a to each processor, which is getting a copy of local_a (which is of size n/size)
Merge sort is being called on each local_a.
What is going on after this? (Assuming I am correct so far!)
It's sort of fun to see these PRAM-type sorting networks popping up again after all these years. The original mental model of parallel computing for these things was massively parallel arrays of tiny processors as "comparators", eg the Connection Machines - back in the day when networking was cheap compared to CPU/RAM. Of course that ended up looking very different from the supercomputers of the mid to late 80s and on, and even more so than the x86 clusters of the late 90s on; but now they're starting to come back in vogue with GPUs and other accelerators which actually do look a bit like that future past if you squint.
It looks like what you have above is something more like a Baudet-Stevenson odd-even sort, which was already starting to move in the direction of assuming that the processors would have multiple items stored locally and you could make good use of the processors by sorting those local lists in between communication steps.
Fleshing out your code and simplifying it a bit, we have something like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int merge(double *ina, int lena, double *inb, int lenb, double *out) {
int i,j;
int outcount=0;
for (i=0,j=0; i<lena; i++) {
while ((inb[j] < ina[i]) && j < lenb) {
out[outcount++] = inb[j++];
}
out[outcount++] = ina[i];
}
while (j<lenb)
out[outcount++] = inb[j++];
return 0;
}
int domerge_sort(double *a, int start, int end, double *b) {
if ((end - start) <= 1) return 0;
int mid = (end+start)/2;
domerge_sort(a, start, mid, b);
domerge_sort(a, mid, end, b);
merge(&(a[start]), mid-start, &(a[mid]), end-mid, &(b[start]));
for (int i=start; i<end; i++)
a[i] = b[i];
return 0;
}
int merge_sort(int n, double *a) {
double b[n];
domerge_sort(a, 0, n, b);
return 0;
}
void printstat(int rank, int iter, char *txt, double *la, int n) {
printf("[%d] %s iter %d: <", rank, txt, iter);
for (int j=0; j<n-1; j++)
printf("%6.3lf,",la[j]);
printf("%6.3lf>\n", la[n-1]);
}
void MPI_Pairwise_Exchange(int localn, double *locala, int sendrank, int recvrank,
MPI_Comm comm) {
/*
* the sending rank just sends the data and waits for the results;
* the receiving rank receives it, sorts the combined data, and returns
* the correct half of the data.
*/
int rank;
double remote[localn];
double all[2*localn];
const int mergetag = 1;
const int sortedtag = 2;
MPI_Comm_rank(comm, &rank);
if (rank == sendrank) {
MPI_Send(locala, localn, MPI_DOUBLE, recvrank, mergetag, MPI_COMM_WORLD);
MPI_Recv(locala, localn, MPI_DOUBLE, recvrank, sortedtag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
MPI_Recv(remote, localn, MPI_DOUBLE, sendrank, mergetag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
merge(locala, localn, remote, localn, all);
int theirstart = 0, mystart = localn;
if (sendrank > rank) {
theirstart = localn;
mystart = 0;
}
MPI_Send(&(all[theirstart]), localn, MPI_DOUBLE, sendrank, sortedtag, MPI_COMM_WORLD);
for (int i=mystart; i<mystart+localn; i++)
locala[i-mystart] = all[i];
}
}
int MPI_OddEven_Sort(int n, double *a, int root, MPI_Comm comm)
{
int rank, size, i;
double *local_a;
// get rank and size of comm
MPI_Comm_rank(comm, &rank); //&rank = address of rank
MPI_Comm_size(comm, &size);
local_a = (double *) calloc(n / size, sizeof(double));
// scatter the array a to local_a
MPI_Scatter(a, n / size, MPI_DOUBLE, local_a, n / size, MPI_DOUBLE,
root, comm);
// sort local_a
merge_sort(n / size, local_a);
//odd-even part
for (i = 1; i <= size; i++) {
printstat(rank, i, "before", local_a, n/size);
if ((i + rank) % 2 == 0) { // means i and rank have same nature
if (rank < size - 1) {
MPI_Pairwise_Exchange(n / size, local_a, rank, rank + 1, comm);
}
} else if (rank > 0) {
MPI_Pairwise_Exchange(n / size, local_a, rank - 1, rank, comm);
}
}
printstat(rank, i-1, "after", local_a, n/size);
// gather local_a to a
MPI_Gather(local_a, n / size, MPI_DOUBLE, a, n / size, MPI_DOUBLE,
root, comm);
if (rank == root)
printstat(rank, i, " all done ", a, n);
return MPI_SUCCESS;
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int n = argc-1;
double a[n];
for (int i=0; i<n; i++)
a[i] = atof(argv[i+1]);
MPI_OddEven_Sort(n, a, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
So the way this works is that the list is evenly split up between processors (non-equal distributions are easily handled too, but it's a lot of extra bookkeeping which doesn't add much to this discussion).
We first sorting our local lists (which is O(n/P ln n/P)). There's no reason it has to be a merge sort, of course, except that here we can re-use that merge code the following steps. Then we do P neighbour exchange steps, half in each direction. The model here was that there was a linear network where we could communicate directly and quickly with immediate neighbours, and perhaps not at all with neighbours further away.
The original odd-even sorting network is the case where each processor has one key, in which case the communication is easy - you compare your item with your neighbour, and swap if necessary (so that this is basically a parallel bubble sort). In this case, we do a simple parallel sort between pairs of processes - here, each pair just sends all data to one of the pair, that pair merges the already locally sorted lists O(N/P), and then gives the appropriate half of the data back to the other processor. I took out your check-if-done; it can be shown that it's completed in P neighbour exchanges. You can certainly add it back in just in case of early termination; however, all the processors have to agree when everything's done, which requires something like an all reduce, which breaks the original model somewhat.
So we have O(n) data transfer per link, (sending and receiving n/P items P times each), and each processor does (n/P ln n/P) + (2 n/P - 1)*P/2 = O(n/P ln n/P + N) comparisons; in this case there's a scatter and a gather to be considered as well, but in general this sort is done with data in place.
Running the above - with, for clarity, that same example in that document linked gives (with output re-ordered to make it easier to read):
$ mpirun -np 4 ./baudet-stevenson 43 54 63 28 79 81 32 47 84 17 25 49
[0] before iter 1: <43.000,54.000,63.000>
[1] before iter 1: <28.000,79.000,81.000>
[2] before iter 1: <32.000,47.000,84.000>
[3] before iter 1: <17.000,25.000,49.000>
[0] before iter 2: <43.000,54.000,63.000>
[1] before iter 2: <28.000,32.000,47.000>
[2] before iter 2: <79.000,81.000,84.000>
[3] before iter 2: <17.000,25.000,49.000>
[0] before iter 3: <28.000,32.000,43.000>
[1] before iter 3: <47.000,54.000,63.000>
[2] before iter 3: <17.000,25.000,49.000>
[3] before iter 3: <79.000,81.000,84.000>
[0] before iter 4: <28.000,32.000,43.000>
[1] before iter 4: <17.000,25.000,47.000>
[2] before iter 4: <49.000,54.000,63.000>
[3] before iter 4: <79.000,81.000,84.000>
[0] after iter 4: <17.000,25.000,28.000>
[1] after iter 4: <32.000,43.000,47.000>
[2] after iter 4: <49.000,54.000,63.000>
[3] after iter 4: <79.000,81.000,84.000>
[0] all done iter 5: <17.000,25.000,28.000,32.000,43.000,47.000,49.000,54.000,63.000,79.000,81.000,84.000>

MPI joining vectors

I'm trying to do some parallel calculations and then reduce them to one vector.
I try it by dividing for loop into parts which should be calculated separatedly from vector. Later I'd like to join all those subvectors into one main vector by replacing parts of it with values gotten from processes. Needless to say, I have no idea how to do it and my tries were in vain.
Any help will be appreciated.
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(x0, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
printf("My id: %d, mySize: %d, myStart: %d, myEnd: %d", rank, size, mystart, myend);
while(delta > granica)
{
ii++;
delta = 0;
//if(rank > 0)
//{
for(i = mystart; i < myend; i++)
{
xNowe[i] = b[i];
for(j = 0; j < n; j++)
{
if(i != j)
{
xNowe[i] -= A[i][j] * x0[j];
}
}
xNowe[i] = xNowe[i] / A[i][i];
printf("Result in iteration %d: %d", i, xNowe[i]);
}
MPI_Reduce(xNowe, xNowe,n,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
I'm going to ignore your calculations and assume they're all doing whatever it is you want them to do and at the end, you have an array called xNowe that has the results for your rank somewhere within it (in some subarray).
You have two options.
The first way uses an MPI_REDUCE in the way you're currently doing it.
What needs to happen is that you should probably set all of the values that do not pertain to your rank to 0, then you can just do a big MPI_REDUCE (as you're already doing), where each process contributes its xNowe array which will look something like this (depending on the input/rank/etc.):
rank: 0 1 2 3 4 5 6 7
value: 0 0 1 2 0 0 0 0
When you do the reduction (with MPI_SUM as the op), you'll get an array (on rank 0) that has each value filled in with the value contributed by each rank.
The second way uses an MPI_GATHER. Some might consider this to be the "more proper" way.
For this version, instead of using MPI_REDUCE to get the result, you only send the data that was calculated on your rank. You wouldn't have one large array. So your code would look something like this:
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(x0, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
printf("My id: %d, mySize: %d, myStart: %d, myEnd: %d", rank, size, mystart, myend);
while(delta > granica)
{
ii++;
delta = 0;
for(i = mystart; i < myend; i++)
{
xNowe[i-mystart] = b[i];
for(j = 0; j < n; j++)
{
if(i != j)
{
xNowe[i] -= A[i][j] * x0[j];
}
}
xNowe[i-mystart] = xNowe[i-mystart] / A[i][i];
printf("Result in iteration %d: %d", i, xNowe[i-mystart]);
}
}
MPI_Gather(xNowe, myend-mystart, MPI_DOUBLE, result, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
You would obviously need to create a new array on rank 0 that is called result to hold the resulting values.
UPDATE:
As pointed out by Hristo in the comments below, MPI_GATHER might not work here if myend - mystart is not the same on all ranks. If that's the case, you'd need to use MPI_GATHERV which allows you to specify a different size for each rank.

Resources