I am working on a parallel processing program that uses MPI_Send() and MPI_Recv() instead of using MPI_Reduce(). I understand that MPI_Send() will need to send a value from each processor to the root processor aka 0 and MPI_Recv() will need to receive all of the values from each processor.
I keep getting the error where the value in Send will not be sent to the Receiving side thus making the final value 0. The MPI_Reduce() function is still in the code but commented out to see what needs to be replaced. Can anyone help?
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[])
{
int n, i;
double PI25DT = 3.141592653589793238462643;
double pi, h, sum, x;
int numprocs, myid;
double startTime, endTime;
/* Initialize MPI and get number of processes and my number or rank*/
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
/* Processor zero sets the number of intervals and starts its clock*/
if (myid==0) {
n=600000000;
startTime=MPI_Wtime();
for (int i = 0; i < numprocs; i++) {
if (i != myid) {
MPI_Send(&n, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
}
}
}
else {
MPI_Recv(&n, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
/* Calculate the width of intervals */
h = 1.0 / (double) n;
/* Initialize sum */
sum = 0.0;
/* Step over each inteval I own */
for (i = myid+1; i <= n; i += numprocs) {
/* Calculate midpoint of interval */
x = h * ((double)i - 0.5);
/* Add rectangle's area = height*width = f(x)*h */
sum += (4.0/(1.0+x*x))*h;
}
/* Get sum total on processor zero */
//MPI_Reduce(&sum,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
double value = 0;
if (myid != 0) {
MPI_Send(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
else {
for (int i = 1; i < numprocs; i++) {
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
pi += value;
}
}
/* Print approximate value of pi and runtime*/
if (myid==0) {
printf("pi is approximately %.16f, Error is %e\n",
pi, fabs(pi - PI25DT));
endTime=MPI_Wtime();
printf("runtime is=%.16f",endTime-startTime);
}
MPI_Finalize();
return 0;
}
You are using MPI_INT to send a value of type double:
if (myid != 0) {
MPI_Send(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
// ^^^^^^^
}
int is 4 bytes long; double is 8 bytes long. Although the receive operation succeeds, it cannot construct a value of type MPI_DOUBLE given only 4 bytes from the message, so it doesn't write anything into value and it remains 0.0. Indeed, if you replace:
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
with
MPI_Status status;
int count;
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_DOUBLE, &count);
if (count == MPI_UNDEFINED) {
printf("Short message received\n");
MPI_Abort(MPI_COMM_WORLD, 0);
}
your program will abort, indicating that the body of the conditional statement was executed due to MPI_Get_count() returning MPI_UNDEFINED in count, which signals that the length of the received message was not an integer multiple of the size of MPI_DOUBLE.
Also, pi must be explicitly initialised to sum before the receive loop, otherwise you will get the wrong value of pi due to either of the following errors:
pi is left uninitialised and has arbitrary initial value, and
the contribution of rank 0 is not added to the final result.
Related
if I have this code:
int main(void) {
int result=0;
int num[6] = {1, 2, 4, 3, 7, 1};
if (my_rank != 0) {
MPI_Reduce(num, &result, 6, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
} else {
MPI_Reduce(num, &result, 6, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD)
printf("result = %d\n", result);
}
}
the result print is 1 ;
But if the num[0]=9; then the result is 9
I read to solve this problem I must to define the variable num as array.
I can't understand how the function MPI_Reduce works with MPI_MIN. Why, if the num[0] is not equal to the smallest number, then I must to define the variable num as array?
MPI_Reduce performs a reduction over the members of the communicator - not the members of the local array. sendbuf and recvbuf must both be of the same size.
I think the standard says it best:
Thus, all processes provide input buffers and output buffers of the same length, with elements of the same type. Each process can provide one element, or a sequence of elements, in which case the combine operation is executed element-wise on each entry of the sequence.
MPI does not get the minimum of all elements in the array, you have to do that manually.
You can use MPI_MIN to obtain the min value among those passed via reduction.
Lets' examine the function declaration:
int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int root, MPI_Comm comm)
Each process send it's value (or array of values) using the buffer sendbuff.
The process identified by the root id receive the buffers and stores them in the buffer recvbuf. The number of elements to receive from each of the other processes is specified in count, so that recvbuff must be allocated with dimension sizeof(datatype)*count.
If each process has only one integer to send (count = 1) then recvbuff it's also an integer, If each process has two integers then recvbuff it's an array of integers of size 2. See this nice post for further explanations and nice pictures.
Now it should be clear that your code is wrong, sendbuff and recvbuff must be of the same size and there is no need of the condition: if(myrank==0). Simply, recvbuff has meaning only for the root process and sendbuff for the others.
In your example you can assign one or more element of the array to a different process and then compute the minvalue (if there are as many processes as values in the array) or the array of minvalues (if there are more values than processes).
Here is a working example that illustrates the usage of MPI_MIN, MPI_MAX and MPI_SUM (slightly modified from this), in the case of simple values (not array).
Each process do some work, depending on their rank and send to the root process the time spent doing the work. The root process collect the times and output the min, max and average values of the times.
#include <stdio.h>
#include <mpi.h>
int myrank, numprocs;
/* just a function to waste some time */
float work()
{
float x, y;
if (myrank%2) {
for (int i = 0; i < 100000000; ++i) {
x = i/0.001;
y += x;
}
} else {
for (int i = 0; i < 100000; ++i) {
x = i/0.001;
y += x;
}
}
return y;
}
int main(int argc, char **argv)
{
int node;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
/*variables used for gathering timing statistics*/
double mytime,
maxtime,
mintime,
avgtime;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/
mytime = MPI_Wtime(); /*get time just before work section */
work();
mytime = MPI_Wtime() - mytime; /*get time just after work section*/
/*compute max, min, and average timing statistics*/
MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD);
MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);
/* plot the output */
if (myrank == 0) {
avgtime /= numprocs;
printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime);
}
MPI_Finalize();
return 0;
}
If I run this on my OSX laptop, this is what I get:
urcaurca$ mpirun -n 4 ./a.out
Hello World from Node 3
Hello World from Node 0
Hello World from Node 2
Hello World from Node 1
Min: 0.000974 Max: 0.985291 Avg: 0.493081
I'm trying to send a number to p-1 processes. Process 0 sends this value to all other processes. I use an MPI_SEND Command to do this. When I explicitly write out MPI_SEND commands for 3 processes, it works fine. But when I want to put it in a loop, it gives me the output as well as a segmentation fault code. Here is my code:
#include <stdlib.h>
#include <mpi.h>
#include "a1.h"
//AUTHORS
//LAKSHAN SIVANANTHAN - 1150161
//RAZMIG PAPISSIAN - 1152517
int main(int argc, char** argv)
{
RGB *image;
int width, height, max;
int windowLength = atoi(argv[3]);
int my_rank, p, local_height, source, i;
int dest;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int *processorRows;
processorRows = (int*)malloc(sizeof(int)*(p+1));
if (my_rank == 0) {
printf("Process %d is reading...\n", my_rank);
image = readPPM(argv[1], &width, &height, &max);
//calculate rows to each process
for (i=0; i<p; i++) {
processorRows[i] = height/p;
}
for (i=0; i< height%p; i++){
processorRows[i]++;
}
for (dest=1; dest<p; dest++) {
MPI_Send(processorRows + dest, 1, MPI_INT, dest, 0, MPI_COMM_WORLD);
//MPI_Send(processorRows + 2, 1, MPI_INT, 2, 0, MPI_COMM_WORLD);
//MPI_Send(processorRows + 3, 1, MPI_INT, 3, 0, MPI_COMM_WORLD);
}
}
else {
MPI_Recv(processorRows, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
printf("I am Process %d and will run %d rows...\n", my_rank, *processorRows);
}
//processImage(width, height, image, windowLength);
//writePPM(argv[2], width, height, max, image);
free(image);
free(processorRows);
MPI_Finalize();
return(0);
}
If I were to remove the for loop, replace "dest" with 1, and uncomment the other 2 MPI_SEND lines, it works completely fine when running mpirun -np 4 ./program
Not sure what's going on here...
I'm not exactly sure what you are trying to accomplish. But, from the statement
Process 0 sends this value to all other processes.
and from the part of the code, I would expect you to do a scatter from Process-0 to all other PEs rather than this send-receive loop tricks.
Remove all the send-receive pairs and remove the loops, just use a single scatter operation. Here is the link for MPI_Scatter operation https://www.open-mpi.org/doc/v1.8/man3/MPI_Scatter.3.php. If you are unsure about the scatter operation, have a look at this neat explanation http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/
It looks like, the size of processorRows array is the size of the total number of Process used. And, you are trying to send each element of this processorRows array to all other ranks. Hence, your code should look something like this one below:
int *processorRows;
processorRows = (int*)malloc(sizeof(int)*(p+1));
if (my_rank == 0) {
printf("Process %d is reading...\n", my_rank);
image = readPPM(argv[1], &width, &height, &max);
for (i=0; i<p; i++) {
processorRows[i] = height/p;
}
for (i=0; i< height%p; i++){
processorRows[i]++;
}
}
MPI_Scatter(processorRows, 1, MPI_INT, processorRows, 1, MPI_INT, 0, MPI_COMM_WORLD);
I removed the
#include "a.h"
and
image = readPPM(argv[1], &width, &height, &max);
since I do not have these classes, set the height manually to 10 and the code worked. Maybe the problem is with height variable?
I'm beginner in MPI programming. I'm trying to write a program that dynamically takes in an one dimensional arrays of different sizes (multiples of 100, 1000, 10000, 1000000 and so on) and scatters it to allotted processor cores. Processor cores calculate the sum of the received elements and send the sum back. The root process prints the sum of the elements in input array.
I used MPI_Scatter() and MPI_Reduce() to solve the problem. However, when the number of processor cores allotted are odd in number, some of the data get left out. For example, when I have input data size of 100 and 3 processes - only 99 elements are added and last one is left out.
I searched for the alternatives and found that MPI_Scatterv() can be used for uneven distribution of data. But there is no material available to guide me for it's implementation. Can someone help me? I'm posting my code here. Thanks in advance.
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
void readArray(char * fileName, double ** a, int * n);
int Numprocs, MyRank;
int mpi_err;
#define Root = 0
void init_it(int *argc, char ***argv) {
mpi_err = MPI_Init(argc, argv);
mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);
mpi_err = MPI_Comm_size(MPI_COMM_WORLD, &Numprocs);
}
int main(int argc, char** argv) {
/* .......Variables Initialisation ......*/
int index;
double *InputBuffer, *RecvBuffer, sum=0.0, psum = 0.0;
double ptime = 0.0, Totaltime= 0.0,startwtime = 0.0, endwtime = 0.0;
int Scatter_DataSize;
int DataSize;
FILE *fp;
init_it(&argc,&argv);
if (argc != 2) {
fprintf(stderr, "\n*** Usage: arraySum <inputFile>\n\n");
exit(1);
}
if (MyRank == 0) {
startwtime = MPI_Wtime();
printf("Number of nodes running %d\n",Numprocs);
/*...... Read input....*/
readArray(argv[1], &InputBuffer, &DataSize);
printf("Size of array %d\n", DataSize);
}
if (MyRank!=0) {
MPI_Recv(&DataSize, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
}
else {
int i;
for (i=1;i<Numprocs;i++) {
MPI_Send(&DataSize, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
d[i]= i*Numprocs;
}
}
Scatter_DataSize = DataSize / Numprocs;
RecvBuffer = (double *)malloc(Scatter_DataSize * sizeof(double));
MPI_Barrier(MPI_COMM_WORLD);
mpi_err = MPI_Scatter(InputBuffer, Scatter_DataSize, MPI_DOUBLE,
RecvBuffer, Scatter_DataSize, MPI_DOUBLE,
0, MPI_COMM_WORLD);
for (index = 0; index < Scatter_DataSize; index++) {
psum = psum + RecvBuffer[index];
}
//printf("Processor %d computed sum %f\n", MyRank, psum);
mpi_err = MPI_Reduce(&psum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (MyRank == 0) {
endwtime = MPI_Wtime();
Totaltime = endwtime - startwtime;
printf("Total sum %f\n",sum);
printf("Total time %f\n", Totaltime);
}
MPI_Finalize();
return 0;
}
void readArray(char * fileName, double ** a, int * n) {
int count, DataSize;
double * InputBuffer;
FILE * fin;
fin = fopen(fileName, "r");
if (fin == NULL) {
fprintf(stderr, "\n*** Unable to open input file '%s'\n\n",
fileName);
exit(1);
}
fscanf(fin, "%d\n", &DataSize);
InputBuffer = (double *)malloc(DataSize * sizeof(double));
if (InputBuffer == NULL) {
fprintf(stderr, "\n*** Unable to allocate %d-length array", DataSize);
exit(1);
}
for (count = 0; count < DataSize; count++) {
fscanf(fin, "%lf", &InputBuffer[count]);
}
fclose(fin);
*n = DataSize;
*a = InputBuffer;
}
In your case, you may just play with the sendcount[] array of MPI_Scatterv. Indeed, a trivial implementation would be to compute the number of element (let say Nelement) of type sendtype that all the processes but one will reveive. One of the processes (for instance the last one) will get the remaining data. In that case, sendcount[i] = Nelement for indexes i from 0 to p-2 (p is the number of processes in the communicator, for you MPI_COMM_WORLD). Then the process p-1 will get sendcount[p-1] = DataSize-Nelement*(p-1). Concerning the array of displacements displs[], you have just to specify the displacement (in number of elements) from which to take the outgoing data to process i (cf. [1] page 161). For the previous example this would be:
for (i=0; i<p; ++i)
displs[i]=Nelement*i;
If you decide that another process q must compute the other data, think to set the good displacement displs[q+1] for the process q+1 with 0 ≤ q < q+1 ≤ p.
[1] MPI: A Message-Passing Interface Standard (Version 3.1): http://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
The computation of the Scatter_Datasize:
Scatter_DataSize = DataSize / Numprocs;
is correct only if DataSize is a multiple of Numprocs, which in your case, as DataSize is always even, occurs when Numprocs is even. When Numprocs is odd you should explicitly compute the remainder and assign it to one MPI process, i suggest the last.
A paper by Donzis & Aditya suggests, that it is possible to use a finite difference scheme that might have a delay in the stencil. What does this mean? A FD scheme might be used to solve the heat equation and reads (or some simplification of it)
u[t+1,i] = u[t,i] + c (u[t,i-1]-u[t,i+1])
meaning, that the value at the next time step depends on the value at the same position and its neighbours at the previous time step.
This problem can easily be parallized by splitting the (in our case 1D) domain onto the different processors. However, we need communication when computing the boundary nodes at a processor, since the element u[t,i+-1] is only available on another processor.
The problem is illustrated in the following graphic, which is taken from the cited paper.
An MPI implementation might use MPI_Sendand MPI_Recv for synchronous computation.
Since the computation itself is fairly easy, it is the communication which might become a possible bottleneck.
A solution to the problem is given in the paper:
Instead of a synchronous process, just take the boundary note that is available, despite the fact that it might be the value of an earlier time step. The method then still converges (under some assumptions)
For my work, I would like to implement the asynchronous MPI case (which is not part of the paper). The synchronous part using MPI_Send and MPI_Recv is working correctly. I extended the memory by two elements as ghost cells for the neighbouring elements and send the needed values via send and receive. The code below is basically the implementation of the figure above and is performed during each time step prior to the computation.
MPI_Send(&u[NpP],1,MPI_DOUBLE,RIGHT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[0],1,MPI_DOUBLE,LEFT,LEFT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Send(&u[1],1,MPI_DOUBLE,LEFT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[NpP+1],1,MPI_DOUBLE,RIGHT,RIGHT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now, I'm by no means an MPI expert. I figured out, that MPI_Put might be what I need for the asynchronous case and reading a little bit, I came up with the following implementation.
Before the time loop:
MPI_Win win;
double *boundary;
MPI_Alloc_mem(sizeof(double) * 2, MPI_INFO_NULL, &boundary);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"no_locks","true");
MPI_Win_create(boundary, 2*sizeof(double), sizeof(double), info, MPI_COMM_WORLD, &win);
Inside the time loop:
MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Put(&u[NpP],1,MPI_DOUBLE,RIGHT,0,1,MPI_DOUBLE,win);
MPI_Win_fence(0,win);
u[0] = boundary[0];
u[NpP+1] = boundary[1];
which puts the needed elements in the window, namely boundary (array with two elements) on the neighbouring processors and takes the values u[0] and u[NpP+1] from the boundary array itself.
This implementation is working and I get the same result was with MPI_Send/Recv. However, this isn't really asynchronous since I'm still using MPI_Win_fence, which, as far as I understood, ensures synchronization.
The problem is: If I take out the MPI_Win_fence the values inside boundary are never updated and stay the inital values. My understanding was, that without MPI_Win_fence you would take any value that is available inside boundary which might (or might not) have been updated by a neighbouring processor.
Does anybody have an idea to avoid the use of MPI_Win_fence while also solving the problem, that the values inside boundary are never updated?
I'm also not sure, if the code I provided is enough to understand my problem or to give any hints. If that is the case, feel free to ask, as I will try to add all the parts that are missing.
The following works seems to work for me, in the sense of correct execution - a small 1d heat equation taken from one of our tutorials, using for the RMA stuff:
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_unlock( rank, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
In the code I have even ranks wait an extra 10ms each time step to try to make sure that things get out of sync; but looking at traces it actually looks like things remain pretty synced up. I don't know if that high degree of synchrony can be fixed by tweaking the code, or is a restriction of the implementation (IntelMPI 5.0.1), or just happens because the amount of time passing in computation is too little and communication time is dominating (but as to the last, cranking up the usleep interval doesn't seem to have an effect).
#define _BSD_SOURCE /* usleep */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char **argv) {
/* simulation parameters */
const int totpoints=1000;
int locpoints;
const float xleft = -12., xright = +12.;
float locxleft, locxright;
const float kappa = 1.;
const int nsteps=100;
/* data structures */
float *x;
float **temperature;
/* parameters of the original temperature distribution */
const float ao=1., sigmao=1.;
float fixedlefttemp, fixedrighttemp;
int current, new;
int step, i;
float time;
float dt, dx;
float rms;
int rank, size;
int start,end;
int left, right;
int lefttag=1, righttag=2;
/* MPI Initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
locpoints = totpoints/size;
start = rank*locpoints;
end = (rank+1)*locpoints - 1;
if (rank == size-1)
end = totpoints-1;
locpoints = end-start+1;
left = rank-1;
if (left < 0) left = MPI_PROC_NULL;
right= rank+1;
if (right >= size) right = MPI_PROC_NULL;
#ifdef ONESIDED
if (rank == 0)
printf("Onesided: Allocating windows\n");
MPI_Win leftwin, rightwin;
float *leftgc, *rightgc;
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc, &leftwin);
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
#endif
/* set parameters */
dx = (xright-xleft)/(totpoints-1);
dt = dx*dx * kappa/10.;
locxleft = xleft + start*dx;
locxright = xleft + end*dx;
x = (float *)malloc((locpoints+2)*sizeof(float));
temperature = (float **)malloc(2 * sizeof(float *));
temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
current = 0;
new = 1;
/* setup initial conditions */
time = 0.;
for (i=0; i<locpoints+2; i++) {
x[i] = locxleft + (i-1)*dx;
temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
}
fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
#ifdef ONESIDED
*leftgc = fixedlefttemp;
*rightgc = fixedrighttemp;
#endif
/* evolve */
for (step=0; step < nsteps; step++) {
/* boundary conditions: keep endpoint temperatures fixed. */
#ifdef ONESIDED
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_unlock( rank, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
#else
temperature[current][0] = fixedlefttemp;
temperature[current][locpoints+1] = fixedrighttemp;
/* send data rightwards */
MPI_Sendrecv(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, righttag,
&(temperature[current][0]), 1, MPI_FLOAT, left, righttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* send data leftwards */
MPI_Sendrecv(&(temperature[current][1]), 1, MPI_FLOAT, left, lefttag,
&(temperature[current][locpoints+1]), 1, MPI_FLOAT, right, lefttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
#endif
for (i=1; i<locpoints+1; i++) {
temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
(temperature[current][i+1] - 2.*temperature[current][i] +
temperature[current][i-1]) ;
}
time += dt;
if ((rank % 2) == 0)
usleep(10000u);
current = new;
new = 1 - current;
}
rms = 0.;
for (i=1;i<locpoints+1;i++) {
rms += (temperature[current][i])*(temperature[current][i]);
}
float totrms;
MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
totrms = sqrt(totrms/totpoints);
printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
}
#ifdef ONESIDED
MPI_Win_free(&leftwin);
MPI_Win_free(&rightwin);
#endif
free(temperature[1]);
free(temperature[0]);
free(temperature);
free(x);
MPI_Finalize();
return 0;
}
This is a clone of Jonathen Dursi's post, but with changes for MPI-3 RMA synchronization...
#define _BSD_SOURCE /* usleep */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char **argv) {
/* simulation parameters */
const int totpoints=1000;
int locpoints;
const float xleft = -12., xright = +12.;
float locxleft, locxright;
const float kappa = 1.;
const int nsteps=100;
/* data structures */
float *x;
float **temperature;
/* parameters of the original temperature distribution */
const float ao=1., sigmao=1.;
float fixedlefttemp, fixedrighttemp;
int current, new;
int step, i;
float time;
float dt, dx;
float rms;
int rank, size;
int start,end;
int left, right;
int lefttag=1, righttag=2;
/* MPI Initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
locpoints = totpoints/size;
start = rank*locpoints;
end = (rank+1)*locpoints - 1;
if (rank == size-1)
end = totpoints-1;
locpoints = end-start+1;
left = rank-1;
if (left < 0) left = MPI_PROC_NULL;
right= rank+1;
if (right >= size) right = MPI_PROC_NULL;
#ifdef ONESIDED
if (rank == 0)
printf("Onesided: Allocating windows\n");
MPI_Win leftwin, rightwin;
float *leftgc, *rightgc;
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc, &leftwin);
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
MPI_Win_lock_all(MPI_MODE_NOCHECK, leftwin);
MPI_Win_lock_all(MPI_MODE_NOCHECK, rightwin);
#endif
/* set parameters */
dx = (xright-xleft)/(totpoints-1);
dt = dx*dx * kappa/10.;
locxleft = xleft + start*dx;
locxright = xleft + end*dx;
x = (float *)malloc((locpoints+2)*sizeof(float));
temperature = (float **)malloc(2 * sizeof(float *));
temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
current = 0;
new = 1;
/* setup initial conditions */
time = 0.;
for (i=0; i<locpoints+2; i++) {
x[i] = locxleft + (i-1)*dx;
temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
}
fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
#ifdef ONESIDED
*leftgc = fixedlefttemp;
*rightgc = fixedrighttemp;
#endif
/* evolve */
for (step=0; step < nsteps; step++) {
/* boundary conditions: keep endpoint temperatures fixed. */
/* RMA code assumes no conflicts in updates via MPI_Put.
If that is wrong, hopefully it is fine to use MPI_Accumulate
with MPI_SUM to accumulate the result. */
#ifdef ONESIDED
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_flush( left, rightwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_flush( right, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_flush( rank, leftwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_flush( rank, rightwin );
#else
#error Define ONESIDED...
#endif
for (i=1; i<locpoints+1; i++) {
temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
(temperature[current][i+1] - 2.*temperature[current][i] +
temperature[current][i-1]) ;
}
time += dt;
if ((rank % 2) == 0)
usleep(10000u);
current = new;
new = 1 - current;
}
rms = 0.;
for (i=1;i<locpoints+1;i++) {
rms += (temperature[current][i])*(temperature[current][i]);
}
float totrms;
MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
totrms = sqrt(totrms/totpoints);
printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
}
#ifdef ONESIDED
MPI_Win_unlock_all(leftwin);
MPI_Win_unlock_all(rightwin);
MPI_Win_free(&leftwin);
MPI_Win_free(&rightwin);
#endif
free(temperature[1]);
free(temperature[0]);
free(temperature);
free(x);
MPI_Finalize();
return 0;
}
I'm trying to do some parallel calculations and then reduce them to one vector.
I try it by dividing for loop into parts which should be calculated separatedly from vector. Later I'd like to join all those subvectors into one main vector by replacing parts of it with values gotten from processes. Needless to say, I have no idea how to do it and my tries were in vain.
Any help will be appreciated.
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(x0, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
printf("My id: %d, mySize: %d, myStart: %d, myEnd: %d", rank, size, mystart, myend);
while(delta > granica)
{
ii++;
delta = 0;
//if(rank > 0)
//{
for(i = mystart; i < myend; i++)
{
xNowe[i] = b[i];
for(j = 0; j < n; j++)
{
if(i != j)
{
xNowe[i] -= A[i][j] * x0[j];
}
}
xNowe[i] = xNowe[i] / A[i][i];
printf("Result in iteration %d: %d", i, xNowe[i]);
}
MPI_Reduce(xNowe, xNowe,n,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
I'm going to ignore your calculations and assume they're all doing whatever it is you want them to do and at the end, you have an array called xNowe that has the results for your rank somewhere within it (in some subarray).
You have two options.
The first way uses an MPI_REDUCE in the way you're currently doing it.
What needs to happen is that you should probably set all of the values that do not pertain to your rank to 0, then you can just do a big MPI_REDUCE (as you're already doing), where each process contributes its xNowe array which will look something like this (depending on the input/rank/etc.):
rank: 0 1 2 3 4 5 6 7
value: 0 0 1 2 0 0 0 0
When you do the reduction (with MPI_SUM as the op), you'll get an array (on rank 0) that has each value filled in with the value contributed by each rank.
The second way uses an MPI_GATHER. Some might consider this to be the "more proper" way.
For this version, instead of using MPI_REDUCE to get the result, you only send the data that was calculated on your rank. You wouldn't have one large array. So your code would look something like this:
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(x0, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
printf("My id: %d, mySize: %d, myStart: %d, myEnd: %d", rank, size, mystart, myend);
while(delta > granica)
{
ii++;
delta = 0;
for(i = mystart; i < myend; i++)
{
xNowe[i-mystart] = b[i];
for(j = 0; j < n; j++)
{
if(i != j)
{
xNowe[i] -= A[i][j] * x0[j];
}
}
xNowe[i-mystart] = xNowe[i-mystart] / A[i][i];
printf("Result in iteration %d: %d", i, xNowe[i-mystart]);
}
}
MPI_Gather(xNowe, myend-mystart, MPI_DOUBLE, result, n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
You would obviously need to create a new array on rank 0 that is called result to hold the resulting values.
UPDATE:
As pointed out by Hristo in the comments below, MPI_GATHER might not work here if myend - mystart is not the same on all ranks. If that's the case, you'd need to use MPI_GATHERV which allows you to specify a different size for each rank.