A paper by Donzis & Aditya suggests, that it is possible to use a finite difference scheme that might have a delay in the stencil. What does this mean? A FD scheme might be used to solve the heat equation and reads (or some simplification of it)
u[t+1,i] = u[t,i] + c (u[t,i-1]-u[t,i+1])
meaning, that the value at the next time step depends on the value at the same position and its neighbours at the previous time step.
This problem can easily be parallized by splitting the (in our case 1D) domain onto the different processors. However, we need communication when computing the boundary nodes at a processor, since the element u[t,i+-1] is only available on another processor.
The problem is illustrated in the following graphic, which is taken from the cited paper.
An MPI implementation might use MPI_Sendand MPI_Recv for synchronous computation.
Since the computation itself is fairly easy, it is the communication which might become a possible bottleneck.
A solution to the problem is given in the paper:
Instead of a synchronous process, just take the boundary note that is available, despite the fact that it might be the value of an earlier time step. The method then still converges (under some assumptions)
For my work, I would like to implement the asynchronous MPI case (which is not part of the paper). The synchronous part using MPI_Send and MPI_Recv is working correctly. I extended the memory by two elements as ghost cells for the neighbouring elements and send the needed values via send and receive. The code below is basically the implementation of the figure above and is performed during each time step prior to the computation.
MPI_Send(&u[NpP],1,MPI_DOUBLE,RIGHT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[0],1,MPI_DOUBLE,LEFT,LEFT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Send(&u[1],1,MPI_DOUBLE,LEFT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[NpP+1],1,MPI_DOUBLE,RIGHT,RIGHT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now, I'm by no means an MPI expert. I figured out, that MPI_Put might be what I need for the asynchronous case and reading a little bit, I came up with the following implementation.
Before the time loop:
MPI_Win win;
double *boundary;
MPI_Alloc_mem(sizeof(double) * 2, MPI_INFO_NULL, &boundary);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"no_locks","true");
MPI_Win_create(boundary, 2*sizeof(double), sizeof(double), info, MPI_COMM_WORLD, &win);
Inside the time loop:
MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Put(&u[NpP],1,MPI_DOUBLE,RIGHT,0,1,MPI_DOUBLE,win);
MPI_Win_fence(0,win);
u[0] = boundary[0];
u[NpP+1] = boundary[1];
which puts the needed elements in the window, namely boundary (array with two elements) on the neighbouring processors and takes the values u[0] and u[NpP+1] from the boundary array itself.
This implementation is working and I get the same result was with MPI_Send/Recv. However, this isn't really asynchronous since I'm still using MPI_Win_fence, which, as far as I understood, ensures synchronization.
The problem is: If I take out the MPI_Win_fence the values inside boundary are never updated and stay the inital values. My understanding was, that without MPI_Win_fence you would take any value that is available inside boundary which might (or might not) have been updated by a neighbouring processor.
Does anybody have an idea to avoid the use of MPI_Win_fence while also solving the problem, that the values inside boundary are never updated?
I'm also not sure, if the code I provided is enough to understand my problem or to give any hints. If that is the case, feel free to ask, as I will try to add all the parts that are missing.
The following works seems to work for me, in the sense of correct execution - a small 1d heat equation taken from one of our tutorials, using for the RMA stuff:
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_unlock( rank, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
In the code I have even ranks wait an extra 10ms each time step to try to make sure that things get out of sync; but looking at traces it actually looks like things remain pretty synced up. I don't know if that high degree of synchrony can be fixed by tweaking the code, or is a restriction of the implementation (IntelMPI 5.0.1), or just happens because the amount of time passing in computation is too little and communication time is dominating (but as to the last, cranking up the usleep interval doesn't seem to have an effect).
#define _BSD_SOURCE /* usleep */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char **argv) {
/* simulation parameters */
const int totpoints=1000;
int locpoints;
const float xleft = -12., xright = +12.;
float locxleft, locxright;
const float kappa = 1.;
const int nsteps=100;
/* data structures */
float *x;
float **temperature;
/* parameters of the original temperature distribution */
const float ao=1., sigmao=1.;
float fixedlefttemp, fixedrighttemp;
int current, new;
int step, i;
float time;
float dt, dx;
float rms;
int rank, size;
int start,end;
int left, right;
int lefttag=1, righttag=2;
/* MPI Initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
locpoints = totpoints/size;
start = rank*locpoints;
end = (rank+1)*locpoints - 1;
if (rank == size-1)
end = totpoints-1;
locpoints = end-start+1;
left = rank-1;
if (left < 0) left = MPI_PROC_NULL;
right= rank+1;
if (right >= size) right = MPI_PROC_NULL;
#ifdef ONESIDED
if (rank == 0)
printf("Onesided: Allocating windows\n");
MPI_Win leftwin, rightwin;
float *leftgc, *rightgc;
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc, &leftwin);
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
#endif
/* set parameters */
dx = (xright-xleft)/(totpoints-1);
dt = dx*dx * kappa/10.;
locxleft = xleft + start*dx;
locxright = xleft + end*dx;
x = (float *)malloc((locpoints+2)*sizeof(float));
temperature = (float **)malloc(2 * sizeof(float *));
temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
current = 0;
new = 1;
/* setup initial conditions */
time = 0.;
for (i=0; i<locpoints+2; i++) {
x[i] = locxleft + (i-1)*dx;
temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
}
fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
#ifdef ONESIDED
*leftgc = fixedlefttemp;
*rightgc = fixedrighttemp;
#endif
/* evolve */
for (step=0; step < nsteps; step++) {
/* boundary conditions: keep endpoint temperatures fixed. */
#ifdef ONESIDED
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_unlock( rank, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
#else
temperature[current][0] = fixedlefttemp;
temperature[current][locpoints+1] = fixedrighttemp;
/* send data rightwards */
MPI_Sendrecv(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, righttag,
&(temperature[current][0]), 1, MPI_FLOAT, left, righttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* send data leftwards */
MPI_Sendrecv(&(temperature[current][1]), 1, MPI_FLOAT, left, lefttag,
&(temperature[current][locpoints+1]), 1, MPI_FLOAT, right, lefttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
#endif
for (i=1; i<locpoints+1; i++) {
temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
(temperature[current][i+1] - 2.*temperature[current][i] +
temperature[current][i-1]) ;
}
time += dt;
if ((rank % 2) == 0)
usleep(10000u);
current = new;
new = 1 - current;
}
rms = 0.;
for (i=1;i<locpoints+1;i++) {
rms += (temperature[current][i])*(temperature[current][i]);
}
float totrms;
MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
totrms = sqrt(totrms/totpoints);
printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
}
#ifdef ONESIDED
MPI_Win_free(&leftwin);
MPI_Win_free(&rightwin);
#endif
free(temperature[1]);
free(temperature[0]);
free(temperature);
free(x);
MPI_Finalize();
return 0;
}
This is a clone of Jonathen Dursi's post, but with changes for MPI-3 RMA synchronization...
#define _BSD_SOURCE /* usleep */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char **argv) {
/* simulation parameters */
const int totpoints=1000;
int locpoints;
const float xleft = -12., xright = +12.;
float locxleft, locxright;
const float kappa = 1.;
const int nsteps=100;
/* data structures */
float *x;
float **temperature;
/* parameters of the original temperature distribution */
const float ao=1., sigmao=1.;
float fixedlefttemp, fixedrighttemp;
int current, new;
int step, i;
float time;
float dt, dx;
float rms;
int rank, size;
int start,end;
int left, right;
int lefttag=1, righttag=2;
/* MPI Initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
locpoints = totpoints/size;
start = rank*locpoints;
end = (rank+1)*locpoints - 1;
if (rank == size-1)
end = totpoints-1;
locpoints = end-start+1;
left = rank-1;
if (left < 0) left = MPI_PROC_NULL;
right= rank+1;
if (right >= size) right = MPI_PROC_NULL;
#ifdef ONESIDED
if (rank == 0)
printf("Onesided: Allocating windows\n");
MPI_Win leftwin, rightwin;
float *leftgc, *rightgc;
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc, &leftwin);
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
MPI_Win_lock_all(MPI_MODE_NOCHECK, leftwin);
MPI_Win_lock_all(MPI_MODE_NOCHECK, rightwin);
#endif
/* set parameters */
dx = (xright-xleft)/(totpoints-1);
dt = dx*dx * kappa/10.;
locxleft = xleft + start*dx;
locxright = xleft + end*dx;
x = (float *)malloc((locpoints+2)*sizeof(float));
temperature = (float **)malloc(2 * sizeof(float *));
temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
current = 0;
new = 1;
/* setup initial conditions */
time = 0.;
for (i=0; i<locpoints+2; i++) {
x[i] = locxleft + (i-1)*dx;
temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
}
fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
#ifdef ONESIDED
*leftgc = fixedlefttemp;
*rightgc = fixedrighttemp;
#endif
/* evolve */
for (step=0; step < nsteps; step++) {
/* boundary conditions: keep endpoint temperatures fixed. */
/* RMA code assumes no conflicts in updates via MPI_Put.
If that is wrong, hopefully it is fine to use MPI_Accumulate
with MPI_SUM to accumulate the result. */
#ifdef ONESIDED
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_flush( left, rightwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_flush( right, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_flush( rank, leftwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_flush( rank, rightwin );
#else
#error Define ONESIDED...
#endif
for (i=1; i<locpoints+1; i++) {
temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
(temperature[current][i+1] - 2.*temperature[current][i] +
temperature[current][i-1]) ;
}
time += dt;
if ((rank % 2) == 0)
usleep(10000u);
current = new;
new = 1 - current;
}
rms = 0.;
for (i=1;i<locpoints+1;i++) {
rms += (temperature[current][i])*(temperature[current][i]);
}
float totrms;
MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
totrms = sqrt(totrms/totpoints);
printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
}
#ifdef ONESIDED
MPI_Win_unlock_all(leftwin);
MPI_Win_unlock_all(rightwin);
MPI_Win_free(&leftwin);
MPI_Win_free(&rightwin);
#endif
free(temperature[1]);
free(temperature[0]);
free(temperature);
free(x);
MPI_Finalize();
return 0;
}
Related
I am learning MPI with C and I wrote a code based on the one presented in this link: http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml.
In this code a vector containing 1e8 values are summed. However, I am observing that when using more processes the run time is getting bigger. The code is given bellow:
/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml
Code which splits a vector and send information to other processes.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.
Each process will calculate the partial sum of vector values and send it back to root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.
compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum
x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
*/
#include<stdio.h>
#include<mpi.h>
#include<math.h>
#define vec_len 100000000
double vec1[vec_len];
double vec2[vec_len];
int main(int argc, char* argv[]){
// defining program variables
int i;
double sum, partial_sum;
// defining parallel step variables
int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
int num_2_send, num_2_recv, start_point, vec_size, rows_per_proc, leftover;
ierr = MPI_Init(&argc, &argv);
root_process = 0;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
if(my_id == root_process){
// Root process: Define vector size, how to split vector and send information to workers
vec_size = 1e8; // size of main vector
for(i = 0; i < vec_size; i++){
//vec1[i] = pow(-1.0,i+2)/(2.0*(i+1)-1.0); // defining main vector... Correct answer for total sum = 0.78539816339
vec1[i] = pow(i,2)+1.0; // defining main vector...
//printf("Main vector position %d: %f\n", i, vec1[i]); // uncomment if youwhish to print the main vector
}
rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.
// spliting and sending the values
for(an_id = 1; an_id < num_proc; an_id++){
if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
num_2_send = rows_per_proc + leftover; // counting the amount of data to be sent.
start_point = (an_id - 1)*num_2_send; // defining initial position in the main vector (data will be sent from here)
}
else{
num_2_send = rows_per_proc;
start_point = (an_id - 1)*num_2_send + leftover; // starting point for other processes if there is leftover.
}
ierr = MPI_Send(&num_2_send, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data is going to workers.
ierr = MPI_Send(&vec1[start_point], num_2_send, MPI_DOUBLE, an_id, 1234, MPI_COMM_WORLD); // sending pieces of the main vector.
}
sum = 0;
for(an_id = 1; an_id < num_proc; an_id++){
ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
sum = sum + partial_sum;
}
printf("Total sum = %f.\n", sum);
}
else{
// Workers:define which operation will be carried out by each one
ierr = MPI_Recv(&num_2_recv, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must expect.
ierr = MPI_Recv(&vec2, num_2_recv, MPI_DOUBLE, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving main vector pieces.
partial_sum = 0;
for(i=0; i < num_2_recv; i++){
//printf("Position %d from worker id %d: %d\n", i, my_id, vec2[i]); // uncomment if youwhish to print position, id and value of splitted vector
partial_sum = partial_sum + vec2[i];
}
printf("Partial sum of %d: %f\n",my_id, partial_sum);
ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
}
ierr = MPI_Finalize();
}
Obs.: Compile as
mpicc -o vector_sum vector_send.c -lm
and run as:
time mpirun -n x vector_sum
with x = 2 and 5. You will see that with x=5 it takes more time to run.
Did I do something wrong? I did not expected it to be slower, since the summation of each chunk is independent. Or it is a matter of how the program is sending the information for each process? It seems to me that the loops for sending the information for each process is the responsible for this longer time.
As suggested by Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet): I modified the code to generate the vector pieces in each process instead of passing them from the root process. It worked! Now the elapsed time is smaller for more processes. I am posting the new code bellow:
/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml
Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.
Each process will generate and calculate the partial sum of the vector values and send it back to the root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.
compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum
x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/
#include<stdio.h>
#include<mpi.h>
#include<math.h>
#define vec_len 100000000
double vec2[vec_len];
int main(int argc, char* argv[]){
// defining program variables
int i;
double sum, partial_sum;
// defining parallel step variables
int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
int vec_size, rows_per_proc, leftover, num_2_gen, start_point;
ierr = MPI_Init(&argc, &argv);
root_process = 0;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
if(my_id == root_process){
vec_size = 1e8; // defining main vector size
rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.
// defining the number of data and position corresponding to main vector
for(an_id = 1; an_id < num_proc; an_id++){
if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
num_2_gen = rows_per_proc + leftover; // counting the amount of data to be generated.
start_point = (an_id - 1)*num_2_gen; // defining corresponding initial position in the main vector.
}
else{
num_2_gen = rows_per_proc;
start_point = (an_id - 1)*num_2_gen + leftover; // defining corresponding initial position in the main vector for other processes if there is leftover.
}
ierr = MPI_Send(&num_2_gen, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data must be generated.
ierr = MPI_Send(&start_point, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of initial positions on main vector.
}
sum = 0;
for(an_id = 1; an_id < num_proc; an_id++){
ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
sum = sum + partial_sum;
}
printf("Total sum = %f.\n", sum);
}
else{
ierr = MPI_Recv(&num_2_gen, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must generate.
ierr = MPI_Recv(&start_point, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of initial positions.
// generate and sum vector pieces
partial_sum = 0;
for(i = start_point; i < start_point + num_2_gen; i++){
vec2[i] = pow(i,2)+1.0;
partial_sum = partial_sum + vec2[i];
}
printf("Partial sum of %d: %f\n",my_id, partial_sum);
ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
}
ierr = MPI_Finalize();
return 0;
}
In this new version, instead of passing the main vector pieces, it is passed the just the information of how generate those pieces in each process.
The new code using MPI_Reduce() is faster and simpler than the previous one:
/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml
Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 0.
Process id 0 is the root process. However, it will also perform part of calculations.
Each process will generate and calculate the partial sum of the vector values. It will be used MPI_Reduce() to calculate the total sum.
Since the processes are independent, the printing order will be different at each run.
compile as: mpicc -o vector_sum vector_sum.c -lm
run as: time mpirun -n x vector_sum
x = number of splits desired + root process. For example: if x = 3, the vector will be splited in two.
Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/
#include<stdio.h>
#include<mpi.h>
#include<math.h>
#define vec_len 100000000
double vec2[vec_len];
int main(int argc, char* argv[]){
// defining program variables
int i;
double sum, partial_sum;
// defining parallel step variables
int my_id, num_proc, ierr, an_id, root_process;
int vec_size, rows_per_proc, leftover, num_2_gen, start_point;
vec_size = 1e8; // defining the main vector size
ierr = MPI_Init(&argc, &argv);
root_process = 0;
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
rows_per_proc = vec_size/num_proc; // getting the number of elements for each process.
rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
leftover = vec_size - num_proc*rows_per_proc; // counting the leftover.
if(my_id == 0){
num_2_gen = rows_per_proc + leftover; // if there is leftover, it is calculate in process 0
start_point = my_id*num_2_gen; // the corresponding position on the main vector
}
else{
num_2_gen = rows_per_proc;
start_point = my_id*num_2_gen + leftover; // the corresponding position on the main vector
}
partial_sum = 0;
for(i = start_point; i < start_point + num_2_gen; i++){
vec2[i] = pow(i,2) + 1.0; // defining vector values
partial_sum += vec2[i]; // calculating partial sum
}
printf("Partial sum of process id %d: %f.\n", my_id, partial_sum);
MPI_Reduce(&partial_sum, &sum, 1, MPI_DOUBLE, MPI_SUM, root_process, MPI_COMM_WORLD); // calculating total sum
if(my_id == root_process){
printf("Total sum is %f.\n", sum);
}
ierr = MPI_Finalize();
return 0;
}
I am working on a parallel processing program that uses MPI_Send() and MPI_Recv() instead of using MPI_Reduce(). I understand that MPI_Send() will need to send a value from each processor to the root processor aka 0 and MPI_Recv() will need to receive all of the values from each processor.
I keep getting the error where the value in Send will not be sent to the Receiving side thus making the final value 0. The MPI_Reduce() function is still in the code but commented out to see what needs to be replaced. Can anyone help?
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[])
{
int n, i;
double PI25DT = 3.141592653589793238462643;
double pi, h, sum, x;
int numprocs, myid;
double startTime, endTime;
/* Initialize MPI and get number of processes and my number or rank*/
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
/* Processor zero sets the number of intervals and starts its clock*/
if (myid==0) {
n=600000000;
startTime=MPI_Wtime();
for (int i = 0; i < numprocs; i++) {
if (i != myid) {
MPI_Send(&n, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
}
}
}
else {
MPI_Recv(&n, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
/* Calculate the width of intervals */
h = 1.0 / (double) n;
/* Initialize sum */
sum = 0.0;
/* Step over each inteval I own */
for (i = myid+1; i <= n; i += numprocs) {
/* Calculate midpoint of interval */
x = h * ((double)i - 0.5);
/* Add rectangle's area = height*width = f(x)*h */
sum += (4.0/(1.0+x*x))*h;
}
/* Get sum total on processor zero */
//MPI_Reduce(&sum,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
double value = 0;
if (myid != 0) {
MPI_Send(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
else {
for (int i = 1; i < numprocs; i++) {
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
pi += value;
}
}
/* Print approximate value of pi and runtime*/
if (myid==0) {
printf("pi is approximately %.16f, Error is %e\n",
pi, fabs(pi - PI25DT));
endTime=MPI_Wtime();
printf("runtime is=%.16f",endTime-startTime);
}
MPI_Finalize();
return 0;
}
You are using MPI_INT to send a value of type double:
if (myid != 0) {
MPI_Send(&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
// ^^^^^^^
}
int is 4 bytes long; double is 8 bytes long. Although the receive operation succeeds, it cannot construct a value of type MPI_DOUBLE given only 4 bytes from the message, so it doesn't write anything into value and it remains 0.0. Indeed, if you replace:
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
with
MPI_Status status;
int count;
MPI_Recv(&value, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_DOUBLE, &count);
if (count == MPI_UNDEFINED) {
printf("Short message received\n");
MPI_Abort(MPI_COMM_WORLD, 0);
}
your program will abort, indicating that the body of the conditional statement was executed due to MPI_Get_count() returning MPI_UNDEFINED in count, which signals that the length of the received message was not an integer multiple of the size of MPI_DOUBLE.
Also, pi must be explicitly initialised to sum before the receive loop, otherwise you will get the wrong value of pi due to either of the following errors:
pi is left uninitialised and has arbitrary initial value, and
the contribution of rank 0 is not added to the final result.
if I have this code:
int main(void) {
int result=0;
int num[6] = {1, 2, 4, 3, 7, 1};
if (my_rank != 0) {
MPI_Reduce(num, &result, 6, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
} else {
MPI_Reduce(num, &result, 6, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD)
printf("result = %d\n", result);
}
}
the result print is 1 ;
But if the num[0]=9; then the result is 9
I read to solve this problem I must to define the variable num as array.
I can't understand how the function MPI_Reduce works with MPI_MIN. Why, if the num[0] is not equal to the smallest number, then I must to define the variable num as array?
MPI_Reduce performs a reduction over the members of the communicator - not the members of the local array. sendbuf and recvbuf must both be of the same size.
I think the standard says it best:
Thus, all processes provide input buffers and output buffers of the same length, with elements of the same type. Each process can provide one element, or a sequence of elements, in which case the combine operation is executed element-wise on each entry of the sequence.
MPI does not get the minimum of all elements in the array, you have to do that manually.
You can use MPI_MIN to obtain the min value among those passed via reduction.
Lets' examine the function declaration:
int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int root, MPI_Comm comm)
Each process send it's value (or array of values) using the buffer sendbuff.
The process identified by the root id receive the buffers and stores them in the buffer recvbuf. The number of elements to receive from each of the other processes is specified in count, so that recvbuff must be allocated with dimension sizeof(datatype)*count.
If each process has only one integer to send (count = 1) then recvbuff it's also an integer, If each process has two integers then recvbuff it's an array of integers of size 2. See this nice post for further explanations and nice pictures.
Now it should be clear that your code is wrong, sendbuff and recvbuff must be of the same size and there is no need of the condition: if(myrank==0). Simply, recvbuff has meaning only for the root process and sendbuff for the others.
In your example you can assign one or more element of the array to a different process and then compute the minvalue (if there are as many processes as values in the array) or the array of minvalues (if there are more values than processes).
Here is a working example that illustrates the usage of MPI_MIN, MPI_MAX and MPI_SUM (slightly modified from this), in the case of simple values (not array).
Each process do some work, depending on their rank and send to the root process the time spent doing the work. The root process collect the times and output the min, max and average values of the times.
#include <stdio.h>
#include <mpi.h>
int myrank, numprocs;
/* just a function to waste some time */
float work()
{
float x, y;
if (myrank%2) {
for (int i = 0; i < 100000000; ++i) {
x = i/0.001;
y += x;
}
} else {
for (int i = 0; i < 100000; ++i) {
x = i/0.001;
y += x;
}
}
return y;
}
int main(int argc, char **argv)
{
int node;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
/*variables used for gathering timing statistics*/
double mytime,
maxtime,
mintime,
avgtime;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/
mytime = MPI_Wtime(); /*get time just before work section */
work();
mytime = MPI_Wtime() - mytime; /*get time just after work section*/
/*compute max, min, and average timing statistics*/
MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD);
MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);
/* plot the output */
if (myrank == 0) {
avgtime /= numprocs;
printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime);
}
MPI_Finalize();
return 0;
}
If I run this on my OSX laptop, this is what I get:
urcaurca$ mpirun -n 4 ./a.out
Hello World from Node 3
Hello World from Node 0
Hello World from Node 2
Hello World from Node 1
Min: 0.000974 Max: 0.985291 Avg: 0.493081
I'm beginner in MPI programming. I'm trying to write a program that dynamically takes in an one dimensional arrays of different sizes (multiples of 100, 1000, 10000, 1000000 and so on) and scatters it to allotted processor cores. Processor cores calculate the sum of the received elements and send the sum back. The root process prints the sum of the elements in input array.
I used MPI_Scatter() and MPI_Reduce() to solve the problem. However, when the number of processor cores allotted are odd in number, some of the data get left out. For example, when I have input data size of 100 and 3 processes - only 99 elements are added and last one is left out.
I searched for the alternatives and found that MPI_Scatterv() can be used for uneven distribution of data. But there is no material available to guide me for it's implementation. Can someone help me? I'm posting my code here. Thanks in advance.
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
void readArray(char * fileName, double ** a, int * n);
int Numprocs, MyRank;
int mpi_err;
#define Root = 0
void init_it(int *argc, char ***argv) {
mpi_err = MPI_Init(argc, argv);
mpi_err = MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);
mpi_err = MPI_Comm_size(MPI_COMM_WORLD, &Numprocs);
}
int main(int argc, char** argv) {
/* .......Variables Initialisation ......*/
int index;
double *InputBuffer, *RecvBuffer, sum=0.0, psum = 0.0;
double ptime = 0.0, Totaltime= 0.0,startwtime = 0.0, endwtime = 0.0;
int Scatter_DataSize;
int DataSize;
FILE *fp;
init_it(&argc,&argv);
if (argc != 2) {
fprintf(stderr, "\n*** Usage: arraySum <inputFile>\n\n");
exit(1);
}
if (MyRank == 0) {
startwtime = MPI_Wtime();
printf("Number of nodes running %d\n",Numprocs);
/*...... Read input....*/
readArray(argv[1], &InputBuffer, &DataSize);
printf("Size of array %d\n", DataSize);
}
if (MyRank!=0) {
MPI_Recv(&DataSize, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
}
else {
int i;
for (i=1;i<Numprocs;i++) {
MPI_Send(&DataSize, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
d[i]= i*Numprocs;
}
}
Scatter_DataSize = DataSize / Numprocs;
RecvBuffer = (double *)malloc(Scatter_DataSize * sizeof(double));
MPI_Barrier(MPI_COMM_WORLD);
mpi_err = MPI_Scatter(InputBuffer, Scatter_DataSize, MPI_DOUBLE,
RecvBuffer, Scatter_DataSize, MPI_DOUBLE,
0, MPI_COMM_WORLD);
for (index = 0; index < Scatter_DataSize; index++) {
psum = psum + RecvBuffer[index];
}
//printf("Processor %d computed sum %f\n", MyRank, psum);
mpi_err = MPI_Reduce(&psum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (MyRank == 0) {
endwtime = MPI_Wtime();
Totaltime = endwtime - startwtime;
printf("Total sum %f\n",sum);
printf("Total time %f\n", Totaltime);
}
MPI_Finalize();
return 0;
}
void readArray(char * fileName, double ** a, int * n) {
int count, DataSize;
double * InputBuffer;
FILE * fin;
fin = fopen(fileName, "r");
if (fin == NULL) {
fprintf(stderr, "\n*** Unable to open input file '%s'\n\n",
fileName);
exit(1);
}
fscanf(fin, "%d\n", &DataSize);
InputBuffer = (double *)malloc(DataSize * sizeof(double));
if (InputBuffer == NULL) {
fprintf(stderr, "\n*** Unable to allocate %d-length array", DataSize);
exit(1);
}
for (count = 0; count < DataSize; count++) {
fscanf(fin, "%lf", &InputBuffer[count]);
}
fclose(fin);
*n = DataSize;
*a = InputBuffer;
}
In your case, you may just play with the sendcount[] array of MPI_Scatterv. Indeed, a trivial implementation would be to compute the number of element (let say Nelement) of type sendtype that all the processes but one will reveive. One of the processes (for instance the last one) will get the remaining data. In that case, sendcount[i] = Nelement for indexes i from 0 to p-2 (p is the number of processes in the communicator, for you MPI_COMM_WORLD). Then the process p-1 will get sendcount[p-1] = DataSize-Nelement*(p-1). Concerning the array of displacements displs[], you have just to specify the displacement (in number of elements) from which to take the outgoing data to process i (cf. [1] page 161). For the previous example this would be:
for (i=0; i<p; ++i)
displs[i]=Nelement*i;
If you decide that another process q must compute the other data, think to set the good displacement displs[q+1] for the process q+1 with 0 ≤ q < q+1 ≤ p.
[1] MPI: A Message-Passing Interface Standard (Version 3.1): http://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
The computation of the Scatter_Datasize:
Scatter_DataSize = DataSize / Numprocs;
is correct only if DataSize is a multiple of Numprocs, which in your case, as DataSize is always even, occurs when Numprocs is even. When Numprocs is odd you should explicitly compute the remainder and assign it to one MPI process, i suggest the last.
Just looking over some notes prior to an interview and am struggling to understand how Odd-Even sort works in parallel architectures.
int MPI_OddEven_Sort(int n, double *a, int root, MPI_Comm comm)
{
int rank, size, i, sorted_result;
double *local_a;
// get rank and size of comm
MPI_Comm_rank(comm, &rank); //&rank = address of rank
MPI_Comm_size(comm, &size);
local_a = (double *) calloc(n / size, sizeof(double));
// scatter the array a to local_a
MPI_Scatter(a, n / size, MPI_DOUBLE, local_a, n / size, MPI_DOUBLE,
root, comm);
// sort local_a
merge_sort(n / size, local_a);
//odd-even part
for (i = 0; i < size; i++) {
if ((i + rank) % 2 == 0) { // means i and rank have same nature
if (rank < size - 1) {
MPI_Compare(n / size, local_a, rank, rank + 1, comm);
}
} else if (rank > 0) {
MPI_Compare(n / size, local_a, rank - 1, rank, comm);
}
MPI_Barrier(comm);
// test if array is sorted
MPI_Is_Sorted(n / size, local_a, root, comm, &sorted_result);
// is sorted gives integer 0 or 1, if 0 => array is sorted
if (sorted_result == 0) {
break;
} // check for iterations
}
// gather local_a to a
MPI_Gather(local_a, n / size, MPI_DOUBLE, a, n / size, MPI_DOUBLE,
root, comm)
return MPI_SUCCESS;
}
is some code I wrote for this function (not today nor yesterday!). Can someone please break down how it is working ?
I'm scattering my array a to each processor, which is getting a copy of local_a (which is of size n/size)
Merge sort is being called on each local_a.
What is going on after this? (Assuming I am correct so far!)
It's sort of fun to see these PRAM-type sorting networks popping up again after all these years. The original mental model of parallel computing for these things was massively parallel arrays of tiny processors as "comparators", eg the Connection Machines - back in the day when networking was cheap compared to CPU/RAM. Of course that ended up looking very different from the supercomputers of the mid to late 80s and on, and even more so than the x86 clusters of the late 90s on; but now they're starting to come back in vogue with GPUs and other accelerators which actually do look a bit like that future past if you squint.
It looks like what you have above is something more like a Baudet-Stevenson odd-even sort, which was already starting to move in the direction of assuming that the processors would have multiple items stored locally and you could make good use of the processors by sorting those local lists in between communication steps.
Fleshing out your code and simplifying it a bit, we have something like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int merge(double *ina, int lena, double *inb, int lenb, double *out) {
int i,j;
int outcount=0;
for (i=0,j=0; i<lena; i++) {
while ((inb[j] < ina[i]) && j < lenb) {
out[outcount++] = inb[j++];
}
out[outcount++] = ina[i];
}
while (j<lenb)
out[outcount++] = inb[j++];
return 0;
}
int domerge_sort(double *a, int start, int end, double *b) {
if ((end - start) <= 1) return 0;
int mid = (end+start)/2;
domerge_sort(a, start, mid, b);
domerge_sort(a, mid, end, b);
merge(&(a[start]), mid-start, &(a[mid]), end-mid, &(b[start]));
for (int i=start; i<end; i++)
a[i] = b[i];
return 0;
}
int merge_sort(int n, double *a) {
double b[n];
domerge_sort(a, 0, n, b);
return 0;
}
void printstat(int rank, int iter, char *txt, double *la, int n) {
printf("[%d] %s iter %d: <", rank, txt, iter);
for (int j=0; j<n-1; j++)
printf("%6.3lf,",la[j]);
printf("%6.3lf>\n", la[n-1]);
}
void MPI_Pairwise_Exchange(int localn, double *locala, int sendrank, int recvrank,
MPI_Comm comm) {
/*
* the sending rank just sends the data and waits for the results;
* the receiving rank receives it, sorts the combined data, and returns
* the correct half of the data.
*/
int rank;
double remote[localn];
double all[2*localn];
const int mergetag = 1;
const int sortedtag = 2;
MPI_Comm_rank(comm, &rank);
if (rank == sendrank) {
MPI_Send(locala, localn, MPI_DOUBLE, recvrank, mergetag, MPI_COMM_WORLD);
MPI_Recv(locala, localn, MPI_DOUBLE, recvrank, sortedtag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
MPI_Recv(remote, localn, MPI_DOUBLE, sendrank, mergetag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
merge(locala, localn, remote, localn, all);
int theirstart = 0, mystart = localn;
if (sendrank > rank) {
theirstart = localn;
mystart = 0;
}
MPI_Send(&(all[theirstart]), localn, MPI_DOUBLE, sendrank, sortedtag, MPI_COMM_WORLD);
for (int i=mystart; i<mystart+localn; i++)
locala[i-mystart] = all[i];
}
}
int MPI_OddEven_Sort(int n, double *a, int root, MPI_Comm comm)
{
int rank, size, i;
double *local_a;
// get rank and size of comm
MPI_Comm_rank(comm, &rank); //&rank = address of rank
MPI_Comm_size(comm, &size);
local_a = (double *) calloc(n / size, sizeof(double));
// scatter the array a to local_a
MPI_Scatter(a, n / size, MPI_DOUBLE, local_a, n / size, MPI_DOUBLE,
root, comm);
// sort local_a
merge_sort(n / size, local_a);
//odd-even part
for (i = 1; i <= size; i++) {
printstat(rank, i, "before", local_a, n/size);
if ((i + rank) % 2 == 0) { // means i and rank have same nature
if (rank < size - 1) {
MPI_Pairwise_Exchange(n / size, local_a, rank, rank + 1, comm);
}
} else if (rank > 0) {
MPI_Pairwise_Exchange(n / size, local_a, rank - 1, rank, comm);
}
}
printstat(rank, i-1, "after", local_a, n/size);
// gather local_a to a
MPI_Gather(local_a, n / size, MPI_DOUBLE, a, n / size, MPI_DOUBLE,
root, comm);
if (rank == root)
printstat(rank, i, " all done ", a, n);
return MPI_SUCCESS;
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int n = argc-1;
double a[n];
for (int i=0; i<n; i++)
a[i] = atof(argv[i+1]);
MPI_OddEven_Sort(n, a, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
So the way this works is that the list is evenly split up between processors (non-equal distributions are easily handled too, but it's a lot of extra bookkeeping which doesn't add much to this discussion).
We first sorting our local lists (which is O(n/P ln n/P)). There's no reason it has to be a merge sort, of course, except that here we can re-use that merge code the following steps. Then we do P neighbour exchange steps, half in each direction. The model here was that there was a linear network where we could communicate directly and quickly with immediate neighbours, and perhaps not at all with neighbours further away.
The original odd-even sorting network is the case where each processor has one key, in which case the communication is easy - you compare your item with your neighbour, and swap if necessary (so that this is basically a parallel bubble sort). In this case, we do a simple parallel sort between pairs of processes - here, each pair just sends all data to one of the pair, that pair merges the already locally sorted lists O(N/P), and then gives the appropriate half of the data back to the other processor. I took out your check-if-done; it can be shown that it's completed in P neighbour exchanges. You can certainly add it back in just in case of early termination; however, all the processors have to agree when everything's done, which requires something like an all reduce, which breaks the original model somewhat.
So we have O(n) data transfer per link, (sending and receiving n/P items P times each), and each processor does (n/P ln n/P) + (2 n/P - 1)*P/2 = O(n/P ln n/P + N) comparisons; in this case there's a scatter and a gather to be considered as well, but in general this sort is done with data in place.
Running the above - with, for clarity, that same example in that document linked gives (with output re-ordered to make it easier to read):
$ mpirun -np 4 ./baudet-stevenson 43 54 63 28 79 81 32 47 84 17 25 49
[0] before iter 1: <43.000,54.000,63.000>
[1] before iter 1: <28.000,79.000,81.000>
[2] before iter 1: <32.000,47.000,84.000>
[3] before iter 1: <17.000,25.000,49.000>
[0] before iter 2: <43.000,54.000,63.000>
[1] before iter 2: <28.000,32.000,47.000>
[2] before iter 2: <79.000,81.000,84.000>
[3] before iter 2: <17.000,25.000,49.000>
[0] before iter 3: <28.000,32.000,43.000>
[1] before iter 3: <47.000,54.000,63.000>
[2] before iter 3: <17.000,25.000,49.000>
[3] before iter 3: <79.000,81.000,84.000>
[0] before iter 4: <28.000,32.000,43.000>
[1] before iter 4: <17.000,25.000,47.000>
[2] before iter 4: <49.000,54.000,63.000>
[3] before iter 4: <79.000,81.000,84.000>
[0] after iter 4: <17.000,25.000,28.000>
[1] after iter 4: <32.000,43.000,47.000>
[2] after iter 4: <49.000,54.000,63.000>
[3] after iter 4: <79.000,81.000,84.000>
[0] all done iter 5: <17.000,25.000,28.000,32.000,43.000,47.000,49.000,54.000,63.000,79.000,81.000,84.000>