Successive calls to MPI_IScatterv with same communicator - c

I am trying to scatter two different, independent arrays from rank 0 to all others, over the same communicator, using non-blocking version of communication.
Something along these lines:
//do some stuff with arrays here...
MPI_IScatterv(array1, partial_size1, displs1,
MPI_DOUBLE, local1, partial_size1,
MPI_DOUBLE, 0, some_communicator, &request);
MPI_IScatterv(array2, partial_size2, displs2,
MPI_DOUBLE, local2, partial_size2,
MPI_DOUBLE, 0, some_communicator, &request);
//do some stuff where none of the arrays is needed...
MPI_Wait(&request, &status);
//do stuff with the arrays...
So... is it possible (or rather if it is guaranteed to always be error-free) to use two successive calls to MPI_IScatterv using the same communicator, or might that affect the result - mess up the messages from both scatters, since there are no tags?

Yes, it is possible to perform multiple non-blocking collective operations at once according to the MPI standard. In particular, on page 197, in section 5.12. NONBLOCKING COLLECTIVE OPERATIONS:
Multiple nonblocking collective operations can be outstanding on a single communicator. If the nonblocking call causes some system resource to be exhausted, then it will fail and generate an MPI exception. Quality implementations of MPI should ensure that
this happens only in pathological cases. That is, an MPI implementation should be able to
support a large number of pending nonblocking operations.
Nevertheless, make sure that different request are used for the successive calls to MPI_Iscatterv(). The function MPI_Waitall() is useful to check the completion of multiple non blocking operations.
MPI_Request requests[2];
MPI_Iscatterv(...,&requests[0]);
MPI_Iscatterv(...,&requests[1]);
MPI_Waitall(2,requests,...);
A sample code showing how it can be done:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <math.h>
int main(int argc, char *argv[]) {
MPI_Request requests[42];
MPI_Init(&argc,&argv);
int size,rank;
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int version,subversion;
MPI_Get_version( &version, &subversion );
if(rank==0){printf("MPI version %d.%d\n",version,subversion);}
//dimensions
int nbscatter=5;
int nlocal=2;
double* array=NULL;
int i,j,k;
//build a 2D array of nbscatter lines and nlocal*size columns on root process
if(rank==0){
array=malloc(nlocal*nbscatter*size*sizeof(double));
if(array==NULL){printf("malloc failure\n");}
for(i=0;i<nbscatter;i++){
for(j=0;j<size*nlocal;j++){
array[i*size*nlocal+j]=j+0.01*i;
printf("%lf ",array[i*size*nlocal+j]);
}
printf("\n");
}
}
//on each process, a 2D array of nbscatter lines and nlocal columns
double* arrayloc=malloc(nlocal*nbscatter*sizeof(double));
if(arrayloc==NULL){printf("malloc failure2\n");}
//counts and displacements
int* displs;
int* scounts;
displs = malloc(nbscatter*size*sizeof(int));
if(displs==NULL){printf("malloc failure\n");}
scounts = malloc(nbscatter*size*sizeof(int));
if(scounts==NULL){printf("malloc failure\n");}
for(i=0;i<nbscatter;i++){
for(j=0;j<size;j++){
displs[i*size+j]=j*nlocal;
scounts[i*size+j]=nlocal;
}
// scatter the lines
if(rank==0){
MPI_Iscatterv(&array[i*nlocal*size], &scounts[i*size], &displs[i*size],MPI_DOUBLE,&arrayloc[i*nlocal], nlocal,MPI_DOUBLE, 0, MPI_COMM_WORLD, &requests[i]);
}else{
MPI_Iscatterv(NULL, &scounts[i*size], &displs[i*size],MPI_DOUBLE,&arrayloc[i*nlocal], nlocal,MPI_DOUBLE, 0, MPI_COMM_WORLD, &requests[i]);
}
}
MPI_Status status[nbscatter];
if(MPI_Waitall(nbscatter,requests,status)!=MPI_SUCCESS){
printf("MPI_Waitall() failed\n");
}
if(rank==0){
free(array);
}
free(displs);
free(scounts);
//print the local array, containing the scattered columns
for(k=0;k<size;k++){
if(rank==k){
printf("on rank %d\n",k);
for(i=0;i<nbscatter;i++){
for(j=0;j<nlocal;j++){
printf("%lf ",arrayloc[i*nlocal+j]);
}
printf("\n");
}
}
MPI_Barrier(MPI_COMM_WORLD);
}
free(arrayloc);
MPI_Finalize();
return 0;
}
To be compiled by mpicc main.c -o main -Wall and ran by mpirun -np 4 main

Related

Recursive Function using MPI library

I am using the recursive function to find determinant of a 5x5 matrix. Although this sounds a trivial problem and can be solved using OpenMP if the dimensions are huge. I am trying to use MPI to solve this however I cant understand how to deal with recursion which is accumulating the results.
So my question is, How do I use MPI for this?
PS: The matrix is a Hilbert matrix so the answer would be 0
I have written the below code but I think it simply does the same part n times rather than dividing the problem and then accumulating result.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#define ROWS 5
double getDeterminant(double matrix[5][5], int iDim);
double matrix[ROWS][ROWS] = {
{1.0, -0.5, -0.33, -0.25,-0.2},
{-0.5, 0.33, -0.25, -0.2,-0.167},
{-0.33, -0.25, 0.2, -0.167,-0.1428},
{-0.25,-0.2, -0.167,0.1428,-0.125},
{-0.2, -0.167,-0.1428,-0.125,0.111},
};
int rank, size, tag = 0;
int main(int argc, char** argv)
{
//Status of messages for each individual rows
MPI_Status status[ROWS];
//Message ID or Rank
MPI_Request req[ROWS];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double result;
result = getDeterminant(matrix, ROWS);
printf("The determinant is %lf\n", result);
//Set barrier to wait for all processess
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
double getDeterminant(double matrix[ROWS][ROWS], int iDim)
{
int iCols, iMinorRow, iMinorCol, iTemp, iSign;
double c[5];
double tempMat[5][5];
double dDet;
dDet = 0;
if (iDim == 2)
{
dDet = (matrix[0][0] * matrix[1][1]) - (matrix[0][1] * matrix[1][0]);
return dDet;
}
else
{
for (iCols = 0; iCols < iDim; iCols++)
{
int temp_row = 0, temp_col = 0;
for (iMinorRow = 0; iMinorRow < iDim; iMinorRow++)
{
for (iMinorCol = 0; iMinorCol < iDim; iMinorCol++)
{
if (iMinorRow != 0 && iMinorCol != iCols)
{
tempMat[temp_row][temp_col] = matrix[iMinorRow][iMinorCol];
temp_col++;
if (temp_col >= iDim - 1)
{
temp_row++;
temp_col = 0;
}
}
}
}
//Hanlding the alternate signs while calculating diterminant
for (iTemp = 0, iSign = 1; iTemp < iCols; iTemp++)
{
iSign = (-1) * iSign;
}
//Evaluating what has been calculated if the resulting matrix is 2x2
c[iCols] = iSign * getDeterminant(tempMat, iDim - 1);
}
for (iCols = 0, dDet = 0.0; iCols < iDim; iCols++)
{
dDet = dDet + (matrix[0][iCols] * c[iCols]);
}
return dDet;
}
}
Expected result should be a very small value close to 0. I am getting the same result but not using MPI
The provided program will be executed in n processes. mpirun launches the n processes and they all executes the provided code. It is the expected behaviour. Unlike openmp, MPI is not a shared memory programming model rather a distributed memory programming model. It uses message passing to communicate with other processes.There are no global variables in MPI. All the data in your program will be local to your process.. If you need to share a data between process, you have to use MPI_send or MPI_Bcast or etc. explicitly to send it. You can use collective operations like MPI_Bcastto send it to all processes or point to point operations like MPI_send to send to specific processes.
For your application to do the expected behaviour, you have to tailor make it for MPI (unlike in openmp where you can use pragmas). All processes has an identifier or rank. Typically, rank 0 (lets call it your main process) should pass the data to all processes using MPI_Send (or any other methods) and the remaining processes should receive it using MPI_Recv (use MPI_Recv for MPI_Send). After receiving the local data from main process, the local processes should perform some computation on it and then send the results back to the main process. Main process will agreggate the result. This is a very basic scenario using MPI. You can use MPI IO etc..
MPI does not do anything by itself for synchronization or data sharing. It just launches instance of the application n times and provides required routines. It is the application developer who is in charge of communication (data structures etc), synchronization(using MPI_Barrier), etc. among processes.
Following is a simple send receive program using MPI. When you run the below code, say with n as 2, two copies of this program will be launched. In the program, using MPI_Comm_rank(), each process will get its id. We can use this ID for further computations/controlling the flow of code. In the code below, the process with rank 0 will send the variable number using MPI_Send and the process with rank 1 will receive this value using MPI_Recv. We can see that if and else if to differentiate between processes and change the control flow to send and receive the data. This is a very basic MPI program that share the data between processes.
// Find out rank, size
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
if (world_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n",
number);
}
Here is a tutorial on MPI.

Looking for an MPI routine

So just today I started messing around with the MPI library in C and I've tried it out some and have now found myself in a situation where I need the following:
A routine that'll send a message to a random process in a blocking receive while leaving the others still blocked.
Does such a routine exist? If not, how can something like this be accomplished?
No, such routine does not exist. However, you can easily build one using the available routines in the MPI standard. For example if you want a routine that sends to a random process which is not the current one you can write the following:
int MPI_SendRand(void *data, unsigned size, int tag, MPI_Comm comm, MPI_Status *status) {
// one process sends
int comm_size, my_rank, dest;
MPI_Comm_rank(comm, &my_rank);
MPI_Comm_size(comm, &comm_size);
// random number between [0, comm_size) excluding my_rank
while ((dst = ((float)rand())/RAND_MAX*comm_size)) == my_rank) ;
return MPI_Send(data, size, dst, tag, comm, status);
}
can be used as follows:
if (rank == master) {
MPI_SendRand(some_data, sime_size, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
// the rest waits
MPI_Recv(some_buff, some_size, MPI_SOURCE_ANY, MPI_TAG_ANY, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
// do work...
}

Comparing CPU utilization during MPI thread deadlock using mvapich2 vs. openmpi

I noticed that when I have a deadlocked MPI program, e.g. wait.c
#include <stdio.h>
#include <mpi.h>
int main(int argc, char * argv[])
{
int taskID = -1;
int NTasks = -1;
int a = 11;
int b = 22;
MPI_Status Stat;
/* MPI Initializations */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskID);
MPI_Comm_size(MPI_COMM_WORLD, &NTasks);
if(taskID == 0)
MPI_Send(&a, 1, MPI_INT, 1, 66, MPI_COMM_WORLD);
else //if(taskID == 1)
MPI_Recv(&b, 1, MPI_INT, 0, 66, MPI_COMM_WORLD, &Stat);
printf("Task %i : a: %i b: %i\n", taskID, a, b);
MPI_Finalize();
return 0;
}
When I compile wait.c with the mvapich2-2.1 library (which itself was compiled using gcc-4.9.2) and run it (e.g. mpirun -np 4 ./a.out) I notice (via top), that all 4 processors are chugging along at 100%.
When I compile wait.c with the openmpi-1.6 library (which itself was compiled using gcc-4.9.2) and run it (e.g. mpirun -np 4 ./a.out), I notice (via top), that 2 processors are chugging at 100% and 2 at 0%.
Presumably the 2 at 0% are the ones that completed communication.
QUESTION : Why is there a difference in CPU usage between openmpi and mvapich2? Is this the expected behavior? When the CPU usage is 100%, is that from constantly checking to see if a message is being sent?
Both implementations busy-wait on MPI_Recv() in order to minimize latencies. This explains why ranks 2 and 3 are at 100% with either of the two MPI implementations.
Now, clearly ranks 0 and 1 progress to the MPI_Finalize() call and this is where the two implementations differ: mvapich2 busy-wait while openmpi does not.
To answer your question: yes, they are at 100% while checking whether a message has been received and it is expected behaviour.
If you are not on InfiniBand, you can observe this by attaching a strace to one of the processes: you should see a number of poll() invocations there.

MPI Deadlock with collective functions

I'm writing a simple program in C with MPI library.
The intent of this program is the following:
I have a group of processes that perform an iterative loop, at the end of this loop all processes in the communicator must call two collective functions(MPI_Allreduce and MPI_Bcast). The first one sends the id of the processes that have generated the minimum value of the num.val variable, and the second one broadcasts from the source num_min.idx_v to all processes in the communicator MPI_COMM_WORLD.
The problem is that I don't know if the i-th process will be finalized before calling the collective functions. All processes have a probability of 1/10 to terminate. This simulates the behaviour of the real program that I'm implementing. And when the first process terminates, the others cause deadlock.
This is the code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
typedef struct double_int{
double val;
int idx_v;
}double_int;
int main(int argc, char **argv)
{
int n = 10;
int max_it = 4000;
int proc_id, n_proc;double *x = (double *)malloc(n*sizeof(double));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &n_proc);
MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
srand(proc_id);
double_int num_min;
double_int num;
int k;
for(k = 0; k < max_it; k++){
num.idx_v = proc_id;
num.val = rand()/(double)RAND_MAX;
if((rand() % 10) == 0){
printf("iter %d: proc %d terminato\n", k, proc_id);
MPI_Finalize();
exit(EXIT_SUCCESS);
}
MPI_Allreduce(&num, &num_min, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
MPI_Bcast(x, n, MPI_DOUBLE, num_min.idx_v, MPI_COMM_WORLD);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}
Perhaps I should create a new group and new communicator before calling MPI_Finalize function in the if statement? How should I solve this?
If you have control over a process before it terminates you should send a non-blocking flag to a rank that cannot terminate early (lets call it the root rank). Then instead of having a blocking all_reduce, you could have sends from all ranks to the root rank with their value.
The root rank could post non-blocking receives for a possible flag, and the value. All ranks would have to have sent one or the other. Once all ranks are accounted for you can do the reduce on the root rank, remove exited ranks from communication and broadcast it.
If your ranks exit without notice, I am not sure what options you have.

MPI RMA operation: Ordering between MPI_Win_free and local load

I'm trying to do a simple test on MPI's RMA operation using MPI_Win_lock and MPI_Win_unlock. The program just let process 0 to update the integer value in process 1 and display it.
The below program runs correctly (at least the result seems correct to me):
#include "mpi.h"
#include "stdio.h"
#define root 0
int main(int argc, char *argv[])
{
int myrank, nprocs;
int send, recv, err;
MPI_Win nwin;
int *st;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Alloc_mem(1*sizeof(int), MPI_INFO_NULL, &st);
st[0] = 0;
if (myrank != root) {
MPI_Win_create(st, 1*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);
}
else {
MPI_Win_create(NULL, 0, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);
}
if (myrank == root) {
st[0] = 1;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, nwin);
MPI_Put(st, 1, MPI_INT, 1, 0, 1, MPI_INT, nwin);
MPI_Win_unlock(1, nwin);
MPI_Win_free(&nwin);
}
else { // rank 1
MPI_Win_free(&nwin);
printf("Rank %d, st = %d\n", myrank, st[0]);
}
MPI_Free_mem(st);
MPI_Finalize();
return 0;
}
The output I got is Rank 1, st = 1. But curiously, if I switch the lines in the else block for rank 1 to
else { // rank 1
printf("Rank %d, st = %d\n", myrank, st[0]);
MPI_Win_free(&nwin);
}
The output is Rank 1, st = 0.
I cannot find out the reason behind it, and why I need to put MPI_Win_free after loading the data is originally I need to put all the stuff in a while loop and let rank 0 to determine when to stop the loop. When condition is satisfied, I try to let rank 0 to update the flag (st) in rank 1. I try to put the MPI_Win_free outside the while loop so that the window will only be freed after the loop. Now it seems that I cannot do this and need to create and free the window every time in the loop?
I'll be honest, MPI RMA is not my speciality, but I'll give this a shot:
The problem is that you're running into a race condition. When you do the MPI_PUT operation, it sends the data from rank 0 to rank 1 to be put into the buffer at some point in the future. You don't have any control over that from rank 0's perspective.
One rank 1's side, you're not doing anything to complete the operation. I know that RMA (or one-sided operations) sound like they shouldn't require any intervention on the target side, but the do require a bit. When you use one-sided operations, you have to have something on the receiving side that also synchronizes the data. In this case, you're trying to use MPI put/get operations in combination with non-MPI load store operations. This is erroneous and results in the race condition you're seeing. When you switch the MPI_WIN_FREE to be first, you complete all of the outstanding operations so your data is correct.
You can find out lots more about passive target synchronization (which is what you're doing here) with this question: MPI with C: Passive RMA synchronization.

Resources