Looking for an MPI routine - c

So just today I started messing around with the MPI library in C and I've tried it out some and have now found myself in a situation where I need the following:
A routine that'll send a message to a random process in a blocking receive while leaving the others still blocked.
Does such a routine exist? If not, how can something like this be accomplished?

No, such routine does not exist. However, you can easily build one using the available routines in the MPI standard. For example if you want a routine that sends to a random process which is not the current one you can write the following:
int MPI_SendRand(void *data, unsigned size, int tag, MPI_Comm comm, MPI_Status *status) {
// one process sends
int comm_size, my_rank, dest;
MPI_Comm_rank(comm, &my_rank);
MPI_Comm_size(comm, &comm_size);
// random number between [0, comm_size) excluding my_rank
while ((dst = ((float)rand())/RAND_MAX*comm_size)) == my_rank) ;
return MPI_Send(data, size, dst, tag, comm, status);
}
can be used as follows:
if (rank == master) {
MPI_SendRand(some_data, sime_size, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
// the rest waits
MPI_Recv(some_buff, some_size, MPI_SOURCE_ANY, MPI_TAG_ANY, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
// do work...
}

Related

Successive calls to MPI_IScatterv with same communicator

I am trying to scatter two different, independent arrays from rank 0 to all others, over the same communicator, using non-blocking version of communication.
Something along these lines:
//do some stuff with arrays here...
MPI_IScatterv(array1, partial_size1, displs1,
MPI_DOUBLE, local1, partial_size1,
MPI_DOUBLE, 0, some_communicator, &request);
MPI_IScatterv(array2, partial_size2, displs2,
MPI_DOUBLE, local2, partial_size2,
MPI_DOUBLE, 0, some_communicator, &request);
//do some stuff where none of the arrays is needed...
MPI_Wait(&request, &status);
//do stuff with the arrays...
So... is it possible (or rather if it is guaranteed to always be error-free) to use two successive calls to MPI_IScatterv using the same communicator, or might that affect the result - mess up the messages from both scatters, since there are no tags?
Yes, it is possible to perform multiple non-blocking collective operations at once according to the MPI standard. In particular, on page 197, in section 5.12. NONBLOCKING COLLECTIVE OPERATIONS:
Multiple nonblocking collective operations can be outstanding on a single communicator. If the nonblocking call causes some system resource to be exhausted, then it will fail and generate an MPI exception. Quality implementations of MPI should ensure that
this happens only in pathological cases. That is, an MPI implementation should be able to
support a large number of pending nonblocking operations.
Nevertheless, make sure that different request are used for the successive calls to MPI_Iscatterv(). The function MPI_Waitall() is useful to check the completion of multiple non blocking operations.
MPI_Request requests[2];
MPI_Iscatterv(...,&requests[0]);
MPI_Iscatterv(...,&requests[1]);
MPI_Waitall(2,requests,...);
A sample code showing how it can be done:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <math.h>
int main(int argc, char *argv[]) {
MPI_Request requests[42];
MPI_Init(&argc,&argv);
int size,rank;
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int version,subversion;
MPI_Get_version( &version, &subversion );
if(rank==0){printf("MPI version %d.%d\n",version,subversion);}
//dimensions
int nbscatter=5;
int nlocal=2;
double* array=NULL;
int i,j,k;
//build a 2D array of nbscatter lines and nlocal*size columns on root process
if(rank==0){
array=malloc(nlocal*nbscatter*size*sizeof(double));
if(array==NULL){printf("malloc failure\n");}
for(i=0;i<nbscatter;i++){
for(j=0;j<size*nlocal;j++){
array[i*size*nlocal+j]=j+0.01*i;
printf("%lf ",array[i*size*nlocal+j]);
}
printf("\n");
}
}
//on each process, a 2D array of nbscatter lines and nlocal columns
double* arrayloc=malloc(nlocal*nbscatter*sizeof(double));
if(arrayloc==NULL){printf("malloc failure2\n");}
//counts and displacements
int* displs;
int* scounts;
displs = malloc(nbscatter*size*sizeof(int));
if(displs==NULL){printf("malloc failure\n");}
scounts = malloc(nbscatter*size*sizeof(int));
if(scounts==NULL){printf("malloc failure\n");}
for(i=0;i<nbscatter;i++){
for(j=0;j<size;j++){
displs[i*size+j]=j*nlocal;
scounts[i*size+j]=nlocal;
}
// scatter the lines
if(rank==0){
MPI_Iscatterv(&array[i*nlocal*size], &scounts[i*size], &displs[i*size],MPI_DOUBLE,&arrayloc[i*nlocal], nlocal,MPI_DOUBLE, 0, MPI_COMM_WORLD, &requests[i]);
}else{
MPI_Iscatterv(NULL, &scounts[i*size], &displs[i*size],MPI_DOUBLE,&arrayloc[i*nlocal], nlocal,MPI_DOUBLE, 0, MPI_COMM_WORLD, &requests[i]);
}
}
MPI_Status status[nbscatter];
if(MPI_Waitall(nbscatter,requests,status)!=MPI_SUCCESS){
printf("MPI_Waitall() failed\n");
}
if(rank==0){
free(array);
}
free(displs);
free(scounts);
//print the local array, containing the scattered columns
for(k=0;k<size;k++){
if(rank==k){
printf("on rank %d\n",k);
for(i=0;i<nbscatter;i++){
for(j=0;j<nlocal;j++){
printf("%lf ",arrayloc[i*nlocal+j]);
}
printf("\n");
}
}
MPI_Barrier(MPI_COMM_WORLD);
}
free(arrayloc);
MPI_Finalize();
return 0;
}
To be compiled by mpicc main.c -o main -Wall and ran by mpirun -np 4 main

Is there a method in MPI that acts like a MPI_Bcast but for an individual process, instead of operating on an entire communicator?

Essentially what I am looking for here is a simple MPI_SendRecv() routine that allows me to synchronize the same buffer by specifying a source and a destination processor.
In my mind the function call for my Ideal_MPI_SendRecv() function would look precisely like MPI_Bcast() but would contain a destination process instead of a Communicator.
It might be called as follows:
Ideal_MPI_SendRecv(&somebuffer, bufferlength, datatype, source_proc, destination_proc);
If not, is there any reason? It seems like this method would be the perfect method to synchronize a variable's values between two processes.
No, there is no such call in MPI since it is trivial to implement it using point-to-point communication. Of course you could write one, for example (with some rudimentary support for error handling):
// Just a random tag that is unlikely to be used by the rest of the program
#define TAG_IDEAL_SNDRCV 11223
int Ideal_MPI_SendRecv(void *buf, int count, MPI_Datatype datatype,
int source, int dest, MPI_Comm comm)
{
int rank;
int err;
if (source == dest)
return MPI_SUCCESS;
err = MPI_Comm_rank(comm, &rank);
if (err != MPI_SUCCESS)
return err;
if (rank == source)
err = MPI_Send(buf, count, datatype, dest, TAG_IDEAL_SNDRCV, comm);
else if (rank == dest)
err = MPI_Recv(buf, count, datatype, source, TAG_IDEAL_SNDRCV, comm,
MPI_STATUS_IGNORE);
return err;
}
// Example: transfer 'int buf[10]' from rank 0 to rank 2
Ideal_MPI_SendRecv(buf, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
You could also add another output argument of type MPI_Status * and store the status of MPI_Recv there. It could be useful if both processes have different buffer sizes.
Another option would be, if you have to do that many times within a fixed set of ranks, e.g. always from rank 0 to rank 2, you could simply create a new communicator and broadcast inside it:
int rank;
MPI_Comm buddycomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split(MPI_COMM_WORLD, (!rank || rank == 2) ? 0 : MPI_UNDEFINED, rank,
&buddycomm);
// Transfer 'int buf[10]' from rank 0 to rank 2
MPI_Bcast(buf, 10, MPI_INT, 0, buddycomm);
This, of course, is an overkill since the broadcast is more expensive than the simple combination of MPI_Send and MPI_Recv.
Perhaps you want to call MPI_Send on one process (the source process, with the values you want) and MPI_Recv on another process (the one which doesn't initially have the values you want)?
If not, could you clarify how what you're trying to accomplish differs from a simple point-to-point message?

MPI_Waitall is failing

I wonder if anyone can shed some light on the MPI_Waitall function for me. I have a program passing information using MPI_Isend and MPI_Irecv. After all the sends and receives are complete, one process in the program (in this case, process 0), will print a message. My Isend/Irecv are working, but the message prints out at some random point in the program; so I am trying to use MPI_Waitall to wait until all the requests are done before printing the message. I receive the following error message:
Fatal error in PMPI_Waitall: Invalid MPI_Request, error stack:
PMPI_Waitall(311): MPI_Waitall(count=16, req_array=0x16f70d0, status_array=0x16f7260) failed
PMPI_Waitall(288): The supplied request in array element 1 was invalid (kind=0)
Here is some relevant code:
MPI_Status *status;
MPI_Request *request;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
status = (MPI_Status *) malloc(numtasks * sizeof(MPI_Status));
request = (MPI_Request *) malloc(numtasks * sizeof(MPI_Request));
/* Generate Data to send */
//Isend/Irecvs look like this:
MPI_Isend(&data, count, MPI_INT, dest, tag, MPI_COMM_WORLD, &request[taskid]);
MPI_Irecv(&data, count, MPI_INT, source, tag, MPI_COMM_WORLD, &request[taskid]);
MPI_Wait(&request[taskid], &status[taskid]
/* Calculations and such */
if (taskid == 0) {
MPI_Waitall (numtasks, request, status);
printf ("All done!\n");
}
MPI_Finalize();
Without the call to MPI_Waitall, the program runs cleanly, but the "All done" message prints as soon as process 0's Isend/Irecv messages complete, instead of after all Isend/Irecvs complete.
Thank you for any help you can provide.
You are only setting one element of the request array, namely request[taskid] (and by the way you overwrite the send request handle with the receive one, irrevocably losing the former). Remember, MPI is used to program distributed memory machines and each MPI process has its own copy of the request array. Setting one element in rank taskid does not magically propagate the value to the other ranks, and even if it does, requests have only local validity. The proper implementation would be:
MPI_Status status[2];
MPI_Request request[2];
MPI_Init(&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &taskid);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
/* Generate Data to send */
//Isend/Irecvs look like this:
MPI_Isend (&data, count, MPI_INT, dest, tag, MPI_COMM_WORLD, &request[0]);
// ^^^^
// ||
// data race !!
// ||
// vvvv
MPI_Irecv (&data, count, MPI_INT, source, tag, MPI_COMM_WORLD, &request[1]);
// Wait for both operations to complete
MPI_Waitall(2, request, status);
/* Calculations and such */
// Wait for all processes to reach this line in the code
MPI_Barrier(MPI_COMM_WORLD);
if (taskid == 0) {
printf ("All done!\n");
}
MPI_Finalize();
By the way, there is a data race in your code. Both MPI_Isend and MPI_Irecv are using the same data buffer, which is incorrect. If you are simply trying to send the content of data to dest and then receive into it from source, then use MPI_Sendrecv_replace instead and forget about the non-blocking operations:
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &taskid);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
/* Generate Data to send */
MPI_Sendrecv_replace (&data, count, MPI_INT, dest, tag, source, tag,
MPI_COMM_WORLD, &status);
/* Calculations and such */
// Wait for all processes to reach this line in the code
MPI_Barrier(MPI_COMM_WORLD);
if (taskid == 0) {
printf ("All done!\n");
}
MPI_Finalize();

Reason for MPI 'rc' variable and MPI_Get_Count()

In the ping-pong program below, what use is the rc variable? It is constantly updated but never used.
Plus what does MPI_Get_Count() do?
#include "mpi.h"
#include <stdio.h>
int main(int argc, char * argv [])
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg;
MPI_Status Stat ;
MPI_Init (&argc,&argv);
MPI_Comm_size (MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = source = 1;outmsg=’x’;
rc = MPI_Send (&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv (&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = source = 0;outmsg=’y’;
rc = MPI_Recv (&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send (&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}
rc = MPI_Get_count (&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d \n", rank, count, Stat.MPI_SOURCE,Stat.MPI_TAG);
MPI_Finalize ();
}
The bit about MPI_Get_count is answered by the documentation in the "Output" section, further down that linked page.
As for rc, here's the best explanation I can offer with no access to the author of this code or any related notes. all MPI routines in the C bindings return an error code. Some compilers check whether one is dropping return values on the floor, as it might indicate an error in the code, and generate warnings where they see that happen. Thus, to keep those warnings from appearing, this code assigns the return value to the variable rc.
That said, many compilers also warn about setting a variable that's never used, which is the case here. An idiom for telling a compiler "yes, I know I'm ignoring this return value, leave me alone" is (void)function_call(foo, bar, baz); (i.e. cast the return value to void). This is most often seen on calls to functions whose return value really should be checked, like write(). Writing it on every MPI call rather than silencing an offending warning would be rather ugly.

MPI Send and receive questions

I have questions about MPI send and receive operations.
Suppose, we have 2 MPI threads that try to send message to each other. Following are three code snippets doing that:
First (Blocking 'send' and 'receive'):
...
int data = ...;
...
MPI_Send( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD );
MPI_Status status;
MPI_Recv( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &status );
...
Second (Non-blocking 'send' but blocking 'receive'):
...
int data = ...;
...
MPI_Request request;
MPI_Isend( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &request);
MPI_Status status;
MPI_Recv( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &status );
// Synchronize sender & receiver
MPI_Wait( &request, &status);
...
Third (Non-blocking 'receive' with blocking 'send'):
...
int data = ...;
...
MPI_Request request;
MPI_Irecv( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &request );
MPI_Send( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD);
MPI_Status status;
// Synchronize sender & receiver
MPI_Wait( &request, &status);
...
I guess there are potential problems with above three code but I want your opinion. So, I have the following questions:
What are the (potential) problems (if any) with 3 codes given above?
Which of the above three code are valid/correct considering MPI standard so that it can work with all MPI implementations?
What is the best way (if not one of above 3 please write it) to do that?
In the third code, what if we change the order of MPI_Irecv and MPI_Send call?
PS: By the way, I have tried executing them using Scali MPI and all of them worked!
Your first implementation is likely to cause a deadlock, especially if the comminication is done in synchronized mode (maybe it worked in your tests, because the communication was buffered; it's not likely to be the case for large data).
The other two implementations should work without deadlocking. I believe it's considered better practice to initiate receive operations before sends, so I would personally favour the 3rd implementation. From the MPI standard, section 3.7:
Advice to users
[...]
The message-passing model implies that communication is initiated by the sender. The communication will generally have lower overhead if a receive is already posted when the sender initiates the communication (data can be moved directly to the receive buffer, and there is no need to queue a pending send request). However, a receive operation can complete only after the matching send has occurred. The use of nonblocking receives allows one to achieve lower communication overheads without blocking the receiver while it waits for the send.
The third implementation with order MPI_Send/MPI_Irecv can deadlock in the MPI_Send call for the same reasons as the first implementation.

Resources