I have questions about MPI send and receive operations.
Suppose, we have 2 MPI threads that try to send message to each other. Following are three code snippets doing that:
First (Blocking 'send' and 'receive'):
...
int data = ...;
...
MPI_Send( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD );
MPI_Status status;
MPI_Recv( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &status );
...
Second (Non-blocking 'send' but blocking 'receive'):
...
int data = ...;
...
MPI_Request request;
MPI_Isend( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &request);
MPI_Status status;
MPI_Recv( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &status );
// Synchronize sender & receiver
MPI_Wait( &request, &status);
...
Third (Non-blocking 'receive' with blocking 'send'):
...
int data = ...;
...
MPI_Request request;
MPI_Irecv( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD, &request );
MPI_Send( &data, sizeof( int ), MPI_INT,
(my_id == 0)?1:0, 0, MPI_COMM_WORLD);
MPI_Status status;
// Synchronize sender & receiver
MPI_Wait( &request, &status);
...
I guess there are potential problems with above three code but I want your opinion. So, I have the following questions:
What are the (potential) problems (if any) with 3 codes given above?
Which of the above three code are valid/correct considering MPI standard so that it can work with all MPI implementations?
What is the best way (if not one of above 3 please write it) to do that?
In the third code, what if we change the order of MPI_Irecv and MPI_Send call?
PS: By the way, I have tried executing them using Scali MPI and all of them worked!
Your first implementation is likely to cause a deadlock, especially if the comminication is done in synchronized mode (maybe it worked in your tests, because the communication was buffered; it's not likely to be the case for large data).
The other two implementations should work without deadlocking. I believe it's considered better practice to initiate receive operations before sends, so I would personally favour the 3rd implementation. From the MPI standard, section 3.7:
Advice to users
[...]
The message-passing model implies that communication is initiated by the sender. The communication will generally have lower overhead if a receive is already posted when the sender initiates the communication (data can be moved directly to the receive buffer, and there is no need to queue a pending send request). However, a receive operation can complete only after the matching send has occurred. The use of nonblocking receives allows one to achieve lower communication overheads without blocking the receiver while it waits for the send.
The third implementation with order MPI_Send/MPI_Irecv can deadlock in the MPI_Send call for the same reasons as the first implementation.
Related
I have a master process and more slave processes. I want that every slave process to send back to the master one integer, so I guess I should gather them using MPI_Gather. But somehow it doesn't work and I started to think that MPI_Gather is incompatible with MPI_Send.
The relevant lines of code look like this:
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
MPI_Comm_size(MPI_COMM_WORLD, &process_count);
int full_word_count = 0;
int* receiving_buffer = (int*)malloc(sizeof(int) * 100);
if (process_id == 0)
{
// Some Master code here ...
MPI_Gather(full_word_count, 1, MPI_INT, receiving_buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
// ...
}
else
{
// Some Slave code here ...
MPI_Send(full_word_count, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//...
}
MPI_Finalize();
I also know that I used "1" for MPI_Gather because I tried to run only for two processes, so process 1 would send, and process 0 would gather; of course, for more processes I should modify it using ranks. But my main question here is that I can use (and if yes, how) MPI_Gather combined with MPI_Send for a situation like this.
MPI_Gather() is a collective operation and must hence be called by all the ranks of the communicator. They also must provide matching signatures (datatype and count) and all use the same root value.
Note the send buffer of the root rank is also gathered into the receive buffers, so if the send count is 1, you really should allocate your receive buffer with
int* receiving_buffer = (int*)malloc(sizeof(int) * process_count)
and since all ranks send 1 * MPI_INT, a correct receive signature is also be 1 * MPI_INT.
Also note that "threads" is improper in this context. MPI tasks or MPI processes are the right terminology.
Keep in mind that the standard does not specify how a collective operation should be implemented. In the case of MPI_Gather(), a naive implementation would have all MPI tasks send their buffer to the root rank. But some more sophisticated algorithm can be used such as a tree-based gather, and in that case, not all tasks would send their buffer to the root rank.
I am trying to implement some form of persistent calling. Somehow the following code keeps hanging - I guessed I must have introduced a deadlock but can't really wrap my head around it...
MPI_Request r[4];
[...]
MPI_Send_init(&Arr[1][1], 1, MPI_DOUBLE, 1, A, MPI_COMM_WORLD, &r[0]);
MPI_Recv_init(&Arr[1][0], 1, MPI_DOUBLE, 0, A, MPI_COMM_WORLD, &r[1]);
MPI_Send_init(&Arr[2][1], 1, MPI_DOUBLE, 0, B, MPI_COMM_WORLD, &r[2]);
MPI_Recv_init(&Arr[2][0], 1, MPI_DOUBLE, 1, B, MPI_COMM_WORLD, &r[3]);
[...]
MPI_Startall(4, r);
MPI_Waitall(4, r, MPI_STATUSES_IGNORE);
I think this is perfect material for deadlock - what would be the remedy here if I want to init these send/receive message and just invoke the processes later all with Startall and Waitall?
EDIT: So if I do
MPI_Start(&r[0]);
MPI_Wait(&r[0], &status):
Then it does not hang. Invoking something like:
for (int k=0; k<1; k++)
{
MPI_Start(&r[k]);
MPI_Wait(&r[k], &status);
}
fail and hang. if that helps
your tags do not match.
for example, rank 0 receives from itself with tag A
but it sends to itself with tag B
I have to admit I'm not familiar with the concept of MPI requests and the MPI_Send/Recv_init. However, I could reproduce the deadlock with simple sends and receives. This is the code (it has a deadlock):
double someVal = 3.5;
const int MY_FIRST_TAG = 42;
MPI_Send(&someVal, 1, MPI_DOUBLE, 1, MY_FIRST_TAG, MPI_COMM_WORLD);
MPI_Recv(&someVal, 1, MPI_DOUBLE, 0, MY_FIRST_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
Even if you run it with only two processes, the problem is the following: Both, process 0 and 1 send a message to process 1. Then both processes want to receive a message from process 0. Process 1 can because process zero actually sent a message to process 1. But nobody sent a message to process 0. Consequently, this process will wait there forever.
How to fix: You need to specify that only process 0 sends to process 1 and only process 1 is supposed to receive from process 0. You can simply do it with:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
MPI_Send(&someVal, 1, MPI_DOUBLE, 1, MY_FIRST_TAG, MPI_COMM_WORLD);
else // Assumption: Only two processes
MPI_Recv(&someVal, 1, MPI_DOUBLE, 0, MY_FIRST_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
I'm not 100% sure how to translate this to the concept of requests and MPI_Send/Recv_init but maybe this helps you nonetheless.
Is there any way in MPI to get the total number of bytes transferred by my entire MPI program in C?
The best way is to use a MPI profiling tool such as the simple mpiP. There are more sophisticated / heavyweight tools that can also do that, such as Score-P. You should check if there is something available if you are running your code on an HPC site.
Not that I know of directly, but you could adapt the following code to your purposes:
uint64_t bytes_recv = 0;
void CommRecv(MyObject* a){
MPI_Status status;
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int msg_size;
MPI_Get_count(&status, MPI_BYTE, &msg_size);
bytes_recv += msg_size;
// Allocate a buffer to hold the incoming data
char* buf = (char*)malloc(msg_size);
assert(buf!=NULL);
MPI_Recv(buf, msg_size, MPI_BYTE, from, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
//Do stuff
free(buf);
}
The point here is to wrap the standard MPI communication functions with functions that keep track of data transfer statistics. Internally, these functions use MPI_Get_count() to retrieve the size of the incoming message. This is then added to a global variable that tracks communication over all of the wrapped MPI functions.
At the end of the program you can accumulate each instance's global variables on the master process.
So just today I started messing around with the MPI library in C and I've tried it out some and have now found myself in a situation where I need the following:
A routine that'll send a message to a random process in a blocking receive while leaving the others still blocked.
Does such a routine exist? If not, how can something like this be accomplished?
No, such routine does not exist. However, you can easily build one using the available routines in the MPI standard. For example if you want a routine that sends to a random process which is not the current one you can write the following:
int MPI_SendRand(void *data, unsigned size, int tag, MPI_Comm comm, MPI_Status *status) {
// one process sends
int comm_size, my_rank, dest;
MPI_Comm_rank(comm, &my_rank);
MPI_Comm_size(comm, &comm_size);
// random number between [0, comm_size) excluding my_rank
while ((dst = ((float)rand())/RAND_MAX*comm_size)) == my_rank) ;
return MPI_Send(data, size, dst, tag, comm, status);
}
can be used as follows:
if (rank == master) {
MPI_SendRand(some_data, sime_size, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
// the rest waits
MPI_Recv(some_buff, some_size, MPI_SOURCE_ANY, MPI_TAG_ANY, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
// do work...
}
Essentially what I am looking for here is a simple MPI_SendRecv() routine that allows me to synchronize the same buffer by specifying a source and a destination processor.
In my mind the function call for my Ideal_MPI_SendRecv() function would look precisely like MPI_Bcast() but would contain a destination process instead of a Communicator.
It might be called as follows:
Ideal_MPI_SendRecv(&somebuffer, bufferlength, datatype, source_proc, destination_proc);
If not, is there any reason? It seems like this method would be the perfect method to synchronize a variable's values between two processes.
No, there is no such call in MPI since it is trivial to implement it using point-to-point communication. Of course you could write one, for example (with some rudimentary support for error handling):
// Just a random tag that is unlikely to be used by the rest of the program
#define TAG_IDEAL_SNDRCV 11223
int Ideal_MPI_SendRecv(void *buf, int count, MPI_Datatype datatype,
int source, int dest, MPI_Comm comm)
{
int rank;
int err;
if (source == dest)
return MPI_SUCCESS;
err = MPI_Comm_rank(comm, &rank);
if (err != MPI_SUCCESS)
return err;
if (rank == source)
err = MPI_Send(buf, count, datatype, dest, TAG_IDEAL_SNDRCV, comm);
else if (rank == dest)
err = MPI_Recv(buf, count, datatype, source, TAG_IDEAL_SNDRCV, comm,
MPI_STATUS_IGNORE);
return err;
}
// Example: transfer 'int buf[10]' from rank 0 to rank 2
Ideal_MPI_SendRecv(buf, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
You could also add another output argument of type MPI_Status * and store the status of MPI_Recv there. It could be useful if both processes have different buffer sizes.
Another option would be, if you have to do that many times within a fixed set of ranks, e.g. always from rank 0 to rank 2, you could simply create a new communicator and broadcast inside it:
int rank;
MPI_Comm buddycomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split(MPI_COMM_WORLD, (!rank || rank == 2) ? 0 : MPI_UNDEFINED, rank,
&buddycomm);
// Transfer 'int buf[10]' from rank 0 to rank 2
MPI_Bcast(buf, 10, MPI_INT, 0, buddycomm);
This, of course, is an overkill since the broadcast is more expensive than the simple combination of MPI_Send and MPI_Recv.
Perhaps you want to call MPI_Send on one process (the source process, with the values you want) and MPI_Recv on another process (the one which doesn't initially have the values you want)?
If not, could you clarify how what you're trying to accomplish differs from a simple point-to-point message?