Is there any way in MPI to get the total number of bytes transferred by my entire MPI program in C?
The best way is to use a MPI profiling tool such as the simple mpiP. There are more sophisticated / heavyweight tools that can also do that, such as Score-P. You should check if there is something available if you are running your code on an HPC site.
Not that I know of directly, but you could adapt the following code to your purposes:
uint64_t bytes_recv = 0;
void CommRecv(MyObject* a){
MPI_Status status;
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int msg_size;
MPI_Get_count(&status, MPI_BYTE, &msg_size);
bytes_recv += msg_size;
// Allocate a buffer to hold the incoming data
char* buf = (char*)malloc(msg_size);
assert(buf!=NULL);
MPI_Recv(buf, msg_size, MPI_BYTE, from, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
//Do stuff
free(buf);
}
The point here is to wrap the standard MPI communication functions with functions that keep track of data transfer statistics. Internally, these functions use MPI_Get_count() to retrieve the size of the incoming message. This is then added to a global variable that tracks communication over all of the wrapped MPI functions.
At the end of the program you can accumulate each instance's global variables on the master process.
Related
I am trying to send data between all processes where I have an array on each process such as
int local_data[] = {0*rank,1*rank,2*rank,3*rank};
I have a corresponding flag array where each value in that array points to which process I should be sending this value, for example:
int part[] = {0,1,3,2};
so this means local_data[0] should go to process with rank 0
local_data[2] should go to process with rank 3 and so on.
The values in the flag arr changes from one process to the other ( all within range 0-P-1 where P is the total number of processes available) .
Using this, What I am currently doing is :
for(int i=0; i<local_len; i++){
if(part[i] != rank){
MPI_Send(&local_data[i], 1,MPI_INT, part[i], 0, MPI_COMM_WORLD);
MPI_Recv(&temp,1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status );
recvbuf[j] = temp;
j++;
}
else{
recvbuf[j] = local_data[i];
j++;
}
}
where I am only sending and receiving data if the part[i] != rank, to avoid sending and receiving from the same process
recvbuf is the array I receive the values in for each process. It can be longer than the initial local_data length.
I also tried
MPI_Sendrecv(&local_data[i], 1,MPI_INT, part[i], 0, &temp, 1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status);
the program gets stuck for both ways
How do I go about solving this?
Is the All-to-All collective the way to go here?
Your basic problem is that your send call goes to a dynamically determined target, but there is no corresponding logic to determine which processes need to do a receive at all, and if so, from where.
If the logic of your application implies that everyone will send to everyone, then you can use MPI_Alltoall.
If everyone sends to some, but you know that you will receive exactly four messages, then you can combine MPI_Isend for the sends and MPI_Recv from ANY_SOURCE. Note that you need Isend because your code will deadlock, strictly speaking. It may work if your MPI has a "eager mode" for small messages.
If the number of sends and the targets are entirely random, then you need something like MPI_Ibarrier to detect that all is over and done.
But I suspect you're leaving out major information here. Why is the length of local_data 4? Is the part array a permutation? Et cetera.
Following #GillesGouaillardet advice, I used MPI_AlltoAllv
to solve this problem.
What I want to achieve is to broadcast partial result to other threads and receive other threads' result at a different line of code, it can be expressed as the following pseudo code:
if have any incoming message:
read the message and compare it with the local optimal
if is optimal:
update the local optimal
calculate local result
if local result is better than local optimal:
update local optimal
send the local optimal to others
The question is, MPI_Bcast/MPI_Ibcast do the send and receive in the same place, what I want is separate send and receive. I wonder if MPI has builtin support for my purpose, or if I can only achieve this by calling MPI_Send/MPI_Isend in a for loop?
What I want to achieve is to broadcast partial result to other threads
and receive other threads' result at a different line of code, it can
be expressed as the following pseudo code:
Typically, in MPI and in this context, one tends to use the term process rather than thread:
The question is, MPI_Bcast/MPI_Ibcast do the send and receive in the
same place, what I want is separate send and receive.
This is the typical use case for a MPI_Allreduce:
Combines values from all processes and distributes the result back to
all processes
So an example that illustrates your pseudo code:
#include <stdio.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int world_rank;
int world_size;
MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
MPI_Comm_size(MPI_COMM_WORLD,&world_size);
int my_local_optimal = world_rank;
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
printf("Step 1 : Process %d -> max local %d \n", world_rank, my_local_optimal);
my_local_optimal += world_rank * world_size;
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
printf("Step 2 : Process %d -> max local %d \n", world_rank, my_local_optimal);
MPI_Finalize();
return 0;
}
So all processes start with a local optimal:
int my_local_optimal = world_rank;
then they perform a MPI_Allreduce:
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
which will basically get the max value (i.e., MPI_MAX) of the variable my_local_optimal of all processes and stores that value into my_local_optimal.
Conceptually, the difference between the aforementioned approach and:
if have any incoming message:
read the message and compare it with the local optimal
if is optimal:
update the local optimal
is that you neither explicitly check "if have any incoming message:" nor "if is optimal": you just calculate the max among the processes and update the local optimal accordingly. This makes the approach much simpler to handle.
In my example, I have used MPI_MAX, however, you need to use the operation (in your code) that defines what is optimal or not.
I have a master process and more slave processes. I want that every slave process to send back to the master one integer, so I guess I should gather them using MPI_Gather. But somehow it doesn't work and I started to think that MPI_Gather is incompatible with MPI_Send.
The relevant lines of code look like this:
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
MPI_Comm_size(MPI_COMM_WORLD, &process_count);
int full_word_count = 0;
int* receiving_buffer = (int*)malloc(sizeof(int) * 100);
if (process_id == 0)
{
// Some Master code here ...
MPI_Gather(full_word_count, 1, MPI_INT, receiving_buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
// ...
}
else
{
// Some Slave code here ...
MPI_Send(full_word_count, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//...
}
MPI_Finalize();
I also know that I used "1" for MPI_Gather because I tried to run only for two processes, so process 1 would send, and process 0 would gather; of course, for more processes I should modify it using ranks. But my main question here is that I can use (and if yes, how) MPI_Gather combined with MPI_Send for a situation like this.
MPI_Gather() is a collective operation and must hence be called by all the ranks of the communicator. They also must provide matching signatures (datatype and count) and all use the same root value.
Note the send buffer of the root rank is also gathered into the receive buffers, so if the send count is 1, you really should allocate your receive buffer with
int* receiving_buffer = (int*)malloc(sizeof(int) * process_count)
and since all ranks send 1 * MPI_INT, a correct receive signature is also be 1 * MPI_INT.
Also note that "threads" is improper in this context. MPI tasks or MPI processes are the right terminology.
Keep in mind that the standard does not specify how a collective operation should be implemented. In the case of MPI_Gather(), a naive implementation would have all MPI tasks send their buffer to the root rank. But some more sophisticated algorithm can be used such as a tree-based gather, and in that case, not all tasks would send their buffer to the root rank.
So just today I started messing around with the MPI library in C and I've tried it out some and have now found myself in a situation where I need the following:
A routine that'll send a message to a random process in a blocking receive while leaving the others still blocked.
Does such a routine exist? If not, how can something like this be accomplished?
No, such routine does not exist. However, you can easily build one using the available routines in the MPI standard. For example if you want a routine that sends to a random process which is not the current one you can write the following:
int MPI_SendRand(void *data, unsigned size, int tag, MPI_Comm comm, MPI_Status *status) {
// one process sends
int comm_size, my_rank, dest;
MPI_Comm_rank(comm, &my_rank);
MPI_Comm_size(comm, &comm_size);
// random number between [0, comm_size) excluding my_rank
while ((dst = ((float)rand())/RAND_MAX*comm_size)) == my_rank) ;
return MPI_Send(data, size, dst, tag, comm, status);
}
can be used as follows:
if (rank == master) {
MPI_SendRand(some_data, sime_size, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
// the rest waits
MPI_Recv(some_buff, some_size, MPI_SOURCE_ANY, MPI_TAG_ANY, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
// do work...
}
I am trying to modify my program so the code can look better. Right now I am using MPI_Send and MPI_Recv, but I am trying to make it work with MPI_Scatter. I have an array called All_vals and I trying to send Sub_arr_len equal parts to every slave.
So after I first find the number of values I should send to every slave I have to send and recieve the number of values and then send and recieve the values. I am trying to change those send/recv to MPI_Scatter and think about the case where the number of the values wont be able to be divided to equal parts like when I have 20 numbers,but I have 3 processes. So slaves should have 20/3=6 values, but the master should have the rest 20-2*6=8.
Here's the part I am trying to edit:
int master(int argc, char **argv){
...
for (i=0; i<ntasks; i++)
{
Sub_arr_len[i]=Num_vals/ntasks; //Finding the number of values I should send to every process
printf("\n\nIn range %d there are %d\n\n",i,Sub_arr_len[i]);
}
for(i=1;i<ntasks;i++) {
Sub_len = Sub_arr_len[i];
MPI_Isend(&Sub_len,1, MPI_INTEGER, i, i, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
Sub_arr_start += Sub_arr_len[i-1];
printf("\r\nSub_arr_start = %d ",Sub_arr_start);
MPI_Isend(&All_vals[Sub_arr_start],Sub_len, MPI_INTEGER, i, i, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
}
...
}
int slave(){
MPI_Irecv(&Sub_arr_len, 1, MPI_INTEGER, 0, myrank, MPI_COMM_WORLD, &request);
MPI_Wait(&request,&status);
printf("\r\nSLAVE:Sub_arr_len = %d\n\n",Sub_arr_len);
All_vals = (int*) malloc(Sub_arr_len * sizeof(MPI_INTEGER));
MPI_Irecv(&All_vals[0], Sub_arr_len, MPI_INTEGER, 0, myrank, MPI_COMM_WORLD, &request);
MPI_Wait(&request,&status);
}
I am trying to make the scatter thing, but I am doing something wrong, so I would love if someone help me build it.
With MPI_Scatter, each rank involved must receive the same number of elements, including the root. Generally the root process is a normal participant in the scatter operation and "receives" his own chunk. This means you also need to specify a receive buffer for the root process. You basically have the following options:
Use MPI_IN_PLACE as recvbuf on the root, by doing that the root will not send anything to itself. Combine that with scattering a tail of the original sendbuffer from the root such that this "tail" has a number of elements that is divisible by the number of processes. E.g. for 20 elements scatter &All_vals[2] which has a total 18 elements, 6 for each (including root again). The root can then use All_vals[0-7].
Pad the sendbuffer with a few elements that don't do anything such that it is divisible by the number of processes.
Use MPI_Scatterv to send unequal numbers of elements to each rank. Here is a good example how to properly setup the send counts and displacements
The last option is has a some distinct advantages. It creates the least load imbalance among processes and allows all processes to use the same code. If applicable, the second option is very simple and still has a good load balance.