MPI_Scatter usage - c

I am trying to modify my program so the code can look better. Right now I am using MPI_Send and MPI_Recv, but I am trying to make it work with MPI_Scatter. I have an array called All_vals and I trying to send Sub_arr_len equal parts to every slave.
So after I first find the number of values I should send to every slave I have to send and recieve the number of values and then send and recieve the values. I am trying to change those send/recv to MPI_Scatter and think about the case where the number of the values wont be able to be divided to equal parts like when I have 20 numbers,but I have 3 processes. So slaves should have 20/3=6 values, but the master should have the rest 20-2*6=8.
Here's the part I am trying to edit:
int master(int argc, char **argv){
...
for (i=0; i<ntasks; i++)
{
Sub_arr_len[i]=Num_vals/ntasks; //Finding the number of values I should send to every process
printf("\n\nIn range %d there are %d\n\n",i,Sub_arr_len[i]);
}
for(i=1;i<ntasks;i++) {
Sub_len = Sub_arr_len[i];
MPI_Isend(&Sub_len,1, MPI_INTEGER, i, i, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
Sub_arr_start += Sub_arr_len[i-1];
printf("\r\nSub_arr_start = %d ",Sub_arr_start);
MPI_Isend(&All_vals[Sub_arr_start],Sub_len, MPI_INTEGER, i, i, MPI_COMM_WORLD, &request);
MPI_Wait(&request, &status);
}
...
}
int slave(){
MPI_Irecv(&Sub_arr_len, 1, MPI_INTEGER, 0, myrank, MPI_COMM_WORLD, &request);
MPI_Wait(&request,&status);
printf("\r\nSLAVE:Sub_arr_len = %d\n\n",Sub_arr_len);
All_vals = (int*) malloc(Sub_arr_len * sizeof(MPI_INTEGER));
MPI_Irecv(&All_vals[0], Sub_arr_len, MPI_INTEGER, 0, myrank, MPI_COMM_WORLD, &request);
MPI_Wait(&request,&status);
}
I am trying to make the scatter thing, but I am doing something wrong, so I would love if someone help me build it.

With MPI_Scatter, each rank involved must receive the same number of elements, including the root. Generally the root process is a normal participant in the scatter operation and "receives" his own chunk. This means you also need to specify a receive buffer for the root process. You basically have the following options:
Use MPI_IN_PLACE as recvbuf on the root, by doing that the root will not send anything to itself. Combine that with scattering a tail of the original sendbuffer from the root such that this "tail" has a number of elements that is divisible by the number of processes. E.g. for 20 elements scatter &All_vals[2] which has a total 18 elements, 6 for each (including root again). The root can then use All_vals[0-7].
Pad the sendbuffer with a few elements that don't do anything such that it is divisible by the number of processes.
Use MPI_Scatterv to send unequal numbers of elements to each rank. Here is a good example how to properly setup the send counts and displacements
The last option is has a some distinct advantages. It creates the least load imbalance among processes and allows all processes to use the same code. If applicable, the second option is very simple and still has a good load balance.

Related

mpi send and recv all processes to one another

I am trying to send data between all processes where I have an array on each process such as
int local_data[] = {0*rank,1*rank,2*rank,3*rank};
I have a corresponding flag array where each value in that array points to which process I should be sending this value, for example:
int part[] = {0,1,3,2};
so this means local_data[0] should go to process with rank 0
local_data[2] should go to process with rank 3 and so on.
The values in the flag arr changes from one process to the other ( all within range 0-P-1 where P is the total number of processes available) .
Using this, What I am currently doing is :
for(int i=0; i<local_len; i++){
if(part[i] != rank){
MPI_Send(&local_data[i], 1,MPI_INT, part[i], 0, MPI_COMM_WORLD);
MPI_Recv(&temp,1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status );
recvbuf[j] = temp;
j++;
}
else{
recvbuf[j] = local_data[i];
j++;
}
}
where I am only sending and receiving data if the part[i] != rank, to avoid sending and receiving from the same process
recvbuf is the array I receive the values in for each process. It can be longer than the initial local_data length.
I also tried
MPI_Sendrecv(&local_data[i], 1,MPI_INT, part[i], 0, &temp, 1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status);
the program gets stuck for both ways
How do I go about solving this?
Is the All-to-All collective the way to go here?
Your basic problem is that your send call goes to a dynamically determined target, but there is no corresponding logic to determine which processes need to do a receive at all, and if so, from where.
If the logic of your application implies that everyone will send to everyone, then you can use MPI_Alltoall.
If everyone sends to some, but you know that you will receive exactly four messages, then you can combine MPI_Isend for the sends and MPI_Recv from ANY_SOURCE. Note that you need Isend because your code will deadlock, strictly speaking. It may work if your MPI has a "eager mode" for small messages.
If the number of sends and the targets are entirely random, then you need something like MPI_Ibarrier to detect that all is over and done.
But I suspect you're leaving out major information here. Why is the length of local_data 4? Is the part array a permutation? Et cetera.
Following #GillesGouaillardet advice, I used MPI_AlltoAllv
to solve this problem.

Does MPI support only broadcasting?

What I want to achieve is to broadcast partial result to other threads and receive other threads' result at a different line of code, it can be expressed as the following pseudo code:
if have any incoming message:
read the message and compare it with the local optimal
if is optimal:
update the local optimal
calculate local result
if local result is better than local optimal:
update local optimal
send the local optimal to others
The question is, MPI_Bcast/MPI_Ibcast do the send and receive in the same place, what I want is separate send and receive. I wonder if MPI has builtin support for my purpose, or if I can only achieve this by calling MPI_Send/MPI_Isend in a for loop?
What I want to achieve is to broadcast partial result to other threads
and receive other threads' result at a different line of code, it can
be expressed as the following pseudo code:
Typically, in MPI and in this context, one tends to use the term process rather than thread:
The question is, MPI_Bcast/MPI_Ibcast do the send and receive in the
same place, what I want is separate send and receive.
This is the typical use case for a MPI_Allreduce:
Combines values from all processes and distributes the result back to
all processes
So an example that illustrates your pseudo code:
#include <stdio.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int world_rank;
int world_size;
MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
MPI_Comm_size(MPI_COMM_WORLD,&world_size);
int my_local_optimal = world_rank;
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
printf("Step 1 : Process %d -> max local %d \n", world_rank, my_local_optimal);
my_local_optimal += world_rank * world_size;
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
printf("Step 2 : Process %d -> max local %d \n", world_rank, my_local_optimal);
MPI_Finalize();
return 0;
}
So all processes start with a local optimal:
int my_local_optimal = world_rank;
then they perform a MPI_Allreduce:
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
which will basically get the max value (i.e., MPI_MAX) of the variable my_local_optimal of all processes and stores that value into my_local_optimal.
Conceptually, the difference between the aforementioned approach and:
if have any incoming message:
read the message and compare it with the local optimal
if is optimal:
update the local optimal
is that you neither explicitly check "if have any incoming message:" nor "if is optimal": you just calculate the max among the processes and update the local optimal accordingly. This makes the approach much simpler to handle.
In my example, I have used MPI_MAX, however, you need to use the operation (in your code) that defines what is optimal or not.

How should I broadcast with MPI if I want to compute the sum across multiple processors?

I am starting to get into parallel computing, and have started with MPI using C. I understand how to do such a thing using p2p (send/recv), however my confusion is when I try to use collective communication with bcast and reduce.
My code goes as follows:
int collective(int val, int rank, int n, int *toSum){
int *globalBuf=malloc(n*sizeof(int*));
int globalSum=0;
int localSum=0;
struct timespec before;
if(rank==0){
//only rank 0 will start timer
clock_gettime(CLOCK_MONOTONIC, &before);
}
int numInts=(val*100000)/n;
int *mySum = malloc((numInts)*sizeof(int *));
int j;
for(j=rank*numInts;j<numInts*rank+numInts;j++){
localSum=localSum+(toSum[j]);
}
MPI_Bcast(&localSum, 1, MPI_INT, rank, MPI_COMM_WORLD);
MPI_Reduce(&localSum, &globalSum, n, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if(rank==0){
printf("Communicative sum = %d\n", globalSum);
//only rank 0 will end the timer
//an display
struct timespec after;
clock_gettime(CLOCK_MONOTONIC, &after);
printf("Time to complete = %f\n",(after.tv_nsec-before.tv_nsec));
}
}
Where the parameters being passed in can be described as:
val = the number of total ints that need to be summed - divided by 100000
rank= the rank of this process
n = the total number of processes
toSum = the ints that are going to be added together
Where I begin to run into errors, is when I try to broadcast this processors localSum to be handled by rank 0.
I will explain what I've put into the function call so you can possibly understand where my confusion comes from.
For MPI_Bcast:
&localSum - the address of this processes sum
1 - there is one value that I want to broadcast, the int held by localSum
MPI_INT - meaning implied
rank - the rank of this process that is broadcasting
MPI_COMM_WORLD - meaning implied
For MPI_Reduce
&localSum - the address of the variable that it will "reducing"
&globalSum - the address of the variable that I want to hold the reduced values of localSum
n - the number of "localSum"s that this process will reduce (n is number of processes)
MPI_INT - meaning implied
MPI_SUM - meaning implied
0 - I want rank 0 to be the process that will reduce so it can print
MPI_COMM_WORLD - meaning implied
When I look through the code, I feel it makes sense logically, and it compiles okay, however when I run the program with m amount of processors, I get the following error message:
Assertion failed in file src/mpi/coll/helper_fns.c at line 84: FALSE
memcpy argument memory ranges overlap, dst_=0x7fffffffd2ac src_=0x7fffffffd2a8 len_=16
internal ABORT - process 0
Can anyone help me find a solution? Apologies to anyone who see's this as second nature, this is only my third parallel program, and first time using bcast/reduce!
I see two issues in the call of collective operations (MPI_Bcast, MPI_Reduce) provided in your code. First in MPI_Reduce, you are reducing an integer localSum from every processes to an integer globalSum. Basically, a single integer. But in your MPI_Reduce call, you are trying to reduce n values, in reality you just need to reduce 1 value from n processes. That may cause this error.
The reduce should ideally be like this, if you want to reduce a single value:
MPI_Reduce(&localSum, &globalSum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
For the broadcast,
MPI_Bcast(&localSum, 1, MPI_INT, rank, MPI_COMM_WORLD);
every rank is broadcasting above in your call. According to the general idea of broadcast there should be one root process that should broadcast the value to all the processes. So, the call should be like this:
int rootProcess = 0;
MPI_Bcast(&localSum, 1, MPI_INT, rootProcess, MPI_COMM_WORLD);
Here, the rootProcess will send the value contain in its localSum to all the processes. Meanwhile all the processes calling this broadcast will receive the value from rootProcess and will store in its local variable localSum

Can MPI_Gather be used to receive data from threads that use MPI_Send?

I have a master process and more slave processes. I want that every slave process to send back to the master one integer, so I guess I should gather them using MPI_Gather. But somehow it doesn't work and I started to think that MPI_Gather is incompatible with MPI_Send.
The relevant lines of code look like this:
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
MPI_Comm_size(MPI_COMM_WORLD, &process_count);
int full_word_count = 0;
int* receiving_buffer = (int*)malloc(sizeof(int) * 100);
if (process_id == 0)
{
// Some Master code here ...
MPI_Gather(full_word_count, 1, MPI_INT, receiving_buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
// ...
}
else
{
// Some Slave code here ...
MPI_Send(full_word_count, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//...
}
MPI_Finalize();
I also know that I used "1" for MPI_Gather because I tried to run only for two processes, so process 1 would send, and process 0 would gather; of course, for more processes I should modify it using ranks. But my main question here is that I can use (and if yes, how) MPI_Gather combined with MPI_Send for a situation like this.
MPI_Gather() is a collective operation and must hence be called by all the ranks of the communicator. They also must provide matching signatures (datatype and count) and all use the same root value.
Note the send buffer of the root rank is also gathered into the receive buffers, so if the send count is 1, you really should allocate your receive buffer with
int* receiving_buffer = (int*)malloc(sizeof(int) * process_count)
and since all ranks send 1 * MPI_INT, a correct receive signature is also be 1 * MPI_INT.
Also note that "threads" is improper in this context. MPI tasks or MPI processes are the right terminology.
Keep in mind that the standard does not specify how a collective operation should be implemented. In the case of MPI_Gather(), a naive implementation would have all MPI tasks send their buffer to the root rank. But some more sophisticated algorithm can be used such as a tree-based gather, and in that case, not all tasks would send their buffer to the root rank.

What will happen when root in MPI_Bcast is random

Usually when we call MPI_Bcast, the root must be decided. But now I have an application that does not care who will broadcast the message. It means the node that will broadcast the message is random, but once the message is broadcasted, a global variable must be consistent. My understanding is that since MPI_Bcast is a collective function, all nodes need to call it but the order may be different. So who arrives at MPI_Bcast first, who will broadcast the message to others. I ran the following code with 3 nodes, I think if node 1 (rank==1) arrives at MPI_Bcast first, it will send the local_count value (1) to other nodes,and then all nodes update global_count with the same local_count, so one of my expected result is (the output order does not matter)
node 0, global count is 1
node 1, global count is 1
node 2, global count is 1
But the the actual result is always (the output order does not matter):
node 1, global count is 1
node 0, global count is 0
node 2, global count is 2
This result is exactly the same as the code without MPI_Bcast. So is there anything wrong with my understanding of MPI_Bcast or my code. Thanks.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int rank, size;
int local_count, global_count;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
printf("node %d, global count is: %d\n", rank, global_count);
MPI_Finalize();
}
The code is a simplified case. In my application, there are some computations before MPI_Bcast and I don't know who will finish the computation first. Whenever a node comes to MPI_Bcast point, it needs to broadcast its own computation result local variable and all nodes update the global variable. So all nodes need to broadcast a message but we don't know the order. How to implement this idea?
Normally what you have written will lead to deadlock. In a typical MPI_Bcast case the process with rank root sends its data to all other processes in the comunicator. You need to specify the same root rank in those receiving processes so they know whom to "listen" to. This is an oversimplified description since usually hierarchial broadcast is used in orderto reduce the total operation time but with three processes this hierarchial implementation reduces to the very simple linear one. In your case process with rank 0 will try to send the same message to both processes 1 and 2. In the same time process 1 would not receive that message but instead will try to send its one to processes 0 and 2. Process 2 will also be trying to send to 0 and 1. In the end every process will be sending messages that no other process will be willing to receive. This is an almost sure recipe for distaster.
Why your program doesn't hang? Because messages being sent are very small, only one MPI_INT element and also the number of processes is small, thus those sends are all being buffered internally by the MPI library in every process - they never reach their destinations but nevertheless the calls made internally by MPI_Bcast do not block and your code gets back execution control although the operation is still in progress. This is undefined behaviour since the MPI library is not required to buffer anything by the standard - some implementations might buffer, others might not.
If you are trying to compute the sum of all local_count variables in all processes, then just use MPI_Allreduce. Replace this:
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
with this:
local_count = rank;
MPI_Allreduce(&local_count, &global_count, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
The correct usage of MPI_BCast is that all processes call the function with the same root. Even if you don't care about who the broadcaster is, all the processes must call the function with the same rank to listen to the broadcaster. In your code, each process are calling MPI_BCast with their own rank and all different from each other.
I haven't look at the standard document, but it is likely that you trigger undefined behavior by calling MPI_BCast with different ranks.
from the openmpi docs
"MPI_Bcast broadcasts a message from the process with rank root to all processes of the group, itself included. It is called by all members of group using the same arguments for comm, root. On return, the contents of root’s communication buffer has been copied to all processes. "
calling it might from ranks != root might cause some problems
a safer way is to hard code your own broadcast function and call that
its basically a for loop and a mpi_send command and shouldn't be to difficult to implement your self

Resources