What will happen when root in MPI_Bcast is random - c

Usually when we call MPI_Bcast, the root must be decided. But now I have an application that does not care who will broadcast the message. It means the node that will broadcast the message is random, but once the message is broadcasted, a global variable must be consistent. My understanding is that since MPI_Bcast is a collective function, all nodes need to call it but the order may be different. So who arrives at MPI_Bcast first, who will broadcast the message to others. I ran the following code with 3 nodes, I think if node 1 (rank==1) arrives at MPI_Bcast first, it will send the local_count value (1) to other nodes,and then all nodes update global_count with the same local_count, so one of my expected result is (the output order does not matter)
node 0, global count is 1
node 1, global count is 1
node 2, global count is 1
But the the actual result is always (the output order does not matter):
node 1, global count is 1
node 0, global count is 0
node 2, global count is 2
This result is exactly the same as the code without MPI_Bcast. So is there anything wrong with my understanding of MPI_Bcast or my code. Thanks.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int rank, size;
int local_count, global_count;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
printf("node %d, global count is: %d\n", rank, global_count);
MPI_Finalize();
}
The code is a simplified case. In my application, there are some computations before MPI_Bcast and I don't know who will finish the computation first. Whenever a node comes to MPI_Bcast point, it needs to broadcast its own computation result local variable and all nodes update the global variable. So all nodes need to broadcast a message but we don't know the order. How to implement this idea?

Normally what you have written will lead to deadlock. In a typical MPI_Bcast case the process with rank root sends its data to all other processes in the comunicator. You need to specify the same root rank in those receiving processes so they know whom to "listen" to. This is an oversimplified description since usually hierarchial broadcast is used in orderto reduce the total operation time but with three processes this hierarchial implementation reduces to the very simple linear one. In your case process with rank 0 will try to send the same message to both processes 1 and 2. In the same time process 1 would not receive that message but instead will try to send its one to processes 0 and 2. Process 2 will also be trying to send to 0 and 1. In the end every process will be sending messages that no other process will be willing to receive. This is an almost sure recipe for distaster.
Why your program doesn't hang? Because messages being sent are very small, only one MPI_INT element and also the number of processes is small, thus those sends are all being buffered internally by the MPI library in every process - they never reach their destinations but nevertheless the calls made internally by MPI_Bcast do not block and your code gets back execution control although the operation is still in progress. This is undefined behaviour since the MPI library is not required to buffer anything by the standard - some implementations might buffer, others might not.
If you are trying to compute the sum of all local_count variables in all processes, then just use MPI_Allreduce. Replace this:
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
with this:
local_count = rank;
MPI_Allreduce(&local_count, &global_count, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

The correct usage of MPI_BCast is that all processes call the function with the same root. Even if you don't care about who the broadcaster is, all the processes must call the function with the same rank to listen to the broadcaster. In your code, each process are calling MPI_BCast with their own rank and all different from each other.
I haven't look at the standard document, but it is likely that you trigger undefined behavior by calling MPI_BCast with different ranks.

from the openmpi docs
"MPI_Bcast broadcasts a message from the process with rank root to all processes of the group, itself included. It is called by all members of group using the same arguments for comm, root. On return, the contents of root’s communication buffer has been copied to all processes. "
calling it might from ranks != root might cause some problems
a safer way is to hard code your own broadcast function and call that
its basically a for loop and a mpi_send command and shouldn't be to difficult to implement your self

Related

mpi send and recv all processes to one another

I am trying to send data between all processes where I have an array on each process such as
int local_data[] = {0*rank,1*rank,2*rank,3*rank};
I have a corresponding flag array where each value in that array points to which process I should be sending this value, for example:
int part[] = {0,1,3,2};
so this means local_data[0] should go to process with rank 0
local_data[2] should go to process with rank 3 and so on.
The values in the flag arr changes from one process to the other ( all within range 0-P-1 where P is the total number of processes available) .
Using this, What I am currently doing is :
for(int i=0; i<local_len; i++){
if(part[i] != rank){
MPI_Send(&local_data[i], 1,MPI_INT, part[i], 0, MPI_COMM_WORLD);
MPI_Recv(&temp,1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status );
recvbuf[j] = temp;
j++;
}
else{
recvbuf[j] = local_data[i];
j++;
}
}
where I am only sending and receiving data if the part[i] != rank, to avoid sending and receiving from the same process
recvbuf is the array I receive the values in for each process. It can be longer than the initial local_data length.
I also tried
MPI_Sendrecv(&local_data[i], 1,MPI_INT, part[i], 0, &temp, 1, MPI_INT, rank, 0, MPI_COMM_WORLD, &status);
the program gets stuck for both ways
How do I go about solving this?
Is the All-to-All collective the way to go here?
Your basic problem is that your send call goes to a dynamically determined target, but there is no corresponding logic to determine which processes need to do a receive at all, and if so, from where.
If the logic of your application implies that everyone will send to everyone, then you can use MPI_Alltoall.
If everyone sends to some, but you know that you will receive exactly four messages, then you can combine MPI_Isend for the sends and MPI_Recv from ANY_SOURCE. Note that you need Isend because your code will deadlock, strictly speaking. It may work if your MPI has a "eager mode" for small messages.
If the number of sends and the targets are entirely random, then you need something like MPI_Ibarrier to detect that all is over and done.
But I suspect you're leaving out major information here. Why is the length of local_data 4? Is the part array a permutation? Et cetera.
Following #GillesGouaillardet advice, I used MPI_AlltoAllv
to solve this problem.

Does MPI support only broadcasting?

What I want to achieve is to broadcast partial result to other threads and receive other threads' result at a different line of code, it can be expressed as the following pseudo code:
if have any incoming message:
read the message and compare it with the local optimal
if is optimal:
update the local optimal
calculate local result
if local result is better than local optimal:
update local optimal
send the local optimal to others
The question is, MPI_Bcast/MPI_Ibcast do the send and receive in the same place, what I want is separate send and receive. I wonder if MPI has builtin support for my purpose, or if I can only achieve this by calling MPI_Send/MPI_Isend in a for loop?
What I want to achieve is to broadcast partial result to other threads
and receive other threads' result at a different line of code, it can
be expressed as the following pseudo code:
Typically, in MPI and in this context, one tends to use the term process rather than thread:
The question is, MPI_Bcast/MPI_Ibcast do the send and receive in the
same place, what I want is separate send and receive.
This is the typical use case for a MPI_Allreduce:
Combines values from all processes and distributes the result back to
all processes
So an example that illustrates your pseudo code:
#include <stdio.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int world_rank;
int world_size;
MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
MPI_Comm_size(MPI_COMM_WORLD,&world_size);
int my_local_optimal = world_rank;
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
printf("Step 1 : Process %d -> max local %d \n", world_rank, my_local_optimal);
my_local_optimal += world_rank * world_size;
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
printf("Step 2 : Process %d -> max local %d \n", world_rank, my_local_optimal);
MPI_Finalize();
return 0;
}
So all processes start with a local optimal:
int my_local_optimal = world_rank;
then they perform a MPI_Allreduce:
MPI_Allreduce(&my_local_optimal, &my_local_optimal, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
which will basically get the max value (i.e., MPI_MAX) of the variable my_local_optimal of all processes and stores that value into my_local_optimal.
Conceptually, the difference between the aforementioned approach and:
if have any incoming message:
read the message and compare it with the local optimal
if is optimal:
update the local optimal
is that you neither explicitly check "if have any incoming message:" nor "if is optimal": you just calculate the max among the processes and update the local optimal accordingly. This makes the approach much simpler to handle.
In my example, I have used MPI_MAX, however, you need to use the operation (in your code) that defines what is optimal or not.

How should I broadcast with MPI if I want to compute the sum across multiple processors?

I am starting to get into parallel computing, and have started with MPI using C. I understand how to do such a thing using p2p (send/recv), however my confusion is when I try to use collective communication with bcast and reduce.
My code goes as follows:
int collective(int val, int rank, int n, int *toSum){
int *globalBuf=malloc(n*sizeof(int*));
int globalSum=0;
int localSum=0;
struct timespec before;
if(rank==0){
//only rank 0 will start timer
clock_gettime(CLOCK_MONOTONIC, &before);
}
int numInts=(val*100000)/n;
int *mySum = malloc((numInts)*sizeof(int *));
int j;
for(j=rank*numInts;j<numInts*rank+numInts;j++){
localSum=localSum+(toSum[j]);
}
MPI_Bcast(&localSum, 1, MPI_INT, rank, MPI_COMM_WORLD);
MPI_Reduce(&localSum, &globalSum, n, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if(rank==0){
printf("Communicative sum = %d\n", globalSum);
//only rank 0 will end the timer
//an display
struct timespec after;
clock_gettime(CLOCK_MONOTONIC, &after);
printf("Time to complete = %f\n",(after.tv_nsec-before.tv_nsec));
}
}
Where the parameters being passed in can be described as:
val = the number of total ints that need to be summed - divided by 100000
rank= the rank of this process
n = the total number of processes
toSum = the ints that are going to be added together
Where I begin to run into errors, is when I try to broadcast this processors localSum to be handled by rank 0.
I will explain what I've put into the function call so you can possibly understand where my confusion comes from.
For MPI_Bcast:
&localSum - the address of this processes sum
1 - there is one value that I want to broadcast, the int held by localSum
MPI_INT - meaning implied
rank - the rank of this process that is broadcasting
MPI_COMM_WORLD - meaning implied
For MPI_Reduce
&localSum - the address of the variable that it will "reducing"
&globalSum - the address of the variable that I want to hold the reduced values of localSum
n - the number of "localSum"s that this process will reduce (n is number of processes)
MPI_INT - meaning implied
MPI_SUM - meaning implied
0 - I want rank 0 to be the process that will reduce so it can print
MPI_COMM_WORLD - meaning implied
When I look through the code, I feel it makes sense logically, and it compiles okay, however when I run the program with m amount of processors, I get the following error message:
Assertion failed in file src/mpi/coll/helper_fns.c at line 84: FALSE
memcpy argument memory ranges overlap, dst_=0x7fffffffd2ac src_=0x7fffffffd2a8 len_=16
internal ABORT - process 0
Can anyone help me find a solution? Apologies to anyone who see's this as second nature, this is only my third parallel program, and first time using bcast/reduce!
I see two issues in the call of collective operations (MPI_Bcast, MPI_Reduce) provided in your code. First in MPI_Reduce, you are reducing an integer localSum from every processes to an integer globalSum. Basically, a single integer. But in your MPI_Reduce call, you are trying to reduce n values, in reality you just need to reduce 1 value from n processes. That may cause this error.
The reduce should ideally be like this, if you want to reduce a single value:
MPI_Reduce(&localSum, &globalSum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
For the broadcast,
MPI_Bcast(&localSum, 1, MPI_INT, rank, MPI_COMM_WORLD);
every rank is broadcasting above in your call. According to the general idea of broadcast there should be one root process that should broadcast the value to all the processes. So, the call should be like this:
int rootProcess = 0;
MPI_Bcast(&localSum, 1, MPI_INT, rootProcess, MPI_COMM_WORLD);
Here, the rootProcess will send the value contain in its localSum to all the processes. Meanwhile all the processes calling this broadcast will receive the value from rootProcess and will store in its local variable localSum

Can MPI_Gather be used to receive data from threads that use MPI_Send?

I have a master process and more slave processes. I want that every slave process to send back to the master one integer, so I guess I should gather them using MPI_Gather. But somehow it doesn't work and I started to think that MPI_Gather is incompatible with MPI_Send.
The relevant lines of code look like this:
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
MPI_Comm_size(MPI_COMM_WORLD, &process_count);
int full_word_count = 0;
int* receiving_buffer = (int*)malloc(sizeof(int) * 100);
if (process_id == 0)
{
// Some Master code here ...
MPI_Gather(full_word_count, 1, MPI_INT, receiving_buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
// ...
}
else
{
// Some Slave code here ...
MPI_Send(full_word_count, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//...
}
MPI_Finalize();
I also know that I used "1" for MPI_Gather because I tried to run only for two processes, so process 1 would send, and process 0 would gather; of course, for more processes I should modify it using ranks. But my main question here is that I can use (and if yes, how) MPI_Gather combined with MPI_Send for a situation like this.
MPI_Gather() is a collective operation and must hence be called by all the ranks of the communicator. They also must provide matching signatures (datatype and count) and all use the same root value.
Note the send buffer of the root rank is also gathered into the receive buffers, so if the send count is 1, you really should allocate your receive buffer with
int* receiving_buffer = (int*)malloc(sizeof(int) * process_count)
and since all ranks send 1 * MPI_INT, a correct receive signature is also be 1 * MPI_INT.
Also note that "threads" is improper in this context. MPI tasks or MPI processes are the right terminology.
Keep in mind that the standard does not specify how a collective operation should be implemented. In the case of MPI_Gather(), a naive implementation would have all MPI tasks send their buffer to the root rank. But some more sophisticated algorithm can be used such as a tree-based gather, and in that case, not all tasks would send their buffer to the root rank.

Changing value of a variable with MPI

#include<stdio.h>
#include<mpi.h>
int a=1;
int *p=&a;
int main(int argc, char **argv)
{
MPI_Init(&argc,&argv);
int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
//printf("Address val: %u \n",p);
*p=*p+1;
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
printf("Value of a : %d\n",*p);
return 0;
}
Here, I am trying to execute the program with 3 processes where each tries to increment the value of a by 1, so the value at the end of execution of all processes should be 4. Then why does the value printed as 2 only at the printf statement after MPI_Finalize(). And isnt it that the parallel execution stops at MPI_Finalize() and there should be only one process running after it. Then why do I get the print statement 3 times, one for each process, during execution?
It is a common misunderstanding to think that mpi_init starts up the requested number of processes (or whatever mechanism is used to implement MPI) and that mpi_finalize stops them. It's better to think of mpi_init starting the MPI system on top of a set of operating-system processes. The MPI standard is silent on what MPI actually runs on top of and how the underlying mechanism(s) is/are started. In practice a call to mpiexec (or mpirun) is likely to fire up a requested number of processes, all of which are alive when the program starts. It is also likely that the processes will continue to live after the call to mpi_finalize until the program finishes.
This means that prior to the call to mpi_init, and after the call to mpi_finalize it is likely that there is a number of o/s processes running, each of them executing the same program. This explains why you get the printf statement executed once for each of your processes.
As to why the value of a is set to 2 rather than to 4, well, essentially you are running n copies of the same program (where n is the number of processes) each of which adds 1 to its own version of a. A variable in the memory of one process has no relationship to a variable of the same name in the memory of another process. So each process sets a to 2.
To get any data from one process to another the processes need to engage in message-passing.
EDIT, in response to OP's comment
Just as a variable in the memory of one process has no relationship to a variable of the same name in the memory of another process, a pointer (which is a kind of variable) has no relationship to a pointer of the same name in the memory of another process. Do not be fooled, if the ''same'' pointer has the ''same'' address in multiple processes, those addresses are in different address spaces and are not the same, the pointers don't point to the same place.
An analogy: 1 High Street, Toytown is not the same address as 1 High Street, Legotown; there is a coincidence in names across address spaces.
To get any data (pointer or otherwise) from one process to another the processes need to engage in message-passing. You seem to be clinging to a notion that MPI processes share memory in some way. They don't, let go of that notion.
Since MPI is only giving you the option to communicate between separate processes, you have to do message passing. For your purpose there is something like MPI_Allreduce, which can sum data over the separate processes. Note that this adds the values, so in your case you want to sum the increment, and add the sum later to p:
int inc = 1;
MPI_Allreduce(MPI_IN_PLACE, &inc, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
*p += inc;
In your implementation there is no communication between the spawned threads. Each process has his own int a variable which it increments and prints to the screen. Making the variable global doesn't make it shared between processes and all the pointer gimmicks show me that you don't know what you are doing. I would suggest learning a little more C and Operating Systems before you move on.
Anyway, you have to make the processes communicate. Here's how an example might look like:
#include<stdio.h>
#include<mpi.h>
// this program will count the number of spawned processes in a *very* bad way
int main(int argc, char **argv)
{
int partial = 1;
int sum;
int my_id = 0;
// let's just assume the process with id 0 is root
int root_process = 0;
// spawn processes, etc.
MPI_Init(&argc,&argv);
// every process learns his id
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
// all processes add their 'partial' to the 'sum'
MPI_Reduce(&partial, &sum, 1, MPI_INT, MPI_SUM, root_process, MPI_COMM_WORLD);
// de-init MPI
MPI_Finalize();
// the root process communicates the summation result
if (my_id == root_process)
{
printf("Sum total : %d\n", sum);
}
return 0;
}

Resources