MPI merge multiple intercoms into a single intracomm - c

I trying to set up bunch of spawned processes into a single intracomm. I need to spawn separate processes into unique working directories since these subprocesses will write out a bunch of files. After all the processes are spawned in want to merge them into a single intra communicator. To try this out I set up a simple test program.
int main(int argc, const char * argv[]) {
int rank, size;
const int num_children = 5;
int error_codes;
MPI_Init(&argc, (char ***)&argv);
MPI_Comm parentcomm;
MPI_Comm childcomm;
MPI_Comm intracomm;
MPI_Comm_get_parent(&parentcomm);
if (parentcomm == MPI_COMM_NULL) {
printf("Current Command %s\n", argv[0]);
for (size_t i = 0; i < num_children; i++) {
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &childcomm, &error_codes);
MPI_Intercomm_merge(childcomm, 0, &intracomm);
MPI_Barrier(childcomm);
}
} else {
MPI_Intercomm_merge(parentcomm, 1, &intracomm);
MPI_Barrier(parentcomm);
}
printf("Test\n");
MPI_Barrier(intracomm);
printf("Test2\n");
MPI_Comm_rank(intracomm, &rank);
MPI_Comm_size(intracomm, &size);
printf("Rank %d of %d\n", rank + 1, size);
MPI_Barrier(intracomm);
MPI_Finalize();
return 0;
}
When I run this I get all 6 processes but my intracomm is only speaking between the parent and the last child spawned. The resulting output is
Test
Test
Test
Test
Test
Test
Test2
Rank 1 of 2
Test2
Rank 2 of 2
Is there a way to merge multiple communicators into a single communicator? Also note that I'm spawning these one at a time since I need each subprocess to execute in a unique working directory.

I realize I'm a year out of date with this answer, but I thought maybe other people might want to see an implementation of this. As the original respondent said, there is no way to merge three (or more) communicators. You have to build up the new intra-comm one at a time. Here is the code I use. This version deletes the original intra-comm; you may or may not want to do that depending on your particular application:
#include <mpi.h>
// The Borg routine: given
// (1) a (quiesced) intra-communicator with one or more members, and
// (2) a (quiesced) inter-communicator with exactly two members, one
// of which is rank zero of the intra-communicator, and
// the other of which is an unrelated spawned rank,
// return a new intra-communicator which is the union of both inputs.
//
// This is a collective operation. All ranks of the intra-
// communicator, and the remote rank of the inter-communicator, must
// call this routine. Ranks that are members of the intra-comm must
// supply the proper value for the "intra" argument, and MPI_COMM_NULL
// for the "inter" argument. The remote inter-comm rank must
// supply MPI_COMM_NULL for the "intra" argument, and the proper value
// for the "inter" argument. Rank zero (only) of the intra-comm must
// supply proper values for both arguments.
//
// N.B. It would make a certain amount of sense to split this into
// separate routines for the intra-communicator processes and the
// remote inter-communicator process. The reason we don't do that is
// that, despite the relatively few lines of code, what's going on here
// is really pretty complicated, and requires close coordination of the
// participating processes. Putting all the code for all the processes
// into this one routine makes it easier to be sure everything "lines up"
// properly.
MPI_Comm
assimilateComm(MPI_Comm intra, MPI_Comm inter)
{
MPI_Comm peer = MPI_COMM_NULL;
MPI_Comm newInterComm = MPI_COMM_NULL;
MPI_Comm newIntraComm = MPI_COMM_NULL;
// The spawned rank will be the "high" rank in the new intra-comm
int high = (MPI_COMM_NULL == intra) ? 1 : 0;
// If this is one of the (two) ranks in the inter-comm,
// create a new intra-comm from the inter-comm
if (MPI_COMM_NULL != inter) {
MPI_Intercomm_merge(inter, high, &peer);
} else {
peer = MPI_COMM_NULL;
}
// Create a new inter-comm between the pre-existing intra-comm
// (all of it, not only rank zero), and the remote (spawned) rank,
// using the just-created intra-comm as the peer communicator.
int tag = 12345;
if (MPI_COMM_NULL != intra) {
// This task is a member of the pre-existing intra-comm
MPI_Intercomm_create(intra, 0, peer, 1, tag, &newInterComm);
}
else {
// This is the remote (spawned) task
MPI_Intercomm_create(MPI_COMM_SELF, 0, peer, 0, tag, &newInterComm);
}
// Now convert this inter-comm into an intra-comm
MPI_Intercomm_merge(newInterComm, high, &newIntraComm);
// Clean up the intermediaries
if (MPI_COMM_NULL != peer) MPI_Comm_free(&peer);
MPI_Comm_free(&newInterComm);
// Delete the original intra-comm
if (MPI_COMM_NULL != intra) MPI_Comm_free(&intra);
// Return the new intra-comm
return newIntraComm;
}

If you are going to do this by calling MPI_COMM_SPAWN multiple times, then you'll have to do it more carefully. After you call SPAWN the first time, the spawned process will also need to take part in the next call to SPAWN, otherwise it will be left out of the communicator you're merging. it ends up looking like this:
The problem is that only two processes are participating in each MPI_INTERCOMM_MERGE and you can't merge three communicators so you'll never end up with one big communicator that way.
If you instead have each process participate in the merge as it goes, you end up with one big communicator in the end:
Of course, you can just spawn all of your extra processes at once, but it sounds like you might have other reasons for not doing that.

Related

Can MPI_Gather be used to receive data from threads that use MPI_Send?

I have a master process and more slave processes. I want that every slave process to send back to the master one integer, so I guess I should gather them using MPI_Gather. But somehow it doesn't work and I started to think that MPI_Gather is incompatible with MPI_Send.
The relevant lines of code look like this:
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
MPI_Comm_size(MPI_COMM_WORLD, &process_count);
int full_word_count = 0;
int* receiving_buffer = (int*)malloc(sizeof(int) * 100);
if (process_id == 0)
{
// Some Master code here ...
MPI_Gather(full_word_count, 1, MPI_INT, receiving_buffer, 1, MPI_INT, 0, MPI_COMM_WORLD);
// ...
}
else
{
// Some Slave code here ...
MPI_Send(full_word_count, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//...
}
MPI_Finalize();
I also know that I used "1" for MPI_Gather because I tried to run only for two processes, so process 1 would send, and process 0 would gather; of course, for more processes I should modify it using ranks. But my main question here is that I can use (and if yes, how) MPI_Gather combined with MPI_Send for a situation like this.
MPI_Gather() is a collective operation and must hence be called by all the ranks of the communicator. They also must provide matching signatures (datatype and count) and all use the same root value.
Note the send buffer of the root rank is also gathered into the receive buffers, so if the send count is 1, you really should allocate your receive buffer with
int* receiving_buffer = (int*)malloc(sizeof(int) * process_count)
and since all ranks send 1 * MPI_INT, a correct receive signature is also be 1 * MPI_INT.
Also note that "threads" is improper in this context. MPI tasks or MPI processes are the right terminology.
Keep in mind that the standard does not specify how a collective operation should be implemented. In the case of MPI_Gather(), a naive implementation would have all MPI tasks send their buffer to the root rank. But some more sophisticated algorithm can be used such as a tree-based gather, and in that case, not all tasks would send their buffer to the root rank.

Barrier call stuck in Open MPI (C program)

I am practicing synchronization through barrier by using Open MPI message communication. I have created an array of struct called containers. Each container is linked to its neighbor on the right, and the two elements at both ends are also linked, forming a circle.
In the main() testing client, I run MPI with multiple processes (mpiexec -n 5 ./a.out), and they are supposed to be synchronized by calling the barrier() function, however, my code is stuck at the last process. I am looking for help with the debugging. Please see my code below:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <mpi.h>
typedef struct container {
int labels;
struct container *linked_to_container;
int sense;
} container;
container *allcontainers; /* an array for all containers */
int size_containers_array;
int get_next_container_id(int current_container_index, int max_index)
{
if (max_index - current_container_index >= 1)
{
return current_container_index + 1;
}
else
return 0; /* elements at two ends are linked */
}
container *get_container(int index)
{
return &allcontainers[index];
}
void container_init(int num_containers)
{
allcontainers = (container *) malloc(num_containers * sizeof(container)); /* is this right to malloc memory on the array of container when the struct size is still unknown?*/
size_containers_array = num_containers;
int i;
for (i = 0; i < num_containers; i++)
{
container *current_container = get_container(i);
current_container->labels = 0;
int next_container_id = get_next_container_id(i, num_containers - 1); /* max index in all_containers[] is num_containers-1 */
current_container->linked_to_container = get_container(next_container_id);
current_container->sense = 0;
}
}
void container_barrier()
{
int current_container_id, my_sense = 1;
int tag = current_container_id;
MPI_Request request[size_containers_array];
MPI_Status status[size_containers_array];
MPI_Comm_rank(MPI_COMM_WORLD, &current_container_id);
container *current_container = get_container(current_container_id);
int next_container_id = get_next_container_id(current_container_id, size_containers_array - 1);
/* send asynchronous message to the next container, wait, then do blocking receive */
MPI_Isend(&my_sense, 1, MPI_INT, next_container_id, tag, MPI_COMM_WORLD, &request[current_container_id]);
MPI_Wait(&request[current_container_id], &status[current_container_id]);
MPI_Recv(&my_sense, 1, MPI_INT, next_container_id, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
void free_containers()
{
free(allcontainers);
}
int main(int argc, char **argv)
{
int my_id, num_processes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_processes);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
container_init(num_processes);
printf("Hello world from thread %d of %d \n", my_id, num_processes);
container_barrier();
printf("passed barrier \n");
MPI_Finalize();
free_containers();
return 0;
}
The problem is the series of calls:
MPI_Isend()
MPI_Wait()
MPI_Recv()
This is a common source of confusion. When you use a "nonblocking" call in MPI, you are essentially telling the MPI library that you want to do some operation (send) with some data (my_sense). MPI gives you back an MPI_Request object with the guarantee that the call will be finished by the time a completion function finishes that MPI_Request.
The problem you have here is that you're calling MPI_Isend and immediately calling MPI_Wait before ever calling MPI_Recv on any rank. This means that all of those send calls get queued up but never actually have anywhere to go because you've never told MPI where to put the data by calling MPI_Recv (which tells MPI that you want to put the data in my_sense).
The reason this works part of the time is that MPI expects that things might not always sync up perfectly. If you smaller messages (which you do), MPI reserves some buffer space and will let your MPI_Send operations complete and the data gets stashed in that temporary space for a while until you call MPI_Recv later to tell MPI where to move the data. Eventually though, this won't work anymore. The buffers will be full and you'll need to actually start receiving your messages. For you, this means that you need to switch the order of your operations. Instead of doing a non-blocking send, you should do a non-blocking receive first, then do your blocking send, then wait for your receive to finish:
MPI_Irecv()
MPI_Send()
MPI_Wait()
The other option is to turn both functions into nonblocking functions and use MPI_Waitall instead:
MPI_Isend()
MPI_Irecv()
MPI_Waitall()
This last option is usually the best. The only thing that you'll need to be careful about is that you don't overwrite your own data. Right now you're using the same buffer for both the send and receive operations. If both of these are happening at the same time, there's no guarantees about the ordering. Normally this doesn't make a difference. Whether you send the message first or receive it doesn't really matter. However, in this case it does. If you receive data first, you'll end up sending the same data back out again instead of sending the data you had before the receive operation. You can solve this by using a temporary buffer to stage your data and move it to the right place when it's safe.

MPI RMA operation: Ordering between MPI_Win_free and local load

I'm trying to do a simple test on MPI's RMA operation using MPI_Win_lock and MPI_Win_unlock. The program just let process 0 to update the integer value in process 1 and display it.
The below program runs correctly (at least the result seems correct to me):
#include "mpi.h"
#include "stdio.h"
#define root 0
int main(int argc, char *argv[])
{
int myrank, nprocs;
int send, recv, err;
MPI_Win nwin;
int *st;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Alloc_mem(1*sizeof(int), MPI_INFO_NULL, &st);
st[0] = 0;
if (myrank != root) {
MPI_Win_create(st, 1*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);
}
else {
MPI_Win_create(NULL, 0, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);
}
if (myrank == root) {
st[0] = 1;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, nwin);
MPI_Put(st, 1, MPI_INT, 1, 0, 1, MPI_INT, nwin);
MPI_Win_unlock(1, nwin);
MPI_Win_free(&nwin);
}
else { // rank 1
MPI_Win_free(&nwin);
printf("Rank %d, st = %d\n", myrank, st[0]);
}
MPI_Free_mem(st);
MPI_Finalize();
return 0;
}
The output I got is Rank 1, st = 1. But curiously, if I switch the lines in the else block for rank 1 to
else { // rank 1
printf("Rank %d, st = %d\n", myrank, st[0]);
MPI_Win_free(&nwin);
}
The output is Rank 1, st = 0.
I cannot find out the reason behind it, and why I need to put MPI_Win_free after loading the data is originally I need to put all the stuff in a while loop and let rank 0 to determine when to stop the loop. When condition is satisfied, I try to let rank 0 to update the flag (st) in rank 1. I try to put the MPI_Win_free outside the while loop so that the window will only be freed after the loop. Now it seems that I cannot do this and need to create and free the window every time in the loop?
I'll be honest, MPI RMA is not my speciality, but I'll give this a shot:
The problem is that you're running into a race condition. When you do the MPI_PUT operation, it sends the data from rank 0 to rank 1 to be put into the buffer at some point in the future. You don't have any control over that from rank 0's perspective.
One rank 1's side, you're not doing anything to complete the operation. I know that RMA (or one-sided operations) sound like they shouldn't require any intervention on the target side, but the do require a bit. When you use one-sided operations, you have to have something on the receiving side that also synchronizes the data. In this case, you're trying to use MPI put/get operations in combination with non-MPI load store operations. This is erroneous and results in the race condition you're seeing. When you switch the MPI_WIN_FREE to be first, you complete all of the outstanding operations so your data is correct.
You can find out lots more about passive target synchronization (which is what you're doing here) with this question: MPI with C: Passive RMA synchronization.

Changing value of a variable with MPI

#include<stdio.h>
#include<mpi.h>
int a=1;
int *p=&a;
int main(int argc, char **argv)
{
MPI_Init(&argc,&argv);
int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
//printf("Address val: %u \n",p);
*p=*p+1;
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
printf("Value of a : %d\n",*p);
return 0;
}
Here, I am trying to execute the program with 3 processes where each tries to increment the value of a by 1, so the value at the end of execution of all processes should be 4. Then why does the value printed as 2 only at the printf statement after MPI_Finalize(). And isnt it that the parallel execution stops at MPI_Finalize() and there should be only one process running after it. Then why do I get the print statement 3 times, one for each process, during execution?
It is a common misunderstanding to think that mpi_init starts up the requested number of processes (or whatever mechanism is used to implement MPI) and that mpi_finalize stops them. It's better to think of mpi_init starting the MPI system on top of a set of operating-system processes. The MPI standard is silent on what MPI actually runs on top of and how the underlying mechanism(s) is/are started. In practice a call to mpiexec (or mpirun) is likely to fire up a requested number of processes, all of which are alive when the program starts. It is also likely that the processes will continue to live after the call to mpi_finalize until the program finishes.
This means that prior to the call to mpi_init, and after the call to mpi_finalize it is likely that there is a number of o/s processes running, each of them executing the same program. This explains why you get the printf statement executed once for each of your processes.
As to why the value of a is set to 2 rather than to 4, well, essentially you are running n copies of the same program (where n is the number of processes) each of which adds 1 to its own version of a. A variable in the memory of one process has no relationship to a variable of the same name in the memory of another process. So each process sets a to 2.
To get any data from one process to another the processes need to engage in message-passing.
EDIT, in response to OP's comment
Just as a variable in the memory of one process has no relationship to a variable of the same name in the memory of another process, a pointer (which is a kind of variable) has no relationship to a pointer of the same name in the memory of another process. Do not be fooled, if the ''same'' pointer has the ''same'' address in multiple processes, those addresses are in different address spaces and are not the same, the pointers don't point to the same place.
An analogy: 1 High Street, Toytown is not the same address as 1 High Street, Legotown; there is a coincidence in names across address spaces.
To get any data (pointer or otherwise) from one process to another the processes need to engage in message-passing. You seem to be clinging to a notion that MPI processes share memory in some way. They don't, let go of that notion.
Since MPI is only giving you the option to communicate between separate processes, you have to do message passing. For your purpose there is something like MPI_Allreduce, which can sum data over the separate processes. Note that this adds the values, so in your case you want to sum the increment, and add the sum later to p:
int inc = 1;
MPI_Allreduce(MPI_IN_PLACE, &inc, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
*p += inc;
In your implementation there is no communication between the spawned threads. Each process has his own int a variable which it increments and prints to the screen. Making the variable global doesn't make it shared between processes and all the pointer gimmicks show me that you don't know what you are doing. I would suggest learning a little more C and Operating Systems before you move on.
Anyway, you have to make the processes communicate. Here's how an example might look like:
#include<stdio.h>
#include<mpi.h>
// this program will count the number of spawned processes in a *very* bad way
int main(int argc, char **argv)
{
int partial = 1;
int sum;
int my_id = 0;
// let's just assume the process with id 0 is root
int root_process = 0;
// spawn processes, etc.
MPI_Init(&argc,&argv);
// every process learns his id
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
// all processes add their 'partial' to the 'sum'
MPI_Reduce(&partial, &sum, 1, MPI_INT, MPI_SUM, root_process, MPI_COMM_WORLD);
// de-init MPI
MPI_Finalize();
// the root process communicates the summation result
if (my_id == root_process)
{
printf("Sum total : %d\n", sum);
}
return 0;
}

What will happen when root in MPI_Bcast is random

Usually when we call MPI_Bcast, the root must be decided. But now I have an application that does not care who will broadcast the message. It means the node that will broadcast the message is random, but once the message is broadcasted, a global variable must be consistent. My understanding is that since MPI_Bcast is a collective function, all nodes need to call it but the order may be different. So who arrives at MPI_Bcast first, who will broadcast the message to others. I ran the following code with 3 nodes, I think if node 1 (rank==1) arrives at MPI_Bcast first, it will send the local_count value (1) to other nodes,and then all nodes update global_count with the same local_count, so one of my expected result is (the output order does not matter)
node 0, global count is 1
node 1, global count is 1
node 2, global count is 1
But the the actual result is always (the output order does not matter):
node 1, global count is 1
node 0, global count is 0
node 2, global count is 2
This result is exactly the same as the code without MPI_Bcast. So is there anything wrong with my understanding of MPI_Bcast or my code. Thanks.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int rank, size;
int local_count, global_count;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
printf("node %d, global count is: %d\n", rank, global_count);
MPI_Finalize();
}
The code is a simplified case. In my application, there are some computations before MPI_Bcast and I don't know who will finish the computation first. Whenever a node comes to MPI_Bcast point, it needs to broadcast its own computation result local variable and all nodes update the global variable. So all nodes need to broadcast a message but we don't know the order. How to implement this idea?
Normally what you have written will lead to deadlock. In a typical MPI_Bcast case the process with rank root sends its data to all other processes in the comunicator. You need to specify the same root rank in those receiving processes so they know whom to "listen" to. This is an oversimplified description since usually hierarchial broadcast is used in orderto reduce the total operation time but with three processes this hierarchial implementation reduces to the very simple linear one. In your case process with rank 0 will try to send the same message to both processes 1 and 2. In the same time process 1 would not receive that message but instead will try to send its one to processes 0 and 2. Process 2 will also be trying to send to 0 and 1. In the end every process will be sending messages that no other process will be willing to receive. This is an almost sure recipe for distaster.
Why your program doesn't hang? Because messages being sent are very small, only one MPI_INT element and also the number of processes is small, thus those sends are all being buffered internally by the MPI library in every process - they never reach their destinations but nevertheless the calls made internally by MPI_Bcast do not block and your code gets back execution control although the operation is still in progress. This is undefined behaviour since the MPI library is not required to buffer anything by the standard - some implementations might buffer, others might not.
If you are trying to compute the sum of all local_count variables in all processes, then just use MPI_Allreduce. Replace this:
global_count = 0;
local_count = rank;
MPI_Bcast(&local_count, 1, MPI_INT, rank, MPI_COMM_WORLD);
global_count += local_count;
with this:
local_count = rank;
MPI_Allreduce(&local_count, &global_count, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
The correct usage of MPI_BCast is that all processes call the function with the same root. Even if you don't care about who the broadcaster is, all the processes must call the function with the same rank to listen to the broadcaster. In your code, each process are calling MPI_BCast with their own rank and all different from each other.
I haven't look at the standard document, but it is likely that you trigger undefined behavior by calling MPI_BCast with different ranks.
from the openmpi docs
"MPI_Bcast broadcasts a message from the process with rank root to all processes of the group, itself included. It is called by all members of group using the same arguments for comm, root. On return, the contents of root’s communication buffer has been copied to all processes. "
calling it might from ranks != root might cause some problems
a safer way is to hard code your own broadcast function and call that
its basically a for loop and a mpi_send command and shouldn't be to difficult to implement your self

Resources