MPI RMA operation: Ordering between MPI_Win_free and local load

MPI RMA operation: Ordering between MPI_Win_free and local load - c

I'm trying to do a simple test on MPI's RMA operation using MPI_Win_lock and MPI_Win_unlock. The program just let process 0 to update the integer value in process 1 and display it.
The below program runs correctly (at least the result seems correct to me):
#include "mpi.h"
#include "stdio.h"
#define root 0
int main(int argc, char *argv[])
{
int myrank, nprocs;
int send, recv, err;
MPI_Win nwin;
int *st;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Alloc_mem(1*sizeof(int), MPI_INFO_NULL, &st);
st[0] = 0;
if (myrank != root) {
MPI_Win_create(st, 1*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);
}
else {
MPI_Win_create(NULL, 0, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);
}
if (myrank == root) {
st[0] = 1;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, nwin);
MPI_Put(st, 1, MPI_INT, 1, 0, 1, MPI_INT, nwin);
MPI_Win_unlock(1, nwin);
MPI_Win_free(&nwin);
}
else { // rank 1
MPI_Win_free(&nwin);
printf("Rank %d, st = %d\n", myrank, st[0]);
}
MPI_Free_mem(st);
MPI_Finalize();
return 0;
}
The output I got is Rank 1, st = 1. But curiously, if I switch the lines in the else block for rank 1 to
else { // rank 1
printf("Rank %d, st = %d\n", myrank, st[0]);
MPI_Win_free(&nwin);
}
The output is Rank 1, st = 0.
I cannot find out the reason behind it, and why I need to put MPI_Win_free after loading the data is originally I need to put all the stuff in a while loop and let rank 0 to determine when to stop the loop. When condition is satisfied, I try to let rank 0 to update the flag (st) in rank 1. I try to put the MPI_Win_free outside the while loop so that the window will only be freed after the loop. Now it seems that I cannot do this and need to create and free the window every time in the loop?

I'll be honest, MPI RMA is not my speciality, but I'll give this a shot:
The problem is that you're running into a race condition. When you do the MPI_PUT operation, it sends the data from rank 0 to rank 1 to be put into the buffer at some point in the future. You don't have any control over that from rank 0's perspective.
One rank 1's side, you're not doing anything to complete the operation. I know that RMA (or one-sided operations) sound like they shouldn't require any intervention on the target side, but the do require a bit. When you use one-sided operations, you have to have something on the receiving side that also synchronizes the data. In this case, you're trying to use MPI put/get operations in combination with non-MPI load store operations. This is erroneous and results in the race condition you're seeing. When you switch the MPI_WIN_FREE to be first, you complete all of the outstanding operations so your data is correct.
You can find out lots more about passive target synchronization (which is what you're doing here) with this question: MPI with C: Passive RMA synchronization.

Related

(C) MPI Processes Fail to Respond After For Loop

Suppose that I have a code block that attempts to calculate the first 15 numbers in the Fibonacci sequence and distributes each unique number among 3 processes (MPI_Send) using a for loop as shown in the code block below.
int main(int argc, char* argv[]) {
int rank, size, recieve_data1, recieve_data2;
MPI_Init(NULL, NULL);
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Available ranks are: %d \n \n", rank); // first rank rollcall
fflush(stdout);
int num1 = 1; int num2 = 1;
int RecieveNum; int SumNum;
for
(int n = 0; n < 16; n++) {
if
(rank == 0) {
// perform the fibb sequence algorithim
SumNum = num1 + num2;
num1 = num2;
num2 = SumNum;
// define the sorting algorithim
int DeliverTo = (n % 3) + 1;
// send calculated result
MPI_Send(&SumNum, 1, MPI_INT, DeliverTo, 1, MPI_COMM_WORLD);
}
else {
// recieve the element integer
MPI_Recv(&RecieveNum, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
// print and flush the buffer
printf("I am process rank %d, and I recieved the number %d. \n", rank, RecieveNum);
fflush(stdout);
}
}
printf("Available ranks are: %d \n \n", rank); // second rank roll call
fflush(stdout);
/*
more code that I run...
*/
MPI_Finalize();
return 0;
}
Before the first for loop is called, processes 0,1,2,3 all respond to the printf(Available ranks are: %d \n \n", rank); on line 25. However, after executing the first for loop and using a second printf, only process 0 responds. I was expecting all 4 processes 0 - 3 to respond again after the execution of the for loop. To solve this problem, I isolated this section of code and attempted to debug for several hours with no sucess. This particular issue proves to be problematic, as I have additional code (not shown here for the sake of being concise), that will access the numbers generated from this sequence.
Finally, I am running the code by building the solution, running the VS terminal as an administrator, and typing mpiexec -n 4 my_file_name.exe. No build errors or complication mistakes occurred. From what I can see (correct me if I'm wrong), all processes hang after completing the for loop, but I am unsure why or how to fix it.
After searching the website, I did not see anything that answered this question (from my point of view). I am a bit of an MPI (and Stack Overflow) newbie, so any code pointers are also welcomed. Thanks

You have process zero compute who to send to. And then everyone does a receive. That means that all processes that are not the computed receiver will hang.
This scenario where you send to a dynamically computed receiver is not easy to do in MPI. You either need to
send a message to all other processes "no, I have nothing for you", or
send a message to all, and all-but-one process ignores the data, or
use one sided operations where you MPI_Put the data on the computer receiver.

How should I broadcast with MPI if I want to compute the sum across multiple processors?

I am starting to get into parallel computing, and have started with MPI using C. I understand how to do such a thing using p2p (send/recv), however my confusion is when I try to use collective communication with bcast and reduce.
My code goes as follows:
int collective(int val, int rank, int n, int *toSum){
int *globalBuf=malloc(n*sizeof(int*));
int globalSum=0;
int localSum=0;
struct timespec before;
if(rank==0){
//only rank 0 will start timer
clock_gettime(CLOCK_MONOTONIC, &before);
}
int numInts=(val*100000)/n;
int *mySum = malloc((numInts)*sizeof(int *));
int j;
for(j=rank*numInts;j<numInts*rank+numInts;j++){
localSum=localSum+(toSum[j]);
}
MPI_Bcast(&localSum, 1, MPI_INT, rank, MPI_COMM_WORLD);
MPI_Reduce(&localSum, &globalSum, n, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if(rank==0){
printf("Communicative sum = %d\n", globalSum);
//only rank 0 will end the timer
//an display
struct timespec after;
clock_gettime(CLOCK_MONOTONIC, &after);
printf("Time to complete = %f\n",(after.tv_nsec-before.tv_nsec));
}
}
Where the parameters being passed in can be described as:
val = the number of total ints that need to be summed - divided by 100000
rank= the rank of this process
n = the total number of processes
toSum = the ints that are going to be added together
Where I begin to run into errors, is when I try to broadcast this processors localSum to be handled by rank 0.
I will explain what I've put into the function call so you can possibly understand where my confusion comes from.
For MPI_Bcast:
&localSum - the address of this processes sum
1 - there is one value that I want to broadcast, the int held by localSum
MPI_INT - meaning implied
rank - the rank of this process that is broadcasting
MPI_COMM_WORLD - meaning implied
For MPI_Reduce
&localSum - the address of the variable that it will "reducing"
&globalSum - the address of the variable that I want to hold the reduced values of localSum
n - the number of "localSum"s that this process will reduce (n is number of processes)
MPI_INT - meaning implied
MPI_SUM - meaning implied
0 - I want rank 0 to be the process that will reduce so it can print
MPI_COMM_WORLD - meaning implied
When I look through the code, I feel it makes sense logically, and it compiles okay, however when I run the program with m amount of processors, I get the following error message:
Assertion failed in file src/mpi/coll/helper_fns.c at line 84: FALSE
memcpy argument memory ranges overlap, dst_=0x7fffffffd2ac src_=0x7fffffffd2a8 len_=16
internal ABORT - process 0
Can anyone help me find a solution? Apologies to anyone who see's this as second nature, this is only my third parallel program, and first time using bcast/reduce!

I see two issues in the call of collective operations (MPI_Bcast, MPI_Reduce) provided in your code. First in MPI_Reduce, you are reducing an integer localSum from every processes to an integer globalSum. Basically, a single integer. But in your MPI_Reduce call, you are trying to reduce n values, in reality you just need to reduce 1 value from n processes. That may cause this error.
The reduce should ideally be like this, if you want to reduce a single value:
MPI_Reduce(&localSum, &globalSum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
For the broadcast,
MPI_Bcast(&localSum, 1, MPI_INT, rank, MPI_COMM_WORLD);
every rank is broadcasting above in your call. According to the general idea of broadcast there should be one root process that should broadcast the value to all the processes. So, the call should be like this:
int rootProcess = 0;
MPI_Bcast(&localSum, 1, MPI_INT, rootProcess, MPI_COMM_WORLD);
Here, the rootProcess will send the value contain in its localSum to all the processes. Meanwhile all the processes calling this broadcast will receive the value from rootProcess and will store in its local variable localSum

Recursive Function using MPI library

I am using the recursive function to find determinant of a 5x5 matrix. Although this sounds a trivial problem and can be solved using OpenMP if the dimensions are huge. I am trying to use MPI to solve this however I cant understand how to deal with recursion which is accumulating the results.
So my question is, How do I use MPI for this?
PS: The matrix is a Hilbert matrix so the answer would be 0
I have written the below code but I think it simply does the same part n times rather than dividing the problem and then accumulating result.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#define ROWS 5
double getDeterminant(double matrix[5][5], int iDim);
double matrix[ROWS][ROWS] = {
{1.0, -0.5, -0.33, -0.25,-0.2},
{-0.5, 0.33, -0.25, -0.2,-0.167},
{-0.33, -0.25, 0.2, -0.167,-0.1428},
{-0.25,-0.2, -0.167,0.1428,-0.125},
{-0.2, -0.167,-0.1428,-0.125,0.111},
};
int rank, size, tag = 0;
int main(int argc, char** argv)
{
//Status of messages for each individual rows
MPI_Status status[ROWS];
//Message ID or Rank
MPI_Request req[ROWS];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double result;
result = getDeterminant(matrix, ROWS);
printf("The determinant is %lf\n", result);
//Set barrier to wait for all processess
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
double getDeterminant(double matrix[ROWS][ROWS], int iDim)
{
int iCols, iMinorRow, iMinorCol, iTemp, iSign;
double c[5];
double tempMat[5][5];
double dDet;
dDet = 0;
if (iDim == 2)
{
dDet = (matrix[0][0] * matrix[1][1]) - (matrix[0][1] * matrix[1][0]);
return dDet;
}
else
{
for (iCols = 0; iCols < iDim; iCols++)
{
int temp_row = 0, temp_col = 0;
for (iMinorRow = 0; iMinorRow < iDim; iMinorRow++)
{
for (iMinorCol = 0; iMinorCol < iDim; iMinorCol++)
{
if (iMinorRow != 0 && iMinorCol != iCols)
{
tempMat[temp_row][temp_col] = matrix[iMinorRow][iMinorCol];
temp_col++;
if (temp_col >= iDim - 1)
{
temp_row++;
temp_col = 0;
}
}
}
}
//Hanlding the alternate signs while calculating diterminant
for (iTemp = 0, iSign = 1; iTemp < iCols; iTemp++)
{
iSign = (-1) * iSign;
}
//Evaluating what has been calculated if the resulting matrix is 2x2
c[iCols] = iSign * getDeterminant(tempMat, iDim - 1);
}
for (iCols = 0, dDet = 0.0; iCols < iDim; iCols++)
{
dDet = dDet + (matrix[0][iCols] * c[iCols]);
}
return dDet;
}
}
Expected result should be a very small value close to 0. I am getting the same result but not using MPI

The provided program will be executed in n processes. mpirun launches the n processes and they all executes the provided code. It is the expected behaviour. Unlike openmp, MPI is not a shared memory programming model rather a distributed memory programming model. It uses message passing to communicate with other processes.There are no global variables in MPI. All the data in your program will be local to your process.. If you need to share a data between process, you have to use MPI_send or MPI_Bcast or etc. explicitly to send it. You can use collective operations like MPI_Bcastto send it to all processes or point to point operations like MPI_send to send to specific processes.
For your application to do the expected behaviour, you have to tailor make it for MPI (unlike in openmp where you can use pragmas). All processes has an identifier or rank. Typically, rank 0 (lets call it your main process) should pass the data to all processes using MPI_Send (or any other methods) and the remaining processes should receive it using MPI_Recv (use MPI_Recv for MPI_Send). After receiving the local data from main process, the local processes should perform some computation on it and then send the results back to the main process. Main process will agreggate the result. This is a very basic scenario using MPI. You can use MPI IO etc..
MPI does not do anything by itself for synchronization or data sharing. It just launches instance of the application n times and provides required routines. It is the application developer who is in charge of communication (data structures etc), synchronization(using MPI_Barrier), etc. among processes.
Following is a simple send receive program using MPI. When you run the below code, say with n as 2, two copies of this program will be launched. In the program, using MPI_Comm_rank(), each process will get its id. We can use this ID for further computations/controlling the flow of code. In the code below, the process with rank 0 will send the variable number using MPI_Send and the process with rank 1 will receive this value using MPI_Recv. We can see that if and else if to differentiate between processes and change the control flow to send and receive the data. This is a very basic MPI program that share the data between processes.
// Find out rank, size
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
if (world_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n",
number);
}
Here is a tutorial on MPI.

MPI merge multiple intercoms into a single intracomm

I trying to set up bunch of spawned processes into a single intracomm. I need to spawn separate processes into unique working directories since these subprocesses will write out a bunch of files. After all the processes are spawned in want to merge them into a single intra communicator. To try this out I set up a simple test program.
int main(int argc, const char * argv[]) {
int rank, size;
const int num_children = 5;
int error_codes;
MPI_Init(&argc, (char ***)&argv);
MPI_Comm parentcomm;
MPI_Comm childcomm;
MPI_Comm intracomm;
MPI_Comm_get_parent(&parentcomm);
if (parentcomm == MPI_COMM_NULL) {
printf("Current Command %s\n", argv[0]);
for (size_t i = 0; i < num_children; i++) {
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &childcomm, &error_codes);
MPI_Intercomm_merge(childcomm, 0, &intracomm);
MPI_Barrier(childcomm);
}
} else {
MPI_Intercomm_merge(parentcomm, 1, &intracomm);
MPI_Barrier(parentcomm);
}
printf("Test\n");
MPI_Barrier(intracomm);
printf("Test2\n");
MPI_Comm_rank(intracomm, &rank);
MPI_Comm_size(intracomm, &size);
printf("Rank %d of %d\n", rank + 1, size);
MPI_Barrier(intracomm);
MPI_Finalize();
return 0;
}
When I run this I get all 6 processes but my intracomm is only speaking between the parent and the last child spawned. The resulting output is
Test
Test
Test
Test
Test
Test
Test2
Rank 1 of 2
Test2
Rank 2 of 2
Is there a way to merge multiple communicators into a single communicator? Also note that I'm spawning these one at a time since I need each subprocess to execute in a unique working directory.

I realize I'm a year out of date with this answer, but I thought maybe other people might want to see an implementation of this. As the original respondent said, there is no way to merge three (or more) communicators. You have to build up the new intra-comm one at a time. Here is the code I use. This version deletes the original intra-comm; you may or may not want to do that depending on your particular application:
#include <mpi.h>
// The Borg routine: given
// (1) a (quiesced) intra-communicator with one or more members, and
// (2) a (quiesced) inter-communicator with exactly two members, one
// of which is rank zero of the intra-communicator, and
// the other of which is an unrelated spawned rank,
// return a new intra-communicator which is the union of both inputs.
//
// This is a collective operation. All ranks of the intra-
// communicator, and the remote rank of the inter-communicator, must
// call this routine. Ranks that are members of the intra-comm must
// supply the proper value for the "intra" argument, and MPI_COMM_NULL
// for the "inter" argument. The remote inter-comm rank must
// supply MPI_COMM_NULL for the "intra" argument, and the proper value
// for the "inter" argument. Rank zero (only) of the intra-comm must
// supply proper values for both arguments.
//
// N.B. It would make a certain amount of sense to split this into
// separate routines for the intra-communicator processes and the
// remote inter-communicator process. The reason we don't do that is
// that, despite the relatively few lines of code, what's going on here
// is really pretty complicated, and requires close coordination of the
// participating processes. Putting all the code for all the processes
// into this one routine makes it easier to be sure everything "lines up"
// properly.
MPI_Comm
assimilateComm(MPI_Comm intra, MPI_Comm inter)
{
MPI_Comm peer = MPI_COMM_NULL;
MPI_Comm newInterComm = MPI_COMM_NULL;
MPI_Comm newIntraComm = MPI_COMM_NULL;
// The spawned rank will be the "high" rank in the new intra-comm
int high = (MPI_COMM_NULL == intra) ? 1 : 0;
// If this is one of the (two) ranks in the inter-comm,
// create a new intra-comm from the inter-comm
if (MPI_COMM_NULL != inter) {
MPI_Intercomm_merge(inter, high, &peer);
} else {
peer = MPI_COMM_NULL;
}
// Create a new inter-comm between the pre-existing intra-comm
// (all of it, not only rank zero), and the remote (spawned) rank,
// using the just-created intra-comm as the peer communicator.
int tag = 12345;
if (MPI_COMM_NULL != intra) {
// This task is a member of the pre-existing intra-comm
MPI_Intercomm_create(intra, 0, peer, 1, tag, &newInterComm);
}
else {
// This is the remote (spawned) task
MPI_Intercomm_create(MPI_COMM_SELF, 0, peer, 0, tag, &newInterComm);
}
// Now convert this inter-comm into an intra-comm
MPI_Intercomm_merge(newInterComm, high, &newIntraComm);
// Clean up the intermediaries
if (MPI_COMM_NULL != peer) MPI_Comm_free(&peer);
MPI_Comm_free(&newInterComm);
// Delete the original intra-comm
if (MPI_COMM_NULL != intra) MPI_Comm_free(&intra);
// Return the new intra-comm
return newIntraComm;
}

If you are going to do this by calling MPI_COMM_SPAWN multiple times, then you'll have to do it more carefully. After you call SPAWN the first time, the spawned process will also need to take part in the next call to SPAWN, otherwise it will be left out of the communicator you're merging. it ends up looking like this:
The problem is that only two processes are participating in each MPI_INTERCOMM_MERGE and you can't merge three communicators so you'll never end up with one big communicator that way.
If you instead have each process participate in the merge as it goes, you end up with one big communicator in the end:
Of course, you can just spawn all of your extra processes at once, but it sounds like you might have other reasons for not doing that.

Is there a method in MPI that acts like a MPI_Bcast but for an individual process, instead of operating on an entire communicator?

Essentially what I am looking for here is a simple MPI_SendRecv() routine that allows me to synchronize the same buffer by specifying a source and a destination processor.
In my mind the function call for my Ideal_MPI_SendRecv() function would look precisely like MPI_Bcast() but would contain a destination process instead of a Communicator.
It might be called as follows:
Ideal_MPI_SendRecv(&somebuffer, bufferlength, datatype, source_proc, destination_proc);
If not, is there any reason? It seems like this method would be the perfect method to synchronize a variable's values between two processes.

No, there is no such call in MPI since it is trivial to implement it using point-to-point communication. Of course you could write one, for example (with some rudimentary support for error handling):
// Just a random tag that is unlikely to be used by the rest of the program
#define TAG_IDEAL_SNDRCV 11223
int Ideal_MPI_SendRecv(void *buf, int count, MPI_Datatype datatype,
int source, int dest, MPI_Comm comm)
{
int rank;
int err;
if (source == dest)
return MPI_SUCCESS;
err = MPI_Comm_rank(comm, &rank);
if (err != MPI_SUCCESS)
return err;
if (rank == source)
err = MPI_Send(buf, count, datatype, dest, TAG_IDEAL_SNDRCV, comm);
else if (rank == dest)
err = MPI_Recv(buf, count, datatype, source, TAG_IDEAL_SNDRCV, comm,
MPI_STATUS_IGNORE);
return err;
}
// Example: transfer 'int buf[10]' from rank 0 to rank 2
Ideal_MPI_SendRecv(buf, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
You could also add another output argument of type MPI_Status * and store the status of MPI_Recv there. It could be useful if both processes have different buffer sizes.
Another option would be, if you have to do that many times within a fixed set of ranks, e.g. always from rank 0 to rank 2, you could simply create a new communicator and broadcast inside it:
int rank;
MPI_Comm buddycomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split(MPI_COMM_WORLD, (!rank || rank == 2) ? 0 : MPI_UNDEFINED, rank,
&buddycomm);
// Transfer 'int buf[10]' from rank 0 to rank 2
MPI_Bcast(buf, 10, MPI_INT, 0, buddycomm);
This, of course, is an overkill since the broadcast is more expensive than the simple combination of MPI_Send and MPI_Recv.

Perhaps you want to call MPI_Send on one process (the source process, with the values you want) and MPI_Recv on another process (the one which doesn't initially have the values you want)?
If not, could you clarify how what you're trying to accomplish differs from a simple point-to-point message?