MPI Deadlock with collective functions

MPI Deadlock with collective functions - c

I'm writing a simple program in C with MPI library.
The intent of this program is the following:
I have a group of processes that perform an iterative loop, at the end of this loop all processes in the communicator must call two collective functions(MPI_Allreduce and MPI_Bcast). The first one sends the id of the processes that have generated the minimum value of the num.val variable, and the second one broadcasts from the source num_min.idx_v to all processes in the communicator MPI_COMM_WORLD.
The problem is that I don't know if the i-th process will be finalized before calling the collective functions. All processes have a probability of 1/10 to terminate. This simulates the behaviour of the real program that I'm implementing. And when the first process terminates, the others cause deadlock.
This is the code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
typedef struct double_int{
double val;
int idx_v;
}double_int;
int main(int argc, char **argv)
{
int n = 10;
int max_it = 4000;
int proc_id, n_proc;double *x = (double *)malloc(n*sizeof(double));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &n_proc);
MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
srand(proc_id);
double_int num_min;
double_int num;
int k;
for(k = 0; k < max_it; k++){
num.idx_v = proc_id;
num.val = rand()/(double)RAND_MAX;
if((rand() % 10) == 0){
printf("iter %d: proc %d terminato\n", k, proc_id);
MPI_Finalize();
exit(EXIT_SUCCESS);
}
MPI_Allreduce(&num, &num_min, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
MPI_Bcast(x, n, MPI_DOUBLE, num_min.idx_v, MPI_COMM_WORLD);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}
Perhaps I should create a new group and new communicator before calling MPI_Finalize function in the if statement? How should I solve this?

If you have control over a process before it terminates you should send a non-blocking flag to a rank that cannot terminate early (lets call it the root rank). Then instead of having a blocking all_reduce, you could have sends from all ranks to the root rank with their value.
The root rank could post non-blocking receives for a possible flag, and the value. All ranks would have to have sent one or the other. Once all ranks are accounted for you can do the reduce on the root rank, remove exited ranks from communication and broadcast it.
If your ranks exit without notice, I am not sure what options you have.

Related

When do I free an MPI sub-group?

I'm creating MPI groups in a loop that perform a task, but when I want to free the group, the computation aborts. When should I free the group?
The error I get is:
[KLArch:13617] *** An error occurred in MPI_Comm_free
[KLArch:13617] *** reported by process [1712324609,2]
[KLArch:13617] *** on communicator MPI_COMM_WORLD
[KLArch:13617] *** MPI_ERR_COMM: invalid communicator
[KLArch:13617] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[KLArch:13617] *** and potentially your MPI job)
[KLArch:13611] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[KLArch:13611] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
What I intend to do is to do a calculation with an increasing number of processes in parallel as a benchmark for MPI, i.e. doing the whole calculation with only 1 processe, take the time, run the same calculation with 2 processes, take the time, run the same calculation with 4 processes, take the time... and compare how the problem scales with the number of processes.
MWE:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int j = 0;
int size = 0;
int rank = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Get the group of processes in MPI_COMM_WORLD
MPI_Group world_group;
MPI_Comm_group(MPI_COMM_WORLD, &world_group);
// Construct a group containing all of the ranks smaller than i in MPI_COMM_WORLD
for (i = 1; i <= size; i++)
{
int group_ranks[i];
for (int j = 0; j < i; j++)
{
group_ranks[j] = j;
}
// Construct a group with all the ranks smaller than i
MPI_Group sub_group;
MPI_Group_incl(world_group, i, group_ranks, &sub_group);
// Create a communicator based on the group
MPI_Comm sub_comm;
MPI_Comm_create(MPI_COMM_WORLD, sub_group, &sub_comm);
int sub_rank = -1;
int sub_size = -1;
// If this rank isn't in the new communicator, it will be
// MPI_COMM_NULL. Using MPI_COMM_NULL for MPI_Comm_rank or
// MPI_Comm_size is erroneous
if (MPI_COMM_NULL != sub_comm)
{
MPI_Comm_rank(sub_comm, &sub_rank);
MPI_Comm_size(sub_comm, &sub_size);
}
// Do some work
printf("WORLD RANK/SIZE: %d/%d \t Group RANK/SIZE: %d/%d\n",
rank, size, sub_rank, sub_size);
// Free the communicator and group
MPI_Barrier(MPI_COMM_WORLD);
//MPI_Comm_free(&sub_comm);
//MPI_Group_free(&sub_group);
j = 0;
}
MPI_Group_free(&world_group);
MPI_Finalize();
return 0;
}
The code throws the error, if I uncomment MPI_Comm_free(&sub_comm); and MPI_Group_free(&sub_group); at the end of the loop.

Let me collect some remarks about this code.
that barrier is not needed.
if you test this on a multi-node system you have to be aware that your processes are not spread evenly: 13 processes on 3 6-core nodes would give you 6+6+1 which is unbalanced. You would want 5+4+4 or so. Your mpiexec would do this correctly; achieving this in your code is a little harder. Just be aware of this since you are doing benchmarking.
It's a little tricky getting this code right. When you make a subgroup, all processes have the same value for the group, including the ones that are not in the group. For instance they do not get MPI_GROUP_NULL. Then you have to call MPI_Comm_create collectively on the large communicator; processes that are not in the group get MPI_COMM_NULL as result. They do not participate in the actions on the subcommunicator. Also, and this was your problem: they do not free the subcommunicator, but they do free the subgroup.
(That last point was also pointed out by #GillesGouaillardet)

(C) MPI Processes Fail to Respond After For Loop

Suppose that I have a code block that attempts to calculate the first 15 numbers in the Fibonacci sequence and distributes each unique number among 3 processes (MPI_Send) using a for loop as shown in the code block below.
int main(int argc, char* argv[]) {
int rank, size, recieve_data1, recieve_data2;
MPI_Init(NULL, NULL);
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Available ranks are: %d \n \n", rank); // first rank rollcall
fflush(stdout);
int num1 = 1; int num2 = 1;
int RecieveNum; int SumNum;
for
(int n = 0; n < 16; n++) {
if
(rank == 0) {
// perform the fibb sequence algorithim
SumNum = num1 + num2;
num1 = num2;
num2 = SumNum;
// define the sorting algorithim
int DeliverTo = (n % 3) + 1;
// send calculated result
MPI_Send(&SumNum, 1, MPI_INT, DeliverTo, 1, MPI_COMM_WORLD);
}
else {
// recieve the element integer
MPI_Recv(&RecieveNum, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
// print and flush the buffer
printf("I am process rank %d, and I recieved the number %d. \n", rank, RecieveNum);
fflush(stdout);
}
}
printf("Available ranks are: %d \n \n", rank); // second rank roll call
fflush(stdout);
/*
more code that I run...
*/
MPI_Finalize();
return 0;
}
Before the first for loop is called, processes 0,1,2,3 all respond to the printf(Available ranks are: %d \n \n", rank); on line 25. However, after executing the first for loop and using a second printf, only process 0 responds. I was expecting all 4 processes 0 - 3 to respond again after the execution of the for loop. To solve this problem, I isolated this section of code and attempted to debug for several hours with no sucess. This particular issue proves to be problematic, as I have additional code (not shown here for the sake of being concise), that will access the numbers generated from this sequence.
Finally, I am running the code by building the solution, running the VS terminal as an administrator, and typing mpiexec -n 4 my_file_name.exe. No build errors or complication mistakes occurred. From what I can see (correct me if I'm wrong), all processes hang after completing the for loop, but I am unsure why or how to fix it.
After searching the website, I did not see anything that answered this question (from my point of view). I am a bit of an MPI (and Stack Overflow) newbie, so any code pointers are also welcomed. Thanks

You have process zero compute who to send to. And then everyone does a receive. That means that all processes that are not the computed receiver will hang.
This scenario where you send to a dynamically computed receiver is not easy to do in MPI. You either need to
send a message to all other processes "no, I have nothing for you", or
send a message to all, and all-but-one process ignores the data, or
use one sided operations where you MPI_Put the data on the computer receiver.

MPI get the number of non-finalized processes

I have the following code
int main(void) {
...
int numtasks, rank;
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int k = 2;
if (rank == 3) {
do {
...
} while (number_of_not_finalized_processes >= numtasks - k);
}
...
MPI_Finalize();
}
and want to loop process with rank = 3 until k processes are finalized with MPI_Finalize(). How can I find out, how many processes have been finalized and define number_of_not_finalized_processes ? Or maybe I have any better alternative way to find out how many processes are still alive or passed the certain part of code?

You can solve this with one-sided communication. Create a window where every process before it finalizes:
Acquires a lock on the window on rank 3
Uses MPI_Accumulate to increase the counter
This is a very asynchronous solution. If your processes do something synchronized, say a loop where they dynamically decide to exit the loop, you could for instance at the synchonization point create a subcommunicator of only the active processes.

Recursive Function using MPI library

I am using the recursive function to find determinant of a 5x5 matrix. Although this sounds a trivial problem and can be solved using OpenMP if the dimensions are huge. I am trying to use MPI to solve this however I cant understand how to deal with recursion which is accumulating the results.
So my question is, How do I use MPI for this?
PS: The matrix is a Hilbert matrix so the answer would be 0
I have written the below code but I think it simply does the same part n times rather than dividing the problem and then accumulating result.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#define ROWS 5
double getDeterminant(double matrix[5][5], int iDim);
double matrix[ROWS][ROWS] = {
{1.0, -0.5, -0.33, -0.25,-0.2},
{-0.5, 0.33, -0.25, -0.2,-0.167},
{-0.33, -0.25, 0.2, -0.167,-0.1428},
{-0.25,-0.2, -0.167,0.1428,-0.125},
{-0.2, -0.167,-0.1428,-0.125,0.111},
};
int rank, size, tag = 0;
int main(int argc, char** argv)
{
//Status of messages for each individual rows
MPI_Status status[ROWS];
//Message ID or Rank
MPI_Request req[ROWS];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double result;
result = getDeterminant(matrix, ROWS);
printf("The determinant is %lf\n", result);
//Set barrier to wait for all processess
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
double getDeterminant(double matrix[ROWS][ROWS], int iDim)
{
int iCols, iMinorRow, iMinorCol, iTemp, iSign;
double c[5];
double tempMat[5][5];
double dDet;
dDet = 0;
if (iDim == 2)
{
dDet = (matrix[0][0] * matrix[1][1]) - (matrix[0][1] * matrix[1][0]);
return dDet;
}
else
{
for (iCols = 0; iCols < iDim; iCols++)
{
int temp_row = 0, temp_col = 0;
for (iMinorRow = 0; iMinorRow < iDim; iMinorRow++)
{
for (iMinorCol = 0; iMinorCol < iDim; iMinorCol++)
{
if (iMinorRow != 0 && iMinorCol != iCols)
{
tempMat[temp_row][temp_col] = matrix[iMinorRow][iMinorCol];
temp_col++;
if (temp_col >= iDim - 1)
{
temp_row++;
temp_col = 0;
}
}
}
}
//Hanlding the alternate signs while calculating diterminant
for (iTemp = 0, iSign = 1; iTemp < iCols; iTemp++)
{
iSign = (-1) * iSign;
}
//Evaluating what has been calculated if the resulting matrix is 2x2
c[iCols] = iSign * getDeterminant(tempMat, iDim - 1);
}
for (iCols = 0, dDet = 0.0; iCols < iDim; iCols++)
{
dDet = dDet + (matrix[0][iCols] * c[iCols]);
}
return dDet;
}
}
Expected result should be a very small value close to 0. I am getting the same result but not using MPI

The provided program will be executed in n processes. mpirun launches the n processes and they all executes the provided code. It is the expected behaviour. Unlike openmp, MPI is not a shared memory programming model rather a distributed memory programming model. It uses message passing to communicate with other processes.There are no global variables in MPI. All the data in your program will be local to your process.. If you need to share a data between process, you have to use MPI_send or MPI_Bcast or etc. explicitly to send it. You can use collective operations like MPI_Bcastto send it to all processes or point to point operations like MPI_send to send to specific processes.
For your application to do the expected behaviour, you have to tailor make it for MPI (unlike in openmp where you can use pragmas). All processes has an identifier or rank. Typically, rank 0 (lets call it your main process) should pass the data to all processes using MPI_Send (or any other methods) and the remaining processes should receive it using MPI_Recv (use MPI_Recv for MPI_Send). After receiving the local data from main process, the local processes should perform some computation on it and then send the results back to the main process. Main process will agreggate the result. This is a very basic scenario using MPI. You can use MPI IO etc..
MPI does not do anything by itself for synchronization or data sharing. It just launches instance of the application n times and provides required routines. It is the application developer who is in charge of communication (data structures etc), synchronization(using MPI_Barrier), etc. among processes.
Following is a simple send receive program using MPI. When you run the below code, say with n as 2, two copies of this program will be launched. In the program, using MPI_Comm_rank(), each process will get its id. We can use this ID for further computations/controlling the flow of code. In the code below, the process with rank 0 will send the variable number using MPI_Send and the process with rank 1 will receive this value using MPI_Recv. We can see that if and else if to differentiate between processes and change the control flow to send and receive the data. This is a very basic MPI program that share the data between processes.
// Find out rank, size
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
if (world_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n",
number);
}
Here is a tutorial on MPI.

Barrier call stuck in Open MPI (C program)

I am practicing synchronization through barrier by using Open MPI message communication. I have created an array of struct called containers. Each container is linked to its neighbor on the right, and the two elements at both ends are also linked, forming a circle.
In the main() testing client, I run MPI with multiple processes (mpiexec -n 5 ./a.out), and they are supposed to be synchronized by calling the barrier() function, however, my code is stuck at the last process. I am looking for help with the debugging. Please see my code below:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <mpi.h>
typedef struct container {
int labels;
struct container *linked_to_container;
int sense;
} container;
container *allcontainers; /* an array for all containers */
int size_containers_array;
int get_next_container_id(int current_container_index, int max_index)
{
if (max_index - current_container_index >= 1)
{
return current_container_index + 1;
}
else
return 0; /* elements at two ends are linked */
}
container *get_container(int index)
{
return &allcontainers[index];
}
void container_init(int num_containers)
{
allcontainers = (container *) malloc(num_containers * sizeof(container)); /* is this right to malloc memory on the array of container when the struct size is still unknown?*/
size_containers_array = num_containers;
int i;
for (i = 0; i < num_containers; i++)
{
container *current_container = get_container(i);
current_container->labels = 0;
int next_container_id = get_next_container_id(i, num_containers - 1); /* max index in all_containers[] is num_containers-1 */
current_container->linked_to_container = get_container(next_container_id);
current_container->sense = 0;
}
}
void container_barrier()
{
int current_container_id, my_sense = 1;
int tag = current_container_id;
MPI_Request request[size_containers_array];
MPI_Status status[size_containers_array];
MPI_Comm_rank(MPI_COMM_WORLD, &current_container_id);
container *current_container = get_container(current_container_id);
int next_container_id = get_next_container_id(current_container_id, size_containers_array - 1);
/* send asynchronous message to the next container, wait, then do blocking receive */
MPI_Isend(&my_sense, 1, MPI_INT, next_container_id, tag, MPI_COMM_WORLD, &request[current_container_id]);
MPI_Wait(&request[current_container_id], &status[current_container_id]);
MPI_Recv(&my_sense, 1, MPI_INT, next_container_id, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
void free_containers()
{
free(allcontainers);
}
int main(int argc, char **argv)
{
int my_id, num_processes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_processes);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
container_init(num_processes);
printf("Hello world from thread %d of %d \n", my_id, num_processes);
container_barrier();
printf("passed barrier \n");
MPI_Finalize();
free_containers();
return 0;
}

The problem is the series of calls:
MPI_Isend()
MPI_Wait()
MPI_Recv()
This is a common source of confusion. When you use a "nonblocking" call in MPI, you are essentially telling the MPI library that you want to do some operation (send) with some data (my_sense). MPI gives you back an MPI_Request object with the guarantee that the call will be finished by the time a completion function finishes that MPI_Request.
The problem you have here is that you're calling MPI_Isend and immediately calling MPI_Wait before ever calling MPI_Recv on any rank. This means that all of those send calls get queued up but never actually have anywhere to go because you've never told MPI where to put the data by calling MPI_Recv (which tells MPI that you want to put the data in my_sense).
The reason this works part of the time is that MPI expects that things might not always sync up perfectly. If you smaller messages (which you do), MPI reserves some buffer space and will let your MPI_Send operations complete and the data gets stashed in that temporary space for a while until you call MPI_Recv later to tell MPI where to move the data. Eventually though, this won't work anymore. The buffers will be full and you'll need to actually start receiving your messages. For you, this means that you need to switch the order of your operations. Instead of doing a non-blocking send, you should do a non-blocking receive first, then do your blocking send, then wait for your receive to finish:
MPI_Irecv()
MPI_Send()
MPI_Wait()
The other option is to turn both functions into nonblocking functions and use MPI_Waitall instead:
MPI_Isend()
MPI_Irecv()
MPI_Waitall()
This last option is usually the best. The only thing that you'll need to be careful about is that you don't overwrite your own data. Right now you're using the same buffer for both the send and receive operations. If both of these are happening at the same time, there's no guarantees about the ordering. Normally this doesn't make a difference. Whether you send the message first or receive it doesn't really matter. However, in this case it does. If you receive data first, you'll end up sending the same data back out again instead of sending the data you had before the receive operation. You can solve this by using a temporary buffer to stage your data and move it to the right place when it's safe.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

MPI Deadlock with collective functions - c

Related

When do I free an MPI sub-group?

(C) MPI Processes Fail to Respond After For Loop

MPI get the number of non-finalized processes

Recursive Function using MPI library

Barrier call stuck in Open MPI (C program)

Categories

Resources