Related
Good afternoon, I've developed a 2D FFT in MPI for scientific purpose.
Everything used to work until I've implemented MPI_Scatterv.
Since I've implemented it something odd started happening. In particular if I stay below 64 modes I don't get problems, but when I push above it I get the message:
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
>--------------------------------------------------------------------------
>mpiexec noticed that process rank 0 with PID 0 on node MacBook-Pro-di-Mirco
>exited on signal 11 (Segmentation fault: 11).`
I can't figure out where is the mistake, but I'm pretty sure it is in MPI_Scatterv.
Could anyone help me please?
/********************************** Setup factors for scattering **********************************/
// Alloc the arrays
int* displs = (int *)malloc(size*sizeof(int));
int* scounts = (int *)malloc(size*sizeof(int));
int* receive = (int *)malloc(size*sizeof(int));
// Setup matrix
int modes_per_proc[size];
for (int i = 0; i < size; i++){
modes_per_proc[i] = 0;
}
// Set modes per processor
cores_handler( nx*nz, size, modes_per_proc);
// Scattering parameters
for (int i=0; i<size; ++i) {
scounts[i] = modes_per_proc[i]*ny*2;
receive[i] = scounts[i];
displs[i] = displs[i-1] + modes_per_proc[i-1] *ny*2; // *2 to handle complex numbers
if (i == 0 ) displs[0] = 0;
}
/************************************************ Data scattering ***********************************************/
MPI_Scatterv(U, scounts, displs, MPI_DOUBLE, u, receive[rank] , MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
The core_handler function:
void cores_handler( int modes, int size, int modes_per_proc[size]) {
int rank =0;
int check=0;
for (int i = 0; i < modes; i++) {
modes_per_proc[rank] = modes_per_proc[rank]+1;
rank = rank+1;
if (rank == size ) rank = 0;
}
for (int i = 0; i < size; i++){
//printf("%d modes on rank %d\n", modes_per_proc[i], i);
check = check+modes_per_proc[i];
}
if ( (int)(check - modes) != 0 ) {
printf("[ERROR] check - modes = %d!!\nUnable to scatter modes properly\nAbort... \n", check - modes);
}
I'm trying to write an implementation of a hyper quicksort in MPI, but I'm having an issue where a process gets stuck on MPI_Recv().
While testing with 2 processes, it seems that inside the else of the if (rank % comm_sz == 0), process 1 is never receiving the pivot from process 0. Process 0 successfully sends its pivot and recurses through the method correctly. If put in some print debug statements and received the output:
(arr, 0, 2, 0, 9)
Rank 0 sending pivot 7 to 1
(arr, 1, 2, 0, 9)
Rank 1 pre-recv from 0
After which, the post-recv message from rank 1 never prints. Rank 0 prints its post-send message and continues through its section of the array. Is there something wrong with my implementation of MPI_Send() or MPI_Recv() that may be causing this?
Here is my code for the quicksort:
(For reference, comm_sz in the parameters for the method refers to the number of processes looking at that section of the array.)
void hyper_quick(int *array, int rank, int comm_sz, int s, int e) {
printf("(arr, %d, %d, %d, %d)\n", rank, comm_sz, s, e);
// Keeps recursing until there is only one element
if (s < e) {
int pivot;
if (comm_sz > 1) {
// One process gets a random pivot within its range and sends that to every process looking at that range
if (rank % comm_sz == 0) {
pivot = rand() % (e - s) + s;
for (int i = rank + 1; i < comm_sz; i++) {
int partner = rank + i;
printf("Rank %d sending pivot %d to %d\n", rank, pivot, partner);
MPI_Send(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD);
printf("Rank %d successfully sent %d to %d\n", rank, pivot, partner);
}
}
else {
int partner = rank - (rank % comm_sz);
printf("Rank %d pre-recv from %d\n", rank, partner);
MPI_Recv(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Rank %d received pivot %d from %d\n", rank, pivot, partner);
}
}
else {
pivot = rand() % (e - s) + s;
}
int tmp = array[pivot];
array[pivot] = array[e];
array[e] = tmp;
// Here is where the actual quick sort happens
int i = s;
int j = e - 1;
while (i < j) {
while (array[e] >= array[i] && i < j) {
i++;
}
while (array[e] < array[j] && i < j) {
j--;
}
if (i < j) {
tmp = array[i];
array[i] = array[j];
array[j] = tmp;
}
}
if (array[e] < array[i]) {
tmp = array[i];
array[i] = array[e];
array[e] = tmp;
pivot = i;
}
else {
pivot = e;
}
// Split remaining elements between remaining processes
if (comm_sz > 1) {
// Elements greater than pivot
if (rank % comm_sz >= comm_sz/2) {
hyper_quick(array, rank, comm_sz/2, pivot + 1, e);
}
// Elements lesser than pivot
else {
hyper_quick(array, rank, comm_sz/2, s, pivot - 1);
}
}
// Recurse remaining elements in current process
else {
hyper_quick(array, rank, 1, s, pivot - 1);
hyper_quick(array, rank, 1, pivot + 1, e);
}
}
Rank 0 sending pivot 7 to 1
MPI_Send(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD);
^^^^
So the sender tag is zero.
Rank 1 pre-recv from 0
MPI_Recv(&pivot, 1, MPI_INT, partner, rank, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
^^^^
And the receiver tag is one.
If the receiver asks for only messages with a specific tag, it will not receive a message with a different tag.
Sometimes there are cases when A might have to send many different types of messages to B. Instead of B having to go through extra measures to differentiate all these messages, MPI allows senders and receivers to also specify message IDs with the message (known as tags). When process B only requests a message with a certain tag number, messages with different tags will be buffered by the network until B is ready for them. [MPI Tutorial -- Send and Receive]
I have a code of page_rank where I have to compute page rank in parallel. My code hangs where I wrote the following MPI_send and MPI_receive function. What could be the problem?
int **sendto_list = (int **)malloc(comm_size*sizeof(int*));
for(i=0; i < comm_size; i++) {
sendto_list[i] = (int *)malloc(18*sizeof(int));
sendto_list[i][0] = 16;
sendto_list[i][1] = 0;
}
int temp_data = 1;
for(i=0; i < comm_size; i++) {
if(request_list[i][1] > 0) {
for(k=0; k < request_list[i][1]; ) {
for(j=0; j < 200; j++) {
if( k >= request_list[i][1] )
break;
sendrecv_buffer_int[j] = request_list[i][k+2];
k++;
}
// Request appropriate process for pagerank.
if(i!= my_rank)
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_REQ, MPI_COMM_WORLD);
}
}
if( i != my_rank )
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_DONE, MPI_COMM_WORLD);
}
int expected_requests = 0, done = 0,temp,s;
s=0;
while( (done == 0) && (comm_size > 1) ) {
if(expected_requests == (comm_size - 1))
break;
int count;
// Receive pagerank requests or messages with TAG_PR_DONE(can be from any process).
MPI_Recv(&temp, 1, MPI_INT, MPI_ANY_SOURCE ,MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_INT, &count);
switch(status.MPI_TAG) {
case TAG_PR_REQ:{
for(i = 0 ; i < count; i++)
insert_into_adj_list(&sendto_list[status.MPI_SOURCE], sendrecv_buffer_int[i], num_nodes);
break;
}
case TAG_PR_DONE:{
expected_requests++;
break;
}
default:
break;
}
}
A cursory glance over your code looks as if your issue is because your MPI_Send() calls are blocking and thus nothing is receiving and freeing them up.
If (request_list[i][1] > 0) and (i!= my_rank) evaluate to true you try to perform 2 MPI_Send() operations to process rank i but you only have 1 matching MPI_Recv() operation in each process, i.e. process rank i.
You may want to try changing
if(request_list[i][1] > 0) {
...
}
if( i != my_rank )
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_DONE, MPI_COMM_WORLD);
to
if(request_list[i][1] > 0) {
...
} else if( i != my_rank ) {
MPI_Send(&temp_data, 1, MPI_INT, i, TAG_PR_DONE, MPI_COMM_WORLD);
}
Note the addition of else turning if into else if. This ensures only 1 MPI_Send() operation per process. It does not look like the two MPI_Send() operations should be executed if the above conditions are true.
Alternatively if you need you could look into MPI_Isend() and MPI_Irecv(). Although I don't think they will solve your problem entirely in this case. I still think you need the else if clause.
I'd also like to point out in C there is no need to cast the return of malloc(). This topic has been covered extensively on StackOverflow so I won't dwell on it too long.
You should also check the result of malloc() is a valid pointer. It returns NULL if an error occured.
Just looking over some notes prior to an interview and am struggling to understand how Odd-Even sort works in parallel architectures.
int MPI_OddEven_Sort(int n, double *a, int root, MPI_Comm comm)
{
int rank, size, i, sorted_result;
double *local_a;
// get rank and size of comm
MPI_Comm_rank(comm, &rank); //&rank = address of rank
MPI_Comm_size(comm, &size);
local_a = (double *) calloc(n / size, sizeof(double));
// scatter the array a to local_a
MPI_Scatter(a, n / size, MPI_DOUBLE, local_a, n / size, MPI_DOUBLE,
root, comm);
// sort local_a
merge_sort(n / size, local_a);
//odd-even part
for (i = 0; i < size; i++) {
if ((i + rank) % 2 == 0) { // means i and rank have same nature
if (rank < size - 1) {
MPI_Compare(n / size, local_a, rank, rank + 1, comm);
}
} else if (rank > 0) {
MPI_Compare(n / size, local_a, rank - 1, rank, comm);
}
MPI_Barrier(comm);
// test if array is sorted
MPI_Is_Sorted(n / size, local_a, root, comm, &sorted_result);
// is sorted gives integer 0 or 1, if 0 => array is sorted
if (sorted_result == 0) {
break;
} // check for iterations
}
// gather local_a to a
MPI_Gather(local_a, n / size, MPI_DOUBLE, a, n / size, MPI_DOUBLE,
root, comm)
return MPI_SUCCESS;
}
is some code I wrote for this function (not today nor yesterday!). Can someone please break down how it is working ?
I'm scattering my array a to each processor, which is getting a copy of local_a (which is of size n/size)
Merge sort is being called on each local_a.
What is going on after this? (Assuming I am correct so far!)
It's sort of fun to see these PRAM-type sorting networks popping up again after all these years. The original mental model of parallel computing for these things was massively parallel arrays of tiny processors as "comparators", eg the Connection Machines - back in the day when networking was cheap compared to CPU/RAM. Of course that ended up looking very different from the supercomputers of the mid to late 80s and on, and even more so than the x86 clusters of the late 90s on; but now they're starting to come back in vogue with GPUs and other accelerators which actually do look a bit like that future past if you squint.
It looks like what you have above is something more like a Baudet-Stevenson odd-even sort, which was already starting to move in the direction of assuming that the processors would have multiple items stored locally and you could make good use of the processors by sorting those local lists in between communication steps.
Fleshing out your code and simplifying it a bit, we have something like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int merge(double *ina, int lena, double *inb, int lenb, double *out) {
int i,j;
int outcount=0;
for (i=0,j=0; i<lena; i++) {
while ((inb[j] < ina[i]) && j < lenb) {
out[outcount++] = inb[j++];
}
out[outcount++] = ina[i];
}
while (j<lenb)
out[outcount++] = inb[j++];
return 0;
}
int domerge_sort(double *a, int start, int end, double *b) {
if ((end - start) <= 1) return 0;
int mid = (end+start)/2;
domerge_sort(a, start, mid, b);
domerge_sort(a, mid, end, b);
merge(&(a[start]), mid-start, &(a[mid]), end-mid, &(b[start]));
for (int i=start; i<end; i++)
a[i] = b[i];
return 0;
}
int merge_sort(int n, double *a) {
double b[n];
domerge_sort(a, 0, n, b);
return 0;
}
void printstat(int rank, int iter, char *txt, double *la, int n) {
printf("[%d] %s iter %d: <", rank, txt, iter);
for (int j=0; j<n-1; j++)
printf("%6.3lf,",la[j]);
printf("%6.3lf>\n", la[n-1]);
}
void MPI_Pairwise_Exchange(int localn, double *locala, int sendrank, int recvrank,
MPI_Comm comm) {
/*
* the sending rank just sends the data and waits for the results;
* the receiving rank receives it, sorts the combined data, and returns
* the correct half of the data.
*/
int rank;
double remote[localn];
double all[2*localn];
const int mergetag = 1;
const int sortedtag = 2;
MPI_Comm_rank(comm, &rank);
if (rank == sendrank) {
MPI_Send(locala, localn, MPI_DOUBLE, recvrank, mergetag, MPI_COMM_WORLD);
MPI_Recv(locala, localn, MPI_DOUBLE, recvrank, sortedtag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {
MPI_Recv(remote, localn, MPI_DOUBLE, sendrank, mergetag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
merge(locala, localn, remote, localn, all);
int theirstart = 0, mystart = localn;
if (sendrank > rank) {
theirstart = localn;
mystart = 0;
}
MPI_Send(&(all[theirstart]), localn, MPI_DOUBLE, sendrank, sortedtag, MPI_COMM_WORLD);
for (int i=mystart; i<mystart+localn; i++)
locala[i-mystart] = all[i];
}
}
int MPI_OddEven_Sort(int n, double *a, int root, MPI_Comm comm)
{
int rank, size, i;
double *local_a;
// get rank and size of comm
MPI_Comm_rank(comm, &rank); //&rank = address of rank
MPI_Comm_size(comm, &size);
local_a = (double *) calloc(n / size, sizeof(double));
// scatter the array a to local_a
MPI_Scatter(a, n / size, MPI_DOUBLE, local_a, n / size, MPI_DOUBLE,
root, comm);
// sort local_a
merge_sort(n / size, local_a);
//odd-even part
for (i = 1; i <= size; i++) {
printstat(rank, i, "before", local_a, n/size);
if ((i + rank) % 2 == 0) { // means i and rank have same nature
if (rank < size - 1) {
MPI_Pairwise_Exchange(n / size, local_a, rank, rank + 1, comm);
}
} else if (rank > 0) {
MPI_Pairwise_Exchange(n / size, local_a, rank - 1, rank, comm);
}
}
printstat(rank, i-1, "after", local_a, n/size);
// gather local_a to a
MPI_Gather(local_a, n / size, MPI_DOUBLE, a, n / size, MPI_DOUBLE,
root, comm);
if (rank == root)
printstat(rank, i, " all done ", a, n);
return MPI_SUCCESS;
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int n = argc-1;
double a[n];
for (int i=0; i<n; i++)
a[i] = atof(argv[i+1]);
MPI_OddEven_Sort(n, a, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
So the way this works is that the list is evenly split up between processors (non-equal distributions are easily handled too, but it's a lot of extra bookkeeping which doesn't add much to this discussion).
We first sorting our local lists (which is O(n/P ln n/P)). There's no reason it has to be a merge sort, of course, except that here we can re-use that merge code the following steps. Then we do P neighbour exchange steps, half in each direction. The model here was that there was a linear network where we could communicate directly and quickly with immediate neighbours, and perhaps not at all with neighbours further away.
The original odd-even sorting network is the case where each processor has one key, in which case the communication is easy - you compare your item with your neighbour, and swap if necessary (so that this is basically a parallel bubble sort). In this case, we do a simple parallel sort between pairs of processes - here, each pair just sends all data to one of the pair, that pair merges the already locally sorted lists O(N/P), and then gives the appropriate half of the data back to the other processor. I took out your check-if-done; it can be shown that it's completed in P neighbour exchanges. You can certainly add it back in just in case of early termination; however, all the processors have to agree when everything's done, which requires something like an all reduce, which breaks the original model somewhat.
So we have O(n) data transfer per link, (sending and receiving n/P items P times each), and each processor does (n/P ln n/P) + (2 n/P - 1)*P/2 = O(n/P ln n/P + N) comparisons; in this case there's a scatter and a gather to be considered as well, but in general this sort is done with data in place.
Running the above - with, for clarity, that same example in that document linked gives (with output re-ordered to make it easier to read):
$ mpirun -np 4 ./baudet-stevenson 43 54 63 28 79 81 32 47 84 17 25 49
[0] before iter 1: <43.000,54.000,63.000>
[1] before iter 1: <28.000,79.000,81.000>
[2] before iter 1: <32.000,47.000,84.000>
[3] before iter 1: <17.000,25.000,49.000>
[0] before iter 2: <43.000,54.000,63.000>
[1] before iter 2: <28.000,32.000,47.000>
[2] before iter 2: <79.000,81.000,84.000>
[3] before iter 2: <17.000,25.000,49.000>
[0] before iter 3: <28.000,32.000,43.000>
[1] before iter 3: <47.000,54.000,63.000>
[2] before iter 3: <17.000,25.000,49.000>
[3] before iter 3: <79.000,81.000,84.000>
[0] before iter 4: <28.000,32.000,43.000>
[1] before iter 4: <17.000,25.000,47.000>
[2] before iter 4: <49.000,54.000,63.000>
[3] before iter 4: <79.000,81.000,84.000>
[0] after iter 4: <17.000,25.000,28.000>
[1] after iter 4: <32.000,43.000,47.000>
[2] after iter 4: <49.000,54.000,63.000>
[3] after iter 4: <79.000,81.000,84.000>
[0] all done iter 5: <17.000,25.000,28.000,32.000,43.000,47.000,49.000,54.000,63.000,79.000,81.000,84.000>
I have a 2D array which is distributed across a MPI process grid (3 x 2 processes in this example). The values of the array are generated within the process which that chunk of the array is distributed to, and I want to gather all of those chunks together at the root process to display them.
So far, I have the code below. This generates a cartesian communicator, finds out the co-ordinates of the MPI process and works out how much of the array it should get based on that (as the array need not be a multiple of the cartesian grid size). I then create a new MPI derived datatype which will send the whole of that processes subarray as one item (that is, the stride, blocklength and count are different for each process, as each process has different sized arrays). However, when I come to gather the data together with MPI_Gather, I get a segmentation fault.
I think this is because I shouldn't be using the same datatype for sending and receiving in the MPI_Gather call. The data type is fine for sending the data, as it has the right count, stride and blocklength, but when it gets to the other end it'll need a very different derived datatype. I'm not sure how to calculate the parameters for this datatype - does anyone have any ideas?
Also, if I'm approaching this from completely the wrong angle then please let me know!
#include<stdio.h>
#include<array_alloc.h>
#include<math.h>
#include<mpi.h>
int main(int argc, char ** argv)
{
int size, rank;
int dim_size[2];
int periods[2];
int A = 2;
int B = 3;
MPI_Comm cart_comm;
MPI_Datatype block_type;
int coords[2];
float **array;
float **whole_array;
int n = 10;
int rows_per_core;
int cols_per_core;
int i, j;
int x_start, x_finish;
int y_start, y_finish;
/* Initialise MPI */
MPI_Init(&argc, &argv);
/* Get the rank for this process, and the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
/* If we're the master process */
whole_array = alloc_2d_float(n, n);
/* Initialise whole array to silly values */
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
whole_array[i][j] = 9999.99;
}
}
for (j = 0; j < n; j ++)
{
for (i = 0; i < n; i++)
{
printf("%f ", whole_array[j][i]);
}
printf("\n");
}
}
/* Create the cartesian communicator */
dim_size[0] = B;
dim_size[1] = A;
periods[0] = 1;
periods[1] = 1;
MPI_Cart_create(MPI_COMM_WORLD, 2, dim_size, periods, 1, &cart_comm);
/* Get our co-ordinates within that communicator */
MPI_Cart_coords(cart_comm, rank, 2, coords);
rows_per_core = ceil(n / (float) A);
cols_per_core = ceil(n / (float) B);
if (coords[0] == (B - 1))
{
/* We're at the far end of a row */
cols_per_core = n - (cols_per_core * (B - 1));
}
if (coords[1] == (A - 1))
{
/* We're at the bottom of a col */
rows_per_core = n - (rows_per_core * (A - 1));
}
printf("X: %d, Y: %d, RpC: %d, CpC: %d\n", coords[0], coords[1], rows_per_core, cols_per_core);
MPI_Type_vector(rows_per_core, cols_per_core, cols_per_core + 1, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
array = alloc_2d_float(rows_per_core, cols_per_core);
if (array == NULL)
{
printf("Problem with array allocation.\nExiting\n");
return 1;
}
for (j = 0; j < rows_per_core; j++)
{
for (i = 0; i < cols_per_core; i++)
{
array[j][i] = (float) (i + 1);
}
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Gather(array, 1, block_type, whole_array, 1, block_type, 0, MPI_COMM_WORLD);
/*
if (rank == 0)
{
for (j = 0; j < n; j ++)
{
for (i = 0; i < n; i++)
{
printf("%f ", whole_array[j][i]);
}
printf("\n");
}
}
*/
/* Close down the MPI environment */
MPI_Finalize();
}
The 2D array allocation routine I have used above is implemented as:
float **alloc_2d_float( int ndim1, int ndim2 ) {
float **array2 = malloc( ndim1 * sizeof( float * ) );
int i;
if( array2 != NULL ){
array2[0] = malloc( ndim1 * ndim2 * sizeof( float ) );
if( array2[ 0 ] != NULL ) {
for( i = 1; i < ndim1; i++ )
array2[i] = array2[0] + i * ndim2;
}
else {
free( array2 );
array2 = NULL;
}
}
return array2;
}
This is a tricky one. You're on the right track, and yes, you will need different types for sending and receiving.
The sending part is easy -- if you're sending the whole subarray array, then you don't even need the vector type; you can send the entire (rows_per_core)*(cols_per_core) contiguous floats starting at &(array[0][0]) (or array[0], if you prefer).
It's the receiving that's the tricky part, as you've gathered. Let's start with the simplest case -- assuming that everything divides evenly so all the blocks have the same size. Then you can use the very helfpul MPI_Type_create_subarray (you could always cobble this together with vector types, but for higher-dimensional arrays this becomes tedious, as you need to create 1 intermediate type for each dimension of the array except the last...
Also, rather than hardcoding the decomposition, you can use the also-helpful MPI_Dims_create to create an as-square-as-possible decomposition of your ranks. Note
that this doesn't necessarily have anything to do with MPI_Cart_create, although you can use it for the requested dimensions. I'm going to skip the cart_create stuff here, not because it's not useful, but because I want to focus on the gather stuff.
So if everyone has the same size of array, then root is receiving the same data type from everyone, and one can use a very simple subarray type to get their data:
MPI_Type_create_subarray(2, whole_array_size, sub_array_size, starts,
MPI_ORDER_C, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
where sub_array_size[] = {rows_per_core, cols_per_core}, whole_array_size[] = {n,n}, and for here, starts[]={0,0} - eg, we'll just assume that everything starts the start.
The reason for this is that we can then use Gatherv to explicitly set the displacements into the array:
for (int i=0; i<size; i++) {
counts[i] = 1; /* one block_type per rank */
int row = (i % A);
int col = (i / A);
/* displacement into the whole_array */
disps[i] = (col*cols_per_core + row*(rows_per_core)*n);
}
MPI_Gatherv(array[0], rows_per_core*cols_per_core, MPI_FLOAT,
recvptr, counts, disps, resized_type, 0, MPI_COMM_WORLD);
So now everyone sends their data in one chunk, and it's received into the type into the right part of the array. For this to work, I've resized the type so that it's extent is just one float, so the displacements can be calculated in that unit:
MPI_Type_create_resized(block_type, 0, 1*sizeof(float), &resized_type);
MPI_Type_commit(&resized_type);
The whole code is below:
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<mpi.h>
float **alloc_2d_float( int ndim1, int ndim2 ) {
float **array2 = malloc( ndim1 * sizeof( float * ) );
int i;
if( array2 != NULL ){
array2[0] = malloc( ndim1 * ndim2 * sizeof( float ) );
if( array2[ 0 ] != NULL ) {
for( i = 1; i < ndim1; i++ )
array2[i] = array2[0] + i * ndim2;
}
else {
free( array2 );
array2 = NULL;
}
}
return array2;
}
void free_2d_float( float **array ) {
if (array != NULL) {
free(array[0]);
free(array);
}
return;
}
void init_array2d(float **array, int ndim1, int ndim2, float data) {
for (int i=0; i<ndim1; i++)
for (int j=0; j<ndim2; j++)
array[i][j] = data;
return;
}
void print_array2d(float **array, int ndim1, int ndim2) {
for (int i=0; i<ndim1; i++) {
for (int j=0; j<ndim2; j++) {
printf("%6.2f ", array[i][j]);
}
printf("\n");
}
return;
}
int main(int argc, char ** argv)
{
int size, rank;
int dim_size[2];
int periods[2];
MPI_Datatype block_type, resized_type;
float **array;
float **whole_array;
float *recvptr;
int *counts, *disps;
int n = 10;
int rows_per_core;
int cols_per_core;
int i, j;
int whole_array_size[2];
int sub_array_size[2];
int starts[2];
int A, B;
/* Initialise MPI */
MPI_Init(&argc, &argv);
/* Get the rank for this process, and the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
/* If we're the master process */
whole_array = alloc_2d_float(n, n);
recvptr = &(whole_array[0][0]);
/* Initialise whole array to silly values */
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
whole_array[i][j] = 9999.99;
}
}
print_array2d(whole_array, n, n);
puts("\n\n");
}
/* Create the cartesian communicator */
MPI_Dims_create(size, 2, dim_size);
A = dim_size[1];
B = dim_size[0];
periods[0] = 1;
periods[1] = 1;
rows_per_core = ceil(n / (float) A);
cols_per_core = ceil(n / (float) B);
if (rows_per_core*A != n) {
if (rank == 0) fprintf(stderr,"Aborting: rows %d don't divide by %d evenly\n", n, A);
MPI_Abort(MPI_COMM_WORLD,1);
}
if (cols_per_core*B != n) {
if (rank == 0) fprintf(stderr,"Aborting: cols %d don't divide by %d evenly\n", n, B);
MPI_Abort(MPI_COMM_WORLD,2);
}
array = alloc_2d_float(rows_per_core, cols_per_core);
printf("%d, RpC: %d, CpC: %d\n", rank, rows_per_core, cols_per_core);
whole_array_size[0] = n;
sub_array_size [0] = rows_per_core;
whole_array_size[1] = n;
sub_array_size [1] = cols_per_core;
starts[0] = 0; starts[1] = 0;
MPI_Type_create_subarray(2, whole_array_size, sub_array_size, starts,
MPI_ORDER_C, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
MPI_Type_create_resized(block_type, 0, 1*sizeof(float), &resized_type);
MPI_Type_commit(&resized_type);
if (array == NULL)
{
printf("Problem with array allocation.\nExiting\n");
MPI_Abort(MPI_COMM_WORLD,3);
}
init_array2d(array,rows_per_core,cols_per_core,(float)rank);
counts = (int *)malloc(size * sizeof(int));
disps = (int *)malloc(size * sizeof(int));
/* note -- we're just using MPI_COMM_WORLD rank here to
* determine location, not the cart_comm for now... */
for (int i=0; i<size; i++) {
counts[i] = 1; /* one block_type per rank */
int row = (i % A);
int col = (i / A);
/* displacement into the whole_array */
disps[i] = (col*cols_per_core + row*(rows_per_core)*n);
}
MPI_Gatherv(array[0], rows_per_core*cols_per_core, MPI_FLOAT,
recvptr, counts, disps, resized_type, 0, MPI_COMM_WORLD);
free_2d_float(array);
if (rank == 0) print_array2d(whole_array, n, n);
if (rank == 0) free_2d_float(whole_array);
MPI_Finalize();
}
Minor thing -- you don't need the barrier before the gather. In fact, you hardly ever really need a barrier, and they're expensive operations for a few reasons, and can hide problems -- my rule of thumb is to never, ever, use barriers unless you know exactly why the rule needs to be broken in this case. In this case in particular, the collective gather routine does exactly the same syncronization as the barrier, so just use that.
Now, moving onto the harder stuff. If things don't divide evenly, you have a few options. The simplest, though not necessarily the best, is just to pad the array so that it does divide evenly, even if just for this operation.
If you can arrange it so that the number of columns does divide evenly, even if the number of rows doesn't, then you can still use the gatherv and create a vector type for each part of the row, and gatherv that the appropriate number of rows from each processor. That would work fine.
If you definately have the case where neither can be counted on to divide, and you can't pad data for sending, then there are three sub-options I can see:
As susterpatt suggests, do point-to-point. For small numbers of tasks, this is fine, but as it gets larger, this will be significantly less efficient than the collective operations.
Create a communicator consisting of all the processors not on the outer edges, and use exactly the code above to gather their code; and then point-to-point the edge tasks' data.
Don't gather to process 0 at all; use the Distributed array type to describe the layout of the array, and use MPI-IO to write all the data to a file; once that's done, you can have process zero display the data in some way if you like.
It looks like the first argument to you MPI_Gather call should probably be array[0], and not array.
Also, if you need to get different amounts of data from each rank, you might be better off using MPI_Gatherv.
Finally, not that gathering all your data in once place to do output is not scalable in many circumstances. As the amount of data grows, eventually, it will exceed the memory available to rank 0. You might be much better off distributing the output work (if you are writing to a file, using MPI IO or other library calls) or doing point-to-point sends to rank 0 one at a time, to limit the total memory consumption.
On the other hand, I would not recommend coordinating each of your ranks printing to standard output, one after another, because some major MPI implementations don't guarantee that standard output will be produced in order. Cray's MPI, in particular, jumbles up standard output pretty thoroughly if multiple ranks print.
Accordding to this (emphasis by me):
The type-matching conditions for the collective operations are more strict than the corresponding conditions between sender and receiver in point-to-point. Namely, for collective operations, the amount of data sent must exactly match the amount of data specified by the receiver. Distinct type maps between sender and receiver are still allowed.
Sounds to me like you have two options:
Pad smaller submatrices so that all processes send the same amount of data, then crop the matrix back to its original size after the Gather. If you're feeling adventurous, you might try defining the receiving typemap so that paddings are automatically overwritten during the Gather operation, thus eliminating the need for the crop afterwards. This could get a bit complicated though.
Fall back to point-to-point communication. Much more straightforward, but possibly higher communication costs.
Personally, I'd go with option 2.