Related
I'm attempting to combine row/column 2-D arrays outputted from each process into a single complete 2-D array on all processes. Essentially, I have a large NxN 2-D array (4000x4000 +) that requires the same operation to be carried out on all elements. My intention is to break this down into either sections of rows or columns that each process will complete. I need each process to have the entirety of the array once all sections have been completed.
I have looked at multiple examples of using these MPI instructions but could not find one that combined rows/columns from N processes. Could someone please inform me as to how I can implement this?
Below is a boilerplate example of what I'm trying to achieve. Each process creates a master matrix and then calculates the set of rows it's responsible for. It then creates a 2-D array of that size and copies the data from the master. It then carries out its operation on each element of the 2-D array. I then need to use Allgatherv to collect each process's 2-D array and combine them to overwrite the master matrix. Please note that it is not important whether I use rows or columns from my point of view, I believed sticking to one and not trying to create multiple submatrices would allow me to easily increase the number of processes running without added complexity.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
double **alloc_2d_array(int m, int n) {
double **x;
int i;
x = (double **)malloc(m*sizeof(double *));
x[0] = (double *)calloc(m*n,sizeof(double));
for ( i = 1; i < m; i++ )
x[i] = &x[0][i*n];
return x;
}
void main(int argc, char *argv[]) {
int n = 8;
int rank, size;
int root_rank = 0;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Report active to console
printf("Rank: %d, reporting!\n", rank);
// Make master matrix
double ** master_matrix = alloc_2d_array(n, n);
// Set starting values in master matrix
for (int i=0; i<n; i++) {
for (int j=0; j<n; j++) {
master_matrix[i][j] = i*n+j;
}
}
// Calculate each ranks section of matrix
int interval = n/size;
int section_end = interval*(rank+1);
int section_start = section_end - interval;
if (rank == size-1) {
section_end += n % size;
}
int section_length = section_end-section_start;
printf("Start: %d, End: %d\n", section_start, section_end);
// Make local rows
double ** local_sect = alloc_2d_array(section_length, n);
// Set local rows to master_matrix rows
for (int i=0; i<section_length; i++) {
local_sect[i] = master_matrix[i];
}
// Carry out operation (in this example, adding 7)
for (int i=0; i<section_length; i++) {
for (int j=0; j<n; j++) {
local_sect[i][j] = local_sect[i][j]+7;
}
}
// Use Allgatherv to overwrite master matrix to new complete matrix
// MPI_Allgatherv(my_values, my_values_count, MPI_INT, buffer, counts, displacements, MPI_INT, MPI_COMM_WORLD);
// Print new master matrix out on all processes
printf("NEW MASTER MATRIX\n");
for (int i=0; i<n; i++) {
for (int j=0; j<n; j++) {
printf("%f ", master_matrix[i][j]);
}
printf("\n");
}
MPI_Finalize();
}
I'll suggest reading more indepth about MPI_Type_create_subarray which allow you to create subarrays from multidimensional arrays , by that you can then gather them together with MPI_Gatherv and have them connect to each other like you want.
You might wanna check this example that I've encountered in the past:
How to combine subarrays of different widths using only one array for send and receive in MPI
So, I have this problem where I have an array and I must find the count of numbers that are greater than the number of index k in my array. So I implemented a master-worker strategy where I have a master that takes care of the I/O and split the work to the workers. In the master thread I have created the array in a matrix-like shape, so I could pass the sub-arrays easily to the workers (I know this sounds weird). Then also in the master thread I read all the values from the input to my sub-arrays and set the comp (comparison value) to the value of the k index value.
Then I pass the work portion size, the value for comparison and work data around to all the threads (including the master that gets its share of work). Finally, every worker do its job and report its result to the master, that while receiving the data from the workers will add their values to its own and then print the total result on the screen.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <math.h>
int main(int argc, char *args[]){
int rank, psize;
MPI_Status status;
MPI_Init(&argc, &args);
MPI_Comm_size(MPI_COMM_WORLD, &psize);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int *workvet, worksize, comp;
if(rank == 0){
int tam, k;
int **subvets, portion;
scanf("%d", &tam);
scanf("%d", &k);
portion = ceil((float)tam/(float)psize);
subvets = malloc(sizeof(int) * psize);
for(int i = 0; i < psize; i++)
subvets[i] = calloc(portion, sizeof(int));
for(int i = 0; i < psize; i++){
for(int j = 0; j < portion; j++){
if((i*j+j) < tam)
scanf("%d ", &subvets[i][j]);
if((i*j+j) == k)
comp = subvets[i][j];
}
}
for(int i = 1; i < psize; i++){
MPI_Send(&portion, 1, MPI_INT, i, i, MPI_COMM_WORLD);
MPI_Send(&comp, 1, MPI_INT, i, i, MPI_COMM_WORLD);
MPI_Send(subvets[i], portion, MPI_INT, i, i, MPI_COMM_WORLD);
}
workvet = calloc(portion, sizeof(int));
workvet = subvets[0];
worksize = portion;
} else {
MPI_Recv(&worksize, 1, MPI_INT, 0, rank, MPI_COMM_WORLD, &status);
MPI_Recv(&comp, 1, MPI_INT, 0, rank, MPI_COMM_WORLD, &status);
workvet = calloc(worksize, sizeof(int));
MPI_Recv(workvet, worksize, MPI_INT, 0, rank, MPI_COMM_WORLD, &status);
}
int maior = 0;
for(int i = 0; i < worksize; i++){
if(workvet[i] > comp)
maior++;
}
if(rank == 0){
int temp;
for(int i = 1; i < psize; i++){
MPI_Recv(&temp, 1, MPI_INT, i, rank, MPI_COMM_WORLD, &status);
maior += temp;
}
printf("%d nĂºmeros maiores que %d", maior, comp);
} else {
MPI_Send(&maior, 1, MPI_INT, 0, rank, MPI_COMM_WORLD);
}
MPI_Finalize();
}
My problem is that it looks like its stuck in a loop, and when trying to debug I put an printf in the main for that does the comparison in the sub-arrays and did infinite printing, however, when I put the same print anywhere else in the code, it won't be printed. I don't have any idea where I'm failing and have no idea on how I can debug my code.
Input data:
10 // size
7 // k
1 2 3 4 5 6 7 8 9 10 // elements
So, my program should count how many elements are greater than the element of index 7, which corresponds to the value 8, and this should return 2 in this case.
This is prefaced by my top comment re. tam being unitialized.
There are a number of additional issues ...
You're doing scanf to get a value for comp, but in the loops below it, you're assigning a new value to it (i.e. the prompted value is being trashed). That may be perfectly fine if the original value is treated as a default [if the loop fails to assign a new value], but it seems a bit rickety to me.
AFAICT, you are trying to loop on workvet in all processes. But, for the client ones, this does nothing because you don't send back the result [see below].
The clients are sending back maior but they never compute a value for it. And, main does not receive that value. It computes one of its own.
maior has no definition in your posted code. And, therefore is unitialized [even in main].
It looks like you want the clients to send back a single scalar value of their computed value of maior, but they do no calculation for it.
Thus, the clients send back a garbage maior value that the main process tries to sum.
You're sending portion to the clients, but they receive it as worksize. And, after main sends it, it assigns portion to worksize. I'd recommend using the same name in all places to reduce some confusion.
You've not provided any sample data so it's hard to debug this further here. Part of the problem is that only some of the values in subvets are initialized with the scanf in main, based on the if [or so it appears ...].
So, the clients will loop over possibly unitialized values in the given subvets array [sent to the client which receives it as workvet].
If the setup loops for subvets are correct as far as which values to send (that is, only certain selected values should be sent), I'm not sure you can do what you want with the 2D array method you have.
Without a problem statement describing the input data and what you want to do with it, it's difficult to divine what would be the correct code, but ...
A few guesses ...
You're calculating highest in all processes [probably useless in main], but then nobody does anything with it. My guess is that you want to calculate this in the client processes only. And, send this back to main as maior.
Then, main can sum the maior values from all the clients?
UPDATE:
I actually changed maior to highest to post the issue here, so it would make a bit of sense (maior is greater in portuguese) but failed to do so for all instances
As I mentioned, I guessed as much -- no worries. Side note: In fact, your English is quite good. And, it was nice of you to translate the code. Some others post in English, but leave the code in their native language. This can slow things down a bit. Sometimes, I've put the code into Google translate just to try to make sense of it.
I just updated the code without the translation to reflect what I'm working on. So, for the subvets part I actually thought of this being a matrix, where I would send each of its lines as being one array to each of the worker threads, and the if statement is there to only read up until the size of the array has been reached, thus, leaving the rest of the values as 0 (because I used calloc, thus making this approach fit to the problem I have to solve)
There's really no need for a 2D array. Just fill a 1D array, and then give each worker different offsets and counts into that single array [see below].
By trying to do everything in a single function main, this is probably what caused some of the problems with separating main and worker tasks.
By splitting things up into [more] functions, this can make things easier. We can use the same variable names in master and worker for the same data without any naming conflicts.
Also, a good maxim ... Don't replicate code
The various MPI_* calls take a lot of parameters because they're general purpose. Isolating them to wrapper functions can make things simpler and debugging easier.
Note that the second argument to MPI_Send/MPI_Recv is a count and not number of bytes (hence, not sizeof) (i.e. a bug). By putting them in wrapper functions, the call could be fixed once in a single place.
I did make a slight change to the split logic. In your code [AFAICT] you were having the main/master process do some of the calculation. That's fine but I prefer to have the main process available as a control process and not encumbered by much data calculation. So, in my version, only the worker processes actually process the array.
Sometimes it helps to isolate the calculation algorithm/logic from the MPI code. I did this below by putting it in a function docalc. This allowed the adding of a diagnostic cross check at the end.
Anyway, below it the code. It's been heavily refactored and has many comments:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <math.h>
// _dbgprt -- debug print
#define _dbgprt(_fmt...) \
do { \
printf("%d: ",myrank); \
printf(_fmt); \
} while (0)
#ifdef DEBUG
#define dbgprt(_fmt...) \
_dbgprt(_fmt)
#else
#define dbgprt(_fmt...) \
do { \
} while (0)
#endif
int myrank; // current rank
int numproc; // number of processes in comm group
// dataload -- read in the data
int *
dataload(FILE *xfsrc,int worksize)
{
int *workvet;
// get enough space
workvet = calloc(worksize,sizeof(int));
// fill the array
for (int idx = 0; idx < worksize; ++idx)
fscanf(xfsrc,"%d",&workvet[idx]);
return workvet;
}
// docalc -- count number of values greater than limit
int
docalc(int *workvet,int worksize,int k)
{
int count = 0;
for (int idx = 0; idx < worksize; ++idx) {
if (workvet[idx] > k)
count += 1;
}
return count;
}
// sendint -- send some data
void
sendint(int rankto,int *data,int count)
{
int tag = 0;
// NOTE: second argument is an array _count_ and _not_ the number of bytes
MPI_Send(data,count,MPI_INT,rankto,tag,MPI_COMM_WORLD);
}
// recvint -- receive some data
void
recvint(int rankfrom,int *data,int count)
{
int tag = 0;
MPI_Status status;
MPI_Recv(data,count,MPI_INT,rankfrom,tag,MPI_COMM_WORLD,&status);
}
// worker -- perform all worker operations
void
worker(void)
{
int master = 0;
// get array count
int worksize;
recvint(master,&worksize,1);
// get limit value
int k;
recvint(master,&k,1);
// allocate space for data
int *workvet = calloc(worksize,sizeof(int));
// get that data
recvint(master,workvet,worksize);
// calculate number of elements higher than limit
int count = docalc(workvet,worksize,k);
// send back result
sendint(master,&count,1);
}
// master -- perform all master operations
void
master(int argc,char **argv)
{
int isfile;
FILE *xfsrc;
int workrank;
// get the data either from stdin or from a file passed on the command line
do {
isfile = 0;
xfsrc = stdin;
if (argc <= 0)
break;
xfsrc = fopen(*argv,"r");
if (xfsrc == NULL) {
perror(*argv);
exit(1);
}
isfile = 1;
} while (0);
// get number of data elements
int worksize;
fscanf(xfsrc,"%d",&worksize);
// get limit [pivot]
int k;
fscanf(xfsrc,"%d",&k);
dbgprt("master: PARAMS worksize=%d k=%d\n",worksize,k);
// read in the data array
int *workvet = dataload(xfsrc,worksize);
if (isfile)
fclose(xfsrc);
// get number of workers
// NOTE: we do _not_ have the master do calculations [for simplicity]
// usually, for large data, we want the master free to control things
int numworkers = numproc - 1;
// get number of elements for each worker
int workper = worksize / numworkers;
dbgprt("master: LOOP numworkers=%d workper=%d\n",numworkers,workper);
// send data to other workers
int remain = worksize;
int offset = 0;
int portion;
for (workrank = 1; workrank < numproc; ++workrank,
offset += portion, remain -= portion) {
// get amount for this worker
portion = workper;
// last proc must get all remaining work
if (workrank == (numproc - 1))
portion = remain;
dbgprt("master: WORK/%d offset=%d portion=%d\n",
workrank,offset,portion);
// send the worker's data count
sendint(workrank,&portion,1);
// send the pivot point
sendint(workrank,&k,1);
// send the data to worker
sendint(workrank,&workvet[offset],portion);
}
// accumulate count
int total = 0;
int count;
for (workrank = 1; workrank < numproc; ++workrank) {
recvint(workrank,&count,1);
total += count;
}
printf("%d numbers bigger than %d\n",total,k);
// do cross check of MPI result against a simple single process solution
#ifdef CHECK
count = docalc(workvet,worksize,k);
printf("master count was %d -- %s\n",
count,(count == total) ? "PASS" : "FAIL");
#endif
}
// main -- main program
int
main(int argc,char **argv)
{
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numproc);
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
// skip over program name
--argc;
++argv;
if (myrank == 0)
master(argc,argv);
else
worker();
MPI_Finalize();
return 0;
}
How do you send blocks of 2-D array to different processors? Suppose the 2D array size is 400x400 an I want to send blocks of sizes 100X100 to different processors. The idea is that each processor will perform computation on its separate block and send its result back to the first processor for final result.
I am using MPI in C programs.
Let me start by saying that you generally don't really want to do this - scatter and gather huge chunks of data from some "master" process. Normally you want each task to be chugging away at its own piece of the puzzle, and you should aim to never have one processor need a "global view" of the whole data; as soon as you require that, you limit scalability and the problem size. If you're doing this for I/O - one process reads the data, then scatters it, then gathers it back for writing, you'll want eventually to look into MPI-IO.
Getting to your question, though, MPI has very nice ways of pulling arbitrary data out of memory, and scatter/gathering it to and from a set of processors. Unfortunately that requires a fair number of MPI concepts - MPI Types, extents, and collective operations. A lot of the basic ideas are discussed in the answer to this question -- MPI_Type_create_subarray and MPI_Gather .
Update - In the cold light of day, this is a lot of code and not a lot of explanation. So let me expand a little bit.
Consider a 1d integer global array that task 0 has that you want to distribute to a number of MPI tasks, so that they each get a piece in their local array. Say you have 4 tasks, and the global array is [01234567]. You could have task 0 send four messages (including one to itself) to distribute this, and when it's time to re-assemble, receive four messages to bundle it back together; but that obviously gets very time consuming at large numbers of processes. There are optimized routines for these sorts of operations - scatter/gather operations. So in this 1d case you'd do something like this:
int global[8]; /* only task 0 has this */
int local[2]; /* everyone has this */
const int root = 0; /* the processor with the initial global data */
if (rank == root) {
for (int i=0; i<7; i++) global[i] = i;
}
MPI_Scatter(global, 2, MPI_INT, /* send everyone 2 ints from global */
local, 2, MPI_INT, /* each proc receives 2 ints into local */
root, MPI_COMM_WORLD); /* sending process is root, all procs in */
/* MPI_COMM_WORLD participate */
After this, the processors' data would look like
task 0: local:[01] global: [01234567]
task 1: local:[23] global: [garbage-]
task 2: local:[45] global: [garbage-]
task 3: local:[67] global: [garbage-]
That is, the scatter operation takes the global array and sends contiguous 2-int chunks to all the processors.
To re-assemble the array, we use the MPI_Gather() operation, which works exactly the same but in reverse:
for (int i=0; i<2; i++)
local[i] = local[i] + rank;
MPI_Gather(local, 2, MPI_INT, /* everyone sends 2 ints from local */
global, 2, MPI_INT, /* root receives 2 ints each proc into global */
root, MPI_COMM_WORLD); /* recv'ing process is root, all procs in */
/* MPI_COMM_WORLD participate */
and now the data looks like
task 0: local:[01] global: [0134679a]
task 1: local:[34] global: [garbage-]
task 2: local:[67] global: [garbage-]
task 3: local:[9a] global: [garbage-]
Gather brings all the data back, and here a is 10 because I didn't think my formatting through carefully enough upon starting this example.
What happens if the number of data points doesn't evenly divide the number of processes, and we need to send different numbers of items to each process? Then you need a generalized version of scatter, MPI_Scatterv(), which lets you specify the counts for each
processor, and displacements -- where in the global array that piece of data starts. So let's say you had an array of characters [abcdefghi] with 9 characters, and you were going to assign every process two characters except the last, that got three. Then you'd need
char global[9]; /* only task 0 has this */
char local[3]={'-','-','-'}; /* everyone has this */
int mynum; /* how many items */
const int root = 0; /* the processor with the initial global data */
if (rank == 0) {
for (int i=0; i<8; i++) global[i] = 'a'+i;
}
int counts[4] = {2,2,2,3}; /* how many pieces of data everyone has */
mynum = counts[rank];
int displs[4] = {0,2,4,6}; /* the starting point of everyone's data */
/* in the global array */
MPI_Scatterv(global, counts, displs, /* proc i gets counts[i] pts from displs[i] */
MPI_INT,
local, mynum, MPI_INT; /* I'm receiving mynum MPI_INTs into local */
root, MPI_COMM_WORLD);
Now the data looks like
task 0: local:[ab-] global: [abcdefghi]
task 1: local:[cd-] global: [garbage--]
task 2: local:[ef-] global: [garbage--]
task 3: local:[ghi] global: [garbage--]
You've now used scatterv to distribute the irregular amounts of data. The displacement in each case is two*rank (measured in characters; the displacement is in unit of the types being sent for a scatter or received for a gather; it's not generally in bytes or something) from the start of the array, and the counts are {2,2,2,3}. If it had been the first processor we wanted to have 3 characters, we would have set counts={3,2,2,2} and displacements would have been {0,3,5,7}. Gatherv again works exactly the same but reverse; the counts and displs arrays would remain the same.
Now, for 2D, this is a bit trickier. If we want to send 2d sublocks of a 2d array, the data we're sending now no longer is contiguous. If we're sending (say) 3x3 subblocks of a 6x6 array to 4 processors, the data we're sending has holes in it:
2D Array
---------
|000|111|
|000|111|
|000|111|
|---+---|
|222|333|
|222|333|
|222|333|
---------
Actual layout in memory
[000111000111000111222333222333222333]
(Note that all high-performance computing comes down to understanding the layout of data in memory.)
If we want to send the data that is marked "1" to task 1, we need to skip three values, send three values, skip three values, send three values, skip three values, send three values. A second complication is where the subregions stop and start; note that region "1" doesn't start where region "0" stops; after the last element of region "0", the next location in memory is partway-way through region "1".
Let's tackle the first layout problem first - how to pull out just the data we want to send. We could always just copy out all the "0" region data to another, contiguous array, and send that; if we planned it out carefully enough, we could even do that in such a way that we could call MPI_Scatter on the results. But we'd rather not have to transpose our entire main data structure that way.
So far, all the MPI data types we've used are simple ones - MPI_INT specifies (say) 4 bytes in a row. However, MPI lets you create your own data types that describe arbitrarily complex data layouts in memory. And this case -- rectangular subregions of an array -- is common enough that there's a specific call for that. For the 2-dimensional
case we're describing above,
MPI_Datatype newtype;
int sizes[2] = {6,6}; /* size of global array */
int subsizes[2] = {3,3}; /* size of sub-region */
int starts[2] = {0,0}; /* let's say we're looking at region "0",
which begins at index [0,0] */
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &newtype);
MPI_Type_commit(&newtype);
This creates a type which picks out just the region "0" from the global array; we could
send just that piece of data now to another processor
MPI_Send(&(global[0][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "0" */
and the receiving process could receive it into a local array. Note that the receiving process, if it's only receiving it into a 3x3 array, can not describe what it's receiving as a type of newtype; that no longer describes the memory layout. Instead, it's just receiving a block of 3*3 = 9 integers:
MPI_Recv(&(local[0][0]), 3*3, MPI_INT, 0, tag, MPI_COMM_WORLD);
Note that we could do this for other sub-regions, too, either by creating a different type (with different start array) for the other blocks, or just by sending at the starting point of the particular block:
MPI_Send(&(global[0][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "1" */
MPI_Send(&(global[3][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "2" */
MPI_Send(&(global[3][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "3" */
Finally, note that we require global and local to be contiguous chunks of memory here; that is, &(global[0][0]) and &(local[0][0]) (or, equivalently, *global and *local point to contiguous 6*6 and 3*3 chunks of memory; that isn't guaranteed by the usual way of allocating dynamic multi-d arrays. It's shown how to do this below.
Now that we understand how to specify subregions, there's only one more thing to discuss before using scatter/gather operations, and that's the "size" of these types. We couldn't just use MPI_Scatter() (or even scatterv) with these types yet, because these types have an extent of 16 integers; that is, where they end is 16 integers after they start -- and where they end doesn't line up nicely with where the next block begins, so we can't just use scatter - it would pick the wrong place to start sending data to the next processor.
Of course, we could use MPI_Scatterv() and specify the displacements ourselves, and that's what we'll do - except the displacements are in units of the send-type size, and that doesn't help us either; the blocks start at offsets of (0,3,18,21) integers from the start of the global array, and the fact that a block ends 16 integers from where it starts doesn't let us express those displacements in integer multiples at all.
To deal with this, MPI lets you set the extent of the type for the purposes of these calculations. It doesn't truncate the type; it's just used for figuring out where the next element starts given the last element. For types like these with holes in them, it's frequently handy to set the extent to be something smaller than the distance in memory to the actual end of the type.
We can set the extent to be anything that's convenient to us. We could just make the extent 1 integer, and then set the displacements in units of integers. In this case, though, I like to set the extent to be 3 integers - the size of a sub-row - that way, block "1" starts immediately after block "0", and block "3" starts immediately after block "2". Unfortunately, it doesn't quite work as nicely when jumping from block "2" to block "3", but that can't be helped.
So to scatter the subblocks in this case, we'd do the following:
MPI_Datatype type, resizedtype;
int sizes[2] = {6,6}; /* size of global array */
int subsizes[2] = {3,3}; /* size of sub-region */
int starts[2] = {0,0}; /* let's say we're looking at region "0",
which begins at index [0,0] */
/* as before */
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type);
/* change the extent of the type */
MPI_Type_create_resized(type, 0, 3*sizeof(int), &resizedtype);
MPI_Type_commit(&resizedtype);
Here we've created the same block type as before, but we've resized it; we haven't changed where the type "starts" (the 0) but we've changed where it "ends" (3 ints). We didn't mention this before, but the MPI_Type_commit is required to be able to use the type; but you only need to commit the final type you actually use, not any intermediate steps. You use MPI_Type_free to free the type when you're done.
So now, finally, we can scatterv the blocks: the data manipulations above are a little complicated, but once it's done, the scatterv looks just like before:
int counts[4] = {1,1,1,1}; /* how many pieces of data everyone has, in units of blocks */
int displs[4] = {0,1,6,7}; /* the starting point of everyone's data */
/* in the global array, in block extents */
MPI_Scatterv(global, counts, displs, /* proc i gets counts[i] types from displs[i] */
resizedtype,
local, 3*3, MPI_INT; /* I'm receiving 3*3 MPI_INTs into local */
root, MPI_COMM_WORLD);
And now we're done, after a little tour of scatter, gather, and MPI derived types.
An example code which shows both the gather and the scatter operation, with character arrays, follows. Running the program:
$ mpirun -n 4 ./gathervarray
Global array is:
0123456789
3456789012
6789012345
9012345678
2345678901
5678901234
8901234567
1234567890
4567890123
7890123456
Local process on rank 0 is:
|01234|
|34567|
|67890|
|90123|
|23456|
Local process on rank 1 is:
|56789|
|89012|
|12345|
|45678|
|78901|
Local process on rank 2 is:
|56789|
|89012|
|12345|
|45678|
|78901|
Local process on rank 3 is:
|01234|
|34567|
|67890|
|90123|
|23456|
Processed grid:
AAAAABBBBB
AAAAABBBBB
AAAAABBBBB
AAAAABBBBB
AAAAABBBBB
CCCCCDDDDD
CCCCCDDDDD
CCCCCDDDDD
CCCCCDDDDD
CCCCCDDDDD
and the code follows.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include "mpi.h"
int malloc2dchar(char ***array, int n, int m) {
/* allocate the n*m contiguous items */
char *p = (char *)malloc(n*m*sizeof(char));
if (!p) return -1;
/* allocate the row pointers into the memory */
(*array) = (char **)malloc(n*sizeof(char*));
if (!(*array)) {
free(p);
return -1;
}
/* set up the pointers into the contiguous memory */
for (int i=0; i<n; i++)
(*array)[i] = &(p[i*m]);
return 0;
}
int free2dchar(char ***array) {
/* free the memory - the first element of the array is at the start */
free(&((*array)[0][0]));
/* free the pointers into the memory */
free(*array);
return 0;
}
int main(int argc, char **argv) {
char **global, **local;
const int gridsize=10; // size of grid
const int procgridsize=2; // size of process grid
int rank, size; // rank of current process and no. of processes
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size != procgridsize*procgridsize) {
fprintf(stderr,"%s: Only works with np=%d for now\n", argv[0], procgridsize);
MPI_Abort(MPI_COMM_WORLD,1);
}
if (rank == 0) {
/* fill in the array, and print it */
malloc2dchar(&global, gridsize, gridsize);
for (int i=0; i<gridsize; i++) {
for (int j=0; j<gridsize; j++)
global[i][j] = '0'+(3*i+j)%10;
}
printf("Global array is:\n");
for (int i=0; i<gridsize; i++) {
for (int j=0; j<gridsize; j++)
putchar(global[i][j]);
printf("\n");
}
}
/* create the local array which we'll process */
malloc2dchar(&local, gridsize/procgridsize, gridsize/procgridsize);
/* create a datatype to describe the subarrays of the global array */
int sizes[2] = {gridsize, gridsize}; /* global size */
int subsizes[2] = {gridsize/procgridsize, gridsize/procgridsize}; /* local size */
int starts[2] = {0,0}; /* where this one starts */
MPI_Datatype type, subarrtype;
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &type);
MPI_Type_create_resized(type, 0, gridsize/procgridsize*sizeof(char), &subarrtype);
MPI_Type_commit(&subarrtype);
char *globalptr=NULL;
if (rank == 0) globalptr = &(global[0][0]);
/* scatter the array to all processors */
int sendcounts[procgridsize*procgridsize];
int displs[procgridsize*procgridsize];
if (rank == 0) {
for (int i=0; i<procgridsize*procgridsize; i++) sendcounts[i] = 1;
int disp = 0;
for (int i=0; i<procgridsize; i++) {
for (int j=0; j<procgridsize; j++) {
displs[i*procgridsize+j] = disp;
disp += 1;
}
disp += ((gridsize/procgridsize)-1)*procgridsize;
}
}
MPI_Scatterv(globalptr, sendcounts, displs, subarrtype, &(local[0][0]),
gridsize*gridsize/(procgridsize*procgridsize), MPI_CHAR,
0, MPI_COMM_WORLD);
/* now all processors print their local data: */
for (int p=0; p<size; p++) {
if (rank == p) {
printf("Local process on rank %d is:\n", rank);
for (int i=0; i<gridsize/procgridsize; i++) {
putchar('|');
for (int j=0; j<gridsize/procgridsize; j++) {
putchar(local[i][j]);
}
printf("|\n");
}
}
MPI_Barrier(MPI_COMM_WORLD);
}
/* now each processor has its local array, and can process it */
for (int i=0; i<gridsize/procgridsize; i++) {
for (int j=0; j<gridsize/procgridsize; j++) {
local[i][j] = 'A' + rank;
}
}
/* it all goes back to process 0 */
MPI_Gatherv(&(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_CHAR,
globalptr, sendcounts, displs, subarrtype,
0, MPI_COMM_WORLD);
/* don't need the local data anymore */
free2dchar(&local);
/* or the MPI data type */
MPI_Type_free(&subarrtype);
if (rank == 0) {
printf("Processed grid:\n");
for (int i=0; i<gridsize; i++) {
for (int j=0; j<gridsize; j++) {
putchar(global[i][j]);
}
printf("\n");
}
free2dchar(&global);
}
MPI_Finalize();
return 0;
}
I just found it easier to check it that way.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include "mpi.h"
/*
This is a version with integers, rather than char arrays, presented in this
very good answer: http://stackoverflow.com/a/9271753/2411320
It will initialize the 2D array, scatter it, increase every value by 1 and then gather it back.
*/
int malloc2D(int ***array, int n, int m) {
int i;
/* allocate the n*m contiguous items */
int *p = malloc(n*m*sizeof(int));
if (!p) return -1;
/* allocate the row pointers into the memory */
(*array) = malloc(n*sizeof(int*));
if (!(*array)) {
free(p);
return -1;
}
/* set up the pointers into the contiguous memory */
for (i=0; i<n; i++)
(*array)[i] = &(p[i*m]);
return 0;
}
int free2D(int ***array) {
/* free the memory - the first element of the array is at the start */
free(&((*array)[0][0]));
/* free the pointers into the memory */
free(*array);
return 0;
}
int main(int argc, char **argv) {
int **global, **local;
const int gridsize=4; // size of grid
const int procgridsize=2; // size of process grid
int rank, size; // rank of current process and no. of processes
int i, j, p;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size != procgridsize*procgridsize) {
fprintf(stderr,"%s: Only works with np=%d for now\n", argv[0], procgridsize);
MPI_Abort(MPI_COMM_WORLD,1);
}
if (rank == 0) {
/* fill in the array, and print it */
malloc2D(&global, gridsize, gridsize);
int counter = 0;
for (i=0; i<gridsize; i++) {
for (j=0; j<gridsize; j++)
global[i][j] = ++counter;
}
printf("Global array is:\n");
for (i=0; i<gridsize; i++) {
for (j=0; j<gridsize; j++) {
printf("%2d ", global[i][j]);
}
printf("\n");
}
}
//return;
/* create the local array which we'll process */
malloc2D(&local, gridsize/procgridsize, gridsize/procgridsize);
/* create a datatype to describe the subarrays of the global array */
int sizes[2] = {gridsize, gridsize}; /* global size */
int subsizes[2] = {gridsize/procgridsize, gridsize/procgridsize}; /* local size */
int starts[2] = {0,0}; /* where this one starts */
MPI_Datatype type, subarrtype;
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type);
MPI_Type_create_resized(type, 0, gridsize/procgridsize*sizeof(int), &subarrtype);
MPI_Type_commit(&subarrtype);
int *globalptr=NULL;
if (rank == 0)
globalptr = &(global[0][0]);
/* scatter the array to all processors */
int sendcounts[procgridsize*procgridsize];
int displs[procgridsize*procgridsize];
if (rank == 0) {
for (i=0; i<procgridsize*procgridsize; i++)
sendcounts[i] = 1;
int disp = 0;
for (i=0; i<procgridsize; i++) {
for (j=0; j<procgridsize; j++) {
displs[i*procgridsize+j] = disp;
disp += 1;
}
disp += ((gridsize/procgridsize)-1)*procgridsize;
}
}
MPI_Scatterv(globalptr, sendcounts, displs, subarrtype, &(local[0][0]),
gridsize*gridsize/(procgridsize*procgridsize), MPI_INT,
0, MPI_COMM_WORLD);
/* now all processors print their local data: */
for (p=0; p<size; p++) {
if (rank == p) {
printf("Local process on rank %d is:\n", rank);
for (i=0; i<gridsize/procgridsize; i++) {
putchar('|');
for (j=0; j<gridsize/procgridsize; j++) {
printf("%2d ", local[i][j]);
}
printf("|\n");
}
}
MPI_Barrier(MPI_COMM_WORLD);
}
/* now each processor has its local array, and can process it */
for (i=0; i<gridsize/procgridsize; i++) {
for (j=0; j<gridsize/procgridsize; j++) {
local[i][j] += 1; // increase by one the value
}
}
/* it all goes back to process 0 */
MPI_Gatherv(&(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_INT,
globalptr, sendcounts, displs, subarrtype,
0, MPI_COMM_WORLD);
/* don't need the local data anymore */
free2D(&local);
/* or the MPI data type */
MPI_Type_free(&subarrtype);
if (rank == 0) {
printf("Processed grid:\n");
for (i=0; i<gridsize; i++) {
for (j=0; j<gridsize; j++) {
printf("%2d ", global[i][j]);
}
printf("\n");
}
free2D(&global);
}
MPI_Finalize();
return 0;
}
Output:
linux16:>mpicc -o main main.c
linux16:>mpiexec -n 4 main Global array is:
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Local process on rank 0 is:
| 1 2 |
| 5 6 |
Local process on rank 1 is:
| 3 4 |
| 7 8 |
Local process on rank 2 is:
| 9 10 |
|13 14 |
Local process on rank 3 is:
|11 12 |
|15 16 |
Processed grid:
2 3 4 5
6 7 8 9
10 11 12 13
14 15 16 17
I have a 2D double precision array that is being manipulated in parallel by several processes. Each process manipulates a part of the array, and at the end of every iteration, I need to ensure that all the processes have the SAME copy of the 2D array.
Assuming an array of size 10*10 and 2 processes (or processors). Process 1 (P1) manipulates the first 5 rows of the 2D row (5*10=50 elements in total) and P2 manipulates the last 5 rows (50 elements total). And at the end of each iteration, I need P1 to have (ITS OWN first 5 rows + P2's last 5 rows). P2 should have (P1's first 5 rows + it's OWN last 5 rows). I hope the scenario is clear.
I am trying to broadcast using the code given below. But my program keeps exiting with this error: "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)".
I am already using a contiguous 2D memory allocator as pointed out here: MPI_Bcast a dynamic 2d array by Jonathan. But I am still getting the same error.
Can someone help me out?
My code:
double **grid, **oldgrid;
int gridsize; // size of grid
int rank, size; // rank of current process and no. of processes
int rowsforeachprocess, offset; // to keep track of rows that need to be handled by each process
/* allocation, MPI_Init, and lots of other stuff */
rowsforeachprocess = ceil((float)gridsize/size);
offset = rank*rowsforeachprocess;
/* Each process is handling "rowsforeachprocess" #rows.
* Lots of work done here
* Now I need to broadcast these rows to all other processes.
*/
for(i=0; i<gridsize; i++){
MPI_Bcast(&(oldgrid[i]), gridsize-2, MPI_DOUBLE, (i/rowsforeachprocess), MPI_COMM_WORLD);
}
Part 2: The code above is part of a parallel solver for the laplace equation using 1D decomposition and I did not want to use a Master-worker model. Will my code be easier if I use a Master-worker model?
The crash-causing problem here is a 2d-array pointer issue -- &(oldgrid[i]) is a pointer-to-a-pointer to doubles, not a pointer to doubles, and it points to the pointer to row i of your array, not to row i of your array. You want MPI_Bcast(&(oldgrid[i][0]),.. or MPI_Bcast(oldgrid[i],....
There's another way to do this, too, which only uses one expensive collective communicator instead of one per row; if you need everyone to have a copy of the whole array, you can use MPI_Allgather to gather the data together and distribute it to everyone; or, in the general case where the processes don't have the same number of rows, MPI_Allgatherv. Instead of the loop over broadcasts, this would look a little like:
{
int *counts = malloc(size*sizeof(int));
int *displs = malloc(size*sizeof(int));
for (int i=0; i<size; i++) {
counts[i] = rowsforeachprocess*gridsize;
displs[i] = i*rowsforeachprocess*gridsize;
}
counts[size-1] = (gridsize-(size-1)*rowsforeachprocess)*gridsize;
MPI_Allgatherv(oldgrid[offset], mynumrows*gridsize, MPI_DOUBLE,
oldgrid[0], counts, displs, MPI_DOUBLE, MPI_COMM_WORLD);
free(counts);
free(displs);
}
where counts are the number of items sent by each task, and displs are the displacements.
But finally, are you sure that every process has to have a copy of the entire array? If you're just computing a laplacian, you probably just need neighboring rows, not the whole array.
This would look like:
int main(int argc, char**argv) {
double **oldgrid;
const int gridsize=10; // size of grid
int rank, size; // rank of current process and no. of processes
int rowsforeachprocess; // to keep track of rows that need to be handled by each process
int offset, mynumrows;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
rowsforeachprocess = (int)ceil((float)gridsize/size);
offset = rank*rowsforeachprocess;
mynumrows = rowsforeachprocess;
if (rank == size-1)
mynumrows = gridsize-offset;
rowsforeachprocess = (int)ceil((float)gridsize/size);
offset = rank*rowsforeachprocess;
mynumrows = rowsforeachprocess;
if (rank == size-1)
mynumrows = gridsize-offset;
malloc2ddouble(&oldgrid, mynumrows+2, gridsize);
for (int i=0; i<mynumrows+2; i++)
for (int j=0; j<gridsize; j++)
oldgrid[i][j] = rank;
/* exchange row data with neighbours */
int highneigh = rank+1;
if (rank == size-1) highneigh = 0;
int lowneigh = rank-1;
if (rank == 0) lowneigh = size-1;
/* send data to high neibhour and receive from low */
MPI_Sendrecv(oldgrid[mynumrows], gridsize, MPI_DOUBLE, highneigh, 1,
oldgrid[0], gridsize, MPI_DOUBLE, lowneigh, 1,
MPI_COMM_WORLD, &status);
/* send data to low neibhour and receive from high */
MPI_Sendrecv(oldgrid[1], gridsize, MPI_DOUBLE, lowneigh, 1,
oldgrid[mynumrows+1], gridsize, MPI_DOUBLE, highneigh, 1,
MPI_COMM_WORLD, &status);
for (int proc=0; proc<size; proc++) {
if (rank == proc) {
printf("Rank %d:\n", proc);
for (int i=0; i<mynumrows+2; i++) {
for (int j=0; j<gridsize; j++) {
printf("%f ", oldgrid[i][j]);
}
printf("\n");
}
printf("\n");
}
MPI_Barrier(MPI_COMM_WORLD);
}
I have a 2D array which is distributed across a MPI process grid (3 x 2 processes in this example). The values of the array are generated within the process which that chunk of the array is distributed to, and I want to gather all of those chunks together at the root process to display them.
So far, I have the code below. This generates a cartesian communicator, finds out the co-ordinates of the MPI process and works out how much of the array it should get based on that (as the array need not be a multiple of the cartesian grid size). I then create a new MPI derived datatype which will send the whole of that processes subarray as one item (that is, the stride, blocklength and count are different for each process, as each process has different sized arrays). However, when I come to gather the data together with MPI_Gather, I get a segmentation fault.
I think this is because I shouldn't be using the same datatype for sending and receiving in the MPI_Gather call. The data type is fine for sending the data, as it has the right count, stride and blocklength, but when it gets to the other end it'll need a very different derived datatype. I'm not sure how to calculate the parameters for this datatype - does anyone have any ideas?
Also, if I'm approaching this from completely the wrong angle then please let me know!
#include<stdio.h>
#include<array_alloc.h>
#include<math.h>
#include<mpi.h>
int main(int argc, char ** argv)
{
int size, rank;
int dim_size[2];
int periods[2];
int A = 2;
int B = 3;
MPI_Comm cart_comm;
MPI_Datatype block_type;
int coords[2];
float **array;
float **whole_array;
int n = 10;
int rows_per_core;
int cols_per_core;
int i, j;
int x_start, x_finish;
int y_start, y_finish;
/* Initialise MPI */
MPI_Init(&argc, &argv);
/* Get the rank for this process, and the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
/* If we're the master process */
whole_array = alloc_2d_float(n, n);
/* Initialise whole array to silly values */
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
whole_array[i][j] = 9999.99;
}
}
for (j = 0; j < n; j ++)
{
for (i = 0; i < n; i++)
{
printf("%f ", whole_array[j][i]);
}
printf("\n");
}
}
/* Create the cartesian communicator */
dim_size[0] = B;
dim_size[1] = A;
periods[0] = 1;
periods[1] = 1;
MPI_Cart_create(MPI_COMM_WORLD, 2, dim_size, periods, 1, &cart_comm);
/* Get our co-ordinates within that communicator */
MPI_Cart_coords(cart_comm, rank, 2, coords);
rows_per_core = ceil(n / (float) A);
cols_per_core = ceil(n / (float) B);
if (coords[0] == (B - 1))
{
/* We're at the far end of a row */
cols_per_core = n - (cols_per_core * (B - 1));
}
if (coords[1] == (A - 1))
{
/* We're at the bottom of a col */
rows_per_core = n - (rows_per_core * (A - 1));
}
printf("X: %d, Y: %d, RpC: %d, CpC: %d\n", coords[0], coords[1], rows_per_core, cols_per_core);
MPI_Type_vector(rows_per_core, cols_per_core, cols_per_core + 1, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
array = alloc_2d_float(rows_per_core, cols_per_core);
if (array == NULL)
{
printf("Problem with array allocation.\nExiting\n");
return 1;
}
for (j = 0; j < rows_per_core; j++)
{
for (i = 0; i < cols_per_core; i++)
{
array[j][i] = (float) (i + 1);
}
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Gather(array, 1, block_type, whole_array, 1, block_type, 0, MPI_COMM_WORLD);
/*
if (rank == 0)
{
for (j = 0; j < n; j ++)
{
for (i = 0; i < n; i++)
{
printf("%f ", whole_array[j][i]);
}
printf("\n");
}
}
*/
/* Close down the MPI environment */
MPI_Finalize();
}
The 2D array allocation routine I have used above is implemented as:
float **alloc_2d_float( int ndim1, int ndim2 ) {
float **array2 = malloc( ndim1 * sizeof( float * ) );
int i;
if( array2 != NULL ){
array2[0] = malloc( ndim1 * ndim2 * sizeof( float ) );
if( array2[ 0 ] != NULL ) {
for( i = 1; i < ndim1; i++ )
array2[i] = array2[0] + i * ndim2;
}
else {
free( array2 );
array2 = NULL;
}
}
return array2;
}
This is a tricky one. You're on the right track, and yes, you will need different types for sending and receiving.
The sending part is easy -- if you're sending the whole subarray array, then you don't even need the vector type; you can send the entire (rows_per_core)*(cols_per_core) contiguous floats starting at &(array[0][0]) (or array[0], if you prefer).
It's the receiving that's the tricky part, as you've gathered. Let's start with the simplest case -- assuming that everything divides evenly so all the blocks have the same size. Then you can use the very helfpul MPI_Type_create_subarray (you could always cobble this together with vector types, but for higher-dimensional arrays this becomes tedious, as you need to create 1 intermediate type for each dimension of the array except the last...
Also, rather than hardcoding the decomposition, you can use the also-helpful MPI_Dims_create to create an as-square-as-possible decomposition of your ranks. Note
that this doesn't necessarily have anything to do with MPI_Cart_create, although you can use it for the requested dimensions. I'm going to skip the cart_create stuff here, not because it's not useful, but because I want to focus on the gather stuff.
So if everyone has the same size of array, then root is receiving the same data type from everyone, and one can use a very simple subarray type to get their data:
MPI_Type_create_subarray(2, whole_array_size, sub_array_size, starts,
MPI_ORDER_C, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
where sub_array_size[] = {rows_per_core, cols_per_core}, whole_array_size[] = {n,n}, and for here, starts[]={0,0} - eg, we'll just assume that everything starts the start.
The reason for this is that we can then use Gatherv to explicitly set the displacements into the array:
for (int i=0; i<size; i++) {
counts[i] = 1; /* one block_type per rank */
int row = (i % A);
int col = (i / A);
/* displacement into the whole_array */
disps[i] = (col*cols_per_core + row*(rows_per_core)*n);
}
MPI_Gatherv(array[0], rows_per_core*cols_per_core, MPI_FLOAT,
recvptr, counts, disps, resized_type, 0, MPI_COMM_WORLD);
So now everyone sends their data in one chunk, and it's received into the type into the right part of the array. For this to work, I've resized the type so that it's extent is just one float, so the displacements can be calculated in that unit:
MPI_Type_create_resized(block_type, 0, 1*sizeof(float), &resized_type);
MPI_Type_commit(&resized_type);
The whole code is below:
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<mpi.h>
float **alloc_2d_float( int ndim1, int ndim2 ) {
float **array2 = malloc( ndim1 * sizeof( float * ) );
int i;
if( array2 != NULL ){
array2[0] = malloc( ndim1 * ndim2 * sizeof( float ) );
if( array2[ 0 ] != NULL ) {
for( i = 1; i < ndim1; i++ )
array2[i] = array2[0] + i * ndim2;
}
else {
free( array2 );
array2 = NULL;
}
}
return array2;
}
void free_2d_float( float **array ) {
if (array != NULL) {
free(array[0]);
free(array);
}
return;
}
void init_array2d(float **array, int ndim1, int ndim2, float data) {
for (int i=0; i<ndim1; i++)
for (int j=0; j<ndim2; j++)
array[i][j] = data;
return;
}
void print_array2d(float **array, int ndim1, int ndim2) {
for (int i=0; i<ndim1; i++) {
for (int j=0; j<ndim2; j++) {
printf("%6.2f ", array[i][j]);
}
printf("\n");
}
return;
}
int main(int argc, char ** argv)
{
int size, rank;
int dim_size[2];
int periods[2];
MPI_Datatype block_type, resized_type;
float **array;
float **whole_array;
float *recvptr;
int *counts, *disps;
int n = 10;
int rows_per_core;
int cols_per_core;
int i, j;
int whole_array_size[2];
int sub_array_size[2];
int starts[2];
int A, B;
/* Initialise MPI */
MPI_Init(&argc, &argv);
/* Get the rank for this process, and the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
/* If we're the master process */
whole_array = alloc_2d_float(n, n);
recvptr = &(whole_array[0][0]);
/* Initialise whole array to silly values */
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
whole_array[i][j] = 9999.99;
}
}
print_array2d(whole_array, n, n);
puts("\n\n");
}
/* Create the cartesian communicator */
MPI_Dims_create(size, 2, dim_size);
A = dim_size[1];
B = dim_size[0];
periods[0] = 1;
periods[1] = 1;
rows_per_core = ceil(n / (float) A);
cols_per_core = ceil(n / (float) B);
if (rows_per_core*A != n) {
if (rank == 0) fprintf(stderr,"Aborting: rows %d don't divide by %d evenly\n", n, A);
MPI_Abort(MPI_COMM_WORLD,1);
}
if (cols_per_core*B != n) {
if (rank == 0) fprintf(stderr,"Aborting: cols %d don't divide by %d evenly\n", n, B);
MPI_Abort(MPI_COMM_WORLD,2);
}
array = alloc_2d_float(rows_per_core, cols_per_core);
printf("%d, RpC: %d, CpC: %d\n", rank, rows_per_core, cols_per_core);
whole_array_size[0] = n;
sub_array_size [0] = rows_per_core;
whole_array_size[1] = n;
sub_array_size [1] = cols_per_core;
starts[0] = 0; starts[1] = 0;
MPI_Type_create_subarray(2, whole_array_size, sub_array_size, starts,
MPI_ORDER_C, MPI_FLOAT, &block_type);
MPI_Type_commit(&block_type);
MPI_Type_create_resized(block_type, 0, 1*sizeof(float), &resized_type);
MPI_Type_commit(&resized_type);
if (array == NULL)
{
printf("Problem with array allocation.\nExiting\n");
MPI_Abort(MPI_COMM_WORLD,3);
}
init_array2d(array,rows_per_core,cols_per_core,(float)rank);
counts = (int *)malloc(size * sizeof(int));
disps = (int *)malloc(size * sizeof(int));
/* note -- we're just using MPI_COMM_WORLD rank here to
* determine location, not the cart_comm for now... */
for (int i=0; i<size; i++) {
counts[i] = 1; /* one block_type per rank */
int row = (i % A);
int col = (i / A);
/* displacement into the whole_array */
disps[i] = (col*cols_per_core + row*(rows_per_core)*n);
}
MPI_Gatherv(array[0], rows_per_core*cols_per_core, MPI_FLOAT,
recvptr, counts, disps, resized_type, 0, MPI_COMM_WORLD);
free_2d_float(array);
if (rank == 0) print_array2d(whole_array, n, n);
if (rank == 0) free_2d_float(whole_array);
MPI_Finalize();
}
Minor thing -- you don't need the barrier before the gather. In fact, you hardly ever really need a barrier, and they're expensive operations for a few reasons, and can hide problems -- my rule of thumb is to never, ever, use barriers unless you know exactly why the rule needs to be broken in this case. In this case in particular, the collective gather routine does exactly the same syncronization as the barrier, so just use that.
Now, moving onto the harder stuff. If things don't divide evenly, you have a few options. The simplest, though not necessarily the best, is just to pad the array so that it does divide evenly, even if just for this operation.
If you can arrange it so that the number of columns does divide evenly, even if the number of rows doesn't, then you can still use the gatherv and create a vector type for each part of the row, and gatherv that the appropriate number of rows from each processor. That would work fine.
If you definately have the case where neither can be counted on to divide, and you can't pad data for sending, then there are three sub-options I can see:
As susterpatt suggests, do point-to-point. For small numbers of tasks, this is fine, but as it gets larger, this will be significantly less efficient than the collective operations.
Create a communicator consisting of all the processors not on the outer edges, and use exactly the code above to gather their code; and then point-to-point the edge tasks' data.
Don't gather to process 0 at all; use the Distributed array type to describe the layout of the array, and use MPI-IO to write all the data to a file; once that's done, you can have process zero display the data in some way if you like.
It looks like the first argument to you MPI_Gather call should probably be array[0], and not array.
Also, if you need to get different amounts of data from each rank, you might be better off using MPI_Gatherv.
Finally, not that gathering all your data in once place to do output is not scalable in many circumstances. As the amount of data grows, eventually, it will exceed the memory available to rank 0. You might be much better off distributing the output work (if you are writing to a file, using MPI IO or other library calls) or doing point-to-point sends to rank 0 one at a time, to limit the total memory consumption.
On the other hand, I would not recommend coordinating each of your ranks printing to standard output, one after another, because some major MPI implementations don't guarantee that standard output will be produced in order. Cray's MPI, in particular, jumbles up standard output pretty thoroughly if multiple ranks print.
Accordding to this (emphasis by me):
The type-matching conditions for the collective operations are more strict than the corresponding conditions between sender and receiver in point-to-point. Namely, for collective operations, the amount of data sent must exactly match the amount of data specified by the receiver. Distinct type maps between sender and receiver are still allowed.
Sounds to me like you have two options:
Pad smaller submatrices so that all processes send the same amount of data, then crop the matrix back to its original size after the Gather. If you're feeling adventurous, you might try defining the receiving typemap so that paddings are automatically overwritten during the Gather operation, thus eliminating the need for the crop afterwards. This could get a bit complicated though.
Fall back to point-to-point communication. Much more straightforward, but possibly higher communication costs.
Personally, I'd go with option 2.