I want to send 2d chunks of data using MPI_GATHER. For example: I have 2x3 arrays on each node and I want 8x3 array on root, if I have 4 nodes. For 1d arrays, MPI_GATHER sorts data according to MPI ranks, but for 2d data it creates a mess!
What is the clean way to put chunks in order?
I expected the output of this code:
program testmpi
use mpi
implicit none
integer :: send (2,3)
integer :: rec (4,3)
integer :: ierror,my_rank,i,j
call MPI_Init(ierror)
MPI_DATA_TYPE type_col
! find out process rank
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
if (my_rank==0) then
send=1
do i=1,2
print*,(send(i,j),j=1,3)
enddo
endif
if (my_rank==1) then
send=5
! do 1,2
! print*,(send(i,j),j=1,3)
! enddo
endif
call MPI_GATHER(send,6,MPI_INTEGER,rec,6,MPI_INTEGER,0,MPI_COMM_WORLD,ierror)
if (my_rank==0) then
print*,'<><><><><>rec'
do i=1,4
print*,(rec(i,j),j=1,3)
enddo
endif
call MPI_Finalize(ierror)
end program testmpi
to be something like this :
1 1 1
1 1 1
5 5 5
5 5 5
but it looks like this:
1 1 5
1 1 5
1 5 5
1 5 5
The following a literal Fortran translation of this answer. I had thought this was unnecessary, but the multiple differences in array indexing and memory layout might mean that it is worth doing a Fortran version.
Let me start by saying that you generally don't really want to do this - scatter and gather huge chunks of data from some "master" process. Normally you want each task to be chugging away at its own piece of the puzzle, and you should aim to never have one processor need a "global view" of the whole data; as soon as you require that, you limit scalability and the problem size. If you're doing this for I/O - one process reads the data, then scatters it, then gathers it back for writing, you'll want eventually to look into MPI-IO.
Getting to your question, though, MPI has very nice ways of pulling arbitrary data out of memory, and scatter/gathering it to and from a set of processors. Unfortunately that requires a fair number of MPI concepts - MPI Types, extents, and collective operations. A lot of the basic ideas are discussed in the answer to this question -- MPI_Type_create_subarray and MPI_Gather .
Consider a 1d integer global array that task 0 has that you want to distribute to a number of MPI tasks, so that they each get a piece in their local array. Say you have 4 tasks, and the global array is [0,1,2,3,4,5,6,7]. You could have task 0 send four messages (including one to itself) to distribute this, and when it's time to re-assemble, receive four messages to bundle it back together; but that obviously gets very time consuming at large numbers of processes. There are optimized routines for these sorts of operations - scatter/gather operations. So in this 1d case you'd do something like this:
integer, dimension(8) :: global ! only root has this
integer, dimension(2) :: local ! everyone has this
integer, parameter :: root = 0
integer :: rank, comsize
integer :: i, ierr
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
if (rank == root) then
global = [ (i, i=1,8) ]
endif
call MPI_Scatter(global, 2, MPI_INTEGER, & ! send everyone 2 ints from global
local, 2, MPI_INTEGER, & ! each proc recieves 2 into
root, & ! sending process is root,
MPI_COMM_WORLD, ierr) ! all procs in COMM_WORLD participate
After this, the processors' data would look like
task 0: local:[1,2] global: [1,2,3,4,5,6,7,8]
task 1: local:[3,4] global: [garbage]
task 2: local:[5,6] global: [garbage]
task 3: local:[7,8] global: [garbage]
That is, the scatter operation takes the global array and sends contiguous 2-int chunks to all the processors.
To re-assemble the array, we use the MPI_Gather() operation, which works exactly the same but in reverse:
local = local + rank
call MPI_Gather (local, 2, MPI_INTEGER, & ! everyone sends 2 ints from local
global, 2, MPI_INTEGER, & ! root receives 2 ints each proc into global
root, & ! receiving process is root,
MPI_COMM_WORLD, ierr) ! all procs in COMM_WORLD participate
And now the arrays look like:
task 0: local:[1,2] global: [1,2,4,5,7,8,10,11]
task 1: local:[4,5] global: [garbage-]
task 2: local:[7,8] global: [garbage-]
task 3: local:[10,11] global: [garbage-]
Gather brings all the data back.
What happens if the number of data points doesn't evenly divide the number of processes, and we need to send different numbers of items to each process? Then you need a generalized version of scatter, MPI_Scatterv, which lets you specify the counts for each processor, and displacements -- where in the global array that piece of data starts. So let's say with the same 4 tasks you had an array of characters [a,b,c,d,e,f,g,h,i] with 9 characters, and you were going to assign every process two characters except the last, that got three. Then you'd need
character, dimension(9) :: global
character, dimension(3) :: local
integer, dimension(4) :: counts
integer, dimension(4) :: displs
if (rank == root) then
global = [ (achar(i+ichar('a')), i=0,8) ]
endif
local = ['-','-','-']
counts = [2,2,2,3]
displs = [0,2,4,6]
mycounts = counts(rank+1)
call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) chars from displs(i)
MPI_CHARACTER, &
local, mycounts, MPI_CHARACTER, & ! I get mycounts chars into
root, & ! root rank does sending
MPI_COMM_WORLD, ierr) ! all procs in COMM_WORLD participate
Now the data looks like
task 0: local:"ab-" global: "abcdefghi"
task 1: local:"cd-" global: *garbage*
task 2: local:"ef-" global: *garbage*
task 3: local:"ghi" global: *garbage*
You've now used scatterv to distribute the irregular amounts of data. The displacement in each case is two*rank (measured in characters; the displacement is in unit of the types being sent for a scatter or received for a gather; it's not generally in bytes or something) from the start of the array, and the counts are [2,2,2,3]. If it had been the first processor we wanted to have 3 characters, we would have set counts=[3,2,2,2] and displacements would have been [0,3,5,7]. Gatherv again works exactly the same but reverse; the counts and displs arrays would remain the same.
Now, for 2D, this is a bit trickier. If we want to send 2d sublocks of a 2d array, the data we're sending now no longer is contiguous. If we're sending (say) 3x3 subblocks of a 6x6 array to 4 processors, the data we're sending has holes in it:
2D Array
---------
|000|222|
|000|222|
|000|222|
|---+---|
|111|333|
|111|333|
|111|333|
---------
Actual layout in memory
[000111000111000111222333222333222333]
(Note that all high-performance computing comes down to understanding the layout of data in memory.)
If we want to send the data that is marked "1" to task 1, we need to skip three values, send three values, skip three values, send three values, skip three values, send three values. A second complication is where the subregions stop and start; note that region "1" doesn't start where region "0" stops; after the last element of region "0", the next location in memory is partway-way through region "1".
Let's tackle the first layout problem first - how to pull out just the data we want to send. We could always just copy out all the "0" region data to another, contiguous array, and send that; if we planned it out carefully enough, we could even do that in such a way that we could call MPI_Scatter on the results. But we'd rather not have to transpose our entire main data structure that way.
So far, all the MPI data types we've used are simple ones - MPI_INTEGER specifies (say) 4 bytes in a row. However, MPI lets you create your own data types that describe arbitrarily complex data layouts in memory. And this case -- rectangular subregions of an array -- is common enough that there's a specific call for that. For the 2-dimensional case we're describing above,
integer :: newtype;
integer, dimension(2) :: sizes, subsizes, starts
sizes = [6,6] ! size of global array
subsizes = [3,3] ! size of sub-region
starts = [0,0] ! let's say we're looking at region "0"
! which begins at offset [0,0]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_INTEGER, newtype, ierr)
call MPI_Type_commit(newtype, ierr)
This creates a type which picks out just the region "0" from the global array. Note that even in Fortran, the start parameter is given as an offset (eg, 0-based) from the start of the array, not an index (eg, 1-based).
We could send just that piece of data now to another processor
call MPI_Send(global, 1, newtype, dest, tag, MPI_COMM_WORLD, ierr) ! send region "0"
and the receiving process could receive it into a local array. Note that the receiving process, if it's only receiving it into a 3x3 array, can not describe what it's receiving as a type of newtype; that no longer describes the memory layout, because there aren't big skips between the end of one row and the start of the next. Instead, it's just receiving a block of 3*3 = 9 integers:
call MPI_Recv(local, 3*3, MPI_INTEGER, 0, tag, MPI_COMM_WORLD, ierr)
Note that we could do this for other sub-regions, too, either by creating a different type (with different start array) for the other blocks, or just by sending starting from the first location of the particular block:
if (rank == root) then
call MPI_Send(global(4,1), 1, newtype, 1, tag, MPI_COMM_WORLD, ierr)
call MPI_Send(global(1,4), 1, newtype, 2, tag, MPI_COMM_WORLD, ierr)
call MPI_Send(global(4,4), 1, newtype, 3, tag, MPI_COMM_WORLD, ierr)
local = global(1:3, 1:3)
else
call MPI_Recv(local, 3*3, MPI_INTEGER, 0, tag, MPI_COMM_WORLD, rstatus, ierr)
endif
Now that we understand how to specify subregions, there's only one more thing to discuss before using scatter/gather operations, and that's the "size" of these types. We couldn't just use MPI_Scatter() (or even scatterv) with these types yet, because these types have an extent of 15 integers; that is, where they end is 15 integers after they start -- and where they end doesn't line up nicely with where the next block begins, so we can't just use scatter - it would pick the wrong place to start sending data to the next processor.
Of course, we could use MPI_Scatterv() and specify the displacements ourselves, and that's what we'll do - except the displacements are in units of the send-type size, and that doesn't help us either; the blocks start at offsets of (0,3,18,21) integers from the start of the global array, and the fact that a block ends 15 integers from where it starts doesn't let us express those displacements in integer multiples at all.
To deal with this, MPI lets you set the extent of the type for the purposes of these calculations. It doesn't truncate the type; it's just used for figuring out where the next element starts given the last element. For types like these with holes in them, it's frequently handy to set the extent to be something smaller than the distance in memory to the actual end of the type.
We can set the extent to be anything that's convenient to us. We could just make the extent 1 integer, and then set the displacements in units of integers. In this case, though, I like to set the extent to be 3 integers - the size of a sub-column - that way, block "1" starts immediately after block "0", and block "3" starts immediately after block "2". Unfortunately, it doesn't quite work as nicely when jumping from block "2" to block "3", but that can't be helped.
So to scatter the subblocks in this case, we'd do the following:
integer(kind=MPI_ADDRESS_KIND) :: extent
starts = [0,0]
sizes = [6, 6]
subsizes = [3, 3]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, &
newtype, ierr)
call MPI_Type_size(MPI_INTEGER, intsize, ierr)
extent = 3*intsize
call MPI_Type_create_resized(newtype, 0, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
Here we've created the same block type as before, but we've resized it; we haven't changed where the type "starts" (the 0) but we've changed where it "ends" (3 integers). We didn't mention this before, but the MPI_Type_commit is required to be able to use the type; but you only need to commit the final type you actually use, not any intermediate steps. You use MPI_Type_free to free the committed type when you're done.
So now, finally, we can scatterv the blocks: the data manipulations above are a little complicated, but once it's done, the scatterv looks just like before:
counts = 1 ! we will send one of these new types to everyone
displs = [0,1,6,7] ! the starting point of everyone's data
! in the global array, in block extents
call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) types from displs(i)
resizedtype, &
local, 3*3, MPI_INTEGER, & ! I'm receiving 3*3 int
root, MPI_COMM_WORLD, ierr) !... from (root, MPI_COMM_WORLD)
And now we're done, after a little tour of scatter, gather, and MPI derived types.
An example code which shows both the gather and the scatter operation, with character arrays, follows. Running the program:
$ mpirun -np 4 ./scatter2d
global array is:
000222
000222
000222
111333
111333
111333
Rank 0 received:
000
000
000
Rank 1 received:
111
111
111
Rank 2 received:
222
222
222
Rank 3 received:
333
333
333
Rank 0 sending:
111
111
111
Rank 1 sending:
222
222
222
Rank 2 sending:
333
333
333
Rank 3 sending:
444
444
444
Root received:
111333
111333
111333
222444
222444
222444
and the code follows:
program scatter
use mpi
implicit none
integer, parameter :: gridsize = 6 ! size of array
integer, parameter :: procgridsize = 2 ! size of process grid
character, allocatable, dimension (:,:) :: global, local
integer, dimension(procgridsize**2) :: counts, displs
integer, parameter :: root = 0
integer :: rank, comsize
integer :: localsize
integer :: i, j, row, col, ierr, p, charsize
integer, dimension(2) :: sizes, subsizes, starts
integer :: newtype, resizedtype
integer, parameter :: tag = 1
integer, dimension(MPI_STATUS_SIZE) :: rstatus
integer(kind=MPI_ADDRESS_KIND) :: extent, begin
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
if (comsize /= procgridsize**2) then
if (rank == root) then
print *, 'Only works with np = ', procgridsize**2, ' for now.'
endif
call MPI_Finalize(ierr)
stop
endif
localsize = gridsize/procgridsize
allocate( local(localsize, localsize) )
if (rank == root) then
allocate( global(gridsize, gridsize) )
forall( col=1:procgridsize, row=1:procgridsize )
global((row-1)*localsize+1:row*localsize, &
(col-1)*localsize+1:col*localsize) = &
achar(ichar('0')+(row-1)+(col-1)*procgridsize)
end forall
print *, 'global array is: '
do i=1,gridsize
print *, global(i,:)
enddo
endif
starts = [0,0]
sizes = [gridsize, gridsize]
subsizes = [localsize, localsize]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_CHARACTER, &
newtype, ierr)
call MPI_Type_size(MPI_CHARACTER, charsize, ierr)
extent = localsize*charsize
begin = 0
call MPI_Type_create_resized(newtype, begin, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
counts = 1 ! we will send one of these new types to everyone
forall( col=1:procgridsize, row=1:procgridsize )
displs(1+(row-1)+procgridsize*(col-1)) = (row-1) + localsize*procgridsize*(col-1)
endforall
call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) types from displs(i)
resizedtype, &
local, localsize**2, MPI_CHARACTER, & ! I'm receiving localsize**2 chars
root, MPI_COMM_WORLD, ierr) !... from (root, MPI_COMM_WORLD)
do p=1, comsize
if (rank == p-1) then
print *, 'Rank ', rank, ' received: '
do i=1, localsize
print *, local(i,:)
enddo
endif
call MPI_Barrier(MPI_COMM_WORLD, ierr)
enddo
local = achar( ichar(local) + 1 )
do p=1, comsize
if (rank == p-1) then
print *, 'Rank ', rank, ' sending: '
do i=1, localsize
print *, local(i,:)
enddo
endif
call MPI_Barrier(MPI_COMM_WORLD, ierr)
enddo
call MPI_Gatherv( local, localsize**2, MPI_CHARACTER, & ! I'm sending localsize**2 chars
global, counts, displs, resizedtype,&
root, MPI_COMM_WORLD, ierr)
if (rank == root) then
print *, ' Root received: '
do i=1,gridsize
print *, global(i,:)
enddo
endif
call MPI_Type_free(newtype,ierr)
if (rank == root) deallocate(global)
deallocate(local)
call MPI_Finalize(ierr)
end program scatter
So that's the general solution. For your particular case, where we are just appending by rows, we don't need a Gatherv, we can just use a gather, because in this case, all of the displacements are the same -- before, in the 2d block case we had one displacement going 'down', and then jumps in that displacement as you went 'across' to the next column of blocks. Here, the displacement is always one extent from the previous one, so we don't need to give displacements explicitly. So a final code looks like:
program testmpi
use mpi
implicit none
integer, dimension(:,:), allocatable :: send, recv
integer, parameter :: nsendrows = 2, nsendcols = 3
integer, parameter :: root = 0
integer :: ierror, my_rank, comsize, i, j, ierr
integer :: blocktype, resizedtype
integer, dimension(2) :: starts, sizes, subsizes
integer (kind=MPI_Address_kind) :: start, extent
integer :: intsize
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierror)
allocate( send(nsendrows, nsendcols) )
send = my_rank
if (my_rank==root) then
! we're going to append the local arrays
! as groups of send rows
allocate( recv(nsendrows*comsize, nsendcols) )
endif
! describe what these subblocks look like inside the full concatenated array
sizes = [ nsendrows*comsize, nsendcols ]
subsizes = [ nsendrows, nsendcols ]
starts = [ 0, 0 ]
call MPI_Type_create_subarray( 2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, &
blocktype, ierr)
start = 0
call MPI_Type_size(MPI_INTEGER, intsize, ierr)
extent = intsize * nsendrows
call MPI_Type_create_resized(blocktype, start, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
call MPI_Gather( send, nsendrows*nsendcols, MPI_INTEGER, & ! everyone send 3*2 ints
recv, 1, resizedtype, & ! root gets 1 resized type from everyone
root, MPI_COMM_WORLD, ierr)
if (my_rank==0) then
print*,'<><><><><>recv'
do i=1,nsendrows*comsize
print*,(recv(i,j),j=1,nsendcols)
enddo
endif
call MPI_Finalize(ierror)
end program testmpi
Running this with 3 processes gives:
$ mpirun -np 3 ./testmpi
<><><><><>recv
0 0 0
0 0 0
1 1 1
1 1 1
2 2 2
2 2 2
Here's another code block for any other struggling Fortran beginners out there like myself. It shows two different ways to achieve a MPI_Gather on a nx * ny array divided into M * N blocks, with one block on each process.
One way uses MPI derived datatypes, the other simply sends 1D raw data and sorts if afterward on the main node.
Both produce an ordered M * N 2D array. For the example of 3 x 2 sub-arrays in x * y, each having 4 x 5 elements:
n mpi ranks: 6
rank 0 has topology coords 0,0
rank 1 has topology coords 0,1
rank 2 has topology coords 1,0
rank 3 has topology coords 1,1
rank 4 has topology coords 2,0
rank 5 has topology coords 2,1
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
Note that if the raw data is sent and not re-arranged it would be ordered by default like so:
5 5 5 5 5 5 5 5 5 5 5 5
4 4 4 4 5 5 5 5 5 5 5 5
4 4 4 4 4 4 4 4 4 4 4 4
3 3 3 3 3 3 3 3 4 4 4 4
3 3 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2 2
1 1 1 1 2 2 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0
The following code can be compiled with
mpif90 -Wall test.F90 -o test.out
and run with
mpirun -n 6 test.out
program main
use mpi
use, intrinsic :: iso_fortran_env
use iso_c_binding
implicit none
! ===
integer(int64) i, j
integer(int32) nx , ny
integer(int32) nxr , nyr
integer(int32) npx , npy
integer(int32) ri , rj
integer, dimension(2) :: mpi_coords, mpi_coords2
integer rank, n_ranks, ierror
integer rank_cart
integer comm, comm2d
real(real64), dimension(:,:), allocatable :: A, A_global
real(real64), dimension(:), allocatable :: B_global
integer(int32), dimension(:), allocatable :: lengths, displacements
integer(int32) subarraytype, resizedtype
integer(int32) dblsize
integer(kind=MPI_ADDRESS_KIND) :: start, extent
! === MPI interface initialization
call MPI_INIT(ierror)
comm = MPI_COMM_WORLD
call MPI_COMM_SIZE(comm, n_ranks, ierror)
call MPI_COMM_RANK(comm, rank, ierror)
if (rank .eq. 0) then
print '(a, i0)', 'n mpi ranks: ', n_ranks
end if
npx = 3 !! n processes in x
nxr = 4 !! n pts per rank in x
nx = npx*nxr !! n pts x total
npy = 2 !! n processes in y
nyr = 5 !! n pts per rank in y
ny = npy*nyr !! n pts y total
! === check that n_ranks are equal to the hardcoded number of processes in [x,y]
if (n_ranks .ne. npx*npy) then
if (rank .eq. 0) then
print '(a)','n_ranks != npx*npy'
end if
call MPI_Abort(MPI_COMM_WORLD, -1, ierror)
call MPI_Finalize(ierror)
end if
call MPI_BARRIER(comm, ierror)
! === create 2D Cartesian grid
call MPI_Cart_create(comm, 2, (/npx,npy/), (/.true.,.false./), .true., comm2d, ierror)
! === get rank in 2D communicator
call MPI_Comm_rank(comm2d, rank_cart, ierror)
! === get this rank ID and coordinates within the 2D topology
call MPI_Cart_coords(comm2d, rank_cart, 2, mpi_coords, ierror)
! === print topology
if (.true.) then
print '(a,i0,a,i0,a,i0)', 'rank ', rank_cart, ' has topology coords ', mpi_coords(1), ',', mpi_coords(2)
end if
call MPI_BARRIER(comm, ierror)
call flush(6)
! === populate data
allocate( A(nxr,nyr) )
A(:,:) = real(rank,real64)
! ===
if (.true.) then !! use MPI derived types
allocate (lengths(n_ranks))
allocate (displacements(n_ranks))
call MPI_Type_create_subarray(2, (/nx,ny/), (/nxr,nyr/), (/0,0/), &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, subarraytype, ierror)
call MPI_Type_size(MPI_DOUBLE_PRECISION, dblsize, ierror)
start = 0
extent = nxr*dblsize
call MPI_Type_create_resized(subarraytype, start, extent, resizedtype, ierror)
call MPI_Type_commit(resizedtype, ierror)
lengths = 1
!! displacements = [ npx*0*nyr+0, npx*1*nyr+0, &
!! npx*0*nyr+1, npx*1*nyr+1, &
!! npx*0*nyr+2, npx*1*nyr+2 ] !! for 3x2
!! displacements = [ npx*0*nyr+0, npx*1*nyr+0, npx*2*nyr+0, &
!! npx*0*nyr+1, npx*1*nyr+1, npx*2*nyr+1 ] !! for 2x3
do i=0,n_ranks-1
call MPI_Cart_coords(comm2d, int(i,int32), 2, mpi_coords2, ierror)
ri = mpi_coords2(1)
rj = mpi_coords2(2)
displacements(i+1) = npx*rj*nyr + ri
end do
if (rank .eq. 0) then
allocate( A_global(nx,ny) )
end if
call MPI_Gatherv( A , nxr*nyr, MPI_DOUBLE_PRECISION, &
A_global, lengths, displacements, resizedtype, &
0, comm2d, ierror )
else !! send raw data in 1D, then re-arrange
if (rank .eq. 0) then
allocate( A_global(nx,ny) )
allocate( B_global(nx*ny) )
end if
call MPI_Gather( A , nxr*nyr, MPI_DOUBLE_PRECISION, &
B_global, nxr*nyr, MPI_DOUBLE_PRECISION, &
0, comm2d, ierror )
if (rank .eq. 0) then
if (.true.) then !! re-arrange data
do i=0,n_ranks-1
call MPI_Cart_coords(comm2d, int(i,int32), 2, mpi_coords2, ierror)
ri = mpi_coords2(1)
rj = mpi_coords2(2)
A_global( (ri+0)*nxr+1:(ri+1)*nxr+1 , &
(rj+0)*nyr+1:(rj+1)*nyr+1 ) &
= &
reshape( B_global( (i+0)*nxr*nyr+1 : (i+1)*nxr*nyr+1 ) , (/nxr,nyr/) )
end do
else !! dont do re-arrange
A_global(:,:) = reshape( B_global , (/nx,ny/) )
end if
end if
if (rank .eq. 0) then
deallocate( B_global )
end if
end if
! ===
!! print the array to the terminal
if (rank .eq. 0) then
do j=ny, 1, -1
do i=1, nx
write(*,'(i0,a)',advance='no') int(A_global(i,j)) , " "
end do
write (*,*) ''
end do
end if
!! save data to binary file
if (rank==0) then
open(3, file=trim("A.dat"), access="stream")
write(3) reshape( A_global , (/nx,ny/) )
close(3)
end if
if (rank .eq. 0) then
deallocate( A_global )
end if
! ===
call MPI_BARRIER(comm, ierror)
call MPI_FINALIZE(ierror)
end program main
A final note: pulling all data to one process (for I/O purposes or otherwise) is typically very bad practice for parallel programming. The moment this is done, the code will lose potential scalability. The method shown here is really only fit for smaller codes or for debugging. For I/O purposes, one should seriously look into collective MPI-I/O or libraries such as HDF5 which allow for collective I/O.
I'm writing some code which does calculations on a large 3D grid, and uses a halo exchange procedure so that it can work using MPI. I'm getting the wrong results from my code, and I'm pretty sure it's because of the halo exchange not working properly.
Basically I have a large 3D array, a chunk of which is held on each process. Each process has an array which is 2 elements bigger in each dimension than the chunk of data it is holding - so that we can halo exchange into each face of the array without affecting the data stored in the rest of the array. I have the following code to do the halo exchange communication:
MPI_Type_vector(g->ny, g->nx, g->nx, MPI_DOUBLE, &face1);
MPI_Type_commit(&face1);
MPI_Type_vector(2*g->ny, 1, g->nx, MPI_DOUBLE, &face2);
MPI_Type_commit(&face2);
MPI_Type_vector(g->nz, g->nx, g->nx * g->ny, MPI_DOUBLE, &face3);
MPI_Type_commit(&face3);
/* Send to WEST receive from EAST */
MPI_Sendrecv(&(g->data)[current][0][0][0], 1, face1, g->west, tag,
&(g->data)[current][0][0][0], 1, face1, g->east, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to EAST receive from WEST */
MPI_Sendrecv(&(g->data)[current][g->nz-1][0][0], 1, face1, g->east, tag,
&(g->data)[current][g->nz-1][0][0], 1, face1, g->west, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to NORTH receive from SOUTH */
MPI_Sendrecv(&(g->data)[current][0][0][0], 1, face2, g->north, tag,
&(g->data)[current][0][0][0], 1, face2, g->south, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to SOUTH receive from NORTH */
MPI_Sendrecv(&(g->data)[current][0][g->ny-1][0], 1, face2, g->south, tag,
&(g->data)[current][0][0][0], 1, face2, g->north, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to UP receive from DOWN */
MPI_Sendrecv(&(g->data)[current][0][0][0], 1, face3, g->up, tag,
&(g->data)[current][0][0][0], 1, face3, g->down, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to DOWN receive from UP */
MPI_Sendrecv(&(g->data)[current][0][0][g->nx-1], 1, face3, g->down, tag,
&(g->data)[current][0][0][g->nx-1], 1, face3, g->up, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
g->nx, g->ny and g->nz are the sizes of the array chunk that this process is holding, and g->west, g->east, g->north, g->south, g->up and g->down are the ranks of the adjacent processes in each direction, found using the following code:
/* Who are my neighbours in each direction? */
MPI_Cart_shift( cart_comm, 2, 1, &g->north, &g->south);
MPI_Cart_shift( cart_comm, 1, 1, &g->west, &g->east);
MPI_Cart_shift( cart_comm, 0, 1, &g->up, &g->down);
The array on each process is defined as:
array[2][g->nz][g->ny][g->nx]
(It has two copies because I need one to update into each time through my update routine, once I've done the halo exchange).
Can anyone tell me if I'm doing the communication correctly? Particularly the defining of the vector types. Will the vector types I've defined in the code extract each face of a 3D array? And do the MPI_Sendrecv calls look right?
I'm completely lost as to why my code isn't working, but I'm pretty sure it's communications related.
So I'm a big fan of using MPI_Type_create_subarray for pulling out slices of arrays; it's easier to keep straight than vector types. In general, you can't use a single vector type to describe multi-d guardcells (because there are multiple strides, you need to create vectors of vectors), but I think because you're only using 1 guardcell in each direction here that you're ok.
So let's consider the x-face GC; here you're sending an entire y-z plane to your x-neighbour. In memory, this looks like this given your array layout:
+---------+
| #|
| #|
| #|
| #| z=2
| #|
+---------+
| #|
| #|
| #| z=1
| #|
| #|
+---------+
| #|
^| #|
|| #| z=0
y| #|
| #|
+---------+
x->
so you're looking to send count=(ny*nz) blocks of 1 value, each strided by nx. I'm assuming here nx, ny, and nz include guardcells, and that you're sending the corner values. If you're not sending corner values, subarray is the way to go. I'm also assuming, crucially, that g->data is a contiguous block of nx*ny*nz*2 (or 2 contiguous blocks of nx*ny*nz) doubles, otherwise all is lost.
So your type create should look like
MPI_Type_vector((g->ny*g->nz), 1, g->nx, MPI_DOUBLE, &face1);
MPI_Type_commit(&face1);
Note that we are sending a total of count*blocksize = ny*nz values, which is right, and we are striding over count*stride = nx*ny*nz memory in the process, which is also right.
Ok, so the y face looks like this:
+---------+
|#########|
| |
| |
| | z=2
| |
+---------+
|#########|
| |
| | z=1
| |
| |
+---------+
|#########|
^| |
|| | z=0
y| |
| |
+---------+
x->
So you have nz blocks of nx values, each separated by stride nx*ny. So your type create should look like
MPI_Type_vector(g->nz, g->nx, (g->nx)*(g->ny), MPI_DOUBLE, &face2);
MPI_Type_commit(&face2);
And again double-checking, you're sending count*blocksize = nz*nx values, striding count*stride = nx*ny*nz memory. Check.
Finally, sending z-face data involves sending an entire x-y plane:
+---------+
|#########|
|#########|
|#########| z=2
|#########|
|#########|
+---------+
| |
| |
| | z=1
| |
| |
+---------+
| |
^| |
|| | z=0
y| |
| |
+---------+
x->
MPI_Type_vector(1, (g->nx)*(g->ny), 1, MPI_DOUBLE, &face3);
MPI_Type_commit(&face3);
And again double-checking, you're sending count*blocksize = nx*ny values, striding count*stride = nx*ny memory. Check.
Update:
I didn't take a look at your Sendrecvs, but there might be something there, too. Notice that you have to use a pointer to the first piece of data you're sending with an vector data type.
First off, if you have array size nx in the x direction, and you have two guardcells (one on either side), your left guardcell is 0, right is nx-1, and your 'real' data extends from 1..nx-2. So to send your westmost data to your west neighbour, and to receive into your eastmost guardcell from your east neighbour, you would want
/* Send to WEST receive from EAST */
MPI_Sendrecv(&(g->data)[current][0][0][g->nx-2], 1, face1, g->west, westtag,
&(g->data)[current][0][0][0], 1, face1, g->east, westtag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to EAST receive from WEST */
MPI_Sendrecv(&(g->data)[current][0][0][1], 1, face1, g->east, easttag,
&(g->data)[current][0][0][g->nx-1], 1, face1, g->west, easttag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
(I like to use different tags for each stage of communication, helps keep things sorted.)
Likewise for the other directions.