MPI_Allreduce mix elements in the sum - arrays

I am parallelising a fortran code which works with no problem in a no-MPI version. Below is an excerpt of the code.
Every processor does the following:
For a certain number of particles it evolves certain quantities in the loop "do 203"; in a given interval which is divided in Nint subintervals (j=1,Nint), every processor produces an element of the vectors Nx1(j), Nx2(j).
Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).
Another issue is that if I deallocate the variables the code remains in standby after the end of the calculation without completing the execution; but I don't know if this is related to the MPI_Allreduce issue.
include "mpif.h"
...
integer*4 ....
...
real*8
...
call MPI_INIT(mpierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, npe, mpierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mype, mpierr)
! Allocate variables
allocate(Nx1(Nint),Nx5(Nint))
...
! Parameters
...
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
! Loop on particles
do 100 npartj=1,npart_local
call init_random_seed()
call random_number (rand)
...
Initial condition
...
do 203 i=1,1000000 ! loop for time evolution of single particle
if(ufinp.gt.p1.and.ufinp.le.p2)then
do j=1,Nint ! spatial position at any momentum
ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
rs(j) = ls(j)+Delta/Nint
if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
Nx1(j)=Nx1(j)+1
endif
enddo
endif
if(ufinp.gt.p2.and.ufinp.le.p3)then
do j=1,Nint ! spatial position at any momentum
ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
rs(j) = ls(j)+Delta/Nint
if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
Nx2(j)=Nx2(j)+1
endif
enddo
endif
203 continue
100 continue
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
print*,"To be summed"
do j=1,Nint
call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum,
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum,
& MPI_COMM_WORLD, mpierr)
enddo
if(mype.eq.0)then
do j=1,Nint
write(1,107)ls(j),Nx5(j),Nx6(j)
enddo
107 format(3(F13.2,2X,i6,2X,i6))
endif
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
print*,"Now deallocate"
! deallocate(Nx1) !inserting the de-allocate
! deallocate(Nx2)
close(1)
call MPI_Finalize(mpierr)
end
! Subroutines
...

Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).
This is not what an allreduce does. Reduction means the summation is done in parallel across all processes. allreduce means all processes will get the result of the summing.
Your MPI_Allreduces:
call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum, &
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum, &
& MPI_COMM_WORLD, mpierr)
Actually look like the count should be 1 here. This is because count just states how many elements you are to receive from each process, not how many there will be in total.
However, you actually do not need that loop, because the allreduce luckily is capable of handling multiple elements all at once. Thus, I believe instead of the loop with your allreduces, you actually want something like:
integer :: Nx1(nint)
integer :: Nx2(nint)
integer :: Nx5(nint)
integer :: Nx6(nint)
call MPI_ALLREDUCE (Nx1, Nx5, nint, mpi_integer, mpi_sum, &
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2, Nx6, nint, mpi_integer, mpi_sum, &
& MPI_COMM_WORLD, mpierr)
Nx5 will contain the sum of Nx1 across all partitions, and Nx6 the sum across Nx2.
The information in your question is a little bit lacking, so I am not quite sure, if this is what you are looking for.

Related

How do i declare the integer (MPI_OFFSET_KIND) in fortran, required by MPI_File_set_view, to prevent overflow when writing large data?

Intro:
I am trying to write a large set of data to a single file using MPI IO using the code below.
The problem i encounter is, that i get an integer overflow (variable disp) and thus the MPI IO does not work properly.
The reason for this is, i think, the declaration of the variable disp (integer (kind=MPI_OFFSET_KIND) :: disp = 0) in the subroutine write_to_file(...).
Since for the process with the highest rank disp overflows.
Question:
Can I somehow define disp as kind=MPI_OFFSET_KIND but with higher range? I did not find a solution for that, except writing to multiple files, but I would prefer writing into a single file.
Some context:
The code is just a part of an code, which i use to output (and read; but I cut that part from the code example to make it easier) scalar (ivar = 1), vector(ivar=3) or tensor(ivar=3,6 or 9) values into binary files. The size of the 3D grid is defined by imax, jmax and kmax, where kmax is decomposed by Px processes into Mk. Lately the 3D grid grew to a size where i encountered the described problem.
Code Example: MPI_IO_LargeFile.f90
"""
PROGRAM MPI_IO_LargeFile
use MPI
implicit none
integer rank, ierr, Px
integer i, j, k, cnt
integer imax, jmax, kmax, Mk
integer status(MPI_STATUS_SIZE)
integer ivars;
real*4, dimension(:,:,:,:), allocatable :: outarr, dataarr
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, Px, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
imax = 256
jmax = 512
kmax = 1024
Mk = kmax / Px
if (rank < 1) print *, 'Preparing dataarr'
ivars = 6
allocate(dataarr(ivars,imax,jmax,Mk))
call RANDOM_NUMBER(dataarr)
! Output of Small File
if (rank < 1) print *, 'Output of SmallFile.txt'
ivars = 3
allocate(outarr(ivars,imax,jmax,Mk))
outarr(:,:,:,:) = dataarr(1:3,:,:,:)
call write_to_file(rank, 'SmallFile.txt', outarr)
deallocate(outarr)
! Output of Large File
if (rank < 1) print *, 'Output of LargeFile.txt'
ivars = 6
allocate(outarr(ivars,imax,jmax,Mk))
outarr(:,:,:,:) = dataarr(1:6,:,:,:)
call write_to_file(rank, 'LargeFile.txt', outarr)
deallocate(outarr)
deallocate(dataarr)
call MPI_Finalize(ierr)
CONTAINS
subroutine write_to_file(myrank, filename, arr)
implicit none
integer, intent(in) :: myrank
integer :: ierr, file, varsize
character(len=*), intent(in):: filename
real*4, dimension(:,:,:,:), allocatable, intent(inout) :: arr
**integer (kind=MPI_OFFSET_KIND) :: disp = 0**
varsize = size(arr)
disp = myrank * varsize * 4
**write(*,*) rank, varsize, disp**
call MPI_File_open(MPI_COMM_WORLD, filename, &
& MPI_MODE_WRONLY + MPI_MODE_CREATE, &
& MPI_INFO_NULL, file, ierr )
call MPI_File_set_view(file, disp, MPI_REAL4, &
& MPI_REAL4, 'native', MPI_INFO_NULL, ierr)
call MPI_File_write(file, arr, varsize, &
& MPI_REAL4, MPI_STATUS_IGNORE, ierr)
call MPI_FILE_CLOSE(file, ierr)
end subroutine write_to_file
END PROGRAM MPI_IO_LargeFile
"""
Output of Code: MPI_IO_LargeFile.f90
mpif90 MPI_IO_LargeFile.f90 -o MPI_IO_LargeFile
mpirun -np 4 MPI_IO_LargeFile
Preparing dataarr
Output of SmallFile.txt
2 100663296 805306368
1 100663296 402653184
3 100663296 1207959552
0 100663296 0
Output of LargeFile.txt
1 201326592 805306368
0 201326592 0
2 201326592 1610612736
3 201326592 -1879048192
mca_fbtl_posix_pwritev: error in writev:Invalid argument
mca_fbtl_posix_pwritev: error in writev:Invalid argument
The Problem is that the multiplication in
disp = myrank * varsize * 4
overflowed, since each variable was declared as integer.
One solution provided by #Gilles (in the comments of the question) was simply to change this line to
disp = myrank * size(arr, kind=MPI_OFFSET_KIND) * 4
Using size(arr, kind=MPI_OFFSET_KIND) converts the solution into an integer of kind=MPI_OFFSET_KIND, which solves the overflow problem.
Thank you for your help.

MPI writing file unequal size vectors

I am having small doubt regarding file writing in MPI. Lets say I have "N" no of process working on a program. At the end of the program, each process will have "m" number of particles (positions+velocities). But the number of particles, m , differs for each process. How would I write all the particle info (pos + vel) in a single file. What I understood from searching is that I can do so with MPI_File_open, MPI_File_set_view,MPI_File_write_all, But I need to have same no of particles in each process. Any ideas how I could do it in my case ?
You don't need the same number of particles on each processor. What you do need is for every processor to participate. One or more could very well have zero particles, even.
Allgather is a fine way to do it, and the single integer exchanged among all processes is not such large overhead.
However, a better way is to use MPI_SCAN:
incr = numparts;
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr; /* or skip this with MPI_EXSCAN, but \
then rank 0 has an undefined result */
MPI_File_write_at_all(fh, new_offset, buf, count, datatype, status);
You need to perform
MPI_Allgather(np, 1, MPI_INTEGER, procnp, 1, &
MPI_INTEGER, MPI_COMM_WORLD, ierr)
where np is the number of particles per process and procnp is an array of size number of processes nprocs. This gives you an array on every process of the number of molecules on all other processes. That way MPI_File_set_view can be chosen correctly for each processes by calculating the offset based on the process id. This psudocode to get the offset is something like,
procdisp = 0
!Obtain displacement of each processor using all other procs' np
for i = 1, irank -1
procdisp = procdisp + procnp(i)*datasize
enddo
This was taken from fortran code so irank is from 1 to nprocs

MPI Fortran synchronizing data

I have program that i working on parallelize using MPI. I have attached the test code, which breaks down the array into multiple segement along a given number of processor and then i am using broadcast to combine the results from all of the processes and give it out to all of the processes. The problem occurs when the distribution is not same along all of the processors and causes broadcast to fail. I have attached the following code that works with 6 processes, but not 8 processes.
program main
include 'mpif.h'
integer i, rank, nprocs, count, start, stop, nloops
integer err, n1
REAL srt, sp
REAL b(1,1:10242)
REAL c(1,1:10242)
integer, Allocatable :: jjsta(:), jjlen(:)
allocate (jjsta(0:nprocs-1),jjlen(0:nprocs-1))
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, nprocs, ierr)
n1 = 10242
call para_range(1,n1,nprocs,rank,start,stop)
jjsta(rank) = start
jjlen(rank) = stop -start +1
srt = MPI_WTIME()
do i=start,stop
c(1,i) = i
enddo
sp =MPI_WTIME()
print *,"Process ",rank," performed ", jjlen(rank), &
" iterations of the loop."
! CALL MPI_ALLGATHERV(c(1,jjsta(rank)), jjlen(rank), MPI_REAL,&
! b,jjlen(rank),jjsta(rank)+jjlen(rank)+1,MPI_REAL,MPI_COMM_WORLD,ierr)
CALL MPI_BCAST(c(1,jjsta(rank)), jjlen(rank), MPI_REAL,&
rank, MPI_COMM_WORLD, ierr)
call MPI_Finalize(ierr)
deallocate(jjsta,jjlen)
end program main
subroutine para_range(n1, n2, nprocs, irank, ista, iend)
integer(4) :: n1 ! Lowest value of iteration variable
integer(4) :: n2 ! Highest value of iteration variable
integer(4) :: nprocs ! # cores
integer(4) :: irank ! Iproc (rank)
integer(4) :: ista ! Start of iterations for rank iproc
integer(4) :: iend ! End of iterations for rank iproc
iwork1 = ( n2 -n1 + 1 ) / nprocs
! print *, iwork1
iwork2 = MOD(n2 -n1 + 1, nprocs)
! print *, iwork2
ista = irank* iwork1 + n1 + MIN(irank, iwork2)
iend = ista + iwork1 -1
if (iwork2 > irank) iend = iend + 1
return
end subroutine para_range
and when i run it with 8 processors with MPI_Bcast commented out and this is what it prints out this and it works
Process 2 performed 1280 iterations of the loop.
Process 1 performed 1281 iterations of the loop.
Process 4 performed 1280 iterations of the loop.
Process 5 performed 1280 iterations of the loop.
Process 3 performed 1280 iterations of the loop.
Process 6 performed 1280 iterations of the loop.
Process 0 performed 1281 iterations of the loop.
Process 7 performed 1280 iterations of the loop.
But when i try to combine all of the results and broadcast that this is the error i get
Fatal error in MPI_Bcast:
Message truncated, error stack:
MPI_Bcast(1128)...................: MPI_Bcast(buf=0x2af182d8d028, count=1280, MPI_REAL, root=4, MPI_COMM_WORLD) failed
knomial_2level_Bcast(1268)........:
MPIDI_CH3U_Receive_data_found(259): Message from rank 0 and tag 2 truncated; 5124 bytes received but buffer size is 5120
Process 4 performed 1280 iterations of the loop.
I also tried the MPI_Allgatherv to combine the result in array b and i get the following error:
Fatal error in MPI_Allgatherv:
Message truncated, error stack:
MPI_Allgatherv(1050)................: MPI_Allgatherv(sbuf=0x635df8, scount=1280, MPI_REAL, rbuf=0x62bde8, rcounts=0x2b3ff297b244, displs=0x7fff3ebc1fb0, MPI_REAL, MPI_COMM_WORLD) failed
MPIR_Allgatherv(243)................:
MPIC_Sendrecv(117)..................:
MPIDI_CH3U_Request_unpack_uebuf(631): Message truncated; 5124 bytes received but buffer size is 5120
The problem when combining the arrays to me seems like that it is not updating the length of the array being sent. I can't seem to figure out how to solve the problem any help would be appreciated. Thank you.

MPI collective output 5 noncontiguous 3D arrays in special form

During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)
In the sequence program file is written as follows:
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
In the parallel program, I write the same as follows:
/ parallelization takes place only along the axis X /
call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)
call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)
do i = 1,PDB(iam).Xelements
do j = 1,YY
do k = 1,ZZ
dataTmp(1,k,j,i) = R(i,j,k)
dataTmp(2,k,j,i) = U(i,j,k)
dataTmp(3,k,j,i) = V(i,j,k)
dataTmp(4,k,j,i) = W(i,j,k)
dataTmp(5,k,j,i) = P(i,j,k)
end do
end do
end do
call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.
Using dataTmp is fine, if you have the memory space. your MPI_FILE_WRITE_ALL call will be the most expensive part of this code.
You've done the hard part, setting an MPI-IO file view. if you want to get rid of dataTmp, you could create an MPI datatype to describe the arrays (probably using MPI_Type_hindexed and MPI_Get_address)), then use MPI_BOTTOM as the memory buffer.
If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:
program testoutput
implicit none
integer, parameter :: XX=512, YY=512, ZZ=512
real, dimension(:,:,:), allocatable :: R, U, V, W, P
integer :: timer
integer :: ifile
real :: elapsed
integer :: i,j,k
allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))
R = 1.; U = 2.; V = 3.; W = 4.; P = 5.
open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
call tick(timer)
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for interleaved: ', elapsed
open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
call tick(timer)
write(ifile) R
write(ifile) U
write(ifile) V
write(ifile) W
write(ifile) P
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for noninterleaved: ', elapsed
deallocate(R,U,V,W,P)
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
end program testoutput
Running gives
$ gfortran -Wall io-serial.f90 -o io-serial
$ ./io-serial
Elapsed time for interleaved: 225.755005
Elapsed time for noninterleaved: 4.01700020
As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.
You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.

Writing to files with MPI

I'm writing to a file as follows. The order does not necessarily matter (though it would be nice if I could get it ordered by K, as would be inherently in serial code)
CALL MPI_BARRIER(MPI_COMM_WORLD, IERR)
OPEN(EIGENVALUES_UP_IO, FILE=EIGENVALUES_UP_PATH, ACCESS='APPEND')
WRITE(EIGENVALUES_UP_IO, *) K * 0.0001_DP * PI, (EIGENVALUES(J), J = 1, ATOM_COUNT)
CLOSE(EIGENVALUES_UP_IO)
I'm aware this is likely to be the worst option.
I've taken a look at MPI_FILE_WRITE_AT etc. but I'm not sure they (directly) take data in the form that I have?
The file must be in the same format as this, which comes out as a line per K, with ATOM_COUNT + 1 columns. The values are REAL(8)
I've hunted over and over, and can't find any simple references on achieving this. Any help? :)
Similar code in C (assuming it's basically the same as FORTRAN) is just as useful
Thanks!
So determining the right IO strategy depends on a lot of factors. If you are just sending back a handful of eigenvalues, and you're stuck writing out ASCII, you might be best off just sending all the data back to process 0 to write. This is not normally a winning strategy, as it obviously doesn't scale; but if the amount of data is very small, it could well be better than the contention involved in trying to write out to a shared file (which is, again, harder with ASCII).
Some code is below which will schlep the amount of data back to proc 0, assuming everyone has the same amount of data.
Another approach would just be to have everyone write out their own ks and eigenvalues, and then as a postprocessing step once the program is finished, cat them all together. That avoids the MPI step, and (with the right filesystem) can scale up quite a ways, and is easy; whether that's better is fairly easily testable, and will depend on the amount of data, number of processors, and underlying file system.
program testio
use mpi
implicit none
integer, parameter :: atom_count = 5
integer, parameter :: kpertask = 2
integer, parameter :: fileunit = 7
integer, parameter :: io_master = 0
double precision, parameter :: pi = 3.14159
integer :: totalk
integer :: ierr
integer :: rank, nprocs
integer :: handle
integer(kind=MPI_OFFSET_KIND) :: offset
integer :: filetype
integer :: j,k
double precision, dimension(atom_count, kpertask) :: eigenvalues
double precision, dimension(kpertask) :: ks
double precision, allocatable, dimension(:,:):: alleigenvals
double precision, allocatable, dimension(:) :: allks
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
totalk = nprocs*kpertask
!! setup test data
do k=1,kpertask
ks(k) = (rank*kpertask+k)*1.d-4*PI
do j=1,atom_count
eigenvalues(j,k) = rank*100+j
enddo
enddo
!! Everyone sends proc 0 their data
if (rank == 0) then
allocate(allks(totalk))
allocate(alleigenvals(atom_count, totalk))
endif
call MPI_GATHER(ks, kpertask, MPI_DOUBLE_PRECISION, &
allks, kpertask, MPI_DOUBLE_PRECISION, &
io_master, MPI_COMM_WORLD, ierr)
call MPI_GATHER(eigenvalues, kpertask*atom_count, MPI_DOUBLE_PRECISION, &
alleigenvals, kpertask*atom_count, MPI_DOUBLE_PRECISION, &
io_master, MPI_COMM_WORLD, ierr)
if (rank == 0) then
open(unit=fileunit, file='output.txt')
do k=1,totalk
WRITE(fileunit, *) allks(k), (alleigenvals(j,k), j = 1, atom_count)
enddo
close(unit=fileunit)
deallocate(allks)
deallocate(alleigenvals)
endif
call MPI_FINALIZE(ierr)
end program testio
If you can determine how long each rank's write will be, you can call MPI_SCAN(size, offset, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) to compute the offset that each rank should start at, and then they can all call MPI_FILE_WRITE_AT. This is probably more suitable if you have a lot of data, and you are confident that your MPI implementation does the write efficiently (doesn't serialize internally, or the like).

Resources