OpenMP reduction of large arrays in Fortran - arrays

I know that similar questions to this have been asked sometimes: Openmp array reductions with Fortran, Reducing on array in OpenMP, even in Intel forums (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/345415) but I would like to know your opinion because the scalability that I get is not the one that I expect.
So I need to fill a really large array of complex numbers, which I would like to parallelize with OpenMP. Our first approach is this one:
COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index
ALLOCATE(huge_array(huge_number))
!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
ALLOCATE(thread_huge_array(SIZE(huge_array)))
thread_huge_array = ZERO
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,6
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index) = thread_huge_array(depending_index)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP BARRIER
!$OMP MASTER
huge_array = ZERO
!$OMP END MASTER
!$OMP CRITICAL
huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL
So, with that approach, we get good scalability until 8 cores, reasonable scalability until 32 cores and from 40 cores, it is slower than with 16 cores (we have a machine with 80 physical cores). Of course, we cannot use REDUCTION clause because the size of the array is so big that it doesn't fit in the stack (even increasing ulimit to the maximum allowed in the machine).
We have tried a different approach with this one:
COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), POINTER:: thread_huge_array(:)
INTEGER :: huge_number
ALLOCATE(huge_array(huge_number))
ALLOCATE(thread_huge_array(SIZE(huge_array),omp_get_max_threads()))
thread_huge_array = ZERO
!$OMP PARALLEL PRIVATE (num_thread)
num_thread = omp_get_thread_num()+1
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,num_weights_sp
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index, omp_get_thread_num()) = thread_huge_array(depending_index, omp_get_thread_num())
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
huge_array = ZERO
DO index_ii = 1,omp_get_max_threads()
huge_array = huge_array + thread_huge_array(:,index_ii)
ENDDO
DEALLOCATE(thread_huge_array)
DEALLOCATE(huge_array)
And in this last case, we obtain longer times for the method (due to the allocation of the memory, which is much bigger) and worse relative acceleration.
Can you provide some hints to achieve a better acceleration? Or is it impossible with these huge arrays with OpenMP?

Related

How to split a file with multiple columns into multiple files with two columns each using Fortran 90

I have a physics simulation program that generates a file with six columns, one for the time, and other five for physical properties. I need to make a Fortran 90 program that read this file, and generates five files with two columns, one for the time and another for a physical property.
I have used F90 before, but I only know how to generate files and write on them, but I have no idea how to modify a file and generate more files with data from a file.
I don't expect to have the problem solved, I just want to know where to find information. Any advice will be useful.
I don't know a priori how many rows the program will generate
Here is an example which is not been tested...
It is a bit of a kindergarten approach, but it may be helpful. You could avoid the 6 array altogether, but it is often better to have the variables as separate arrays as it makes it vectorise better with contiguous memory layout. One could also read those into the 6 arrays, and avoid the 6xN array.
PROGRAM ABC
IMPLICIT NONE
REAL, DIMENSION(:,:) :: My_File_Data
REAL, DIMENSION(:) :: My_Data1
REAL, DIMENSION(:) :: My_Data2
REAL, DIMENSION(:) :: My_Data3
REAL, DIMENSION(:) :: My_Data4
REAL, DIMENSION(:) :: My_Data5
REAL, DIMENSION(:) :: My_Data6
INTEGER :: Index, LUN, I, IO_Status
OPEN(NEWUNIT=LUN, FILE='abc.dat')
Index = 0
FirstPass: DO WHILE(.TRUE.)
READ(UNIT=LUN,*, IO_Status)
IF(IO_Status /= 0) EXIT
Index = Index + 1
ENDDO FirstPass
REWIND(LUN)
ALLOCATE(My_File_Data(Index))
ALLOCATE(My_Data1(Index))
ALLOCATE(My_Data2(Index))
ALLOCATE(My_Data3(Index))
ALLOCATE(My_Data4(Index))
ALLOCATE(My_Data5(Index))
ALLOCATE(My_Data6(Index))
SecondPass: DO I = 1, Index
READ(UNIT=LUN,*) My_File_Data(:,I)
Index = Index + 1
ENDDO SecondPass
DO I = 1, Index
Data1(I) = My_File_Data(1,I)
ENDDO
! What follows is more elegant...
Data2(:) = My_File_Data(2,:) !Where the first (:) is redundant... It seems more readable, but there are some reasons not to use it... (LTR)
Data3 = My_File_Data(3,:)
Data4 = My_File_Data(4,:)
Data5 = My_File_Data(5,:)
Data6 = My_File_Data(6,:)
DEALLOCATE(My_File_Data)
!Etc
The first step is to read in the data. In the following instructions, we will first loop over the file and count the number of rows, nrows. This value will be used to allocate a data array to the necessary size. We then return to the beginning of the file and read in our data in a second loop.
Declare an integer variable to act as a file handle/reference.
Declare an allocatable array of reals (floats) to hold the data.
Loop over the file to count the number of lines in the file. Remove header lines from the count.
Allocate the data array to the proper size, (nrows,nvalues).
Return to the beginning of the file. Repeat the loop over each of the rows, reading all values from the row into your data array.
Close the file.
The next step is to create 5 new files, each containing the time and one of the 5 property measurements:
Loop over each of the 5 properties contained in data.
For each jth property, open a new file.
Loop over the data array, writing the time and jth property to a new line.
Close the file.
Here is working code you can use or modify to suit your needs:
program SO
implicit none
integer :: i, j, nrows, nvalues, funit, ios
real, allocatable, dimension(:,:) :: data
character(len=10), dimension(5) :: outfiles
!! begin
nvalues = 5
nrows = 0
open(newunit=funit, file='example.txt', status='old', iostat=ios)
if (ios /= 0) then
print *, 'File could not be opened.'
call exit
else
do
read(funit,*,iostat=ios)
if (ios == 0) then
nrows = nrows + 1
elseif (ios < 0) then !! End of file (EOF).
exit !! The 'exit' stmt breaks out of the loop.
else !! Error if > 0.
print *, 'Read error at line ', nrows + 1
call exit() !! The 'exit' intrinsic ends the program.
endif !! We we may pass an optional exit code.
enddo
endif
nrows = nrows - 1 !! 'nrows-1': Remove column headers from count.
if (allocated(data)) deallocate(data) !! This test follows standard "best practices".
allocate(data(nrows,nvalues+1))
rewind(funit)
read(funit, *) !! Skip column headers.
do i = 1,nrows
read(funit, *) data(i,:) !! Read values into array.
enddo
close(funit)
!! Output file names.
outfiles = ['prop1.txt', 'prop2.txt', 'prop3.txt', 'prop4.txt', 'prop5.txt']
do j = 1,nvalues
open(newunit=funit, file=outfiles(j), status='replace', iostat=ios)
if (ios /= 0) then
print *, 'Could not open output file: ',outfiles(j)
call exit()
endif
write(funit,"(a)") "time "//outfiles(j)(1:5)
do i = 1,nrows
write(funit,"(f0.0,t14,es14.6)") data(i,1), data(i,j+1)
enddo
close(funit)
enddo
end program SO
All the other answers want to read in everything at once. I think that's too much of a bother.
Firstly, I'd check if I even needed Fortran for that. The Linux command cut can be used very effectively here. For example, if your data is comma separated, you could simply use
for i in {1..5}; do
cut -d, -f1,$((i+1)) data.txt > data${i}.txt;
done
to do the whole thing.
If you need Fortran, here's how I'd go about it:
Open all files
In a permanent loop, read in the whole row at once.
If you encounter an error, it's probably EOF, so exit the loop
Write the data to the output files.
Here's some basic code:
program split
implicit none
integer :: t, d(5), u_in, u_out(5)
integer :: i
integer :: ios
open(newunit=u_in, file='data.txt', status="old", action="read")
open(newunit=u_out(1), file='temperature.txt', status='unknown', action='write')
open(newunit=u_out(2), file='pressure.txt', status='unknown', action='write')
open(newunit=u_out(3), file='pair_energy.txt', status='unknown', action='write')
open(newunit=u_out(4), file='ewald_energy.txt', status='unknown', action='write')
open(newunit=u_out(5), file='pppm_energy.txt', status='unknown', action='write')
read(u_in, *) ! omit the column names
write(u_out(1), *) "Time Temperature"
write(u_out(2), *) "Time Pressure"
write(u_out(3), *) "Time Pair Energy"
write(u_out(4), *) "Time Ewald Energy"
write(u_out(5), *) "Time PPPM Energy"
do
read(u_in, *, iostat=ios) t, d
if (ios /= 0) exit
do i = 1, 5
write(u_out(i), *) t, d(i)
end do
end do
close(u_in)
do i = 1, 5
close(u_out(i))
end do
end program split
Cheers

openmp fortran reduction and critical not working for array

I am currently trying to get a fortran FE (finite element) code to work with openmp. I have a loop over all elements, ie that I want to work in parallel. Here is a simplified part of the code that is not working
!$omp parallel do default(none) shared(nelm,A,res,enod) private(ie,Fe,B,edof)
do ie=1,nelm
call calcB(B,A(:,ie))
call calcFe(Fe,B)
write(*,*) Fe !writes Fe=40d0, this is correct
call getEdof(edof,enod(:,ie))
!$OMP CRITICAL
res(edof)=res(edof)+fe
!$OMP END CRITICAL
enddo
!$omp end parallel do
The purpose of the code is to calculate a force Fe and then adding it to res at edof. The force is calculated with calcFe, and the calculated force is correct, but the resulting res is incorrect after the loop.
If I replace calcFe with simply Fe=40d0 then add it to res the result is correct after the loop
!$omp parallel do default(none) shared(nelm,A,res,enod) private(ie,Fe,B,edof)
do ie=1,nelm
call calcB(B,A(:,ie))
Fe=40d0
call getEdof(edof,enod(:,ie))
!$OMP CRITICAL
res(edof)=res(edof)+fe
!$OMP END CRITICAL
enddo
!$omp end parallel do
What causes this error? In both cases Fe=40d0 is declared private but only one of them gives the correct result. Instead of using !$ CRITICAL I could use reduction but it gives the same error. In the program several large and sparse matrices are also used but the are passive/ not used during the loop. My supervisor has had problems with using openmp and sparse matrices before and suspects that they are using the same memory. If the error is not apparent what debugger is best to use? Im a novice to both fortran ,openmp and programing in general.
Im using ifort to compile and my OS is ubuntu.
EDIT: Added simplified code that you can run, although this code works
In the code there are two loops, on parallel and one serial, to they should give the same result, res and res2
program main
use omp_lib
implicit none
integer :: ie, nelm,enod(4,50*50),edof(12),i,j,k
double precision ::B(12,8),fe(12),A(12,12,2500),res(2601*3),res2(2601*3),finish,start
!creates enod
i=1
do j=1,50
ie=j
do k=1,50
nelm=k
enod(:,i)=(/ 51*(nelm-1)+1+ie-1, 51*(nelm-1)+1+ie, 51*(nelm)+1+ie-1, 51*(nelm)+1+ie /)
i=i+1
end do
end do
A=1d0
res2=0d0
nelm=2500
start=omp_get_wtime()
!$omp parallel do default(none) shared(nelm,A,enod) private(ie,fe,edof,B) reduction(+:res2)
do ie=1,nelm
call calcB(B,A(:,:,ie))
call calcFe(fe,B) !the calculated fe is always 2304
!can write fe=2304 to get correct result with real code
call getEdof(edof,enod(:,ie))
res2(edof)=res2(edof)+fe
end do
!$omp end parallel do
finish=omp_get_wtime()
write(*,*) 'time: ', finish-start
res=0d0
nelm=2500
start=omp_get_wtime()
do ie=1,nelm
call calcB(B,A(:,:,ie))
call calcFe(fe,B)
call getEdof(edof,enod(:,ie))
res(edof)=res(edof)+fe
end do
finish=omp_get_wtime()
write(*,*) 'time: ', finish-start
write(*,*) 'difference: ',sum(res2-res)
write(*,*) sum(res2)
stop
end program main
subroutine calcB(B,A)
double precision ::B(12,8),A(12,12),C(12)
integer ::gp
C=1d0
do gp=1,8
B(:,gp)=matmul(A,C)
end do
end subroutine calcB
subroutine calcFe(fe,B)
double precision ::fe(12),B(12,8),D(12,12)
integer ::gp
fe=0d0
D=2d0
do gp=1,8
fe=fe+matmul(D,B(:,gp))
end do
end subroutine calcFe
subroutine getEdof(edof,enod)
implicit none
integer,intent(in) :: enod(4)
integer,intent(out):: edof(12)
edof=0
edof(1:3) =(/ enod(1)*3-2, enod(1)*3-1, enod(1)*3 /)
edof(4:6) =(/ enod(2)*3-2, enod(2)*3-1, enod(2)*3 /)
edof(7:9) =(/ enod(3)*3-2, enod(3)*3-1, enod(3)*3 /)
edof(10:12)=(/ enod(4)*3-2, enod(4)*3-1, enod(4)*3 /)
end subroutine getedof
And the make file
FF = ifort -O3 -openmp
OBJ1 = main.f90
ls: $(FORT_OBJS)
$(FF) -o exec $(OBJ1)
Unfortunately this piece of code works, so i'm unable to replicate the error. res2 and res are calculated in serial and parallel. In my real program I have put all values to 1d0 in order to get a constant fe. The calulated fe is correct, if I add a write(*,*) fe after calcFe I see that the values are correct. I then add these values to res2 and compare them with the serial res. They are then different by a large margin, so there is no numerical roundoff error. If I simply declare fe=2304 in my main program I get the correct answer even though fe already is 2304 when write is used.
In the my real program all the subroutines are in different modules, do I need to take any special care because of this?
Also in one of the modules some global variables are used, they are read only but since they are not declared in the subroutine they are not automaticly made private? This should be no issue since I put all variables used to to calulate fe to a constant, the global variables are not used directly to calculate fe
Solved it, it started working when I added -openmp to the makefile for my modules. Apparently the modules needs to be compiled with -openmp and not just the main file.

MPI_Allreduce mix elements in the sum

I am parallelising a fortran code which works with no problem in a no-MPI version. Below is an excerpt of the code.
Every processor does the following:
For a certain number of particles it evolves certain quantities in the loop "do 203"; in a given interval which is divided in Nint subintervals (j=1,Nint), every processor produces an element of the vectors Nx1(j), Nx2(j).
Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).
Another issue is that if I deallocate the variables the code remains in standby after the end of the calculation without completing the execution; but I don't know if this is related to the MPI_Allreduce issue.
include "mpif.h"
...
integer*4 ....
...
real*8
...
call MPI_INIT(mpierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, npe, mpierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mype, mpierr)
! Allocate variables
allocate(Nx1(Nint),Nx5(Nint))
...
! Parameters
...
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
! Loop on particles
do 100 npartj=1,npart_local
call init_random_seed()
call random_number (rand)
...
Initial condition
...
do 203 i=1,1000000 ! loop for time evolution of single particle
if(ufinp.gt.p1.and.ufinp.le.p2)then
do j=1,Nint ! spatial position at any momentum
ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
rs(j) = ls(j)+Delta/Nint
if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
Nx1(j)=Nx1(j)+1
endif
enddo
endif
if(ufinp.gt.p2.and.ufinp.le.p3)then
do j=1,Nint ! spatial position at any momentum
ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
rs(j) = ls(j)+Delta/Nint
if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
Nx2(j)=Nx2(j)+1
endif
enddo
endif
203 continue
100 continue
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
print*,"To be summed"
do j=1,Nint
call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum,
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum,
& MPI_COMM_WORLD, mpierr)
enddo
if(mype.eq.0)then
do j=1,Nint
write(1,107)ls(j),Nx5(j),Nx6(j)
enddo
107 format(3(F13.2,2X,i6,2X,i6))
endif
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
print*,"Now deallocate"
! deallocate(Nx1) !inserting the de-allocate
! deallocate(Nx2)
close(1)
call MPI_Finalize(mpierr)
end
! Subroutines
...
Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).
This is not what an allreduce does. Reduction means the summation is done in parallel across all processes. allreduce means all processes will get the result of the summing.
Your MPI_Allreduces:
call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum, &
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum, &
& MPI_COMM_WORLD, mpierr)
Actually look like the count should be 1 here. This is because count just states how many elements you are to receive from each process, not how many there will be in total.
However, you actually do not need that loop, because the allreduce luckily is capable of handling multiple elements all at once. Thus, I believe instead of the loop with your allreduces, you actually want something like:
integer :: Nx1(nint)
integer :: Nx2(nint)
integer :: Nx5(nint)
integer :: Nx6(nint)
call MPI_ALLREDUCE (Nx1, Nx5, nint, mpi_integer, mpi_sum, &
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2, Nx6, nint, mpi_integer, mpi_sum, &
& MPI_COMM_WORLD, mpierr)
Nx5 will contain the sum of Nx1 across all partitions, and Nx6 the sum across Nx2.
The information in your question is a little bit lacking, so I am not quite sure, if this is what you are looking for.

MPI collective output 5 noncontiguous 3D arrays in special form

During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)
In the sequence program file is written as follows:
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
In the parallel program, I write the same as follows:
/ parallelization takes place only along the axis X /
call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)
call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)
do i = 1,PDB(iam).Xelements
do j = 1,YY
do k = 1,ZZ
dataTmp(1,k,j,i) = R(i,j,k)
dataTmp(2,k,j,i) = U(i,j,k)
dataTmp(3,k,j,i) = V(i,j,k)
dataTmp(4,k,j,i) = W(i,j,k)
dataTmp(5,k,j,i) = P(i,j,k)
end do
end do
end do
call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.
Using dataTmp is fine, if you have the memory space. your MPI_FILE_WRITE_ALL call will be the most expensive part of this code.
You've done the hard part, setting an MPI-IO file view. if you want to get rid of dataTmp, you could create an MPI datatype to describe the arrays (probably using MPI_Type_hindexed and MPI_Get_address)), then use MPI_BOTTOM as the memory buffer.
If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:
program testoutput
implicit none
integer, parameter :: XX=512, YY=512, ZZ=512
real, dimension(:,:,:), allocatable :: R, U, V, W, P
integer :: timer
integer :: ifile
real :: elapsed
integer :: i,j,k
allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))
R = 1.; U = 2.; V = 3.; W = 4.; P = 5.
open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
call tick(timer)
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for interleaved: ', elapsed
open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
call tick(timer)
write(ifile) R
write(ifile) U
write(ifile) V
write(ifile) W
write(ifile) P
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for noninterleaved: ', elapsed
deallocate(R,U,V,W,P)
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
end program testoutput
Running gives
$ gfortran -Wall io-serial.f90 -o io-serial
$ ./io-serial
Elapsed time for interleaved: 225.755005
Elapsed time for noninterleaved: 4.01700020
As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.
You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.

Writing to files with MPI

I'm writing to a file as follows. The order does not necessarily matter (though it would be nice if I could get it ordered by K, as would be inherently in serial code)
CALL MPI_BARRIER(MPI_COMM_WORLD, IERR)
OPEN(EIGENVALUES_UP_IO, FILE=EIGENVALUES_UP_PATH, ACCESS='APPEND')
WRITE(EIGENVALUES_UP_IO, *) K * 0.0001_DP * PI, (EIGENVALUES(J), J = 1, ATOM_COUNT)
CLOSE(EIGENVALUES_UP_IO)
I'm aware this is likely to be the worst option.
I've taken a look at MPI_FILE_WRITE_AT etc. but I'm not sure they (directly) take data in the form that I have?
The file must be in the same format as this, which comes out as a line per K, with ATOM_COUNT + 1 columns. The values are REAL(8)
I've hunted over and over, and can't find any simple references on achieving this. Any help? :)
Similar code in C (assuming it's basically the same as FORTRAN) is just as useful
Thanks!
So determining the right IO strategy depends on a lot of factors. If you are just sending back a handful of eigenvalues, and you're stuck writing out ASCII, you might be best off just sending all the data back to process 0 to write. This is not normally a winning strategy, as it obviously doesn't scale; but if the amount of data is very small, it could well be better than the contention involved in trying to write out to a shared file (which is, again, harder with ASCII).
Some code is below which will schlep the amount of data back to proc 0, assuming everyone has the same amount of data.
Another approach would just be to have everyone write out their own ks and eigenvalues, and then as a postprocessing step once the program is finished, cat them all together. That avoids the MPI step, and (with the right filesystem) can scale up quite a ways, and is easy; whether that's better is fairly easily testable, and will depend on the amount of data, number of processors, and underlying file system.
program testio
use mpi
implicit none
integer, parameter :: atom_count = 5
integer, parameter :: kpertask = 2
integer, parameter :: fileunit = 7
integer, parameter :: io_master = 0
double precision, parameter :: pi = 3.14159
integer :: totalk
integer :: ierr
integer :: rank, nprocs
integer :: handle
integer(kind=MPI_OFFSET_KIND) :: offset
integer :: filetype
integer :: j,k
double precision, dimension(atom_count, kpertask) :: eigenvalues
double precision, dimension(kpertask) :: ks
double precision, allocatable, dimension(:,:):: alleigenvals
double precision, allocatable, dimension(:) :: allks
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
totalk = nprocs*kpertask
!! setup test data
do k=1,kpertask
ks(k) = (rank*kpertask+k)*1.d-4*PI
do j=1,atom_count
eigenvalues(j,k) = rank*100+j
enddo
enddo
!! Everyone sends proc 0 their data
if (rank == 0) then
allocate(allks(totalk))
allocate(alleigenvals(atom_count, totalk))
endif
call MPI_GATHER(ks, kpertask, MPI_DOUBLE_PRECISION, &
allks, kpertask, MPI_DOUBLE_PRECISION, &
io_master, MPI_COMM_WORLD, ierr)
call MPI_GATHER(eigenvalues, kpertask*atom_count, MPI_DOUBLE_PRECISION, &
alleigenvals, kpertask*atom_count, MPI_DOUBLE_PRECISION, &
io_master, MPI_COMM_WORLD, ierr)
if (rank == 0) then
open(unit=fileunit, file='output.txt')
do k=1,totalk
WRITE(fileunit, *) allks(k), (alleigenvals(j,k), j = 1, atom_count)
enddo
close(unit=fileunit)
deallocate(allks)
deallocate(alleigenvals)
endif
call MPI_FINALIZE(ierr)
end program testio
If you can determine how long each rank's write will be, you can call MPI_SCAN(size, offset, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) to compute the offset that each rank should start at, and then they can all call MPI_FILE_WRITE_AT. This is probably more suitable if you have a lot of data, and you are confident that your MPI implementation does the write efficiently (doesn't serialize internally, or the like).

Resources