I am having small doubt regarding file writing in MPI. Lets say I have "N" no of process working on a program. At the end of the program, each process will have "m" number of particles (positions+velocities). But the number of particles, m , differs for each process. How would I write all the particle info (pos + vel) in a single file. What I understood from searching is that I can do so with MPI_File_open, MPI_File_set_view,MPI_File_write_all, But I need to have same no of particles in each process. Any ideas how I could do it in my case ?
You don't need the same number of particles on each processor. What you do need is for every processor to participate. One or more could very well have zero particles, even.
Allgather is a fine way to do it, and the single integer exchanged among all processes is not such large overhead.
However, a better way is to use MPI_SCAN:
incr = numparts;
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr; /* or skip this with MPI_EXSCAN, but \
then rank 0 has an undefined result */
MPI_File_write_at_all(fh, new_offset, buf, count, datatype, status);
You need to perform
MPI_Allgather(np, 1, MPI_INTEGER, procnp, 1, &
MPI_INTEGER, MPI_COMM_WORLD, ierr)
where np is the number of particles per process and procnp is an array of size number of processes nprocs. This gives you an array on every process of the number of molecules on all other processes. That way MPI_File_set_view can be chosen correctly for each processes by calculating the offset based on the process id. This psudocode to get the offset is something like,
procdisp = 0
!Obtain displacement of each processor using all other procs' np
for i = 1, irank -1
procdisp = procdisp + procnp(i)*datasize
enddo
This was taken from fortran code so irank is from 1 to nprocs
Related
I have just started taking courses in HPC and am doing an assignment where we are expected to implement Reduce function equivalent to the MPI_Reduce on MPI_SUM... Easy enough right? Here is what I did:
I started with basic concept of sending data/array from all nodes to the root-node (0-th ranked process) and there I computed the sum.
As a second step I optimized it further so that each process sends data to it's mirror image which computes the sum,and this process keeps repeating until the result in finally present in the root-node (0-th process). My implementation is as follows:
for(k=(size-1); k>0; k/=2)
{
if(rank<=k)
{
if(rank<=(k/2))
{
//receiving the buffers from different processes and computing them
MPI_Recv(rec_buffer, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(i=0; i<count; i++)
{
res[i] += rec_buffer[i];
}
}
else
{
MPI_Send(res, count, MPI_INT, k-rank, 0, MPI_COMM_WORLD);
}
}
}
But the thing is this code performs significantly poorer compared to the MPI_Reduce function itself.
So how can I further optimize it? What can I do differently to make if better? I can't make the sum loop multi-threaded as it is required that we do it in a single thread. I can may be optimize the sum loop but not sure how and where to begin.
I apologize for a pretty basic question but I am really starting to get feet wet in the field of HPC. Thanks!
Your second approach is the right one since you do the same number of communications but you have parallelized the reduce operation (sum in your case) and the communication (since you communicate between subsets). You typically do reduce operation like described in Reduction operator.
However you may want to try asynchronous communication using MPI_Isend and MPI_Irecv to improve performance and get closer to MPI_Reduce performance.
#GillesGouillardet provided one implementation, you can see that the communications in the code are done with isend and irecv (look for "MCA_PML_CALL( isend" and "MCA_PML_CALL( irecv" )
I'm new with OpenCL and I'm trying to understand this example program written by Apple here.
The goal of the program is to calculate the square of each element of an input array and write the result in a new array.
You can see that the input array has dimension: 1024. The number of work groups is 1024 and the size of each of those is the max CL_KERNEL_WORK_GROUP_SIZE.
Can anybody explain me what's the point of using so many work-items in each work group if in the Kernel there's no get_local_id() call? Could they use 1 as the size of each work group? what would have been the difference?
Thanks.
Some code to show the point:
// Get the maximum work group size for executing the kernel on the device
//
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
// Execute the kernel over the entire range of our 1d input data set
// using the maximum number of work group items for this device
//
global = count;
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
Your global work size is executed in chunks of local work size (in theory), if you set 1 as your local work group size, then it would execute only 1 thread in each local work group. On GPUs, work groups match to compute units - if you have a work group size of 1, your 1 thread may potentially occupy a whole compute unit. This is really, really horribly slow
I am parallelising a fortran code which works with no problem in a no-MPI version. Below is an excerpt of the code.
Every processor does the following:
For a certain number of particles it evolves certain quantities in the loop "do 203"; in a given interval which is divided in Nint subintervals (j=1,Nint), every processor produces an element of the vectors Nx1(j), Nx2(j).
Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).
Another issue is that if I deallocate the variables the code remains in standby after the end of the calculation without completing the execution; but I don't know if this is related to the MPI_Allreduce issue.
include "mpif.h"
...
integer*4 ....
...
real*8
...
call MPI_INIT(mpierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, npe, mpierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mype, mpierr)
! Allocate variables
allocate(Nx1(Nint),Nx5(Nint))
...
! Parameters
...
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
! Loop on particles
do 100 npartj=1,npart_local
call init_random_seed()
call random_number (rand)
...
Initial condition
...
do 203 i=1,1000000 ! loop for time evolution of single particle
if(ufinp.gt.p1.and.ufinp.le.p2)then
do j=1,Nint ! spatial position at any momentum
ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
rs(j) = ls(j)+Delta/Nint
if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
Nx1(j)=Nx1(j)+1
endif
enddo
endif
if(ufinp.gt.p2.and.ufinp.le.p3)then
do j=1,Nint ! spatial position at any momentum
ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
rs(j) = ls(j)+Delta/Nint
if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
Nx2(j)=Nx2(j)+1
endif
enddo
endif
203 continue
100 continue
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
print*,"To be summed"
do j=1,Nint
call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum,
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum,
& MPI_COMM_WORLD, mpierr)
enddo
if(mype.eq.0)then
do j=1,Nint
write(1,107)ls(j),Nx5(j),Nx6(j)
enddo
107 format(3(F13.2,2X,i6,2X,i6))
endif
call MPI_Barrier (MPI_COMM_WORLD, mpierr)
print*,"Now deallocate"
! deallocate(Nx1) !inserting the de-allocate
! deallocate(Nx2)
close(1)
call MPI_Finalize(mpierr)
end
! Subroutines
...
Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).
This is not what an allreduce does. Reduction means the summation is done in parallel across all processes. allreduce means all processes will get the result of the summing.
Your MPI_Allreduces:
call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum, &
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum, &
& MPI_COMM_WORLD, mpierr)
Actually look like the count should be 1 here. This is because count just states how many elements you are to receive from each process, not how many there will be in total.
However, you actually do not need that loop, because the allreduce luckily is capable of handling multiple elements all at once. Thus, I believe instead of the loop with your allreduces, you actually want something like:
integer :: Nx1(nint)
integer :: Nx2(nint)
integer :: Nx5(nint)
integer :: Nx6(nint)
call MPI_ALLREDUCE (Nx1, Nx5, nint, mpi_integer, mpi_sum, &
& MPI_COMM_WORLD, mpierr)
call MPI_ALLREDUCE (Nx2, Nx6, nint, mpi_integer, mpi_sum, &
& MPI_COMM_WORLD, mpierr)
Nx5 will contain the sum of Nx1 across all partitions, and Nx6 the sum across Nx2.
The information in your question is a little bit lacking, so I am not quite sure, if this is what you are looking for.
I'm trying to recombine sub-arrays without the dark-grey rows with MPI_Gatherv. Picture's worth a thousand words:
graphical overview of the ghost/halo dark-grey cells http://img535.imageshack.us/img535/9118/ghostcells.jpg
How would you send only parts of *sendbuf (the first parameter in MPI_Gatherv manual) to the root process (without a wasteful rewriting in another structure, this time without the dark-grey rows)? The *displacements (the 4th parameter) is only relevant to the *recvbuf of the root process.
Thank you.
Update (or, being more precise)
I wanted to also send the "boundary" (light-grey) cells ... not just the "interior" (white) cells. As osgx correctly pointed out: in this case the MPI_Gatherv suffices ... some conditional array indexing will do it.
What about constructing a datatype, which will allow you to send only white (Interior) cells?
The combined (derived) datatype can be a MPI_Type_indexed.
The only problem will be with very first line and very last line in processes P0 and PN, because P1 and PN should send more data then P2....PN-1
For Interior+Boundary you can construct datatype of single "line" with
MPI_Datatype LineType;
MPI_Type_vector (1, row_number, 0 , MPI_DOUBLE, &LineType)
MPI_Type_commit ( &LineType);
Then (for ARRAY sized [I][J] splitted for stripecount stripes )
for (i=0; i< processes_number; ++i) {
displs[i] = i*(I/stripecount)+1; // point to second line in each stripe
rcounts[i] = (I/stripecount) -2 ;
}
rcounts[0] ++; // first and last processes must send one line more
rcounts[processes_number-1] ++;
displs[0] -= 1; // first process should send first line of stripe
// lastprocess displacement is ok, because it should send last line of stripe
source_ptr = ARRAY[displs[rank]];
lines_to_send = rcounts[rank];
MPI_Gatherv( source_ptr, lines_to_send, LineType, recv_buf, rcounts, displs, LineType, root, comm);
I'm writing to a file as follows. The order does not necessarily matter (though it would be nice if I could get it ordered by K, as would be inherently in serial code)
CALL MPI_BARRIER(MPI_COMM_WORLD, IERR)
OPEN(EIGENVALUES_UP_IO, FILE=EIGENVALUES_UP_PATH, ACCESS='APPEND')
WRITE(EIGENVALUES_UP_IO, *) K * 0.0001_DP * PI, (EIGENVALUES(J), J = 1, ATOM_COUNT)
CLOSE(EIGENVALUES_UP_IO)
I'm aware this is likely to be the worst option.
I've taken a look at MPI_FILE_WRITE_AT etc. but I'm not sure they (directly) take data in the form that I have?
The file must be in the same format as this, which comes out as a line per K, with ATOM_COUNT + 1 columns. The values are REAL(8)
I've hunted over and over, and can't find any simple references on achieving this. Any help? :)
Similar code in C (assuming it's basically the same as FORTRAN) is just as useful
Thanks!
So determining the right IO strategy depends on a lot of factors. If you are just sending back a handful of eigenvalues, and you're stuck writing out ASCII, you might be best off just sending all the data back to process 0 to write. This is not normally a winning strategy, as it obviously doesn't scale; but if the amount of data is very small, it could well be better than the contention involved in trying to write out to a shared file (which is, again, harder with ASCII).
Some code is below which will schlep the amount of data back to proc 0, assuming everyone has the same amount of data.
Another approach would just be to have everyone write out their own ks and eigenvalues, and then as a postprocessing step once the program is finished, cat them all together. That avoids the MPI step, and (with the right filesystem) can scale up quite a ways, and is easy; whether that's better is fairly easily testable, and will depend on the amount of data, number of processors, and underlying file system.
program testio
use mpi
implicit none
integer, parameter :: atom_count = 5
integer, parameter :: kpertask = 2
integer, parameter :: fileunit = 7
integer, parameter :: io_master = 0
double precision, parameter :: pi = 3.14159
integer :: totalk
integer :: ierr
integer :: rank, nprocs
integer :: handle
integer(kind=MPI_OFFSET_KIND) :: offset
integer :: filetype
integer :: j,k
double precision, dimension(atom_count, kpertask) :: eigenvalues
double precision, dimension(kpertask) :: ks
double precision, allocatable, dimension(:,:):: alleigenvals
double precision, allocatable, dimension(:) :: allks
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
totalk = nprocs*kpertask
!! setup test data
do k=1,kpertask
ks(k) = (rank*kpertask+k)*1.d-4*PI
do j=1,atom_count
eigenvalues(j,k) = rank*100+j
enddo
enddo
!! Everyone sends proc 0 their data
if (rank == 0) then
allocate(allks(totalk))
allocate(alleigenvals(atom_count, totalk))
endif
call MPI_GATHER(ks, kpertask, MPI_DOUBLE_PRECISION, &
allks, kpertask, MPI_DOUBLE_PRECISION, &
io_master, MPI_COMM_WORLD, ierr)
call MPI_GATHER(eigenvalues, kpertask*atom_count, MPI_DOUBLE_PRECISION, &
alleigenvals, kpertask*atom_count, MPI_DOUBLE_PRECISION, &
io_master, MPI_COMM_WORLD, ierr)
if (rank == 0) then
open(unit=fileunit, file='output.txt')
do k=1,totalk
WRITE(fileunit, *) allks(k), (alleigenvals(j,k), j = 1, atom_count)
enddo
close(unit=fileunit)
deallocate(allks)
deallocate(alleigenvals)
endif
call MPI_FINALIZE(ierr)
end program testio
If you can determine how long each rank's write will be, you can call MPI_SCAN(size, offset, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) to compute the offset that each rank should start at, and then they can all call MPI_FILE_WRITE_AT. This is probably more suitable if you have a lot of data, and you are confident that your MPI implementation does the write efficiently (doesn't serialize internally, or the like).