When using MPI_reduce, does the root processor apply the specified MPI operation on itself as well?
For example, assume the following code is run by all processors including the root, does root reduce it's local_sum into global_sum as if it is non-root one?
int local_sum;
int global_sum;
int i;
for (i = 0; i < 5; i++) {
local_sum += rand_nums[i];
}
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, ROOT, MPI_COMM_WORLD);
Yes, reduce is also applied to the root itself. Maybe in your mind, you think that MPI just add numbers from other ranks to the local_sum variable in the root.
However, what MPI actually does is that it reduces from all ranks (including the root) in your communicator and put the result to global_sum.
If MPI doesn't reduce the root itself, then it doesn't make sense having two parameters for the MPI_Reduce call.
Related
I have just started taking courses in HPC and am doing an assignment where we are expected to implement Reduce function equivalent to the MPI_Reduce on MPI_SUM... Easy enough right? Here is what I did:
I started with basic concept of sending data/array from all nodes to the root-node (0-th ranked process) and there I computed the sum.
As a second step I optimized it further so that each process sends data to it's mirror image which computes the sum,and this process keeps repeating until the result in finally present in the root-node (0-th process). My implementation is as follows:
for(k=(size-1); k>0; k/=2)
{
if(rank<=k)
{
if(rank<=(k/2))
{
//receiving the buffers from different processes and computing them
MPI_Recv(rec_buffer, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(i=0; i<count; i++)
{
res[i] += rec_buffer[i];
}
}
else
{
MPI_Send(res, count, MPI_INT, k-rank, 0, MPI_COMM_WORLD);
}
}
}
But the thing is this code performs significantly poorer compared to the MPI_Reduce function itself.
So how can I further optimize it? What can I do differently to make if better? I can't make the sum loop multi-threaded as it is required that we do it in a single thread. I can may be optimize the sum loop but not sure how and where to begin.
I apologize for a pretty basic question but I am really starting to get feet wet in the field of HPC. Thanks!
Your second approach is the right one since you do the same number of communications but you have parallelized the reduce operation (sum in your case) and the communication (since you communicate between subsets). You typically do reduce operation like described in Reduction operator.
However you may want to try asynchronous communication using MPI_Isend and MPI_Irecv to improve performance and get closer to MPI_Reduce performance.
#GillesGouillardet provided one implementation, you can see that the communications in the code are done with isend and irecv (look for "MCA_PML_CALL( isend" and "MCA_PML_CALL( irecv" )
I am new to MPI programing. I have a 8 by 10 array that I need to use to find the summation of each row parallely. In rank 0 (process 0), it will generate the 8 by 10 matrix using a 2 dimensional array. Then I would use tag number as the first index value(row number) of the array. This way, I can use a unique buffer to send through Isend. However, it looks like my method of tag number generation for Isend is not working. Can you please look in to the following code and tell me if I am passing the 2D array correctly and tag number. When I run this code, it stop just after executing rannk 1 and waits. I use 3 process for this example and use the command mpirun -np 3 test please let me know how to tackle this problem with an example if possible.
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int tag = 1;
int arr[8][10];
MPI_Request request;
MPI_Status status;
int source = 0;
int dest;
printf ("\n--Current Rank: %d\n", world_rank);
if (world_rank == 0)
{
int i = 0;
int a, b, x, y;
printf("* Rank 0 excecuting\n");
for(a=0; a<8/(world_size-1); a++)//if -np is 3, this will loop 4 times
{
for(b=0; b<(world_size-1); b++)//if -np is 3, this loops will loop 2 times
{//So, if -np is 3, due to both of these loops, Isend will be called 8 times
dest = b+1;
tag = a+b;//create a uniqe tag value each time, which can be use as first index value of array
//Error: This tag value passing to Isend doesn't seems to be workiing
MPI_Isend(&arr[tag][0], 10, MPI_INT, dest, tag, MPI_COMM_WORLD, &request);
}
}
for(x=0; x<8; x++)//Generating the whole 8 by 10 2D array
{
i++;
for ( y = 0; y < 10; y++ )
{
arr[x][y] = i;
}
}
}
else
{
int a, b;
for(b=1; b<=8/(world_size-1); b++)
{
int sum = 0;
int i;
MPI_Irecv(&arr[tag][0], 10, MPI_INT, source, tag, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
//Error: not getting the correct tag value
for(i = 0; i<10; i++)
{
sum = arr[tag][i]+sum;
}
printf("\nSum is: %d at rank: %d and tag is:%d\n", sum, world_rank, tag);
}
}
MPI_Finalize();
}
The tag issue is because of how the tag is computed (or not) on different processes. You're initializing the tag values for all processes as
int tag = 1;
and later, for process rank 0 you set the tag to
tag = a+b;
which, for the first time this is set, will set tag to 0 because both a and b start out as zero. However, for processes with rank above 0, the tag is never changed. They will continue to have the tag set to 1.
The tag uniquely identifies the message being sent by MPI_Isend and MPI_Irecv, which means that a send and its corresponding receive must have the same tag for the data transfer to succeed. Because the tags are mismatched between processes for most of the receives, the transfers are mostly unsuccessful. This causes processes with rank higher than 0 to eventually block (wait) forever on the call to MPI_Wait.
In order to fix this, you have to make sure to change the tags for the processes with rank above zero. However, before we can do that, there's a few other issues worth touching up on.
With the way you've set your tag for the rank 0 process right now, tag can only ever have values 0 to 4 (assuming 3 processes). This is because a is limited to the range 0 to 3, and b can only have values 0 or 1. The maximum possible sum of these values is 4. This means that when you access your array using arr[tag][0], you will miss out on a lot of the data, and you'll re-send the same rows several times. I recommend changing the way you approach sending each subarray (which you're currently accessing with tag) so that you have only one for loop to determine which subarray to send, rather than two embedded loops. Then, you can calculate the process to send the array to as
dest = subarray_index%(world_size - 1) + 1;
This will alternate the desitnations between the processes with rank greater than zero. You can keep the tag as just subarray_index. On the receiving side you'll need to calculate the tag per process, per receive.
Finally, I saw that you were initializing your array after you sent the data. You want to do that beforehand.
Combining all these aspects, we get
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int tag = 1;
int arr[8][10];
MPI_Request request;
MPI_Status status;
int source = 0;
int dest;
printf ("\n--Current Rank: %d\n", world_rank);
if (world_rank == 0)
{
int i = 0;
int a, b, x, y;
printf("* Rank 0 excecuting\n");
//I've moved the array generation to before the sends.
for(x=0; x<8; x++)//Generating the whole 8 by 10 2D array
{
i++;
for ( y = 0; y < 10; y++ )
{
arr[x][y] = i;
}
}
//I added a subarray_index as mentioned above.
int subarray_index;
for(subarray_index=0; subarray_index < 8; subarray_index++)
{
dest = subarray_index%(world_size - 1) + 1;
tag = subarray_index;
MPI_Isend(&arr[subarray_index][0], 10, MPI_INT, dest, tag, MPI_COMM_WORLD, &request);
}
}
else
{
int a, b;
for(b=0; b<8/(world_size-1); b++)
{
int sum = 0;
int i;
//We have to do extra calculations here. These match tag, dest, and subarray.
int my_offset = world_rank-1;
tag = b*(world_size-1) + my_offset;
int subarray = b;
MPI_Irecv(&arr[subarray][0], 10, MPI_INT, source, tag, MPI_COMM_WORLD, &request);
MPI_Wait (&request, &status);
for(i = 0; i<10; i++)
{
sum = arr[subarray][i]+sum;
}
printf("\nSum is: %d at rank: %d and tag is:%d\n", sum, world_rank, tag);
}
}
MPI_Finalize();
}
There's a one thing that still seems a bit unfinished in this version for you to consider: what will happen if your number of processes changes? For example, if you have 4 processes instead of 3, it looks like you may run into some trouble with the loop
for(b=0; b<8/(world_size-1); b++)
because each process will execute it the same number of times, but the amount of data sent doesn't cleanly split for 3 workers (non-rank-zero processes).
However, if that is not a concern to you, then you do not need to handle such cases.
Aside from the obvious question: "why on earth would you want to do that?", there are so many problems here that I'm not sure I'll be able to list them all. I'll give it a try though:
Tag: it seems that the bulk of your method is to use the tag as an indicator of where to look for the receiver. But there are (at least) two major flaws here:
Since tag isn't know before reception, what is &arr[tag][0] supposed to be?
Tags in MPI are messages "identifier"... On normal circumstances a given communication (send and matching receive) should have a matching tag. This can be alleviated by using MPI_ANY_TAG special tag on the receiving side, and retrieving its actual value using the MPI_TAG field of the reception's status. But that's another story.
Bottom line here is that the method isn't such a good one.
Data initialisation: one of the major principles of non-blocking MPI communications is that you should never modify a buffer you used for a communication between the post of the communication (the MPI_Isend() here) and its finalisation (which is missing here). Therefore, your data generation must happen before the attempts to communicate the data.
Speaking of which, communication finalisation: you have too finalise your sending communications. This can be done using either a wait-type call (MPI_Wait() or MPI_Waitall()), or an "infinite" loop of test-type calls (MPI_Test() and such)...
The MPI_Irecv(): why are you using a non-blocking receive when the very next call is MPI_Wait()? If you want a blocking receive, just call MPI_Recv() directly.
So fundamentally, what you try to do here doesn't look right. Therefore, I'm very reluctant in trying to propose you a corrected version since I don't understand the actual problem you try to solve. Is this code a reduced version of a bigger real one (or an initial version of something supposed to grow), or just a toy example meant for you to learn how MPI send/receive works? Is ther any fundamental reason why you're not using a collective communication such as MPI_Scatter()?
Depending on your answer on these questions, I can try to produce a valid version.
I am having small doubt regarding file writing in MPI. Lets say I have "N" no of process working on a program. At the end of the program, each process will have "m" number of particles (positions+velocities). But the number of particles, m , differs for each process. How would I write all the particle info (pos + vel) in a single file. What I understood from searching is that I can do so with MPI_File_open, MPI_File_set_view,MPI_File_write_all, But I need to have same no of particles in each process. Any ideas how I could do it in my case ?
You don't need the same number of particles on each processor. What you do need is for every processor to participate. One or more could very well have zero particles, even.
Allgather is a fine way to do it, and the single integer exchanged among all processes is not such large overhead.
However, a better way is to use MPI_SCAN:
incr = numparts;
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr; /* or skip this with MPI_EXSCAN, but \
then rank 0 has an undefined result */
MPI_File_write_at_all(fh, new_offset, buf, count, datatype, status);
You need to perform
MPI_Allgather(np, 1, MPI_INTEGER, procnp, 1, &
MPI_INTEGER, MPI_COMM_WORLD, ierr)
where np is the number of particles per process and procnp is an array of size number of processes nprocs. This gives you an array on every process of the number of molecules on all other processes. That way MPI_File_set_view can be chosen correctly for each processes by calculating the offset based on the process id. This psudocode to get the offset is something like,
procdisp = 0
!Obtain displacement of each processor using all other procs' np
for i = 1, irank -1
procdisp = procdisp + procnp(i)*datasize
enddo
This was taken from fortran code so irank is from 1 to nprocs
How do I access a global vector from an individual thread in MPI?
I'm using a library - specifically, an ODE solver library - called CVODE (part of SUNDIALS). The library works with MPI, so that multiple threads are running in parallel. They are all running the same code. Each thread sends the thread "next to" it a piece of data. But I want one of the threads (rank=0) to print out the state of the data at some points.
The library includes functions so that each thread can access their own data (the local vector). But there is no method to access the global vector.
I need to output the values of all of the equations at specific times. To do so, I would need access to the global vector. Anyone know how get at all of the data in an MPI vector (using CVODE, if possible)?
For example, here is my code that each thread runs
for (iout=1, tout=T1; iout <= NOUT; iout++, tout += DTOUT) {
flag = CVode(cvode_mem, tout, u, &t, CV_NORMAL);
if(check_flag(&flag, "CVode", 1, my_pe)) break;
if (my_pe == 0) PrintData(t, u);
}
...
static void PrintData(realtype t, N_Vector u) {
I want to print data from all threads in here
}
In function f (the function I'm solving), I pass data back and forth using MPI_Send and MPI_Recv. But I can't really do that in PrintData because the other processes have run ahead. Also, I don't want to add messaging overhead. I want to access the global vector in PrintData, and then just print out what's needed. Is it possible?
Edit: While waiting for a better answer, I programmed each thread passing the data back to the 0th thread. I don't think that's adding too much messaging overhead, but I'd still like to hear from you experts if there's a better method (I'm sure there isn't any worse ones! :D ).
Edit 2: Although angainor's solution is surely superior, I stuck with the one I had created. For future reference of anyone who has the same question, here is the basics of how I did it:
/* Is called by all threads */
static void PrintData(realtype t, N_Vector u, UserData data) {
... declarations and such ...
for (n=1; n<=my_length; n++) {
mass_num = my_base + n;
z[mass_num - 1] = udata[n-1];
z[mass_num - 1 + N] = udata[n - 1 + my_length];
}
if (my_pe != 0) {
MPI_Send(&z, 2*N, PVEC_REAL_MPI_TYPE, 0, my_pe, comm);
} else {
for (i=1; i<npes; i++) {
MPI_Recv(&z1, 2*N, PVEC_REAL_MPI_TYPE, i, i, comm, &status);
for (n=0; n<2*N; n++)
z[n] = z[n] + z1[n];
}
... now I can print it out however I like...
return;
}
When using MPI the individual threads do not have access to a 'global'
vector. They are not threads, they are processes that can run on
different physical computers and therefore can not have direct access to global data.
To do what you want you can either send the vector to one of the MPI processes (you did that) and print it there, or to print local worker parts in sequence. Use a function like this:
void MPI_write_ivector(int thrid, int nthr, int vec_dim, int *v)
{
int i, j;
int curthr = 0;
MPI_Barrier(MPI_COMM_WORLD);
while(curthr!=nthr){
if(curthr==thrid){
printf("thread %i writing\n", thrid);
for(i=0; i<vec_dim; i++) printf("%d\n", v[i]);
fflush(stdout);
curthr++;
MPI_Bcast(&curthr, 1, MPI_INT, thrid, MPI_COMM_WORLD);
} else {
MPI_Bcast(&curthr, 1, MPI_INT, curthr, MPI_COMM_WORLD);
}
}
}
All MPI processes should call it at the same time since there is a barrier and broadcast inside. Essentially, the procedure makes sure that all the MPI processes print their vector part in order, starting from rank 0. The data is not messed up since only
one process writes at any given time.
In the example above, Broadcast is used since it gives more flexibility on the order in which the threads should print their results - the thread that currently outputs can decide, who comes next. You could also skip the broadcast and only use a barrier
void MPI_write_ivector(int thrid, int nthr, int vec_dim, int *v)
{
int i, j;
int curthr = 0;
while(curthr!=nthr){
if(curthr==thrid){
printf("thread %i writing\n", thrid);
for(i=0; i<vec_dim; i++) printf("%d\n", v[i]);
fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
curthr++;
}
}
I am thinking about implementing a wrapper for MPI that imitates OpenMP's way
of parallelizing for loops.
begin_parallel_region( chunk_size=100 , num_proc=10 );
for( int i=0 ; i<1000 ; i++ )
{
//some computation
}
end_parallel_region();
The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.
EDITED in response to High Performance Mark.
I have no intention to simulate the OpenMP's shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.
for( int i=0 ; i<n ; i++ )
{
s = s + sin(x[i]);
}
So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.
I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.
Thanks.
There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.
I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?
In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.
Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.
EDIT
Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.
I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.
Yes, you could do this, for specific tasks. But you shouldn't.
Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
typedef struct state_t {
int globaln;
int localn;
int *locals;
int *offsets;
double *localin;
double *localout;
double (*map)(double);
} state;
state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
state *s = malloc(sizeof(state));
s->globaln = n;
s->map = map;
/* figure out decomposition */
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
s->locals = malloc(size * sizeof(int));
s->offsets = malloc(size * sizeof(int));
s->offsets[0] = 0;
for (int i=0; i<size; i++) {
s->locals[i] = (n+i)/size;
if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
}
/* allocate local arrays */
s->localn = s->locals[rank];
s->localin = malloc(s->localn*sizeof(double));
s->localout = malloc(s->localn*sizeof(double));
/* distribute */
MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
s->localin, s->locals[rank], MPI_DOUBLE,
0, MPI_COMM_WORLD);
return s;
}
double end_parallel_mapandsum(state **s) {
double localanswer=0., answer;
/* sum up local answers */
for (int i=0; i<((*s)->localn); i++) {
localanswer += ((*s)->localout)[i];
}
/* and get global result. Everyone gets answer */
MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
free( (*s)->localin );
free( (*s)->localout );
free( (*s)->locals );
free( (*s)->offsets );
free( (*s) );
return answer;
}
int main(int argc, char **argv) {
int rank;
double *inputs;
double result;
int n=100;
const double pi=4.*atan(1.);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
inputs = malloc(n * sizeof(double));
for (int i=0; i<n; i++) {
inputs[i] = 2.*pi/n*i;
}
}
state *s=begin_parallel_mapandsum(inputs, n, sin);
for (int i=0; i<s->localn; i++) {
s->localout[i] = (s->map)(s->localin[i]);
}
result = end_parallel_mapandsum(&s);
if (rank == 0) {
printf("Calculated result: %lf\n", result);
double trueresult = 0.;
for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
printf("True result: %lf\n", trueresult);
}
MPI_Finalize();
}
That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.
To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.
It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just
for( int i=0 ; i<localn ; i++ )
{
s = s + sin(x[i]);
}
and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.
Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:
http://www.hpcwire.com/hpcwire/2006-05-19/openmp_on_clusters-1.html