Writing on file using a different view for each mpi process - c

I'm trying to write on file different matrices, each processed by a different MPI process, mergin their content as described in the following image:
Is there any way I can obtain the desired output using a suitable MPI-View?
To help answering the question I attach a simple code where, with respect to
the previous image, the white columns also get included in the output file.
#include <mpi.h>
#define N 6
int main(int argc, char **argv) {
double A[N*N];
int mpi_rank, mpi_size;
MPI_File file;
MPI_Status status;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
MPI_File_open(MPI_COMM_WORLD, "test.dat", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
MPI_Datatype my_type;
MPI_Type_vector(N, N, N*mpi_size, MPI_DOUBLE, &my_type);
MPI_Type_commit(&my_type);
MPI_Offset disp = mpi_rank*N*sizeof(double);
MPI_File_set_view(file, disp, MPI_DOUBLE, arr_type, "native", MPI_INFO_NULL);
MPI_Datatype row_type;
MPI_Type_contiguous(N, MPI_DOUBLE, &row_type);
MPI_Type_commit(&row_type);
MPI_File_write(file, A, N, row_type, &status);
MPI_File_close(&file);
MPI_Finalize();
return 0;
}

I admire your industry in re-inventing MPI_TYPE_SUBARRAY but surely you could just do that instead of creating contig-of-vectors?
You are so close. In order to omit the ghost cells from your output, I would simply define a subaray memory type that does not describe them.
I think that could be done your way with N-1 instead of N to the vector block length... but just use subarray and make it more clear. Long ago, vectors like this were indeed idiomatic MPI, but MPI-2 introduced subarray types in 1995 so I think you are ok using this "newfangled" feature.
You can indeed shift the file view with the offset parameter. You could also have every file start at offset 0 and instead vary the subarray-in-file arguments. All proceesses will have the same "global array" values. you'd adjust the start[] and possibly the count[] for each processor.

Related

How to read different interval of lines from a text file in different processes using MPI in C

I am trying to portion out 1 million lines of float numbers to 16 different processes. For example,
process 0 needs to read between lines 1-62500 and
process 1 needs to read between lines 62501-125000 etc.
I have tried the following code, but every process reads the lines between 1-62500. How can I change the line interval for each process?
MPI_Init(NULL, NULL);
n=1000000/numberOfProcesses;
FILE *myFile;
myFile = fopen("input.txt","r");
i=0;
k = n+1;
while(k--){
fscanf(myFile,"%f",&input[i]);
i++;
}
fclose(myFile);
MPI_Finalize();
Assuming numbeOfProcesses=4 and numberOfLines=16
//so new n will be 4
//n=1000000/numberOfProcesses;
n=numberOfLines/numbeOfProcesses
FILE *myFile;
myFile = fopen("input.txt","r");
i=0;
k = n+1 //(5)
From your program, all processes will read the file from the same location or offset. What you need to do is to make each process read from their own specific line or offset. For example, rank 0 should read from 0, rank 1 from n, rank 2 from 2*n etc. Pass this as parameter to fseek.
n=numberOfLines/numbeOfProcesses
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
file_start= n*rank
fseek(myfile, file_start, SEEK_SET);
fseek will go the offset (file_start) of the file. Then file_start will be 4 for rank 0, 8 for rank 1 etc...
Also while loop should be modified accordingly.
As #Gilles pointed out in comments, here we are explicitly assuming the number of lines in the file. This can lead to many issues.
To get scalability and parallel performance benefits, it is better to use MPI IO, which offers great features for parallel file operations. MPI IO is developed for this kind of usecases.

Understanding broadcast operation mpi

I have an MPI program that was given to us that receives an integer and double as input from the user and have the processes announce their value that they receive.
For example:
user input = 7 10.1
Output:
Process 1 got 7 and 10.100000
Process 2 got 7 and 10.100000
.
.
I understand that each process will just have to announce the values that was given by user input through a single broadcast but the code seemed complicated that i couldn't understand the logic of it.
#include <stdio.h>
#include "mpi.h"
int main(int argc, char *argv[])
{
int rank; //rank of the process
struct {int a;double b;} value;
MPI_Datatype mystruct;
int blocklens[2]; //what is this?
MPI_Aint indices[2]; //what is this?
MPI_Datatype oldtype[2];
MPI_Init(&argc,&argv); //initialize MPI environment
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
blocklens[0] = 1;
blocklens[1] = 1;
oldtype[0] = MPI_INT;
oldtype[1] = MPI_DOUBLE;
MPI_Get_address(&value.a, &indices[0]);
MPI_Get_address(&value.b, &indices[1]);
indices[1] = indices[1] - indices[0];
indices[0] = 0;
MPI_Type_create_struct(2,blocklens,indices,oldtype,&mystruct);
MPI_Type_commit(&mystruct);
while (value.a >= 0) {
if (rank == 0) {
printf("Enter an integer and double: ");
fflush(stdout);
scanf("%d %lf",&value.a,&value.b);
}
MPI_Bcast(&value,1,mystruct,0,MPI_COMM_WORLD);
printf("Process %d got %d and %lf\n",rank,value.a,value.b);
}
MPI_Type_free(&mystruct);
MPI_Finalize();
return 0;
}
I would appreciate if someone could give me a run through of how the code works as i find it really hard to understand it.
This code creates a MPI derived datatype so struct value can be broadcasted in a single MPI call.
This is IMHO a bad example since :
the offsetof() macro should be used to (directly) populate the displacements array (indices is a very poor choice here)
the predefined MPI_DOUBLE_INT datatype is a perfect fit (do not forget to swap a and b in the struct value definition)
as a matter of taste, I’d rather recommend you pass the values via the command line rather than reading them from stdin (this is very subjective, and from experience, you will avoid surprises)

Consistency of MPI_Fetch_and_op

I am trying to understand the MPI-Function `MPI_Fetch_and_op() through a small example and ran into a strange behaviour I would like to understand.
In the example the process with rank 0 is waiting till the processes 1..4 have each incremented the value of result by one before carrying on.
With the default value 0 for assert used in the function MPI_Win_lock_all() I sometimes (1 out of 10) get an infinite loop, that is updating the value of result[0] in the MASTER to the value of 3. The terminal output looks like the following code snippet:
result: 3
result: 3
result: 3
...
According to the documentation the function MPI_Fetch_and_op is atomic.
This operations is atomic with respect to other "accumulate"
operations.
First Question:
Why is it not updating the value of result[0] to 4?
If I change the value of assert to MPI_MODE_NOCHECK it seems to work
Second Question:
Why is it working with MPI_MODE_NOCHECK
According to the documentation I thought this means the mutual exclusion has to be organized in a different way. Can someone explain the passage from the documentation of MPI_Win_lock_all()?
MPI_MODE_NOCHECK
No other process holds, or will attempt to acquire a conflicting lock, while the caller holds the window lock. This is useful when
mutual exclusion is achieved by other means, but the coherence
operations that may be attached to the lock and unlock calls are still
required.
Thanks in advance!
Example program:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER 0
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
int r, p;
MPI_Comm_rank(comm, &r);
MPI_Comm_size(comm, &p);
printf("Hello from %d\n", r);
int result[1] = {0};
//int assert = MPI_MODE_NOCHECK;
int assert = 0;
int one = 1;
MPI_Win win_res;
MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0], &win_res);
MPI_Win_lock_all(assert, win_res);
if (r == MASTER) {
result[0] = 0;
do{
MPI_Fetch_and_op(&result, &result , MPI_INT, r, 0, MPI_NO_OP, win_res);
printf("result: %d\n", result[0]);
} while(result[0] != 4);
printf("Master is done!\n");
} else {
MPI_Fetch_and_op(&one, &result, MPI_INT, 0, 0, MPI_SUM, win_res);
}
MPI_Win_unlock_all(win_res);
MPI_Win_free(&win_res);
MPI_Finalize();
return 0;
}
Compiled with the following Makefile:
MPICC = mpicc
CFLAGS = -g -std=c99 -Wall -Wpedantic -Wextra
all: fetch_and
fetch_and: main.c
$(MPICC) $(CFLAGS) -o $# main.c
clean:
rm fetch_and
run: all
mpirun -np 5 ./fetch_and
Your code works for me, unchanged. But that may be coincidence. There are many problems with your code. Let me point out what I see:
You hard-coded the number of processes in the test result[0] != 4
You hard-coded the master value into MPI_Fetch_and_op(&one, &result, MPI_INT, 0
Passing the same address as update and result seems dangerous to me: MPI_Fetch_and_op(&result, &result
And my compiler complains about the first parameter since it is in effect an int** (actually int (*)[1])
I'm not sure why you don't get the same complaint on the second parameter,
....but I'm not happy about that second parameter anyway, since the fetch operation writes in memory that you designated to be the window buffer. I guess the lack of coherence here saves you.
You initialize the window with result[0] = 0; but I don't think that is coherent with the window so again, you may just be lucky.
I would think that MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0] would also be some sort of memory corruption since result is an output here, but it is a statically allocated array.
Similarly, Win_free tries to deallocate the memory buffer, but that was, as already remarked, a static buffer, so again: memory corruption.
Your use of Win_lock_all is not appropriate: it means that one process locks the window on all targets. Without any competing locks!! You are locking the window on only one process, but from all possible origins. I'd use an ordinary lock.
Finally, RMA calls are non-blocking. Normally, consistency is made by a Win_fence or Win_unlock. But because you are using a long-lived lock, you need to follow the Fetch_and_op by a MPI_Win_flush_local.
Ok, so that's a dozen cases of, eh, less than ideal programming. Still, in my set up it works. (Sometimes. Sometimes it also hangs.) So you may want to clean up your code a little. Your logic is correct, but your actual implementation not.

Writing to multiple shared files with MPI-IO

I'm running a simulation with thousands of MPI processes and need to write output data to a small set of files. For example, even though I might have 10,000 processes I only want to write out 10 files, with 1,000 writing to each one (at some appropriate offset). AFAIK the correct way to do this is to create a new communicator for the groups of processes that will be writing to the same files, open a shared file for that communicator with MPI_File_open(), and then write to it with MPI_File_write_at_all(). Is that correct? The following code is a toy example that I wrote up:
#include <mpi.h>
#include <math.h>
#include <stdio.h>
const int MAX_NUM_FILES = 4;
int main(){
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int numProcsPerFile = ceil(((double) numProcs) / MAX_NUM_FILES);
int targetFile = rank / numProcsPerFile;
MPI_Comm fileComm;
MPI_Comm_split(MPI_COMM_WORLD, targetFile, rank, &fileComm);
int targetFileRank;
MPI_Comm_rank(fileComm, &targetFileRank);
char filename[20]; // Sufficient for testing purposes
snprintf(filename, 20, "out_%d.dat", targetFile);
printf(
"Proc %d: writing to file %s with rank %d\n", rank, filename,
targetFileRank);
MPI_File outFile;
MPI_File_open(
fileComm, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &outFile);
char bufToWrite[4];
snprintf(bufToWrite, 4, "%3d", rank);
MPI_File_write_at_all(
outFile, targetFileRank * 3,
bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close(&outFile);
MPI_Finalize();
}
I can compile with mpicc file.c -lm and run, say, 20 processes with mpirun -np 20 a.out, and I get the expected output (four files with five entries each), but I'm unsure whether this is the technically correct/most optimal way of doing it. Is there anything I should do differently?
Your approach is correct. To clarify, we need to revisit the standard and the definitions. MPI_File_Open API from MPI: A Message-Passing Interface Standard Version 2.2 (page 391)
int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info,
MPI_File *fh)
Description:
MPI_FILE_OPEN opens the file identified by the file name filename on all processes in
the comm communicator group. MPI_FILE_OPEN is a collective routine: all processes must
provide the same value for amode, and all processes must provide filenames that reference
the same file. (Values for info may vary.) comm must be an intracommunicator; it is
erroneous to pass an intercommunicator to MPI_FILE_OPEN.
intracommunicator vs intercommunicator (page 134):
For the purposes of this chapter, it is sufficient to know that there are two types
of communicators: intra-communicators and inter-communicators. An intracommunicator
can be thought of as an identifier for a single group of processes linked with a context. An
intercommunicator identifies two distinct groups of processes linked with a context.
The point of passing an intracommunicator to MPI_File_open()is to specify a set of processes that will perform operations on the file. This information is needed by the MPI runtime, so it could enforce appropriate synchronizations when collective I/O operations occur. It is the programmer's responsibility to understand the logic of the application and create/choose the correct intracommunicators.
MPI_Comm_Split() in a powerful API that allows to split a communicating group into disjoint subgroups to use for different use cases including MPI I/O.
I think it's probably a typo above, but it's the "_all" that signifies a collective operation.
The main point I wanted to make, however, was that the reason the collective operations are faster is that they enable the I/O system to aggregate data from many processes. You may issue 1000 writes from 1000 processes, but with the collective form this might be aggregated into a single large write to the file (rather than 1000 small writes). This is of course a best-case scenario, but the improvements can be dramatic - for access to a shared file I have seen collective I/O go 1000 times faster than non-collective, admittedly for more complicated IO patterns than this.
MPI_File_write_at_all should be the most efficient way to do this. Collective IO functions are typically fastest for large non-contiguous parallel writes to a shared file and the _all variant combines the seek and the write into one call.

MPI_File_open: Can it be made to give up if it finds a file in use?

I have a situation in an MPI code where many processes will be reading many files and constructing their own domains by getting various pieces of data from various files. Most files will be read by several processes. Most processes will read from several files. I am trying to figure out a way to keep all processes active. I thought that I might try to write code so that each process will cycle through its list of files (determined at run time, impossible to determine before), try to open with MPI_File_open, then, if it sees its current file already in use, go on and try the next file. This cycle would continue until all data is read.
But is it possible to make MPI_File_open behave in this way? As far as I can tell, if MPI_File_open sees a file already in use, it just waits until it can open it. I haven't been able to find anything that changes this behavior.
It looks like you can pass info to mpi_file_open to specify how long to wait before moving on to a new file. This seems to be implementation dependant but from openmpi docs is seems the hint shared_file_timeout specifies how long to wait if the file is locked before returning MPI_ERR_TIMEDOUT. Something like this could work (I've only tested this compiles/runs correctly when the file is not locked).
#include "mpi.h"
#include <stdio.h>
#include <sys/file.h>
int main( int argc, char *argv[] )
{
MPI_Fint handleA, handleB;
int rc, ec, rank;
MPI_File fh;
MPI_Info info;
//int fd = open("temp", O_CREAT | O_RDWR, 0666);
//int result = flock(fd, LOCK_EX);
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
MPI_Info_create( &info );
MPI_Info_set(info, "shared_file_timeout", "10.0");
ec = MPI_File_open( MPI_COMM_WORLD, "temp", MPI_MODE_RDONLY, info, &fh );
if (ec != MPI_SUCCESS) {
char estring[MPI_MAX_ERROR_STRING];
int len;
MPI_Error_string(ec, error_string, &len);
fprintf(stderr, "%3d: %s\n", rank, error_string);
} else{
fprintf(stderr, "%3d: %s\n", rank, "Success");
}
MPI_File_close( &fh );
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
MPI_Finalize();
return 0;
}
Few notes, you probably need to set MPI_Errhandler to ensure the MPI_ERR_TIMEDOUT error doesn't result in termination. Not sure how to make this portable over different versions of mpi but the standard does not seem to specify useful hints for this case, leaving it to implementers. For mpich this does not work and just blocks endlessly (I can't see an option in mpich to timeout). Non-blocking file open is being considered in the advanced features of MPI-3 so probably not soon.
The other alternative is to simply check if the file is locked in whatever language you are using and then open with mpi only if it's not locked.

Resources