MPI simple data transfer program - c

I'm supposed to send some integer number from one processor to another, and it is to be done on shell server from my university...
First I created my solution code, which is going to look like (at least I think so...)
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int currentRank = -1;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &currentRank);
if(currentRank == 0) {
int numberToSend = 1;
MPI_Send(&numberToSend , 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
}
else if(currentRank == 1) {
int recivedNumber;
MPI_Recv(&recivedNumber, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Recived number = %d\n", recivedNumber);
}
MPI_Finalize();
return 0;
}
Than I should create some name.pbs file... and run it. And I can't understand how to specify this number of processors... I tried like follow:
#PBS -l nodes=2:ppn=2
#PBS -N cnt
#PBS -j oe
mpiexec ~/mpi1
But later on still have no idea what to do with this on putty. qstat comamnd seems to do nothing... only when qstat -Q or q it shows me some 'statistics' but there are 0 values everywhere... it's my first program in mpi and I really don't understand it at all...
And when I try to run my program I get:
164900#halite:~$ ./transfer1
Fatal error in MPI_Send: Invalid rank, error stack:
MPI_Send(174): MPI_Send(buf=0x7fffd28ec640, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
MPI_Send(99).: Invalid rank has value 1 but must be nonnegative and less than 1
Can anyone explain me how to run this on the server ?

the example code works fine here, tested with OpenMPI and GCC,
the problem is that when you run the code you need to specific the number of cores via your mpirun instance, you may havecorrectly allocated them using torque or what ever scheduler you are using but you are running the compiled code as if it were serial you need to run it with mpi heres and example with the associated output
mpirun -np 2 ./example
Recived number = 1
with different scheduler hydra,PBS, or different MPI version you need to follow the same pattern as above and specific to your mpi run command the number of cores

Related

Consistency of MPI_Fetch_and_op

I am trying to understand the MPI-Function `MPI_Fetch_and_op() through a small example and ran into a strange behaviour I would like to understand.
In the example the process with rank 0 is waiting till the processes 1..4 have each incremented the value of result by one before carrying on.
With the default value 0 for assert used in the function MPI_Win_lock_all() I sometimes (1 out of 10) get an infinite loop, that is updating the value of result[0] in the MASTER to the value of 3. The terminal output looks like the following code snippet:
result: 3
result: 3
result: 3
...
According to the documentation the function MPI_Fetch_and_op is atomic.
This operations is atomic with respect to other "accumulate"
operations.
First Question:
Why is it not updating the value of result[0] to 4?
If I change the value of assert to MPI_MODE_NOCHECK it seems to work
Second Question:
Why is it working with MPI_MODE_NOCHECK
According to the documentation I thought this means the mutual exclusion has to be organized in a different way. Can someone explain the passage from the documentation of MPI_Win_lock_all()?
MPI_MODE_NOCHECK
No other process holds, or will attempt to acquire a conflicting lock, while the caller holds the window lock. This is useful when
mutual exclusion is achieved by other means, but the coherence
operations that may be attached to the lock and unlock calls are still
required.
Thanks in advance!
Example program:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER 0
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
int r, p;
MPI_Comm_rank(comm, &r);
MPI_Comm_size(comm, &p);
printf("Hello from %d\n", r);
int result[1] = {0};
//int assert = MPI_MODE_NOCHECK;
int assert = 0;
int one = 1;
MPI_Win win_res;
MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0], &win_res);
MPI_Win_lock_all(assert, win_res);
if (r == MASTER) {
result[0] = 0;
do{
MPI_Fetch_and_op(&result, &result , MPI_INT, r, 0, MPI_NO_OP, win_res);
printf("result: %d\n", result[0]);
} while(result[0] != 4);
printf("Master is done!\n");
} else {
MPI_Fetch_and_op(&one, &result, MPI_INT, 0, 0, MPI_SUM, win_res);
}
MPI_Win_unlock_all(win_res);
MPI_Win_free(&win_res);
MPI_Finalize();
return 0;
}
Compiled with the following Makefile:
MPICC = mpicc
CFLAGS = -g -std=c99 -Wall -Wpedantic -Wextra
all: fetch_and
fetch_and: main.c
$(MPICC) $(CFLAGS) -o $# main.c
clean:
rm fetch_and
run: all
mpirun -np 5 ./fetch_and
Your code works for me, unchanged. But that may be coincidence. There are many problems with your code. Let me point out what I see:
You hard-coded the number of processes in the test result[0] != 4
You hard-coded the master value into MPI_Fetch_and_op(&one, &result, MPI_INT, 0
Passing the same address as update and result seems dangerous to me: MPI_Fetch_and_op(&result, &result
And my compiler complains about the first parameter since it is in effect an int** (actually int (*)[1])
I'm not sure why you don't get the same complaint on the second parameter,
....but I'm not happy about that second parameter anyway, since the fetch operation writes in memory that you designated to be the window buffer. I guess the lack of coherence here saves you.
You initialize the window with result[0] = 0; but I don't think that is coherent with the window so again, you may just be lucky.
I would think that MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0] would also be some sort of memory corruption since result is an output here, but it is a statically allocated array.
Similarly, Win_free tries to deallocate the memory buffer, but that was, as already remarked, a static buffer, so again: memory corruption.
Your use of Win_lock_all is not appropriate: it means that one process locks the window on all targets. Without any competing locks!! You are locking the window on only one process, but from all possible origins. I'd use an ordinary lock.
Finally, RMA calls are non-blocking. Normally, consistency is made by a Win_fence or Win_unlock. But because you are using a long-lived lock, you need to follow the Fetch_and_op by a MPI_Win_flush_local.
Ok, so that's a dozen cases of, eh, less than ideal programming. Still, in my set up it works. (Sometimes. Sometimes it also hangs.) So you may want to clean up your code a little. Your logic is correct, but your actual implementation not.

Writing to multiple shared files with MPI-IO

I'm running a simulation with thousands of MPI processes and need to write output data to a small set of files. For example, even though I might have 10,000 processes I only want to write out 10 files, with 1,000 writing to each one (at some appropriate offset). AFAIK the correct way to do this is to create a new communicator for the groups of processes that will be writing to the same files, open a shared file for that communicator with MPI_File_open(), and then write to it with MPI_File_write_at_all(). Is that correct? The following code is a toy example that I wrote up:
#include <mpi.h>
#include <math.h>
#include <stdio.h>
const int MAX_NUM_FILES = 4;
int main(){
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int numProcsPerFile = ceil(((double) numProcs) / MAX_NUM_FILES);
int targetFile = rank / numProcsPerFile;
MPI_Comm fileComm;
MPI_Comm_split(MPI_COMM_WORLD, targetFile, rank, &fileComm);
int targetFileRank;
MPI_Comm_rank(fileComm, &targetFileRank);
char filename[20]; // Sufficient for testing purposes
snprintf(filename, 20, "out_%d.dat", targetFile);
printf(
"Proc %d: writing to file %s with rank %d\n", rank, filename,
targetFileRank);
MPI_File outFile;
MPI_File_open(
fileComm, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &outFile);
char bufToWrite[4];
snprintf(bufToWrite, 4, "%3d", rank);
MPI_File_write_at_all(
outFile, targetFileRank * 3,
bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close(&outFile);
MPI_Finalize();
}
I can compile with mpicc file.c -lm and run, say, 20 processes with mpirun -np 20 a.out, and I get the expected output (four files with five entries each), but I'm unsure whether this is the technically correct/most optimal way of doing it. Is there anything I should do differently?
Your approach is correct. To clarify, we need to revisit the standard and the definitions. MPI_File_Open API from MPI: A Message-Passing Interface Standard Version 2.2 (page 391)
int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info,
MPI_File *fh)
Description:
MPI_FILE_OPEN opens the file identified by the file name filename on all processes in
the comm communicator group. MPI_FILE_OPEN is a collective routine: all processes must
provide the same value for amode, and all processes must provide filenames that reference
the same file. (Values for info may vary.) comm must be an intracommunicator; it is
erroneous to pass an intercommunicator to MPI_FILE_OPEN.
intracommunicator vs intercommunicator (page 134):
For the purposes of this chapter, it is sufficient to know that there are two types
of communicators: intra-communicators and inter-communicators. An intracommunicator
can be thought of as an identifier for a single group of processes linked with a context. An
intercommunicator identifies two distinct groups of processes linked with a context.
The point of passing an intracommunicator to MPI_File_open()is to specify a set of processes that will perform operations on the file. This information is needed by the MPI runtime, so it could enforce appropriate synchronizations when collective I/O operations occur. It is the programmer's responsibility to understand the logic of the application and create/choose the correct intracommunicators.
MPI_Comm_Split() in a powerful API that allows to split a communicating group into disjoint subgroups to use for different use cases including MPI I/O.
I think it's probably a typo above, but it's the "_all" that signifies a collective operation.
The main point I wanted to make, however, was that the reason the collective operations are faster is that they enable the I/O system to aggregate data from many processes. You may issue 1000 writes from 1000 processes, but with the collective form this might be aggregated into a single large write to the file (rather than 1000 small writes). This is of course a best-case scenario, but the improvements can be dramatic - for access to a shared file I have seen collective I/O go 1000 times faster than non-collective, admittedly for more complicated IO patterns than this.
MPI_File_write_at_all should be the most efficient way to do this. Collective IO functions are typically fastest for large non-contiguous parallel writes to a shared file and the _all variant combines the seek and the write into one call.

Implementing the ls command in C

I'm trying to implement the ls command in c with as many flags as possible, but i'm having issues with getting the correct Minor and Major of the files, here's an example of what i did.
> ls -l ~/../../dev/tty
crw-rw-rw- 1 root tty 5, 0 Nov 25 13:30
this is the normal ls command as you can see the Major is 5, and Minor is 0.
my program shows the following :
Minor: 6
Major: 0
i'm still a beginner so i didn't really understand the issue here, this is what i did so far (the program is not identical to the ls command yet, but only shows information about a file).
int disp_file_info(char **argv)
{
struct stat sb;
stat(argv[1], &sb);
printf("Inode: %d\n", sb.st_ino);
printf("Hard Links: %d\n", sb.st_nlink);
printf("Size: %d\n", sb.st_size);
printf("Allocated space: %d\n", sb.st_blocks);
printf("Minor: %d\n", minor(sb.st_dev));
printf("Major: %d\n", major(sb.st_dev));
printf("UID: %d\n", sb.st_uid);
printf("GID: %d\n", sb.st_gid);
}
for now this is only to obtain certain information about a file, everything seems to be correct when compared with the ls command except for Minor and Major.
You are using st_dev, which is the device on which the file resides. You want st_rdev, which is the device the special file "is"/represents. (You should first check whether the file is a device node, though.)

MPI_File_open: Can it be made to give up if it finds a file in use?

I have a situation in an MPI code where many processes will be reading many files and constructing their own domains by getting various pieces of data from various files. Most files will be read by several processes. Most processes will read from several files. I am trying to figure out a way to keep all processes active. I thought that I might try to write code so that each process will cycle through its list of files (determined at run time, impossible to determine before), try to open with MPI_File_open, then, if it sees its current file already in use, go on and try the next file. This cycle would continue until all data is read.
But is it possible to make MPI_File_open behave in this way? As far as I can tell, if MPI_File_open sees a file already in use, it just waits until it can open it. I haven't been able to find anything that changes this behavior.
It looks like you can pass info to mpi_file_open to specify how long to wait before moving on to a new file. This seems to be implementation dependant but from openmpi docs is seems the hint shared_file_timeout specifies how long to wait if the file is locked before returning MPI_ERR_TIMEDOUT. Something like this could work (I've only tested this compiles/runs correctly when the file is not locked).
#include "mpi.h"
#include <stdio.h>
#include <sys/file.h>
int main( int argc, char *argv[] )
{
MPI_Fint handleA, handleB;
int rc, ec, rank;
MPI_File fh;
MPI_Info info;
//int fd = open("temp", O_CREAT | O_RDWR, 0666);
//int result = flock(fd, LOCK_EX);
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
MPI_Info_create( &info );
MPI_Info_set(info, "shared_file_timeout", "10.0");
ec = MPI_File_open( MPI_COMM_WORLD, "temp", MPI_MODE_RDONLY, info, &fh );
if (ec != MPI_SUCCESS) {
char estring[MPI_MAX_ERROR_STRING];
int len;
MPI_Error_string(ec, error_string, &len);
fprintf(stderr, "%3d: %s\n", rank, error_string);
} else{
fprintf(stderr, "%3d: %s\n", rank, "Success");
}
MPI_File_close( &fh );
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
MPI_Finalize();
return 0;
}
Few notes, you probably need to set MPI_Errhandler to ensure the MPI_ERR_TIMEDOUT error doesn't result in termination. Not sure how to make this portable over different versions of mpi but the standard does not seem to specify useful hints for this case, leaving it to implementers. For mpich this does not work and just blocks endlessly (I can't see an option in mpich to timeout). Non-blocking file open is being considered in the advanced features of MPI-3 so probably not soon.
The other alternative is to simply check if the file is locked in whatever language you are using and then open with mpi only if it's not locked.

Writing on file using a different view for each mpi process

I'm trying to write on file different matrices, each processed by a different MPI process, mergin their content as described in the following image:
Is there any way I can obtain the desired output using a suitable MPI-View?
To help answering the question I attach a simple code where, with respect to
the previous image, the white columns also get included in the output file.
#include <mpi.h>
#define N 6
int main(int argc, char **argv) {
double A[N*N];
int mpi_rank, mpi_size;
MPI_File file;
MPI_Status status;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
MPI_File_open(MPI_COMM_WORLD, "test.dat", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
MPI_Datatype my_type;
MPI_Type_vector(N, N, N*mpi_size, MPI_DOUBLE, &my_type);
MPI_Type_commit(&my_type);
MPI_Offset disp = mpi_rank*N*sizeof(double);
MPI_File_set_view(file, disp, MPI_DOUBLE, arr_type, "native", MPI_INFO_NULL);
MPI_Datatype row_type;
MPI_Type_contiguous(N, MPI_DOUBLE, &row_type);
MPI_Type_commit(&row_type);
MPI_File_write(file, A, N, row_type, &status);
MPI_File_close(&file);
MPI_Finalize();
return 0;
}
I admire your industry in re-inventing MPI_TYPE_SUBARRAY but surely you could just do that instead of creating contig-of-vectors?
You are so close. In order to omit the ghost cells from your output, I would simply define a subaray memory type that does not describe them.
I think that could be done your way with N-1 instead of N to the vector block length... but just use subarray and make it more clear. Long ago, vectors like this were indeed idiomatic MPI, but MPI-2 introduced subarray types in 1995 so I think you are ok using this "newfangled" feature.
You can indeed shift the file view with the offset parameter. You could also have every file start at offset 0 and instead vary the subarray-in-file arguments. All proceesses will have the same "global array" values. you'd adjust the start[] and possibly the count[] for each processor.

Resources