I'm running a simulation with thousands of MPI processes and need to write output data to a small set of files. For example, even though I might have 10,000 processes I only want to write out 10 files, with 1,000 writing to each one (at some appropriate offset). AFAIK the correct way to do this is to create a new communicator for the groups of processes that will be writing to the same files, open a shared file for that communicator with MPI_File_open(), and then write to it with MPI_File_write_at_all(). Is that correct? The following code is a toy example that I wrote up:
#include <mpi.h>
#include <math.h>
#include <stdio.h>
const int MAX_NUM_FILES = 4;
int main(){
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int numProcsPerFile = ceil(((double) numProcs) / MAX_NUM_FILES);
int targetFile = rank / numProcsPerFile;
MPI_Comm fileComm;
MPI_Comm_split(MPI_COMM_WORLD, targetFile, rank, &fileComm);
int targetFileRank;
MPI_Comm_rank(fileComm, &targetFileRank);
char filename[20]; // Sufficient for testing purposes
snprintf(filename, 20, "out_%d.dat", targetFile);
printf(
"Proc %d: writing to file %s with rank %d\n", rank, filename,
targetFileRank);
MPI_File outFile;
MPI_File_open(
fileComm, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &outFile);
char bufToWrite[4];
snprintf(bufToWrite, 4, "%3d", rank);
MPI_File_write_at_all(
outFile, targetFileRank * 3,
bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close(&outFile);
MPI_Finalize();
}
I can compile with mpicc file.c -lm and run, say, 20 processes with mpirun -np 20 a.out, and I get the expected output (four files with five entries each), but I'm unsure whether this is the technically correct/most optimal way of doing it. Is there anything I should do differently?
Your approach is correct. To clarify, we need to revisit the standard and the definitions. MPI_File_Open API from MPI: A Message-Passing Interface Standard Version 2.2 (page 391)
int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info,
MPI_File *fh)
Description:
MPI_FILE_OPEN opens the file identified by the file name filename on all processes in
the comm communicator group. MPI_FILE_OPEN is a collective routine: all processes must
provide the same value for amode, and all processes must provide filenames that reference
the same file. (Values for info may vary.) comm must be an intracommunicator; it is
erroneous to pass an intercommunicator to MPI_FILE_OPEN.
intracommunicator vs intercommunicator (page 134):
For the purposes of this chapter, it is sufficient to know that there are two types
of communicators: intra-communicators and inter-communicators. An intracommunicator
can be thought of as an identifier for a single group of processes linked with a context. An
intercommunicator identifies two distinct groups of processes linked with a context.
The point of passing an intracommunicator to MPI_File_open()is to specify a set of processes that will perform operations on the file. This information is needed by the MPI runtime, so it could enforce appropriate synchronizations when collective I/O operations occur. It is the programmer's responsibility to understand the logic of the application and create/choose the correct intracommunicators.
MPI_Comm_Split() in a powerful API that allows to split a communicating group into disjoint subgroups to use for different use cases including MPI I/O.
I think it's probably a typo above, but it's the "_all" that signifies a collective operation.
The main point I wanted to make, however, was that the reason the collective operations are faster is that they enable the I/O system to aggregate data from many processes. You may issue 1000 writes from 1000 processes, but with the collective form this might be aggregated into a single large write to the file (rather than 1000 small writes). This is of course a best-case scenario, but the improvements can be dramatic - for access to a shared file I have seen collective I/O go 1000 times faster than non-collective, admittedly for more complicated IO patterns than this.
MPI_File_write_at_all should be the most efficient way to do this. Collective IO functions are typically fastest for large non-contiguous parallel writes to a shared file and the _all variant combines the seek and the write into one call.
Related
I am trying to portion out 1 million lines of float numbers to 16 different processes. For example,
process 0 needs to read between lines 1-62500 and
process 1 needs to read between lines 62501-125000 etc.
I have tried the following code, but every process reads the lines between 1-62500. How can I change the line interval for each process?
MPI_Init(NULL, NULL);
n=1000000/numberOfProcesses;
FILE *myFile;
myFile = fopen("input.txt","r");
i=0;
k = n+1;
while(k--){
fscanf(myFile,"%f",&input[i]);
i++;
}
fclose(myFile);
MPI_Finalize();
Assuming numbeOfProcesses=4 and numberOfLines=16
//so new n will be 4
//n=1000000/numberOfProcesses;
n=numberOfLines/numbeOfProcesses
FILE *myFile;
myFile = fopen("input.txt","r");
i=0;
k = n+1 //(5)
From your program, all processes will read the file from the same location or offset. What you need to do is to make each process read from their own specific line or offset. For example, rank 0 should read from 0, rank 1 from n, rank 2 from 2*n etc. Pass this as parameter to fseek.
n=numberOfLines/numbeOfProcesses
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
file_start= n*rank
fseek(myfile, file_start, SEEK_SET);
fseek will go the offset (file_start) of the file. Then file_start will be 4 for rank 0, 8 for rank 1 etc...
Also while loop should be modified accordingly.
As #Gilles pointed out in comments, here we are explicitly assuming the number of lines in the file. This can lead to many issues.
To get scalability and parallel performance benefits, it is better to use MPI IO, which offers great features for parallel file operations. MPI IO is developed for this kind of usecases.
I have a situation in an MPI code where many processes will be reading many files and constructing their own domains by getting various pieces of data from various files. Most files will be read by several processes. Most processes will read from several files. I am trying to figure out a way to keep all processes active. I thought that I might try to write code so that each process will cycle through its list of files (determined at run time, impossible to determine before), try to open with MPI_File_open, then, if it sees its current file already in use, go on and try the next file. This cycle would continue until all data is read.
But is it possible to make MPI_File_open behave in this way? As far as I can tell, if MPI_File_open sees a file already in use, it just waits until it can open it. I haven't been able to find anything that changes this behavior.
It looks like you can pass info to mpi_file_open to specify how long to wait before moving on to a new file. This seems to be implementation dependant but from openmpi docs is seems the hint shared_file_timeout specifies how long to wait if the file is locked before returning MPI_ERR_TIMEDOUT. Something like this could work (I've only tested this compiles/runs correctly when the file is not locked).
#include "mpi.h"
#include <stdio.h>
#include <sys/file.h>
int main( int argc, char *argv[] )
{
MPI_Fint handleA, handleB;
int rc, ec, rank;
MPI_File fh;
MPI_Info info;
//int fd = open("temp", O_CREAT | O_RDWR, 0666);
//int result = flock(fd, LOCK_EX);
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
MPI_Info_create( &info );
MPI_Info_set(info, "shared_file_timeout", "10.0");
ec = MPI_File_open( MPI_COMM_WORLD, "temp", MPI_MODE_RDONLY, info, &fh );
if (ec != MPI_SUCCESS) {
char estring[MPI_MAX_ERROR_STRING];
int len;
MPI_Error_string(ec, error_string, &len);
fprintf(stderr, "%3d: %s\n", rank, error_string);
} else{
fprintf(stderr, "%3d: %s\n", rank, "Success");
}
MPI_File_close( &fh );
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
MPI_Finalize();
return 0;
}
Few notes, you probably need to set MPI_Errhandler to ensure the MPI_ERR_TIMEDOUT error doesn't result in termination. Not sure how to make this portable over different versions of mpi but the standard does not seem to specify useful hints for this case, leaving it to implementers. For mpich this does not work and just blocks endlessly (I can't see an option in mpich to timeout). Non-blocking file open is being considered in the advanced features of MPI-3 so probably not soon.
The other alternative is to simply check if the file is locked in whatever language you are using and then open with mpi only if it's not locked.
I'm trying to write on file different matrices, each processed by a different MPI process, mergin their content as described in the following image:
Is there any way I can obtain the desired output using a suitable MPI-View?
To help answering the question I attach a simple code where, with respect to
the previous image, the white columns also get included in the output file.
#include <mpi.h>
#define N 6
int main(int argc, char **argv) {
double A[N*N];
int mpi_rank, mpi_size;
MPI_File file;
MPI_Status status;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
MPI_File_open(MPI_COMM_WORLD, "test.dat", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
MPI_Datatype my_type;
MPI_Type_vector(N, N, N*mpi_size, MPI_DOUBLE, &my_type);
MPI_Type_commit(&my_type);
MPI_Offset disp = mpi_rank*N*sizeof(double);
MPI_File_set_view(file, disp, MPI_DOUBLE, arr_type, "native", MPI_INFO_NULL);
MPI_Datatype row_type;
MPI_Type_contiguous(N, MPI_DOUBLE, &row_type);
MPI_Type_commit(&row_type);
MPI_File_write(file, A, N, row_type, &status);
MPI_File_close(&file);
MPI_Finalize();
return 0;
}
I admire your industry in re-inventing MPI_TYPE_SUBARRAY but surely you could just do that instead of creating contig-of-vectors?
You are so close. In order to omit the ghost cells from your output, I would simply define a subaray memory type that does not describe them.
I think that could be done your way with N-1 instead of N to the vector block length... but just use subarray and make it more clear. Long ago, vectors like this were indeed idiomatic MPI, but MPI-2 introduced subarray types in 1995 so I think you are ok using this "newfangled" feature.
You can indeed shift the file view with the offset parameter. You could also have every file start at offset 0 and instead vary the subarray-in-file arguments. All proceesses will have the same "global array" values. you'd adjust the start[] and possibly the count[] for each processor.
The problem: I have a few text files (10) with numbers in them on every line. I need to have them split across some threads I create using the pthread library. These threads that are created (worker threads) are to find the largest prime number that gets sent to them (and over all the largest prime from all of the text files).
My current thoughts on solutions: I am thinking myself to have two arrays and all of the text files in one array and the other array will contain a binary file that I can read say 1000 lines and send the pointer to the index of that binary file in a struct that contains the id, file pointer, and file position and let it crank through that.
A little bit of what I am talking about:
pthread_create(&threads[index],NULL,workerThread,(void *)threadFields[index]);//Pass struct to each worker
Struct:
typedef struct threadFields{
int *id, *position;
FILE *Fin;
}tField;
If anyone has any insight or a better solution it would be greatly appreciated
EDIT:
Okay so I found a solution to my problem and I believe it is similar to what SaveTheRbtz suggested. Here is what I implemented:
I took the files and merged them in to 1 binary file and kept tack of it in the loop (I had to account for how many bytes each entry was, this was hard-coded)
struct threadFields *info = threadStruct;
int index;
int id = info->id;
unsigned int currentNum = 0;
int Seek = info->StartPos;
unsigned int localLargestPrime = 0;
char *buffer = malloc(50);
int isPrime = 0;
while(Seek<info->EndPos){
for(index = 0; index < 1000; index++){//Loop 1000 times
fseek(fileOut,Seek*sizeof(char)*20, SEEK_SET);
fgets(buffer,20,fileOut);
Seek++;
currentNum = atoi(buffer);
if(currentNum>localLargestPrime && currentNum > 0){
isPrime = ChkPrim(currentNum);
if( isPrime == 1)
localLargestPrime = currentNum;
}
}
Can you do ten threads, each of which processes a file specified as an argument. Each thread will read its own file, checking whether the value is larger than the largest prime it has recorded so far, and if so, checking that the new number is prime. Then, when its finished, it can return the prime to the coordinator thread. The coordinator threads sits back and waits for the threads to finish, collecting the largest prime from each thread, and only keeping the largest. You can probably use 0 as a sentinel value to indicate 'no primes found (yet)'.
Let's say I wanted 11 threads instead of 10; how would I split the workload then?
I'd have the 11th thread do pthread_exit() immediately. If you want to make coordination problems for yourself, you can, but why make life harder than you have to.
If you absolutely must have 11 threads process 10 files and divvy up the work, then I suppose I would probably have set of 10 file streams initially in a queue. The threads would wait on a condition 'queue not empty' to get a file stream (mutexes and conditions and all that). When a thread acuires a file stream, it would read one number from the file and push the stream back onto the queue (signalling queue not empty), then process the number. On EOF, a thread would close the file and not push it back onto the queue (so the threads have to detect 'no file streams left with unread data'). This means that each thread would read about one eleventh of the data, depending on how long the prime calculation takes for the numbers it actually reads. That's much, much, much trickier to code than a simple one thread per file solution, but it scales (more or less) to an arbitrary number of threads and files. In particular, it could be used to have 7 threads process 10 files, as well as having 17 threads process 10 files.
Looks like a job for message queue:
Set of "supplier" threads which split data into chunks
and put then to the queue. In your case chunk can be represented with file name or
(fd, offset, size) tuple. For simplicity there can be one such
supplier.
Number of "worker" threads that pull data from input
queue, process it and put results to another queue. For performance
reasons there usually many workers, for example if your task is
CPU-intensive then sysconf(_SC_NPROCESSORS_ONLN) should be a good
choice.
One "aggregator" thread that "reduces" result queue to single value. For your case it's simple max() function.
This is highly scalable solution will enable you to easily combine many different kinds of processing stages into easily understandable pipeline.
I have a C application which generates a lot of output and for which speed is critical. The program is basically a loop over a large (8-12GB) binary input file which must be read sequentially. In each iteration the read bytes are processed and output is generated and written to multiple files, but never to multiple files at the same time. So if you are at the point where output is generated and there are 4 output files you write to either file 0 or 1 or 2, or 3. At the end of the iteration I now write the output using fwrite(), thereby waiting for the write operation to finish. The total number of output operations is large, up to 4 million per file, and output size of files ranges from 100mb to 3.5GB. The program runs on a basic multicore processor.
I want to write output in a separate thread and I know this can be done with
Asyncronous I/O
Creating threads
I/O completion ports
I have 2 type of questions, namely conceptual and code specific.
Conceptual Question
What would be the best approach. Note that the application should be portable to Linux, however, I don't see how that would be very important for my choice for 1-3, since I would write a wrapper around anything kernel/API specific. For me the most important criteria is speed. I have read that option 1 is not that likely to increase the performance of the program and that the kernel in any case creates new threads for the i/o operation, so then why not use option (2) immediately with the advantage that it seems easier to program (also since I did not succeed with option (1), see code issues below).
Note that I read https://stackoverflow.com/questions/3689759/how-can-i-run-a-specific-function-of-thread-asynchronously-in-c-c, but I dont see a motivation on what to use based on the nature of the application. So I hope somebody could provide me with some advice what would be best in my situation. Also from the book "Windows System Programming" by Johnson M. Hart, I know that the recommendation is using threads, mainly because of the simplicity. However, will it also be fastest?
Code Question
This question involves the attempts I made so far to make asynchronous I/O work. I understand that its a big piece of code so that its not that easy to look into. In any case I would really appreciate any attempt.
To decrease execution time I try to write the output by means of a new thread using WINAPI via CreateFile() with FILE_FLAGGED_OVERLAP with an overlapped structure. I have created a sample program in which I try to get this to work. However, I encountered 2 problems:
The file is only opened in overlapped mode when I delete an already existing file (I have tried using CreateFile in different modes (CREATE_ALWAYS, CREATE_NEW, OPEN_EXISTING), but this does not help).
Only the first WriteFile is executed asynchronously. The remainder of WriteFile commands is synchronous. For this problem I already consulted http://support.microsoft.com/kb/156932. It seems that the problem I have is related to the fact that "any write operation to a file that extends its length will be synchronous". I've already tried to solve this by increasing file size/valid data size (commented region in code). However, I still do not get it to work. I'm aware of the fact that it could be the case that to get most out of asynchronous io i should CreateFile with FILE_FLAG_NO_BUFFERING, however I cannot get this to work as well.
Please note that the program creates a file of about 120mb in the path of execution. Also note that print statements "not ok" are not desireable, I would like to see "can do work in background" appear on my screen... What goes wrong here?
#include <windows.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ASYNC // remove this definition to run synchronously (i.e. using fwrite)
#ifdef ASYNC
struct _OVERLAPPED *pOverlapped;
HANDLE *pEventH;
HANDLE *pFile;
#else
FILE *pFile;
#endif
#define DIM_X 100
#define DIM_Y 150000
#define _PRINTERROR(msgs)\
{printf("file: %s, line: %d, %s",__FILE__,__LINE__,msgs);\
fflush(stdout);\
return 0;} \
#define _PRINTF(msgs)\
{printf(msgs);\
fflush(stdout);} \
#define _START_TIMER \
time_t time1,time2; \
clock_t clock1; \
time(&time1); \
printf("start time: %s",ctime(&time1)); \
fflush(stdout);
#define _END_TIMER\
time(&time2);\
clock1 = clock();\
printf("end time: %s",ctime(&time2));\
printf("elapsed processor time: %.2f\n",(((float)clock1)/CLOCKS_PER_SEC));\
fflush(stdout);
double aio_dat[DIM_Y] = {0};
double do_compute(double A,double B, int arr_len);
int main()
{
_START_TIMER;
const char *pName = "test1.bin";
DWORD dwBytesToWrite;
BOOL bErrorFlag = FALSE;
int j=0;
int i=0;
int fOverlapped=0;
#ifdef ASYNC
// create / open the file
pFile=CreateFile(pName,
GENERIC_WRITE, // open for writing
0, // share write access
NULL, // default security
CREATE_ALWAYS, // create new/overwrite existing
FILE_FLAG_OVERLAPPED, // | FILE_FLAG_NO_BUFFERING, // overlapped file
NULL); // no attr. template
// check whether file opening was ok
if(pFile==INVALID_HANDLE_VALUE){
printf("%x\n",GetLastError());
_PRINTERROR("file not opened properly\n");
}
// make the overlapped structure
pOverlapped = calloc(1,sizeof(struct _OVERLAPPED));
pOverlapped->Offset = 0;
pOverlapped->OffsetHigh = 0;
// put event handle in overlapped structure
if(!(pOverlapped->hEvent = CreateEvent(NULL,TRUE,FALSE,NULL))){
printf("%x\n",GetLastError());
_PRINTERROR("error in createevent\n");
}
#else
pFile = fopen(pName,"wb");
#endif
// create some output
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
// determine how many bytes should be written
dwBytesToWrite = (DWORD)sizeof(aio_dat);
for(i=0;i<DIM_X;i++){ // do this DIM_X times
#ifdef ASYNC
//if(i>0){
//SetFilePointer(pFile,dwBytesToWrite,NULL,FILE_CURRENT);
//if(!(SetEndOfFile(pFile))){
// printf("%i\n",pFile);
// _PRINTERROR("error in set end of file\n");
//}
//SetFilePointer(pFile,-dwBytesToWrite,NULL,FILE_CURRENT);
//}
// write the bytes
if(!(bErrorFlag = WriteFile(pFile,aio_dat,dwBytesToWrite,NULL,pOverlapped))){
// check whether io pending or some other error
if(GetLastError()!=ERROR_IO_PENDING){
printf("lasterror: %x\n",GetLastError());
_PRINTERROR("error while writing file\n");
}
else{
fOverlapped=1;
}
}
else{
// if you get here output got immediately written; bad!
fOverlapped=0;
}
if(fOverlapped){
// do background, this msgs is what I want to see
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
_PRINTF("can do work in background\n");
}
else{
// not overlapped, this message is bad
_PRINTF("not ok\n");
}
// wait to continue
if((WaitForSingleObject(pOverlapped->hEvent,INFINITE))!=WAIT_OBJECT_0){
_PRINTERROR("waiting did not succeed\n");
}
// reset event structure
if(!(ResetEvent(pOverlapped->hEvent))){
printf("%x\n",GetLastError());
_PRINTERROR("error in resetevent\n");
}
pOverlapped->Offset+=dwBytesToWrite;
#else
fwrite(aio_dat,sizeof(double),DIM_Y,pFile);
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
#endif
}
#ifdef ASYNC
CloseHandle(pFile);
free(pOverlapped);
#else
fclose(pFile);
#endif
_END_TIMER;
return 1;
}
double do_compute(double A,double B, int arr_len)
{
int i;
double res = 0;
double *xA = malloc(arr_len * sizeof(double));
double *xB = malloc(arr_len * sizeof(double));
if ( !xA || !xB )
abort();
for (i = 0; i < arr_len; i++) {
xA[i] = sin(A);
xB[i] = cos(B);
res = res + xA[i]*xA[i];
}
free(xA);
free(xB);
return res;
}
Useful links
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/cref_cls/common/cppref_asynchioC_aio_read_write_eg.htm
http://www.ibm.com/developerworks/linux/library/l-async/?ca=dgr-lnxw02aUsingPOISIXAIOAPI
http://www.flounder.com/asynchexplorer.htm#Asynchronous%20I/O
I know this is a big question and I would like to thank everybody in advance who takes the trouble reading it and perhaps even respond!
You should be able to get this to work using the OVERLAPPED structure.
You're on the right track: the system is preventing you from writing asynchronously because every WriteFile extends the size of the file. However, you're doing the file size extension wrong. Simply calling SetFileSize will not actually reserve space in the MFT. Use the SetFileValidData function. This will allocate clusters for your file (note that they will contain whatever garbage the disk had there) and you should be able to execute WriteFile and your computation in parallel.
I would stay away from FILE_FLAG_NO_BUFFERING. You're after more performance with parallelism I presume? Don't prevent the cache from doing its job.
Another option that you did not consider is a memory mapped file. Those are available on Windows and Linux. There is a handy Boost abstraction that you could use.
With a memory mapped file, every thread in your process could write its output to the file on its own time, assuming that the record sizes are known and each thread has its own output area.
The operating system will take care of writing the mapped pages to disk when needed or when it gets around to it or when you close the file. Maybe when you close the file. Now that I think about it, some operating systems may require that you call msync to guarantee it.
I don't see why you would want to write asynchronously. Doing things in parallel does not make them faster in all cases. If you write two file at the same time to the same disk, it will almost always be a lot faster. If that is the case, just write them one after another.
If you have some fancy drive like SSD or a virtual RAM drive, parallel writing could be faster. You have to create an file with at full size and then do your parallel magic.
Asynchronous writing is nice, but is done by any OS anyway. The potential gain for you is that you can do other things than writing to disk like displaying a progress bar. This is where multi-threading can help you.
So imho you should use serial writing or parallel writing to multiple disks.
hth