In my software I read information from a stream X (stdout of another process) with process 1, then I send the information read to the other N-1 processes and finally I collected in process 1 all data elaborated by the N processes.
Now my question is: "What's the most efficient way to share the information read from the stream between processes?"
PS. Processes may also be in different computer connected through a network.
Here I list some possibilities:
Counting lines of stream (M lines), save to N files M/N
lines and send to each process 1 file.
Counting lines of stream (M lines), allocate enough memory to contain all information, send to each process directly the information.
But I think that these can be some problem:
Writing so much files can be an overhead and sending files over a network isn't efficient at all.
I need enough memory in process 1, so that process can be a bottleneck .
What do you suggest, do you have better ideas?
I'm using MPI on C to make this computation.
Using files is just fine if performance is not an issue. The advantage is, that you keep everything modular with the files as a decoupled interface. You can even use very simple command line tools:
./YOUR_COMMAND > SPLIT_ALL
split -n l/$(N) -d SPLIT_ALL SPLIT_FILES
Set N in your shell or replace appropriately.
Note: Unfortunately you cannot pipe directly into split in this case, because it then cannot determine the total number of lines when reading from stdin. If round robin, rather than contiguous split is fine, you can pipe directly:
./YOUR_COMMAND | split -n r/$(N) -d - SPLIT_FILES
You second solution is also fine - if you have enough memory. Keep in mind to use appropriate collective operations, e.g. MPI_Scatter(v) for sending, and MPI_Gather or MPI_Reduce for receiving the data from the clients.
If you run out of memory, then buffer the input in chunks (of for instance 100,000 lines), and then scatter the chunks to your workers, compute, collect the result, and repeat.
Related
I've read that pipes need to have a limited capacity. But I don't understand why. What happens if a process writes into a pipe without a limit?
It's due to buffering. Pipes are not "magical", pipes do not ensure all processes process each individual byte or character in lockstep. Instead pipes buffer inter-process output and then pass the buffer along. And this buffer size limit is what you're referring to. In many Linux distros and in macOS the buffer size is 64KiB.
Imagine there's a process that outputs 1GB of data every second to stdout - and it's piped to another process that can only process 100 bytes of data every minute on stdin - consider that those gigabytes of data have to go somewhere. If there was an infinitely sized buffer than you would quickly fill up the memory-space of whatever OS component owns the pipe and then start paging out to disk - and then your pagefile on disk would fill up - and that's not good.
By having maximum buffer sizes, the output process will be notified when it's filled the buffer and it's free to handle that event however is appropriate (e.g. by pausing output if it's a random number generator, by dropping data if it's a network monitor, by crashing, etc).
Internal mechanisms aside, I suspect the root issue behind the question is one of terminology. Pipes have limited capacity, but unlimited overall volume of data transferred.
The analogy to a piece of physical plumbing is pretty good: a given piece of water pipe has a characteristic internal volume defined by its length, its shape, and the cross section of its interior. At any given time, it cannot hold any more water than fits in that volume, so if you close a valve at its downstream end then water eventually (maybe immediately) stops flowing into its other end because all the available space within -- the pipe's capacity -- is full. Until and unless the pipe is permanently closed, however, there is no bound on how much water may be able traverse it over its lifetime.
In my parallel program, there was a big matrix. Each process computed and stored a part of it. Then the program wrote the matrix to a file by letting each process wrote its own part of the matrix in the correct order. The output file is in "unformatted" form. But when I tried to read the file in a serial code (I have the correct size of the big matrix allocated), I got an error which I don't understand.
My question is: in an MPI program, how do you get a binary file as the serial version output for a big matrix which is stored by different processes?
Here is my attempt:
if(ThisProcs == RootProcs) then
open(unit = file_restart%unit, file = file_restart%file, form = 'unformatted')
write(file_restart%unit)psi
close(file_restart%unit)
endif
#ifdef USEMPI
call mpi_barrier(mpi_comm_world,MPIerr)
#endif
do i = 1, NProcs - 1
if(ThisProcs == i) then
open(unit = file_restart%unit, file = file_restart%file, form = 'unformatted', status = 'old', position = 'append')
write(file_restart%unit)psi
close(file_restart%unit)
endif
#ifdef USEMPI
call mpi_barrier(mpi_comm_world,MPIerr)
#endif
enddo
Psi is the big matrix, it is allocated as:
Psi(N_lattice_points, NPsiStart:NPsiEnd)
But when I tried to load the file in a serial code:
open(2,file=File1,form="unformatted")
read(2)psi
forrtl: severe (67): input statement requires too much data, unit 2 (I am using MSVS 2012+intel fortran 2013)
How can I fix the parallel part to make the binary file readable for the serial code? Of course one can combine them into one big matrix in the MPI program, but is there an easier way?
Edit 1
The two answers are really nice. I'll use access = "stream" to solve my problem. And I just figured I can use inquire to check whether the file is "sequential" or "stream".
This isn't a problem specific to MPI, but would also happen in a serial program which took the same approach of writing out chunks piecemeal.
Ignore the opening and closing for each process and look at the overall connection and transfer statements. Your connection is an unformatted file using sequential access. It's unformatted because you explicitly asked for that, and sequential because you didn't ask for anything else.
Sequential file access is based on records. Each of your write statements transfers out a record consisting of a chunk of the matrix. Conversely, your input statement attempts to read from a single record.
Your problem is that while you try to read the entire matrix from the first record of the file that record doesn't contain the whole matrix. It doesn't contain anything like the correct amount of data. End result: "input statement requires too much data".
So, you need to either read in the data based on the same record structure, or move away from record files.
The latter is simple, use stream access
open(unit = file_restart%unit, file = file_restart%file, &
form = 'unformatted', access='stream')
Alternatively, read with a similar loop structure:
do i=1, NPROCS
! read statement with a slice
end do
This of course requires understanding the correct slicing.
Alternatively, one can consider using MPI-IO for output, which is very similar to using stream output. Read this back in with stream access. You can find about this concept elsewhere on SO.
Fortran unformatted sequential writes in record files are not quite completely raw data. Each write will have data before and after the record in a processor dependent form. The size of your reads cannot exceed the record size of your writes. This means if psi is written in two writes, you will need to read it back in two reads, you cannot read it in at once.
Perhaps the most straightforward option is to instead use stream access instead of sequential. A stream file is indexed by bytes (generally) and does not contain record start and end information. Using this access method you can split the write but read all at once. Stream access is a feature of Fortran 2003.
If you stick with sequential access, you'll need to know how many MPI ranks wrote the file and loop over properly sized records to read the data as it was written. You could make the user specify the number of ranks or store that as the first record in the file and read that first to determine how to read the rest of the data.
If you are writing MPI, why not MPI-IO? Each process will call MPI_File_set_view to set a subarray view of the file, then each process can collectively write the data with MPI_FILE_WRITE_ALL . This approach is likely to scale really well on big machines (though your approach will be fine up to oh, maybe 100 processors.)
I 'd like to write a large array in c in a .csv file.
Would it be possible to write it in parallel?
maybe using a OpenMP ?
The piece of code I'd like to parallelize is a typical IO operation in a file.
Given a resutVector1 and a resultVector2 of size n,
fp=fopen("output.csv","w+");
for(i=0;i<n;i++){
fprintf(fp,"%f,%f\n",resultVector1[i],resultVector2[i]);
}
fclose(fp);
You are going to run into a number of problems trying to perform a parallel write to a single file.
w+ truncates an existing file to 0 length before the write operations or creates a new file, How are you going to coordinate the writing of parallel file pointers?
In any case if you have multiple writers, you will need to synchronize them and you will lose any speed advantage you would have had over a sequential write. In fact, they will probably be slower due to the synchronization overhead than a single dedicated sequential write thread.
Thinking about your question a bit more. If you really had a huge array, say 500 million integers and you really needed the fastest way to read/write this array to a persistent file. You could divide the array by the number of dedicated threads you can allocate, write each segment to a separate file. You can then read this array back into your array by doing a parallel read of this data. In this case you can use a Parallel For type of pattern and avoid the synchronization lock overhead you have with a single file.
So in the example I gave, if you have 4 threads, you will divide the array inter quarters where each thread will write/read its own quarter to and from its separate file.
Note: if all the files are on the same disk drive you may have some I/O slowdown do the multiple simultaneous read/write operations going on at different parts of the disk. This effect can be mediated if you are able to save each file to a different disk/server.
You could open 2 files and write each vector in its own file, this MIGHT help but I won't bet on it, it would depend on the architecture of your platform I think. Plus if you need both in the same file you still have to copy it together, which again takes time.
Also the writes to the harddrive itself are probably the bottleneck here so there is no need to speed up the way you fill up the buffer to the harddrive.
You might open two files on two different harddrives, but I still doubt this would give you a real speed up.
The question triggered me to write pread, a parallel read method implemented using pthread library. Given the file size FILESIZE and the number of threads n, pread method will slice the input file into roughly equal chunks of size FILESIZE/n and assign each chunk to a thread. Then, each thread starts reading the file using fread from different offsets of file with predefined BUFFFERSIZE in parallel. You can find the implementation here.
This is an ongoing implementation, I'm still working on parallel write side.
I am writing a multi-threaded application and as of now I have this idea. I have a FILE*[n] where n is a number determined at runtime. I open all the n files for reading and then multiple threads can access to read it. The computation on the data of each file is equivalent i.e. if serial execution is supposed then each file will remain in memory for the same time.
Each files can be arbitrarily large so on should not assume that they can be loaded in memory.
Now in such a scenario I want to reduce the number of disk IO's that occur. It would be great if someone can suggest any shared memory model for such scenario (I don't know if I am using one because I have very less idea of how things are implemented) .I am not sure how should I achieve this. In other words i just want to know what is the most efficient model to implement such a scenario. I am using C.
EDIT: A more detailed scenario.
The actual problem is I have n bloom filters for data contained in n files and once all the elements from a file are inserted in the corresponding bloom filter I need to need to do membership testing. Since membership testing is a read-only process on data file I can read file from multiple threads and this problem can be easily parallelized. Now the number of files having data are fairly large(around 20k and note that number of files equals number of bloom filter) so I choose to spawn a thread for testing against a bloom-filter i.e. each bloom filter will have its own thread and that will read every other file one by one and test the membership of data against the bloom filter. I wan to minimize disk IO in such a case.
At the start use the mmap() function to map the files into memory, instead of opening/reading FILE*'s. After that spawn the threads which read the files.
In that way the OS buffers the accesses in memory, only performing disk io when the cache becomes full.
If your program is multi-threaded, all the threads are sharing memory unless you take steps to create thread-local storage. You don't need o/s shared memory directly. The way to minimize I/O is to ensure that each file is read only once if at all possible, and similarly that results files are only written once each.
How you do that depends on the processing you're doing.
f each thread is responsible for processing a file in its entirety, then the thread simply reads the file; you can't reduce the I/O any more than that. If a file must be read by several threads, then you should try to memory map the file so that it is available to all the relevant threads. If you're using a 32-bit program and the files are too big to all fit in memory, you can't necessarily do the memory mapping. Then you need to work out how the different threads will process each file, and try to minimize the number of times different threads have to reread the files. If you're using a 64-bit program, you may have enough virtual memory to handle all the files via memory mapped I/O. You still want to keep the number of times that the data is accessed to a minimum. Similar concepts apply to the output files.
I'm training a typical map-reduce architecture (in O.S. classes) and I'm free to decide how the master process will tell its N child processes to parse a log. So, I'm kind of stuck in these two possibilities:
count the number of rows and give X rows for each map OR
each map reads the line of its ID and the next line to read= current_one+number_of_existent_maps
E.g.: with 3 maps, each one is going to read these lines:
Map1: 1, 4, 7, 10, 13
Map2: 2, 5, 8, 11, 14
Map3: 3, 6, 9, 12, 15
I have to do this in order to out-perform a single process that parses the entire log file, so the way I split the job between child processes has to be consistent with this objective.
Which one do you think is best? How can I do the scanf or fgets to adapt to 1) or 2)?
I would be happy with some example code for 2), because the fork/pipes are not my problem :P
RE-EDIT:
I'm not encouraged to use select here, only between map procs and the reduce process that will be monitoring the reads. I have restrictions now and :
I want each process to read total_lines/N lines each. But it seems like I have to make map procs open the file and then read the respective lines. So here are my doubts:
1- Is it bad or even possible to make every procs open the file simultaneously or almost simultaneously? How will that help in speeding up?
2- If it isn't possible to do that, I will have a parent opening the file (instead of each child doing that)that sends a struct with min and max limit and then the map procs will read whatever the lines they are responsible for, process them and give the reduce process a result (this doesn't matter for the problem now).
How can I divide correctly the number of lines by N maps and putting them to read at the same time? I think fseek() may be a good weapon, but I don't know HOW I can use it. Help, please!
If I understood correctly, you want to have all processes reading lines from a single file. I don't recommend this, it's kinda messy, and you'll have to a) read the same parts of the file several times or b) use locking/mutex or some other mechanism to avoid that. It'll get complicated and hard to debug.
I'd have a master process read the file, and assign lines to a subprocess pool. You can use shared memory to speed this up, and reduce the need for data-copying IPC; or use threads.
As for examples, I answered a question about forking and IPC and gave a code snippet on an example function that forks and returns a pair of pipes for parent-child communication. Let me look that up (...) here it is =P Can popen() make bidirectional pipes like pipe() + fork()?
edit: I kept thinking about this =P. Here's an idea:
Have a master process spawn subprocesses with something similar to what I showed in the link above.
Each process starts by sending a byte up to the master to signal it's ready, and blocking on read().
Have the master process read a line from the file to a shared memory buffer, and block on select() on its children pipes.
When select() returns, read one of the bytes that signal readiness and send to that subprocess the offset of the line in the shared memory space.
The master process repeats (reads a line, blocks on select, reads a byte to consume the readiness event, etc.)
The children process the line in whatever way you need, then send a byte to the master to signal readiness once again.
(You can avoid the shared memory buffer if you want, and send the lines down the pipes, though it'll involve constant data-copying. If the processing of each line is computationally expensive, it won't really make a difference; but if the lines require little processing, it may slow you down).
I hope this helps!
edit 2 based on Newba's comments:
Okay, so no shared memory. Use the above model, only instead of sending down the pipe the offset of the line read in the shared memory space, send the whole line. This may sound to you like you're wasting time when you could just read it from the file, but trust me, you're not. Pipes are orders of magnitude faster than reads from regular files in a hard disk, and if you wanted subprocesses to read directly from the file, you'll run into the problem I pointed at the start of the answer.
So, master process:
Spawn subprocesses using something like the function I wrote (link above) that creates pipes for bidirectional communication.
Read a line from the file into a buffer (private, local, no shared memory whatsoever).
You now have data ready to be processed. Call select() to block on all the pipes that communicate you with your subprocesses.
Choose any of the pipes that have data available, read one byte from it, and then send the line you have waiting to be processed in the buffer down the corresponding pipe (remember, we had 2 per child process, on to go up, one to go down).
Repeat from step 2, i.e. read another line.
Child processes:
When they start, they have a reading pipe and a writing pipe at their disposal. Send a byte down your writing pipe to signal the master process you are ready and waiting for data to process (this is the single byte we read in step 4 above).
Block on read(), waiting for the master process (that knows you are ready because of step 1) to send you data to process. Keep reading until you reach a newline (you said you were reading lines, right?). Note I'm following your model, sending a single line to each process at a time, you could send multiple lines if you wanted.
Process the data.
Return to step 1, i.e. send another byte to signal you are ready for more data.
There you go, simple protocol to assign tasks to as many subprocesses as you want. It may be interesting to run a test with 1 child, n children (where n is the number of cores in your computer) and more than n children, and compare performances.
Whew, that was a long answer. I really hope I helped xD
Since each of the processes is going to have to read the file in its entirety (unless the log lines are all of the same length, which is unusual), there really isn't a benefit to your proposal 2.
If you are going to split up the work into 3, then I would do:
Measure (stat()) the size of the log file - call it N bytes.
Allocate the range of bytes 0..(N/3) to first child.
Allocate the range of bytes (N/3)+1..2(N/3) to the second child.
Allocate the range of bytes 2(N/3)+1..end to the third child.
Define that the second and third children must synchronize by reading forward to the first line break after their start position.
Define that each child is responsible for reading to the first line break on or after the end of their range.
Note that the third child (last child) may have to do more work if the log file is growing.
Then the processes are reading independent segments of the file.
(Of course, with them all sharing the file, then the system buffer pool saves rereading the disk, but the data is still copied to each of the three processes, only to have each process throw away 2/3 of what was copied as someone else's job.)
Another, more radical option:
mmap() the log file into memory.
Assign the children to different segments of the file along the lines described previously.
If you're on a 64-bit machine, this works pretty well. If your log files are not too massive (say 1 GB or less), you can do it on a 32-bit machine too. As the file size grows above 1 GB or so, you may start running into memory mapping and allocation issues, though you might well get away with it until you reach a size somewhat less than 4 GB (on a 32-bit machine). The other issue here is with growing log files. AFAIK, mmap() doesn't map extra memory as extra data is written to the file.
Use a master and slave queue pattern.
The master sets up the slaves which sit waiting on a queue for work items.
The master then reads the file line by line.
Each line then represents a work item that you put on the queue
with a function pointer of how do the work.
One of the waiting slaves then takes the item of the queue
A slave processes a work item.
When a slave has finished it rejoins the work queue.