Joining output binary files from MPI simulation - c

I have 64 output binary files from an MPI simulation using a C code.
The files correspond to the output of 64 processes. What would be a way to join all those files into a single file, perhaps using a C script?

Since this was tagged MPI, I'll offer an MPI solution, though it might not be something the questioner can do.
If you are able to modify the simulation, why not adopt an MPI-IO approach? Even better, look into HDF5 or Parallel-NetCDF and get a self-describing file format, platform portability, and a host of analysis and vis tools that already understand your file format.
But no matter which approach you take, the general idea is to use MPI to describe which part of each file belongs to each process. The easiest example is if each process contributes to a 1D array. then for a logically global array of N items, each process contributes 1/N items at offset "myrank/N"

Since all the output files are fairly small and the same size, it would be easy to use MPI_Gather to assemble one large binary array on one node which could then be written to a file. If allocating a large array is an issue, you could simply use MPI_ISend and MPI_Recv to write to the file one piece at at time.
Obviously this is a pretty primitive solution, but it is also very straightforward, foolproof and really won't take notably longer (assuming you're doing all this at the end of your simulation).

Related

Editing single value in file in C efficiently

I have a C program that currently edits a single value in a parameter file by using sed through a system call. I'd like to change the program to use the C file libraries to edit this value, but the only way I know how to do this is by reading in the entire file, changing the value, and rewriting the file. Is there a more efficient way to do this? The program is intended for use on an embedded device so I'd like to use the most efficient solution possible.
Working with files is like working with arrays in the sense that one can't truly before insertions and deletions. Insertions and deletions require shifting (copying) the rest of the file/array. Only replacing elements is possible (by opening the file for reading and writing, and using seek).
Reading and writing the entire file is quite efficient, especially for tiny files. If the memory usage isn't an issue, that's the approach I would take.
Other solutions might be better in specific circumstances, but the approach you describe is generally the best.

MPI programming get external file

i have problem, how to read external file.txt on my mpi code on C. into file.txt containing 10000 word, and i will filter this word remove symbol and number, and i get output like this :
A
As
America
And
Are
Aztec
B
Bald
Bass
Best
up to Z
my question is, how to process it on parallel computing?
It's unclear if you are asking about the MPI_File routines for parallel i/o, or if you are asking how to processs a file in MPI. I'm going to assume you're asking about MPI_File routines.
For unformatted text files, it can be difficult to come up with a parallel decomposition strategy. your file has 10000 words including symbol and number so it's not actually a whole lot of data.
if you know how to use the POSIX system calls open, read, and close, then you can in a first pass simply replace those calls with MPI_File_open, MPI_File_read, and MPI_File_close.
you can ignore details like the MPI file view, in-memory datatypes, and collective I/O: your data is probably not large enough to warrant more sophisticated techniques.

loading multiple files of different lengths into one large array in openmp

I have 4 files (file1,file2,file3,file4) of different lengths (n1,n2,n3,n4) which each contain the following type of data:
x1,y1,z1
x2,y2,z2
...
xn,yn,zn
What is the quickest way to load these into memory - can it be done simultaneously to create one large array (i.e. totarray(1:n1+n2+n3+n4,1:3)) from the 4 smaller arrays? If this can't be done in openmp - what would be the fastest way to do this? At the moment, I simply loop over each filename and added it to the bottom of a temporary array which is filled with the new data in each iteration. There are millions of entries in each file and I want to speed this read in up. Thanks
Unless each file is on a different medium, the fastest way of doing this is probably to read the files one at a time, which is what is sounds like you're doing. In this case, OpenMP will not help you, and might make things worse, as the threads would be competing for a single, slow disk. This assumes that you are I/O bound, though.
You do not specify what format your file is in, though. If it is in binary format, then you can't do much better unless you want to start with compression. If it is in text format, though, you are probably CPU bound due to all the text parsing involved, and can probably get huge speedups simply by moving to a binary format. This will be much more efficient than OpenMP parallelization would be.
HDF is a good binary format you might consider, but you could also go with something as simple as fortran "unformatted" files.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

send glib hashtable with MPI

i recently came across a problem with my parallel program. Each process has several glib hashtables that need to be exchanged with other processes, these hashtables may be quite large. What is the best approach to achieve that?
create derived datatype
use mpi pack and unpack
send key & value as arrays (problem, since amount of elements is not known at compile time)
I haven't used 1 & 2 before and don't even know if thats possible, that's why i am asking you guys..
Pack/unpack creates a copy of your data: if your maps are large, you'll want to avoid that. This also rules out your 3rd option.
You can indeed define a custom datatype, but it'll be a little tricky. See the end of this answer for an example (replacing "graph" with "map" and "node" with "pair" as you read). I suggest you read up on these topics to get a firm understanding of what you need to do.
That the number of elements is not known at compile time shouldn't be a real issue. You can just send a message containing the payload size before sending the map contents. This will let the receiving process allocate just enough memory for the receive buffer.
You may also want to consider simply printing the contents of your maps to files, and then having the processes read each others' ouput. This is much more straightforward, but also less elegant and much slower than message passing.

Resources