MPI programming get external file - c

i have problem, how to read external file.txt on my mpi code on C. into file.txt containing 10000 word, and i will filter this word remove symbol and number, and i get output like this :
A
As
America
And
Are
Aztec
B
Bald
Bass
Best
up to Z
my question is, how to process it on parallel computing?

It's unclear if you are asking about the MPI_File routines for parallel i/o, or if you are asking how to processs a file in MPI. I'm going to assume you're asking about MPI_File routines.
For unformatted text files, it can be difficult to come up with a parallel decomposition strategy. your file has 10000 words including symbol and number so it's not actually a whole lot of data.
if you know how to use the POSIX system calls open, read, and close, then you can in a first pass simply replace those calls with MPI_File_open, MPI_File_read, and MPI_File_close.
you can ignore details like the MPI file view, in-memory datatypes, and collective I/O: your data is probably not large enough to warrant more sophisticated techniques.

Related

edit all files via multi-threading in C

If you had a base file directory with an unknown amount of files and additional folders with files in them and needed to rename every file to append the date it was created on,
i.e filename.ext -> filename_09_30_2021.ext
Assuming the renaming function was already created and returned 1 on success, 0 on fail and -1 on error,
int rename_file(char * filename)
I'm having trouble understanding how you would write the multi-threaded file parsing section to increase the speed.
Would it have to first break down the entire file tree into say 4 parts of char arrays with filenames and then create 4 threads to tackle each section?
Wouldn't that be counterproductive and slower than a single thread going down the file tree and renaming files as it finds them instead of listing them for any multi-threading?
Wouldn't that be counterproductive and slower than a single thread going down the file tree and renaming files as it finds them instead of listing them for any multi-threading?
In general, you get better performance from multithreading for intensive cpu operations. In this case, you'll probably see little to no improvement. It's even quite possible that it gets slower.
The bottleneck here is not the cpu. It's reading from the disk.
Related: An answer I wrote about access times in general https://stackoverflow.com/a/45819202/6699433

Glove: training with single text file. Does GLoVE try to read it into memory? Or is it streamed?

I need to train some glove models to compare them with word2vec and fasttext output. It's implemented in C, and I can't read C code. The github is here.
The training corpus needs to be formatted into a single text file. For me, this would be >>100G -- way too big for memory. Before I waste time constructing such a thing, I'd be grateful if someone could tell me whether the glove algo tries to read the thing into memory, or whether it streams it from disk.
If the former, then glove's current implementation wouldn't be compatible with my data (I think). If the latter, I'd have at it.
Glove first constructs a word co-occurrence matrix and later works on that. While constructing this matrix, the linked implementation streams the input file on several threads. Each thread reads one line at a time.
The required memory will be mainly dependent on the amount of unique words in your corpus, as long as lines are not excessively long.

How efficient is reading a file one byte at a time in C?

After going through most of the book, "The C Programming Language," I think I have a decent grasp on programming in C. One common C idiom presented in that book is reading a file a single byte at a time, using functions like getchar() and fgetc(). So far, I've been using these functions to do all IO in my C programs.
My question is, is this an efficient way of reading a file? Does a call to get a single byte require a lot of overhead that can be minimized if I read multiple bytes into a buffer at a time, for instance by using the read() system call on Unix systems? Or do the operating system and C library handle a buffer behind the scenes to make it more efficient? Also, does this work the same way for writing to files a single byte at a time?
I would like to know how this generally works in C, but if it is implementation or OS specific, I would like to know how it works in GCC on common Unix-like systems (like macOS and linux).
Using getchar() etc is efficient because the standard I/O library uses buffering to read many bytes at once (saving them in a buffer) and doles them out one at a time when you call getchar().
Using read() to read a single byte at a time is much slower, typically, because it makes a full system call each time. It still isn't catastrophically slow, but it is nowhere near as fast as reading 512, or 4096, bytes into a buffer.
Those are broad, sweeping statements. There are many caveats that could be added, but they are a reasonable general outline of the performance of getchar(), etc.

Joining output binary files from MPI simulation

I have 64 output binary files from an MPI simulation using a C code.
The files correspond to the output of 64 processes. What would be a way to join all those files into a single file, perhaps using a C script?
Since this was tagged MPI, I'll offer an MPI solution, though it might not be something the questioner can do.
If you are able to modify the simulation, why not adopt an MPI-IO approach? Even better, look into HDF5 or Parallel-NetCDF and get a self-describing file format, platform portability, and a host of analysis and vis tools that already understand your file format.
But no matter which approach you take, the general idea is to use MPI to describe which part of each file belongs to each process. The easiest example is if each process contributes to a 1D array. then for a logically global array of N items, each process contributes 1/N items at offset "myrank/N"
Since all the output files are fairly small and the same size, it would be easy to use MPI_Gather to assemble one large binary array on one node which could then be written to a file. If allocating a large array is an issue, you could simply use MPI_ISend and MPI_Recv to write to the file one piece at at time.
Obviously this is a pretty primitive solution, but it is also very straightforward, foolproof and really won't take notably longer (assuming you're doing all this at the end of your simulation).

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?
Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.
Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?
The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

Resources