MPI: Distributing segments of large file does not speedup execution - c

I have a large bioinformatics file (fasta), and I am using MPI to open the file at specific regions depending on the current program's ID. Then, I transcribe the amino acid sequence into their corresponding proteins.
#include <mpi.h>
int main(int argc, char* argv[]){
MPI_File in;
int id;
int p;
long buffersize = 3000000000/p;
MPI_Offset fileStart = buffersize * id;
char* nucleotides = (char*)malloc(buffersize);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
MPI_File_read_at(in, fileStart, nucleotides, buffersize, MPI_CHAR, MPI_STATUS_IGNORE);
/* Calculations */
/* Write result */
MPI_File_close(&in);
free(nucleotides);
MPI_Finalize();
return 0;
}
I expect a speedup correlated with number of machines running the algorithm. However, I observe running my applications across multiple machines does not change execution time. Execution time appears to be independent of number of machines listed in my hostfile.
Any ideas how to get the expected behavior of more machines decreasing the read time?

To turn the comments into an answer:
Turn this
#include <mpi.h>
int main(int argc, char* argv[]){
MPI_File in;
int id;
int p;
//you are using p uninitialized here!!!
long buffersize = 3000000000/p;
//same applies to id
MPI_Offset fileStart = buffersize * id;
char* nucleotides = (char*)malloc(buffersize);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
MPI_File_read_at(in, fileStart, nucleotides, buffersize, MPI_CHAR, MPI_STATUS_IGNORE);
/* Calculations */
/* Write result */
MPI_File_close(&in);
free(nucleotides);
MPI_Finalize();
return 0;
}
into this:
#include <mpi.h>
int main(int argc, char* argv[]){
MPI_File in;
MPI_Init(&argc, &argv);
int id;
int p;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
long buffersize = 3000000000/p;
MPI_Offset fileStart = buffersize * id;
char* nucleotides = (char*)malloc(buffersize);
MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
MPI_File_read_at(in, fileStart, nucleotides, buffersize, MPI_CHAR, MPI_STATUS_IGNORE);
/* Calculations */
/* Write result */
MPI_File_close(&in);
free(nucleotides);
MPI_Finalize();
return 0;
}
No guarantee about whether the rest will work above.
I would highly recommend to make yourself comfortable with C programming before starting to write MPI code because you might end up lost. MPI issues are difficult to debug.

Related

MPI_Scatter produces write error, bad address (3)

I am receiving a writing error when trying to scatter a dynamically allocated matrix (it is contiguous), it happens when more than 5 cores are involved in the computation. I have placed printfs and it occurs in the scatter, the code is the next:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cblas.h>
#include <sys/time.h>
int main(int argc, char* argv[])
{
int err = MPI_Init(&argc, &argv);
MPI_Comm world;
world=MPI_COMM_WORLD;
int size = 0;
err = MPI_Comm_size(world, &size);
int rank = 0;
err = MPI_Comm_rank(world, &rank);
int n_rows=2400, n_cols=2400, n_rpc=n_rows/size;
float *A, *Asc, *B, *C; //Dyn alloc A B and C
Asc=malloc(n_rpc*n_cols*sizeof(float));
B=malloc(n_rows*n_cols*sizeof(float));
C=malloc(n_rows*n_cols*sizeof(float));
A=malloc(n_rows*n_cols*sizeof(float));
if(rank==0)
{
for (int i=0; i<n_rows; i++)
{
for (int j=0; j<n_cols; j++)
{
A[i*n_cols+j]= i+1.0;
B[i*n_cols+j]=A[i*n_cols+j];
}
}
}
struct timeval start, end;
if(rank==0) gettimeofday(&start,NULL);
MPI_Bcast(B, n_rows*n_cols, MPI_FLOAT, 0, MPI_COMM_WORLD);
if(rank==0) printf("Before Scatter\n"); //It is breaking here
MPI_Scatter(A, n_rpc*n_cols, MPI_FLOAT, Asc, n_rpc*n_cols, MPI_FLOAT, 0, MPI_COMM_WORLD);
if(rank==0) printf("After Scatter\n");
/* Some computation */
err = MPI_Finalize();
if (err) DIE("MPI_Finalize");
return err;
}
Upto 4 cores, it works correctly and performs the scatter, but with 5 or more it does not, and I can not find a clear reason.
The error message is as follows:
[raspberrypi][[26238,1],0][btl_tcp_frag.c:130:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0xac51e0, 8)
Bad address(3)
[raspberrypi][[26238,1],0][btl_tcp_frag.c:130:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0xaf197048, 29053982)
Bad address(1)
[raspberrypi:05345] pml_ob1_sendreq.c:308 FATAL
Thanks in advance!
Multiple errors, first of all, take care of using always the same type when defining variables. Then, when you use scatter, the send count and receive are the same, and you will be sending Elements/Cores. Also when receiving with gather you have to receive the same amount you sent, so again Elements/Cores.

C - MPI - Bsend no work with buffer attach

I want the buffer to be full before I start receiving. It gives me the feeling that the buffer size is less than necessary
I followed the documentation to the letter but I can't see the error
https://www.mpich.org/static/docs/v3.1/www3/MPI_Buffer_attach.html
In some cases it gives me the feeling that with a specific buffer size I can perform more bsend than theoretically allowed
#include <stdio.h>
#include <mpi.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
int rank, size, tag=0;
MPI_Status status;
MPI_Request request;
int flag;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int bsends_totales=1000000;
if (rank!=0) //Nodos esclavos
{
int size_bsend;
MPI_Pack_size( 1, MPI_INT, MPI_COMM_WORLD, &size_bsend );
int size_buffer=bsends_totales*(size_bsend+MPI_BSEND_OVERHEAD);
int * buffer = malloc(size_buffer);
memset(buffer,0,size_buffer);
MPI_Buffer_attach(buffer,size_buffer); //Buffer saliente
int enviar=4;
int sends_realizados=0;
for (int i=0;i<bsends_totales;i++)
{
printf("BSENDS realizados... %d\n",sends_realizados);
MPI_Bsend(&enviar,1,MPI_INT,0,tag,MPI_COMM_WORLD);
sends_realizados=sends_realizados+1;
}
printf("BSENDS TOTALES REALIZADOS: %d\n",sends_realizados);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Buffer_detach(&buffer,&size_buffer);
free(buffer);
printf("TERMINE\n");
}
else //Master
{
int recibido;
MPI_Barrier(MPI_COMM_WORLD);
for (int i=0;i<bsends_totales;i++)
{
MPI_Recv(&recibido,1,MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
}
}
MPI_Finalize();
return 0;
}
Before being able to fill the buffer an error appears
OUTPUT:
BSENDS realizados... 119696
BSENDS realizados... 119697
BSENDS realizados... 119698
BSENDS realizados... 119699
BSENDS realizados... 119700
code exit 11

open mpi starts very fast but slows down massively soon after

I have a C program that takes a very large file (can be 5GB to 65GB) and transposes the data in the file and then writes out the transposed data to other files. In total, the results files are approx 30 times larger due to the transformation. I am using open mpi so each processor used writes to it's own file.
Each processor writes the first ~18 GB of data to it's own results file at a very fast speed. However, at this stage the program slows to a crawl and the %CPU on the top command output drops drastically from ~100% to 0.3%.
Can anyone suggest a reason for this? Am I reaching some system limit?
Code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
unsigned long long impute_len=0;
void write_results(unsigned long long, unsigned long long, int);
void main(int argc, char **argv){
// the impute output
impute_fp=fopen("infile.txt", "r");
// find input file length
fseek(impute_fp, 0, SEEK_END);
impute_len=ftell(impute_fp);
//mpi magic - hopefully!
MPI_Status status;
unsigned long long proc_id, ierr, num_procs, tot_recs, recs_per_proc,
root_recs, start_byte, end_byte, start_recv, end_recv;
// Now replicte this process to create parallel processes.
ierr = MPI_Init(&argc, &argv);
//find out process ID, and how many processes were started.
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
if(proc_id == 0){
tot_recs = impute_len/54577; //54577 is length of each line
recs_per_proc = tot_recs/num_procs;
if(tot_recs % num_procs != 0){
recs_per_proc=recs_per_proc+1;
root_recs = tot_recs-(recs_per_proc*(num_procs-1));
}else{
root_recs = recs_per_proc;
}
//distribute a portion to each child process
int z=0;
for(int x=1; x<num_procs; x++){
start_byte = ((root_recs*54577))+(z*(recs_per_proc*54577));
end_byte = ((root_recs*54577))+((z+1)*(recs_per_proc*54577));
ierr = MPI_Send(&start_byte, 1 , MPI_UNSIGNED_LONG_LONG, x, 0, MPI_COMM_WORLD);
ierr = MPI_Send(&end_byte, 1 , MPI_UNSIGNED_LONG_LONG, x, 0, MPI_COMM_WORLD);
z++;
}
//root proc bit of work
write_results(0, (root_recs*54577), proc_id);
}else{
//must be a slave process
ierr = MPI_Recv(&start_recv, 1, MPI_UNSIGNED_LONG_LONG, 0, 0, MPI_COMM_WORLD, &status);
ierr = MPI_Recv(&end_recv, 1, MPI_UNSIGNED_LONG_LONG, 0, 0, MPI_COMM_WORLD, &status);
//Write my portion of file
write_results(start_recv, end_recv, proc_id);
}
ierr = MPI_Finalize();
fclose(impute_fp);
}
void write_results(unsigned long long start, unsigned long long end, int proc_id){
**logic to write out transposed data here
}
fclose(results_fp);
}

How to modify MPI blocking send and receive to non-blocking

I am trying to understand the difference between blocking and non-blocking message passing mechanisms in parallel processing using MPI. Suppose we have the following blocking code:
#include <stdio.h>
#include <string.h>
#include "mpi.h"
int main (int argc, char* argv[]) {
const int maximum_message_length = 100;
const int rank_0= 0;
char message[maximum_message_length+1];
MPI_Status status; /* Info about receive status */
int my_rank; /* This process ID */
int num_procs; /* Number of processes in run */
int source; /* Process ID to receive from */
int destination; /* Process ID to send to */
int tag = 0; /* Message ID */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
/* clients processes */
if (my_rank != server_rank) {
sprintf(message, "Hello world from process# %d", my_rank);
MPI_Send(message, strlen(message) + 1, MPI_CHAR, rank_0, tag, MPI_COMM_WORLD);
} else {
/* rank 0 process */
for (source = 0; source < num_procs; source++) {
if (source != rank_0) {
MPI_Recv(message, maximum_message_length + 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD,&status);
fprintf(stderr, "%s\n", message);
}
}
}
MPI_Finalize();
}
Each processor executes its task and send it back to rank_0 (the receiver). rank_0 will run a loop from 1 to n-1 processes and print them sequentially (i step in the loop may not proceed if the current client hasn't sent its task yet). How do I modify this code to achieve the non-blocking mechanism using MPI_Isend and MPI_Irecv? Do I need to remove the loop in receiver part (rank_0) and explicitly state MPI_Irecv(..) for each client, i.e.
MPI_Irecv(message, maximum_message_length + 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD,&status);
Thank you.
What you do with non-blocking communication is to post the communication and then immediately proceed with your program to do other stuff, which again might be posting more communication. Especially, you can post all receives at once, and wait on them to complete only later on.
This is what you typically would do in your scenario here.
Note however, that this specific setup is a bad example, as it basically just reimplements an MPI_Gather!
Here is how you typically would go about the non-blocking communication in your setup. First, you need some storage for all the messages to end up in, and also a list of request handles to keep track of the non-blocking communication requests, thus your first part of the code needs to be changed accordingly:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mpi.h"
int main (int argc, char* argv[]) {
const int maximum_message_length = 100;
const int server_rank = 0;
char message[maximum_message_length+1];
char *allmessages;
MPI_Status *status; /* Info about receive status */
MPI_Request *req; /* Non-Blocking Requests */
int my_rank; /* This process ID */
int num_procs; /* Number of processes in run */
int source; /* Process ID to receive from */
int tag = 0; /* Message ID */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
/* clients processes */
if (my_rank != server_rank) {
sprintf(message, "Hello world from process# %d", my_rank);
MPI_Send(message, maximum_message_length + 1, MPI_CHAR, server_rank,
tag, MPI_COMM_WORLD);
} else {
No need for non-blocking sends here. Now we go on and receive all these messages on server_rank. We need to loop over all of them and store a request handle for each of them:
/* rank 0 process */
allmessages = malloc((maximum_message_length+1)*num_procs);
status = malloc(sizeof(MPI_Status)*num_procs);
req = malloc(sizeof(MPI_Request)*num_procs);
for (source = 0; source < num_procs; source++) {
req[source] = MPI_REQUEST_NULL;
if (source != server_rank) {
/* Post non-blocking receive for source */
MPI_Irecv(allmessages+(source*(maximum_message_length+1)),
maximum_message_length + 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, req+source);
/* Proceed without waiting on the receive */
/* (posting further receives */
}
}
/* Wait on all communications to complete */
MPI_Waitall(num_procs, req, status);
/* Print the messages in order to the screen */
for (source = 0; source < num_procs; source++) {
if (source != server_rank) {
fprintf(stderr, "%s\n",
allmessages+(source*(maximum_message_length+1)));
}
}
}
MPI_Finalize();
}
After posting the non-blocking receives, we need to wait on all of them to complete, to print the messages in the correct order. To do this, a MPI_Waitall is used, which allows us to block until all request handles are satisfied. Note, that I include the server_rank here for simplicity, but set its request to MPI_REQUEST_NULL initially, so it will be ignored.
If you do not care about the order, you could process the communications as soon as they become available, by looping over the requests and employing MPI_Waitany. That would return as soon as any communication is completed and you could act on the corresponding data.
With MPI_Gather that code would look like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mpi.h"
int main (int argc, char* argv[]) {
const int maximum_message_length = 100;
const int server_rank = 0;
char message[maximum_message_length+1];
char *allmessages;
int my_rank; /* This process ID */
int num_procs; /* Number of processes in run */
int source; /* Process ID to receive from */
int tag = 0; /* Message ID */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
if (my_rank == server_rank) {
allmessages = malloc((maximum_message_length+1)*num_procs);
}
sprintf(message, "Hello world from process# %d", my_rank);
MPI_Gather(message, (maximum_message_length+1), MPI_CHAR,
allmessages, (maximum_message_length+1), MPI_CHAR,
server_rank, MPI_COMM_WORLD);
if (my_rank == server_rank) {
/* Print the messages in order to the screen */
for (source = 0; source < num_procs; source++) {
if (source != server_rank) {
fprintf(stderr, "%s\n",
allmessages+(source*(maximum_message_length+1)));
}
}
}
MPI_Finalize();
}
And with MPI-3 you can even use a non-blocking MPI_Igather.
If you don't care about the ordering, the last part (starting with MPI_Waitall) could be done with MPI_Waitany like this:
for (i = 0; i < num_procs-1; i++) {
/* Wait on any next communication to complete */
MPI_Waitany(num_procs, req, &source, status);
fprintf(stderr, "%s\n",
allmessages+(source*(maximum_message_length+1)));
}

MPI - indefinite send and recv

If I am using blocking send and recv (MPI_send(), MPI_recv()), how to make theese two operation indefinite? Like they are repeating all over again?
Sample code:
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank (MPI_COMM_WORLD,&rank);
if(rank==0){
rc=MPI_Send(msg,1,MPI_CHAR,1,1,MPI_COMM_WORLD);
rc=MPI_Recv(msg,1,MPI_CHAR,1,1,MPI_COMM_WORLD,&status);
}else{
rc=MPI_Recv(msg,1,MPI_CHAR,0,0,MPI_COMM_WORLD,&status);
rc=MPI_Send(msg,1,MPI_CHAR,0,0,MPI_COMM_WORLD);
}
I have tried to put before if(rank==0) -> while(1) and it did the job, but I see there are several sends, then several recieves and I want it like - send(0), receive(1), send(1), recieve(0).
You can code a ring of send-receive operations easily by using MPI_Sendrecv:
int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype,
int dest, int sendtag, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
As you can see it's only a condensed version of a MPI_Send and a MPI_Recv, but it comes handy when all the process needs either to send and receive something.
The following code works for any number of processes (you can adapt it to your needs):
CODE UPDATE #1 (Using MPI_Sendrecv)
#include <stdio.h>
#include <unistd.h>
#include <mpi.h>
int main (int argc, char *argv[])
{
int size, rank, value, next, prev, sendval, recval;
double t0, t;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
value = 5;
if (size > 1)
{
next = (rank + 1)% size;
prev = (size+rank - 1)% size;
sendval = value + rank;
for (;;)
{
t0 = MPI_Wtime();
MPI_Sendrecv(&sendval, 1, MPI_INT, next, 1, &recval, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
t = MPI_Wtime();
fprintf(stdout, "[%d of %d]: Sended %d to process %d, Received %d from process %d (MPI_SendRecv Time: %f)\n",rank, size-1, sendval, next, recval, prev, (t - t0));
}
}
MPI_Finalize();
return 0;
}
CODE UPDATE #2 (Using separate MPI_Send/MPI_Recv)
#include <stdio.h>
#include <unistd.h>
#include <mpi.h>
int main (int argc, char *argv[])
{
int size, rank, value, next, prev, sendval, recval;
double s0, s, r, r0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
value = 5;
if (size > 1)
{
next = (rank + 1)% size;
prev = (size+rank - 1)% size;
sendval = value + rank;
for (;;)
{
s0 = MPI_Wtime();
MPI_Send(&sendval, 1, MPI_INT, next, 1, MPI_COMM_WORLD);
s = MPI_Wtime();
fprintf(stdout, "[%d of %d]: Sended %d to process %d (MPI_Send Time: %f)\n", rank, size-1, sendval, next, s-s0);
r0 = MPI_Wtime();
MPI_Recv(&recval, 1, MPI_INT, prev, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
r = MPI_Wtime();
fprintf(stdout, "[%d of %d]: Received %d from process %d (MPI_Recv Time: %f)\n", rank, size-1, recval, prev, r-r0);
}
}
MPI_Finalize();
return 0;
}
Running Example
mpicc -o sendrecv sendrecv.c
mpirun -n 2 sendrecv
[0 of 1]: Sended 5 to process 1, Received 6 from process 1 (MPI_SendRecv Time: 0.000121)
[1 of 1]: Sended 6 to process 0, Received 5 from process 0 (MPI_SendRecv Time: 0.000068)
...
It is impossible to give an accurate answer to that without seeing at least the basic layout of your code. Generally, you would place the Send and Receive operations inside an infinite loop. Or, if you're hard pressed for optimal communication costs (or simply feeling adventurous), you could use persistent Send and Receive.

Resources