As a continuation to my previous question, I have modified the code for variable number of kernels. However, the way Gatherv is implemented in my code seems to be unreliable. Once in 3-4 runs the end sequence in the collecting buffer ends up being corrupted, it seems like, due to the memory leakage. Sample code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
int main (int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int world_size,*sendarray;
int rank, *rbuf=NULL, count,total_counts=0;
int *displs=NULL,i,*rcounts=NULL;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if(rank==0){
displs = malloc((world_size+1)*sizeof(int));
for(int i=1;i<=world_size; i++)displs[i]=0;
rcounts=malloc(world_size*sizeof(int));
sendarray=malloc(1*sizeof(int));
for(int i=0;i<1;i++)sendarray[i]=1111;
count=1;
}
if(rank!=0){
int size=rank*2;
sendarray=malloc(size*sizeof(int));
for(int i=0;i<size;i++)sendarray[i]=rank;
count=size;
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Gather(&count,1,MPI_INT,rcounts,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
if(rank==0){
displs[0]=0;
for(int i=1;i<=world_size; i++){
for(int j=0; j<i; j++)displs[i]+=rcounts[j];
}
total_counts=0;
for(int i=0;i<world_size;i++)total_counts+=rcounts[i];
rbuf = malloc(10*sizeof(int));
}
MPI_Gatherv(sendarray, count, MPI_INT, rbuf, rcounts,
displs, MPI_INT, 0, MPI_COMM_WORLD);
if(rank==0){
int SIZE=total_counts;
for(int i=0;i<SIZE;i++)printf("(%d) %d ",i, rbuf[i]);
free(rbuf);
free(displs);
free(rcounts);
}
if(rank!=0)free(sendarray);
MPI_Finalize();
}
Why is this happening and is there a way to fix it?
This becomes much worse in my actual project. Each sending buffer contains 150 doubles. The receiving buffer gets very dirty and sometimes I get an error of bed termination with exit code 6 or 11.
Can anyone at least reproduce my errors?
My guess: I am allocating memory for sendarray on each thread separately. If my virtual machine was 1-to-1 to the hardware, then, probably, there would be no such problem. But I have only 2 cores and run a process for 4 or more. Could it be the reason?
Change this line:
rbuf = malloc(10*sizeof(int));
to:
rbuf = malloc(total_counts*sizeof(int));
As a side note: each MPI process exists in its own process address space and they cannot stomp on eachothers data except through erroneous data explicitly passed through the MPI_XXX functions, which results in undefined behavior.
Related
I am receiving a writing error when trying to scatter a dynamically allocated matrix (it is contiguous), it happens when more than 5 cores are involved in the computation. I have placed printfs and it occurs in the scatter, the code is the next:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cblas.h>
#include <sys/time.h>
int main(int argc, char* argv[])
{
int err = MPI_Init(&argc, &argv);
MPI_Comm world;
world=MPI_COMM_WORLD;
int size = 0;
err = MPI_Comm_size(world, &size);
int rank = 0;
err = MPI_Comm_rank(world, &rank);
int n_rows=2400, n_cols=2400, n_rpc=n_rows/size;
float *A, *Asc, *B, *C; //Dyn alloc A B and C
Asc=malloc(n_rpc*n_cols*sizeof(float));
B=malloc(n_rows*n_cols*sizeof(float));
C=malloc(n_rows*n_cols*sizeof(float));
A=malloc(n_rows*n_cols*sizeof(float));
if(rank==0)
{
for (int i=0; i<n_rows; i++)
{
for (int j=0; j<n_cols; j++)
{
A[i*n_cols+j]= i+1.0;
B[i*n_cols+j]=A[i*n_cols+j];
}
}
}
struct timeval start, end;
if(rank==0) gettimeofday(&start,NULL);
MPI_Bcast(B, n_rows*n_cols, MPI_FLOAT, 0, MPI_COMM_WORLD);
if(rank==0) printf("Before Scatter\n"); //It is breaking here
MPI_Scatter(A, n_rpc*n_cols, MPI_FLOAT, Asc, n_rpc*n_cols, MPI_FLOAT, 0, MPI_COMM_WORLD);
if(rank==0) printf("After Scatter\n");
/* Some computation */
err = MPI_Finalize();
if (err) DIE("MPI_Finalize");
return err;
}
Upto 4 cores, it works correctly and performs the scatter, but with 5 or more it does not, and I can not find a clear reason.
The error message is as follows:
[raspberrypi][[26238,1],0][btl_tcp_frag.c:130:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0xac51e0, 8)
Bad address(3)
[raspberrypi][[26238,1],0][btl_tcp_frag.c:130:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0xaf197048, 29053982)
Bad address(1)
[raspberrypi:05345] pml_ob1_sendreq.c:308 FATAL
Thanks in advance!
Multiple errors, first of all, take care of using always the same type when defining variables. Then, when you use scatter, the send count and receive are the same, and you will be sending Elements/Cores. Also when receiving with gather you have to receive the same amount you sent, so again Elements/Cores.
I am new to MPI and I am trying to manage arrays of different size in parallel and then pass them to the main thread, unsuccessfully so far.
I have learned that
MPI_Gatherv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, const int *recvcounts, const int *displs,
MPI_Datatype recvtype, int root, MPI_Comm comm)
is the way to go in this case.
Here is my sample code, which doesn't work because of memory issues (I think).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
int main (int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int world_size,*sendarray;
int rank, *rbuf=NULL, count;
int *displs=NULL,i,*rcounts=NULL;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if(rank==0){
rbuf = malloc(10*sizeof(int));
displs = malloc(world_size*sizeof(int));
rcounts=malloc(world_size*sizeof(int));
rcounts[0]=1;
rcounts[1]=3;
rcounts[2]=6;
displs[0]=1;
displs[1]=3;
displs[2]=6;
sendarray=malloc(1*sizeof(int));
for(int i=0;i<1;i++)sendarray[i]=1;
count=1;
}
if(rank==1){
sendarray=malloc(3*sizeof(int));
for(int i=0;i<3;i++)sendarray[i]=2;
count=3;
}
if(rank==2){
sendarray=malloc(6*sizeof(int));
for(int i=0;i<6;i++)sendarray[i]=3;
count=6;
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Gatherv(sendarray, count, MPI_INT, rbuf, rcounts,
displs, MPI_INT, 0, MPI_COMM_WORLD);
if(rank==0){
int SIZE=10;
for(int i=0;i<SIZE;i++)printf("(%d) %d ",i, rbuf[i]);
free(rbuf);
free(displs);
free(rcounts);
}
if(rank!=0)free(sendarray);
MPI_Finalize();
}
Specifically, when I run it, I get
(0) 0 (1) 1 (2) 0 (3) 2 (4) 2 (5) 2 (6) 3 (7) 3 (8) 3 (9) 3
Instead of something like this
(0) 1 (1) 2 (2) 2 (3) 2 (4) 3 (5) 3 (6) 3 (7) 3 (8) 3 (9) 3
Why is that?
What is even more interesting, is that it seems like missing elements are stored in 11th and 12th element of the rbuf, even though those are supposed to not even exist at the first place.
Your program is very close to working. If you change these lines:
displs[0]=1;
displs[1]=3;
displs[2]=6;
to this:
displs[0]=0;
displs[1]=displs[0]+rcounts[0];
displs[2]=displs[1]+rcounts[1];
you will get the expected output. The variable displs is the offset into the receiving buffer to place the data from process i.
I have the following code which works:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
int world_rank, world_size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int n = 10000;
int ni, i;
double t[n];
int x[n];
int buf[n];
int buf_size = n*sizeof(int);
MPI_Buffer_attach(buf, buf_size);
if (world_rank == 0) {
for (ni = 0; ni < n; ++ni) {
int msg_size = ni;
int msg[msg_size];
for (i = 0; i < msg_size; ++i) {
msg[i] = rand();
}
double time0 = MPI_Wtime();
MPI_Bsend(&msg, msg_size, MPI_INT, 1, 0, MPI_COMM_WORLD);
t[ni] = MPI_Wtime() - time0;
x[ni] = msg_size;
MPI_Barrier(MPI_COMM_WORLD);
printf("P0 sent msg with size %d\n", msg_size);
}
}
else if (world_rank == 1) {
for (ni = 0; ni < n; ++ni) {
int msg_size = ni;
int msg[msg_size];
MPI_Request request;
MPI_Barrier(MPI_COMM_WORLD);
MPI_Irecv(&msg, msg_size, MPI_INT, 0, 0, MPI_COMM_WORLD, &request);
MPI_Wait(&request, MPI_STATUS_IGNORE);
printf("P1 received msg with size %d\n", msg_size);
}
}
MPI_Buffer_detach(&buf, &buf_size);
MPI_Finalize();
}
As soon as I remove the print statements, the program crashes, telling me there is a MPI_ERR_BUFFER: invalid buffer pointer. If I remove only one of the print statements the other print statements are still executed, so I believe it crashes at the end of the program. I don't see why it crashes and the fact that it does not crash when I am using the print statements goes beyond my logic...
Would anybody have a clue what is going on here?
You are simply not providing enough buffer space to MPI. In buffered mode, all ongoing messages are stored in the buffer space which is used as a ring buffer. In your code, there can be multiple messages that need to be buffered, regardless of the printf. Note that not even 2*n*sizeof(int) would be enough buffer space - the barriers do not provide a guarantee that the buffer is locally freed even though the corresponding receive is completed. You would have to provide (n*(n-1)/2)*sizeof(int) memory to be sure, or something in-between and hope.
Bottom line: Don't use buffered mode.
Generally, use standard blocking send calls and write the application such that it doesn't deadlock. Tune the MPI implementation such that small messages regardless of the receiver - to avoid wait times on late receivers.
If you want to overlap communication and computation, use nonblocking messages - providing proper memory for each communication.
I have a project where I need to time any bad implementation of MPI_Bcast using MPI_Isend and MPI_Irecv, and compare it against MPI_Bcast. Because the time on these programs is 0.000000 Seconds, I need to use a large array (as I have done). What is not yet in my code below is that the for loop and MPI_Irecv/Isend functions should be in a loop to make the program take a useful amount of time to finish.
Here is my code, and I'll discuss the problem I am having below it:
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(int argc, char **argv) {
int a = 1000000000;
int i, N;
int Start_time, End_time, Elapse_Time;
int proc_rank, partner, world_size;
MPI_Status stat;
float mydata[a];
MPI_Request request;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &proc_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
Start_time = MPI_Wtime();
for (i = 0; i < a; i++) {
mydata[i] = 0.2567*i;
}
MPI_Irecv(mydata, a, MPI_BYTE, 0, 1, MPI_COMM_WORLD, &request);
MPI_Isend(mydata, a, MPI_BYTE, 0, 1, MPI_COMM_WORLD, &request);
End_time = MPI_Wtime();
Elapse_Time = End_time - Start_time;
printf("Time on process %d is %f Seconds.\n", proc_rank, Elapse_Time);
MPI_Finalize;
return 0;
}
When I run this using the command mpirun -np 4 ./a.out, I only get the time for one processor, but I'm not really sure why. I guess I'm just not understanding how these functions work, or how I should be using them.
Thank you for the help!
There are a few different issues in your code, all likely to lead to it to crash and or to behave strangely:
As already mentioned by #Olaf the allocation of the array mydata on the stack is a very bad idea. For arrays this large, you should definitely go for an allocation on the heap with an explicit call to malloc(). Even so, you are playing with some serious chunks of memory here, so be careful of not exhausting what's available on your machine. Moreover, some MPI libraries have difficulties to deal with messages of size greater than 2GB, which is the case of yours. So again, be careful with that.
You use mydata for both sending and receiving purpose. However, once you posted a non-blocking communication, you cannot reuse the corresponding message until the communication is finished. So in your case, you'll need two arrays, one for sending and one for receiving.
The type of the data you pass to your MPI calls, namely MPI_BYTE, isn't coherent with the actual type of the data you transfer, namely float. You should use MPI_FLOAT instead.
You call MPI_Irecv() and MPI_Isend() without calling any valid MPI_Wait() or MPI_Test() functions. This is wrong since this means that the communications might never occur.
MPI_Wtime() returns a double, not an int. This isn't an error per se but it might lead to unexpected results. Moreover, the format requested in your call to printf() corresponds to a floating point data, not an integer, so you have to make it coherent.
(Minor - typo ) You missed the () for MPI_Finalize().
(Minor - I guess) You only communicate with process #0...
So here is some possible version of a working code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <mpi.h>
int main(int argc, char **argv) {
int a = 1000000000;
int i, from, to;
double Start_time, End_time, Elapse_Time;
int proc_rank, world_size;
float *mysenddata, *myrecvdata;
MPI_Request requests[2];
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &proc_rank );
MPI_Comm_size( MPI_COMM_WORLD, &world_size );
Start_time = MPI_Wtime();
mysenddata = (float*) malloc( a * sizeof( float ) );
myrecvdata = (float*) malloc( a * sizeof( float ) );
assert( mysenddata != NULL ); /*very crude sanity check */
assert( myrecvdata != NULL ); /*very crude sanity check */
for ( i = 0; i < a; i++ ) {
mysenddata[i] = 0.2567 * i;
}
from = ( proc_rank + world_size - 1 ) % world_size;
to = ( proc_rank + 1 ) % world_size;
MPI_Irecv( myrecvdata, a, MPI_FLOAT, from, 1, MPI_COMM_WORLD, &requests[0] );
MPI_Isend( mysenddata, a, MPI_FLOAT, to, 1, MPI_COMM_WORLD, &requests[1] );
MPI_Waitall( 2, requests, MPI_STATUSES_IGNORE );
End_time = MPI_Wtime();
Elapse_Time = End_time - Start_time;
printf( "Time on process %d is %f Seconds.\n", proc_rank, Elapse_Time );
free( mysenddata );
free( myrecvdata );
MPI_Finalize();
return 0;
}
NB: for the sake of having a code working in all circumstances, I implemented a communication ring here, were process 0 sends to process 1 and receives from process size-1... However, in the context of your re-implementation of a broadcast, you can just ignore this (ie. the from and to parameters).
The only explaination I see is your other process is crashing before the print. Try to put some part of your code in comment and reexecute the code.
Try this way and see if you see a difference
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &proc_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
/*Start_time = MPI_Wtime();
for (i = 0; i < a; i++) {
mydata[i] = 0.2567*i;
}
MPI_Irecv(mydata, a, MPI_BYTE, 0, 1, MPI_COMM_WORLD, &request);
MPI_Isend(mydata, a, MPI_BYTE, 0, 1, MPI_COMM_WORLD, &request);
End_time = MPI_Wtime();
Elapse_Time = End_time - Start_time;*/
printf("I'm process %d.\n", proc_rank);
MPI_Finalize;
(1). I am wondering how I can speed up the time-consuming computation in the loop of my code below using MPI?
int main(int argc, char ** argv)
{
// some operations
f(size);
// some operations
return 0;
}
void f(int size)
{
// some operations
int i;
double * array = new double [size];
for (i = 0; i < size; i++) // how can I use MPI to speed up this loop to compute all elements in the array?
{
array[i] = complicated_computation(); // time comsuming computation
}
// some operations using all elements in array
delete [] array;
}
As shown in the code, I want to do some operations before and after the part to be paralleled with MPI, but I don't know how to specify where the parallel part begins and ends.
(2) My current code is using OpenMP to speed up the comutation.
void f(int size)
{
// some operations
int i;
double * array = new double [size];
omp_set_num_threads(_nb_threads);
#pragma omp parallel shared(array) private(i)
{
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < size; i++) // how can I use MPI to speed up this loop to compute all elements in the array?
{
array[i] = complicated_computation(); // time comsuming computation
}
}
// some operations using all elements in array
}
I wonder if I change to use MPI, is it possible to have the code written both for OpenMP and MPI? If it is possible, how to write the code and how to compile and run the code?
(3) Our cluster has three versions of MPI: mvapich-1.0.1, mvapich2-1.0.3, openmpi-1.2.6.
Are their usage same? Especially in my case.
Which one is best for me to use?
Thanks and regards!
UPDATE:
I like to explain a bit more about my question about how to specify the start and end of the parallel part. In the following toy code, I want to limit the parallel part within function f():
#include "mpi.h"
#include <stdio.h>
#include <string.h>
void f();
int main(int argc, char **argv)
{
printf("%s\n", "Start running!");
f();
printf("%s\n", "End running!");
return 0;
}
void f()
{
char idstr[32]; char buff[128];
int numprocs; int myid; int i;
MPI_Status stat;
printf("Entering function f().\n");
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if(myid == 0)
{
printf("WE have %d processors\n", numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d", i);
MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
printf("%s\n", buff);
}
}
else
{
MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
sprintf(idstr, " Processor %d ", myid);
strcat(buff, idstr);
strcat(buff, "reporting for duty\n");
MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
printf("Leaving function f().\n");
}
However, the running output is not expected. The printf parts before and after the parallel part have been executed by every process, not just the main process:
$ mpirun -np 3 ex2
Start running!
Entering function f().
Start running!
Entering function f().
Start running!
Entering function f().
WE have 3 processors
Hello 1 Processor 1 reporting for duty
Hello 2 Processor 2 reporting for duty
Leaving function f().
End running!
Leaving function f().
End running!
Leaving function f().
End running!
So it seems to me the parallel part is not limited between MPI_Init() and MPI_Finalize().
Besides this one, I am still hoping someone could answer my other questions. Thanks!
Quick edit (because I either can't figure out how to leave comments, or I'm not allowed to leave comments yet) -- 3lectrologos is incorrect about the parallel part of MPI programs. You cannot do serial work before MPI_Init and after MPI_Finalize and expect it to actually be serial -- it will still be executed by all MPI threads.
I think part of the issue is that the "parallel part" of an MPI program is the entire program. MPI will start executing the same program (your main function) on each node you specify at approximately the same time. The MPI_Init call just sets certain things up for the program so it can use the MPI calls correctly.
The correct "template" (in pseudo-code) for what I think you want to do would be:
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if (myid == 0) { // Do the serial part on a single MPI thread
printf("Performing serial computation on cpu %d\n", myid);
PreParallelWork();
}
ParallelWork(); // Every MPI thread will run the parallel work
if (myid == 0) { // Do the final serial part on a single MPI thread
printf("Performing the final serial computation on cpu %d\n", myid);
PostParallelWork();
}
MPI_Finalize();
return 0;
}
The MPI_Init (with args of &argc and &argv. It is the requirement of MPI implementations) must be really the first executed statement of MAIN. And Finalize must be the very last executed statement.
main() will be started on every node in MPI environment. Parameters like number of nodes, node_id, and master node address may be passed via argc and argv.
It is framework:
#include "mpi.h"
#include <stdio.h>
#include <string.h>
void f();
int numprocs; int myid;
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if(myid == 0)
{ /* main process. user interaction is ONLY HERE */
printf("%s\n", "Start running!");
MPI_Send ... requests with job
/*may be call f in main too*/
MPU_Reqv ... results..
printf("%s\n", "End running!");
}
else
{
/* Slaves. Do sit here and wait a job from main process */
MPI_Recv(.input..);
/* dispatch input by parsing it
(if there can be different types of work)
or just do the work */
f(..)
MPI_Send(.results..);
}
MPI_Finalize();
return 0;
}
If all the values in the array are independent, then it should be trivially parallelizable. Split the array into chunks of roughly equal size, give each chunk to a node, and then compile the results back together.
The easiest migration to cluster form OpenMP can be "Cluster OpenMP" from intel.
For MPI you need to completely rewrite dispatching of work.