I have learned to use some MPI functions. When I try to use MPI_Reduce, I get stack smashing detected when I run my code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void main(int argc, char **argv) {
int i, rank, size;
int sendBuf, recvBuf, count;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sendBuf = rank;
count = size;
MPI_Reduce(&sendBuf, &recvBuf, count, MPI_INT,
MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
printf("Sum is %d\n", recvBuf);
}
MPI_Finalize();
}
It seem to be okey with my code. It will print sum of all rank in recvBufwith process 0. In this case, it will print Sum is 45 if I run my code with 10 process mpirun -np 10 myexecutefile. But I don't know why my code has the error:
Sum is 45
*** stack smashing detected ***: example6 terminated
[ubuntu:06538] *** Process received signal ***
[ubuntu:06538] Signal: Aborted (6)
[ubuntu:06538] Signal code: (-6)
[ubuntu:06538] *** Process received signal ***
[ubuntu:06538] Signal: Segmentation fault (11)
[ubuntu:06538] Signal code: (128)
[ubuntu:06538] Failing at address: (nil)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
What is the problem and how can I fix it?
In
MPI_Reduce(&sendBuf, &recvBuf, count, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
The argument count must be the number of elements in send buffer. As sendBuf is a single integer, try count = 1; instead of count = size;.
The reason why Sum is 45 got correctly printed is hard to explain. Accessing values out of bound is undefined behavior: the problem could have remained unnoticed, or the segmentation fault could have been raised before Sum is 45 got printed. The magic of undefined behavior...
Related
I'm creating MPI groups in a loop that perform a task, but when I want to free the group, the computation aborts. When should I free the group?
The error I get is:
[KLArch:13617] *** An error occurred in MPI_Comm_free
[KLArch:13617] *** reported by process [1712324609,2]
[KLArch:13617] *** on communicator MPI_COMM_WORLD
[KLArch:13617] *** MPI_ERR_COMM: invalid communicator
[KLArch:13617] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[KLArch:13617] *** and potentially your MPI job)
[KLArch:13611] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[KLArch:13611] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
What I intend to do is to do a calculation with an increasing number of processes in parallel as a benchmark for MPI, i.e. doing the whole calculation with only 1 processe, take the time, run the same calculation with 2 processes, take the time, run the same calculation with 4 processes, take the time... and compare how the problem scales with the number of processes.
MWE:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int j = 0;
int size = 0;
int rank = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Get the group of processes in MPI_COMM_WORLD
MPI_Group world_group;
MPI_Comm_group(MPI_COMM_WORLD, &world_group);
// Construct a group containing all of the ranks smaller than i in MPI_COMM_WORLD
for (i = 1; i <= size; i++)
{
int group_ranks[i];
for (int j = 0; j < i; j++)
{
group_ranks[j] = j;
}
// Construct a group with all the ranks smaller than i
MPI_Group sub_group;
MPI_Group_incl(world_group, i, group_ranks, &sub_group);
// Create a communicator based on the group
MPI_Comm sub_comm;
MPI_Comm_create(MPI_COMM_WORLD, sub_group, &sub_comm);
int sub_rank = -1;
int sub_size = -1;
// If this rank isn't in the new communicator, it will be
// MPI_COMM_NULL. Using MPI_COMM_NULL for MPI_Comm_rank or
// MPI_Comm_size is erroneous
if (MPI_COMM_NULL != sub_comm)
{
MPI_Comm_rank(sub_comm, &sub_rank);
MPI_Comm_size(sub_comm, &sub_size);
}
// Do some work
printf("WORLD RANK/SIZE: %d/%d \t Group RANK/SIZE: %d/%d\n",
rank, size, sub_rank, sub_size);
// Free the communicator and group
MPI_Barrier(MPI_COMM_WORLD);
//MPI_Comm_free(&sub_comm);
//MPI_Group_free(&sub_group);
j = 0;
}
MPI_Group_free(&world_group);
MPI_Finalize();
return 0;
}
The code throws the error, if I uncomment MPI_Comm_free(&sub_comm); and MPI_Group_free(&sub_group); at the end of the loop.
Let me collect some remarks about this code.
that barrier is not needed.
if you test this on a multi-node system you have to be aware that your processes are not spread evenly: 13 processes on 3 6-core nodes would give you 6+6+1 which is unbalanced. You would want 5+4+4 or so. Your mpiexec would do this correctly; achieving this in your code is a little harder. Just be aware of this since you are doing benchmarking.
It's a little tricky getting this code right. When you make a subgroup, all processes have the same value for the group, including the ones that are not in the group. For instance they do not get MPI_GROUP_NULL. Then you have to call MPI_Comm_create collectively on the large communicator; processes that are not in the group get MPI_COMM_NULL as result. They do not participate in the actions on the subcommunicator. Also, and this was your problem: they do not free the subcommunicator, but they do free the subgroup.
(That last point was also pointed out by #GillesGouaillardet)
I'm new to MPI and I want my C program, which needs to be launched with two process, to output this:
Hello
Good bye
Hello
Good bye
... (20 times)
But when I start it with mpirun -n 2 ./main it gives me this error and the program don't work:
*** An error occurred in MPI_Send
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[laptop:3786023] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** An error occurred in MPI_Send
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[laptop:3786024] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[61908,1],0]
Exit code: 1
--------------------------------------------------------------------------
Here's my code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <unistd.h>
#include <string.h>
#define N 32
void send (int rank)
{
char mess[N];
if (rank == 0)
sprintf(mess, "Hello");
else
sprintf(mess, "Good bye");
MPI_Send(mess, strlen(mess), MPI_CHAR, !rank, 0, MPI_COMM_WORLD);
}
void receive (int rank)
{
char buf[N];
MPI_Status status;
MPI_Recv(buf, N, MPI_CHAR, !rank, 0, MPI_COMM_WORLD, &status);
printf("%s\n", buf);
}
int main (int argc, char **argv)
{
if (MPI_Init(&argc, &argv)) {
fprintf(stderr, "Erreur MPI_Init\n");
exit(1);
}
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
for (size_t i = 0; i < 20; i ++) {
if (rank == 0) {
send(rank);
receive(rank);
} else {
receive(rank);
send(rank);
}
}
MPI_Finalize();
return 0;
}
I don't understand that error and I don't know how to debug that.
Thank you if you can help me to solve that (probably idiot) problem !
The main error is your function is called send(), and that conflicts with the function of the libc, and hence results in this bizarre behavior.
In order to avoid an other undefined behavior caused by using uninitialized data, you also have to send the NULL terminating character, e.g.
MPI_Send(mess, strlen(mess)+1, MPI_CHAR, !rank, 0, MPI_COMM_WORLD);
I don't know how to fix the problem with this program so far. The purpose of this program is to add up all the number in an array but I can only barely manage to send the arrays before errors start to appear. It has to do with the for loop in the if statement my_rank!=0 section.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char* argv[]){
int my_rank, p, source, dest, tag, total, n = 0;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
//15 processors(1-15) not including processor 0
if(my_rank != 0){
MPI_Recv( &n, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
int arr[n];
MPI_Recv( arr, n, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
//printf("%i ", my_rank);
int i;
for(i = ((my_rank-1)*(n/15)); i < ((my_rank-1)+(n/15)); i++ ){
//printf("%i ", arr[0]);
}
}
else{
printf("Please enter an integer:\n");
scanf("%i", &n);
int i;
int arr[n];
for(i = 0; i < n; i++){
arr[i] = i + 1;
}
for(dest = 0; dest < p; dest++){
MPI_Send( &n, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
MPI_Send( arr, n, MPI_INT, dest, tag, MPI_COMM_WORLD);
}
}
MPI_Finalize();
}
When I take that for loop out it compiles and run but when I put it back in it just stops working. Here is the error it is giving me:
[compute-0-24.local:1072] *** An error occurred in MPI_Recv
[compute-0-24.local:1072] *** on communicator MPI_COMM_WORLD
[compute-0-24.local:1072] *** MPI_ERR_RANK: invalid rank
[compute-0-24.local:1072] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Please enter an integer:
--------------------------------------------------------------------------
mpirun has exited due to process rank 8 with PID 1072 on
node compute-0-24 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-16.local][[31957,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.4.237 failed: Connection refused (111)
[cs-cluster:11677] 14 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[cs-cluster:11677] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
There are two problems in the code you posted:
The send loop starts from p=0, which means that process of rank zero will send to itself. However, since there's no receiving part for process zero, this won't work. Just make the loop to start from p=1 and that should solve it.
The tag you use isn't initialised. So it's value can be whatever (which is OK), but can be a different whatever per process, which will lead to the various communications to never match each-other. Just initialise tag=0 for example, and that should fix that.
With this, your code snippet should work.
Learn to read the informative error messages that Open MPI gives you and to apply some general debugging strategies.
[compute-0-24.local:1072] *** An error occurred in MPI_Recv
[compute-0-24.local:1072] *** on communicator MPI_COMM_WORLD
[compute-0-24.local:1072] *** MPI_ERR_RANK: invalid rank
The library is telling you that the receive operation was called with an invalid rank value. Armed with that knowledge, you take a look at your code:
int my_rank, p, source, dest, tag, total, n = 0;
...
//15 processors(1-15) not including processor 0
if(my_rank != 0){
MPI_Recv( &n, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
...
The rank is source. source is an automatic variable declared some lines before but never initialised, therefore its initial value is completely random. You fix it by assigning source an initial value of 0 or by simply replacing it with 0 since you've already hard-coded the rank of the sender by singling out its code in the else block of the if operator.
The presence of the above error eventually hints you to examine the other variables too. Thus you notice that tag is also used uninitialised and you either initialise it to e.g. 0 or replace it altogether.
Now your program is almost correct. You notice that it seems to work fine for n up to about 33000 (the default eager limit of the self transport divided by sizeof(int)), but then it hangs for larger values. You either fire a debugger of simply add a printf statement before and after each send and receive operation and discover that already the first call to MPI_Send with dest equal to 0 never returns. You then take a closer look at your code and discover this:
for(dest = 0; dest < p; dest++){
dest starts from 0, but this is wrong since rank 0 is only sending data and not receiving. You fix it by setting the initial value to 1.
Your program should now work as intended (or at least for values of n that do not lead to stack overflow in int arr[n];). Congratulations! Now go and learn about MPI_Probe and MPI_Get_count, which will help you do the same without explicitly sending the length of the array first. Then learn about MPI_Scatter and MPI_Reduce, which will enable you to implement the algorithm even more elegantly.
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int myid, numprocs, number_of_completed_operation;
char message = 'a';
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Request* requests = (MPI_Request*)malloc((numprocs - 1)*sizeof(MPI_Request));
MPI_Status* statuses = (MPI_Status*)malloc(sizeof(MPI_Status)*(numprocs - 1));
int* indices = (int *)malloc((numprocs - 1)*sizeof(int));
char* buf = (char *)malloc((numprocs - 1)*sizeof(char));
if (myid != numprocs - 1)
{//worker
printf("***this is sender %d\n", myid);
MPI_Send(&message, 1, MPI_CHAR, numprocs - 1, 110, MPI_COMM_WORLD);
printf("*.*sender %d is done\n", myid);
}
else if (myid == numprocs - 1)
{
//master
int number_of_left_messages = numprocs - 1;//有numprocs-1个信息到来
int i;
for (i = 0; i < numprocs - 1; i++)
{
MPI_Irecv(&buf+i, 1, MPI_CHAR,i, 110, MPI_COMM_WORLD, &requests[i]);
}
MPI_Waitsome(numprocs - 1, requests, &number_of_completed_operation, indices, statuses);
number_of_left_messages = number_of_left_messages - number_of_completed_operation;
printf("number of completed operation is %d\n", number_of_left_messages);
printf("left message amount is %d\n", number_of_left_messages);
int j;
for (j = 0; j <numprocs - 1; j++)
{
printf("-------------\n");
printf("index is %d\n",indices[j]);
printf("source is %d\n", statuses[j].MPI_SOURCE);
//printf("good\n");
printf("--------====\n");
}
while (number_of_left_messages > 0)
{
MPI_Waitsome(numprocs - 1, requests, &number_of_completed_operation, indices, statuses);
printf("number of completed operation is %d\n", number_of_completed_operation);
for (j = 0; j <numprocs - 1; j++)
{
printf("-------------\n");
printf("index is %d\n", indices[j]);
printf("source is %d\n", statuses[j].MPI_SOURCE);
printf("--------====\n");
}
number_of_left_messages = number_of_left_messages - number_of_completed_operation;
printf("left message amount is %d\n", number_of_left_messages);
The logic is simple,I set the final process as the master process,all the other process are worker process,the workers send a message to the master,the master use the waitsome function to receive.
When I set the number of processes as 4 or larger, the system shown me the error as following:
[soit-mpi-pro-1:12197] *** An error occurred in MPI_Waitsome
[soit-mpi-pro-1:12197] *** reported by process [140533176729601,140531329925123]
[soit-mpi-pro-1:12197] *** on communicator MPI_COMM_WORLD
[soit-mpi-pro-1:12197] *** MPI_ERR_REQUEST: invalid request
[soit-mpi-pro-1:12197] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[soit-mpi-pro-1:12197] *** and potentially your MPI job)
It looks like your call to MPI_Irecv might be a problem. Remove the extra & before the buf (you have a double pointer instead of a pointer).
MPI_Irecv(buf+i, 1, MPI_CHAR,i, 110, MPI_COMM_WORLD, &requests[i]);
When I fix that, add closing braces and a call to MPI_Finalize(), and remove a bunch of extra output, I don't have any issues running your program:
$ mpiexec -n 8 ./a.out
***this is sender 3
*.*sender 3 is done
***this is sender 4
*.*sender 4 is done
***this is sender 5
*.*sender 5 is done
***this is sender 6
*.*sender 6 is done
***this is sender 0
*.*sender 0 is done
***this is sender 1
*.*sender 1 is done
***this is sender 2
*.*sender 2 is done
number of completed operation is 1
left message amount is 6
number of completed operation is 1
left message amount is 5
number of completed operation is 1
left message amount is 4
number of completed operation is 1
left message amount is 3
number of completed operation is 1
left message amount is 2
number of completed operation is 1
left message amount is 1
number of completed operation is 1
left message amount is 0
I have no idea if it gets the right answer or not, but that's a different question.
You are passing MPI_Irecv the address of the pointer buf itself plus an offset instead of its value. When the message is received, it overwrites the last byte (on little endian systems like x86/x64) of the value of one or more nearby stack variables, which, depending on the stack layout, might include requests and statuses. Therefore MPI_Waitsome receives a pointer that doesn't point to beginning of the array of requests but rather somewhere before it, after it or in the middle of it, hence some of the request handles are invalid and MPI_Waitsome complains. On a big endian system, this would overwite the highest byte of the address and will much rather result in an invalid address and a segmentation fault.
Either use buf+i (as per Wesley Bland's answer) or use &buf[i]. I usually find it a matter of personal taste whether one uses the first of the second form.
I'm trying to gracefully exit my program after if Rdinput returns an error.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER 0
#define Abort(x) MPI_Abort(MPI_COMM_WORLD, x)
#define Bcast(send_data, count, type) MPI_Bcast(send_data, count, type, MASTER, GROUP) //root --> MASTER
#define Finalize() MPI_Finalize()
int main(int argc, char **argv){
//Code
if( rank == MASTER ) {
time (&start);
printf("Initialized at %s\n", ctime (&start) );
//Read file
error = RdInput();
}
Bcast(&error, 1, INT); Wait();
if( error = 1 ) MPI_Abort(1);
//Code
Finalize();
}
Program output:
mpirun -np 2 code.x
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Initialized at Wed May 30 11:34:46 2012
Error [RdInput]: The file "input.mga" is not available!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 7369 on
node einstein exiting improperly. There are two reasons this could occur:
//More error message.
What can I do to gracefully exit an MPI program without printing this huge error message?
If you have this logic in your code:
Bcast(&error, 1, INT);
if( error = 1 ) MPI_Abort(1);
then you're just about done (although you don't need any kind of wait after a broadcast). The trick, as you've discovered, is that MPI_Abort() does not do "graceful"; it basically is there to shut things down in whatever way possible when something's gone horribly wrong.
In this case, since now everyone agrees on the error code after the broadcast, just do a graceful end of your program:
MPI_Bcast(&error, 1, MPI_INT, MASTER, MPI_COMM_WORLD);
if (error != 0) {
if (rank == 0) {
fprintf(stderr, "Error: Program terminated with error code %d\n", error);
}
MPI_Finalize();
exit(error);
}
It's an error to call MPI_Finalize() and keep on going with more MPI stuff, but that's not what you're doing here, so you're fine.