Segmentation fault error due to MPI_comm_size - c

I have a Fortran code which is design to run with the default communicator MPI_COMM_WORLD, but I intend to run it with a few processors only. I have another code which uses MPI_comm_split to get another communicator MyComm. It is an integer and I got 3 when I printed its value. Now I am calling a C function in my Fortran code to get the rank and size corresponding to MyComm. But I am facing several issues here.
In Fortran, when I printed MyComm, its value was 3, but when I print it inside the C function, it becomes 17278324. I also printed the value of MPI_COMM_WORLD, it's value was about 1140850688. I don't know what is the meaning of these values and why did the value of MyComm change?
My code runs properly and creates the executable, but when I executed it, I got the segmentation fault error. I used gdb to debug my code and the process terminated at following line
Program terminated with signal 11, Segmentation fault.
#0 0x00007fe5e8f6248c in PMPI_Comm_size (comm=0x107a574, size=0x13c4ba0) at pcomm_size.c:62
62 *size = ompi_comm_size((ompi_communicator_t*)comm);
I noticed that MPI_comm_rank gives the rank corresponding to MyComm, but the issue is only with MPI_comm_size. There was no such issue with MPI_COMM_WORLD. So I am unable to understand what is causing this. I checked my inputs but I did not get any clue. Here is my C code,
#include <stdio.h>
#include "utils_sub_names.h"
#include <mpi.h>
#define MAX_MSGTAG 1000
int flag_msgtag=0;
MPI_Request mpi_msgtags[MAX_MSGTAG];
char *ibuff;
int ipos,nbuff;
MPI_Comm MyComm;
void par_init_fortran (MPI_Fint *MyComm_r,MPI_Fint*machnum,MPI_Fint *machsize)
{
MPI_Fint comm_in
comm_in=*MyComm_r;
MyComm=MPI_Comm_f2c(comm_in);
printf("my comm is %d \n",MyComm);
MPI_Comm_rank(MyComm,machnum);
printf("my machnum is %d \n ", machnum);
MPI_Comm_rank(MyComm,machsize);
printf("my machnum is %d \n ", machsize);
}
Edit:
I want to declare MyComm as global communicator for all the functions listed in my C code. But I don't know why my communicator is still invalid. Note that the MPI routines are initialized and finalized in Fortran only, I expect I don't have to initialize them in C again. I am using the following Fortran code.
implicit none
include 'mpif.h'
integer :: MyColor, MyCOMM, MyError, MyKey, Nnodes
integer :: MyRank, pelast
CALL mpi_init (MyError)
CALL mpi_comm_size (MPI_COMM_WORLD, Nnodes, MyError)
CALL mpi_comm_rank (MPI_COMM_WORLD, MyRank, MyError)
MyColor=1
MyKey=0
CALL mpi_comm_split (MPI_COMM_WORLD, MyColor, MyKey, MyComm,MyError)
CALL ramcpl (MyComm)
CALL mpi_barrier (MPI_COMM_WORLD, MyError)
CALL MCTWorld_clean ()
CALL mpi_finalize (MyError)
my subroutine ramcpl is located at another place
subroutine ramcpl (MyComm_r)
implicit none
integer :: MyComm_r, ierr
.
.
.
CALL par_init_fortran (MyComm_r, my_mpi_num,nmachs);
End Subroutine ramcpl
The command line and the output is,
mpirun -np 4 ./ramcplM ramcpl.in
Model Coupling:
[localhost:31472] *** Process received signal ***
[localhost:31473] *** Process received signal ***
[localhost:31472] Signal: Segmentation fault (11)
[localhost:31472] Signal code: Address not mapped (1)
[localhost:31472] Failing at address: (nil)
[localhost:31473] Signal: Segmentation fault (11)
[localhost:31473] Signal code: Address not mapped (1)
[localhost:31473] Failing at address: (nil)
[localhost:31472] [ 0] /lib64/libpthread.so.0() [0x3120c0f7e0]
[localhost:31472] [ 1] ./ramcplM(par_init_fortran_+0x122) [0x842db2]
[localhost:31472] [ 2] ./ramcplM(__rams_MOD_rams_cpl+0x7a0) [0x8428c0]
[localhost:31472] [ 3] ./ramcplM(MAIN__+0xea6) [0x461086]
[localhost:31472] [ 4] ./ramcplM(main+0x2a) [0xc3eefa]
[localhost:31472] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x312081ed1d]
[localhost:31472] [ 6] ./ramcplM() [0x45e2d9]
[localhost:31472] *** End of error message ***
[localhost:31473] [ 0] /lib64/libpthread.so.0() [0x3120c0f7e0]
[localhost:31473] [ 1] ./ramcplM(par_init_fortran_+0x122) [0x842db2]
[localhost:31473] [ 2] ./ramcplM(__rammain_MOD_ramcpl+0x7a0) [0x8428c0]
[localhost:31473] [ 3] ./ramcplM(MAIN__+0xea6) [0x461086]
[localhost:31473] [ 4] ./ramcplM(main+0x2a) [0xc3eefa]
[localhost:31473] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x312081ed1d]
[localhost:31473] [ 6] ./ramcplM() [0x45e2d9]
[localhost:31473] *** End of error message ***

The handles in Fortran and C are NOT compatible. Use MPI_Comm_f2c https://linux.die.net/man/3/mpi_comm_f2c and related connversion functions. Pass it between C and Fortran as an integer, not as MPI_Comm.

Related

Open MPI oversubscription failling

I'm trying to run a MPI C program with more process than CPUs I have using the --oversubscribe flag. The problem is that even if I do that it returns segmentation fault.
$ mpirun -np 8 --oversubscribe ./a
[0piero:68195] *** Process received signal ***
[0piero:68195] Signal: Segmentation fault (11)
[0piero:68195] Signal code: Address not mapped (1)
[0piero:68195] Failing at address: 0x7fd162194c80
[0piero:68185] *** Process received signal ***
[0piero:68185] Signal: Segmentation fault (11)
[0piero:68185] Signal code: Address not mapped (1)
[0piero:68185] Failing at address: 0x7ffbf42f13e0
[0piero:68185] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ffbf88e2520]
[0piero:68185] [ 1] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b0)[0x7ffbf48d42b0]
[0piero:68185] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7ffbf8934b43]
[0piero:68185] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7ffbf89c6a00]
[0piero:68185] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 0 on node 0piero exited on signal 11 (Segmentation fault).
I'm currently using the Open MPI 4.1.2.

Im learning MPI with C and don't understand why my code don't work

I'm new to MPI and I want my C program, which needs to be launched with two process, to output this:
Hello
Good bye
Hello
Good bye
... (20 times)
But when I start it with mpirun -n 2 ./main it gives me this error and the program don't work:
*** An error occurred in MPI_Send
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[laptop:3786023] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** An error occurred in MPI_Send
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[laptop:3786024] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[61908,1],0]
Exit code: 1
--------------------------------------------------------------------------
Here's my code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <unistd.h>
#include <string.h>
#define N 32
void send (int rank)
{
char mess[N];
if (rank == 0)
sprintf(mess, "Hello");
else
sprintf(mess, "Good bye");
MPI_Send(mess, strlen(mess), MPI_CHAR, !rank, 0, MPI_COMM_WORLD);
}
void receive (int rank)
{
char buf[N];
MPI_Status status;
MPI_Recv(buf, N, MPI_CHAR, !rank, 0, MPI_COMM_WORLD, &status);
printf("%s\n", buf);
}
int main (int argc, char **argv)
{
if (MPI_Init(&argc, &argv)) {
fprintf(stderr, "Erreur MPI_Init\n");
exit(1);
}
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
for (size_t i = 0; i < 20; i ++) {
if (rank == 0) {
send(rank);
receive(rank);
} else {
receive(rank);
send(rank);
}
}
MPI_Finalize();
return 0;
}
I don't understand that error and I don't know how to debug that.
Thank you if you can help me to solve that (probably idiot) problem !
The main error is your function is called send(), and that conflicts with the function of the libc, and hence results in this bizarre behavior.
In order to avoid an other undefined behavior caused by using uninitialized data, you also have to send the NULL terminating character, e.g.
MPI_Send(mess, strlen(mess)+1, MPI_CHAR, !rank, 0, MPI_COMM_WORLD);

Trying to receive a vector with MPI_Recv

I'm implementing the Chan and Dehne sorting algorithm using MPI and the CGM realistic parallel model. So far each process receives N/p numbers from the original vector, each process then order their numbers sequentially using quick sort, each process then creates a sample from it's local vector (the sample has size p), each process then sends their sample over to P0; P0 should receive all samples in a bigger vector of size p*p so it can accommodate data from all processors. This is where I'm stuck, it seems to be working but for some reason after P0 receives all the data it exits with Signal: Segmentation fault (11). Thank you.
Here is the relevant part of the code:
// Step 2. Each process calculates it's local sample with size comm_sz
local_sample = create_local_sample(sub_vec, n_over_p, comm_sz);
// Step 3. Each process sends it's local sample to P0
if (my_rank == 0) {
global_sample_receiver = (int*)malloc(pow(comm_sz,2)*sizeof(int));
global_sample_receiver = local_sample;
for (i = 1; i < comm_sz; i++) {
MPI_Recv(global_sample_receiver+(i*comm_sz), comm_sz, MPI_INT,
i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
} else {
MPI_Send(local_sample, comm_sz, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
printf("P%d got here\n", my_rank);
MPI_Finalize();
What is funny is that every process reachs the command printf("P%d got here\n", my_rank); and therefor prints to the terminal. Also global_sample_receiver does contain the data it is supposed to contain at the end, but the program still finished with a segmentation fault.
Here is the output:
P2 got here
P0 got here
P3 got here
P1 got here
[Krabbe-Ubuntu:05969] *** Process received signal ***
[Krabbe-Ubuntu:05969] Signal: Segmentation fault (11)
[Krabbe-Ubuntu:05969] Signal code: Address not mapped (1)
[Krabbe-Ubuntu:05969] Failing at address: 0x18000003e7
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 5969 on node Krabbe-Ubuntu
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Edit: I found the problem, turns out local_sample also needed a malloc.
The issue is you overwrite global_sample_receiver (which is a pointer) with local_sample (which is an other pointer) on rank zero.
If you want to set the first comm_sz elements of global_sample_receiver with the first comm_sz elements from local_sample, then you have to copy the data (e.g. not the pointer) manually.
memcpy(global_sample_receiver, local_sample, comm_sz * sizeof(int));
That being said, the natural MPI way of doing this is via MPI_Gather().
Here is what step 3 would look like :
// Step 3. Each process sends it's local sample to P0
if (my_rank == 0) {
global_sample_receiver = (int*)malloc(pow(comm_sz,2)*sizeof(int));
}
MPI_Gather(global_sample_receiver,comm_sz, MPI_INT, local_sample, comm_sz, MPI_INT, 0, MPI_COMM_WORLD);

stack smashing detected when use MPI_Reduce

I have learned to use some MPI functions. When I try to use MPI_Reduce, I get stack smashing detected when I run my code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void main(int argc, char **argv) {
int i, rank, size;
int sendBuf, recvBuf, count;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sendBuf = rank;
count = size;
MPI_Reduce(&sendBuf, &recvBuf, count, MPI_INT,
MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
printf("Sum is %d\n", recvBuf);
}
MPI_Finalize();
}
It seem to be okey with my code. It will print sum of all rank in recvBufwith process 0. In this case, it will print Sum is 45 if I run my code with 10 process mpirun -np 10 myexecutefile. But I don't know why my code has the error:
Sum is 45
*** stack smashing detected ***: example6 terminated
[ubuntu:06538] *** Process received signal ***
[ubuntu:06538] Signal: Aborted (6)
[ubuntu:06538] Signal code: (-6)
[ubuntu:06538] *** Process received signal ***
[ubuntu:06538] Signal: Segmentation fault (11)
[ubuntu:06538] Signal code: (128)
[ubuntu:06538] Failing at address: (nil)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
What is the problem and how can I fix it?
In
MPI_Reduce(&sendBuf, &recvBuf, count, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
The argument count must be the number of elements in send buffer. As sendBuf is a single integer, try count = 1; instead of count = size;.
The reason why Sum is 45 got correctly printed is hard to explain. Accessing values out of bound is undefined behavior: the problem could have remained unnoticed, or the segmentation fault could have been raised before Sum is 45 got printed. The magic of undefined behavior...

I don't see what the issue is in my program in MPI

I don't know how to fix the problem with this program so far. The purpose of this program is to add up all the number in an array but I can only barely manage to send the arrays before errors start to appear. It has to do with the for loop in the if statement my_rank!=0 section.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char* argv[]){
int my_rank, p, source, dest, tag, total, n = 0;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
//15 processors(1-15) not including processor 0
if(my_rank != 0){
MPI_Recv( &n, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
int arr[n];
MPI_Recv( arr, n, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
//printf("%i ", my_rank);
int i;
for(i = ((my_rank-1)*(n/15)); i < ((my_rank-1)+(n/15)); i++ ){
//printf("%i ", arr[0]);
}
}
else{
printf("Please enter an integer:\n");
scanf("%i", &n);
int i;
int arr[n];
for(i = 0; i < n; i++){
arr[i] = i + 1;
}
for(dest = 0; dest < p; dest++){
MPI_Send( &n, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
MPI_Send( arr, n, MPI_INT, dest, tag, MPI_COMM_WORLD);
}
}
MPI_Finalize();
}
When I take that for loop out it compiles and run but when I put it back in it just stops working. Here is the error it is giving me:
[compute-0-24.local:1072] *** An error occurred in MPI_Recv
[compute-0-24.local:1072] *** on communicator MPI_COMM_WORLD
[compute-0-24.local:1072] *** MPI_ERR_RANK: invalid rank
[compute-0-24.local:1072] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Please enter an integer:
--------------------------------------------------------------------------
mpirun has exited due to process rank 8 with PID 1072 on
node compute-0-24 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-16.local][[31957,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.4.237 failed: Connection refused (111)
[cs-cluster:11677] 14 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[cs-cluster:11677] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
There are two problems in the code you posted:
The send loop starts from p=0, which means that process of rank zero will send to itself. However, since there's no receiving part for process zero, this won't work. Just make the loop to start from p=1 and that should solve it.
The tag you use isn't initialised. So it's value can be whatever (which is OK), but can be a different whatever per process, which will lead to the various communications to never match each-other. Just initialise tag=0 for example, and that should fix that.
With this, your code snippet should work.
Learn to read the informative error messages that Open MPI gives you and to apply some general debugging strategies.
[compute-0-24.local:1072] *** An error occurred in MPI_Recv
[compute-0-24.local:1072] *** on communicator MPI_COMM_WORLD
[compute-0-24.local:1072] *** MPI_ERR_RANK: invalid rank
The library is telling you that the receive operation was called with an invalid rank value. Armed with that knowledge, you take a look at your code:
int my_rank, p, source, dest, tag, total, n = 0;
...
//15 processors(1-15) not including processor 0
if(my_rank != 0){
MPI_Recv( &n, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
...
The rank is source. source is an automatic variable declared some lines before but never initialised, therefore its initial value is completely random. You fix it by assigning source an initial value of 0 or by simply replacing it with 0 since you've already hard-coded the rank of the sender by singling out its code in the else block of the if operator.
The presence of the above error eventually hints you to examine the other variables too. Thus you notice that tag is also used uninitialised and you either initialise it to e.g. 0 or replace it altogether.
Now your program is almost correct. You notice that it seems to work fine for n up to about 33000 (the default eager limit of the self transport divided by sizeof(int)), but then it hangs for larger values. You either fire a debugger of simply add a printf statement before and after each send and receive operation and discover that already the first call to MPI_Send with dest equal to 0 never returns. You then take a closer look at your code and discover this:
for(dest = 0; dest < p; dest++){
dest starts from 0, but this is wrong since rank 0 is only sending data and not receiving. You fix it by setting the initial value to 1.
Your program should now work as intended (or at least for values of n that do not lead to stack overflow in int arr[n];). Congratulations! Now go and learn about MPI_Probe and MPI_Get_count, which will help you do the same without explicitly sending the length of the array first. Then learn about MPI_Scatter and MPI_Reduce, which will enable you to implement the algorithm even more elegantly.

Resources