Open MPI oversubscription failling - c

I'm trying to run a MPI C program with more process than CPUs I have using the --oversubscribe flag. The problem is that even if I do that it returns segmentation fault.
$ mpirun -np 8 --oversubscribe ./a
[0piero:68195] *** Process received signal ***
[0piero:68195] Signal: Segmentation fault (11)
[0piero:68195] Signal code: Address not mapped (1)
[0piero:68195] Failing at address: 0x7fd162194c80
[0piero:68185] *** Process received signal ***
[0piero:68185] Signal: Segmentation fault (11)
[0piero:68185] Signal code: Address not mapped (1)
[0piero:68185] Failing at address: 0x7ffbf42f13e0
[0piero:68185] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ffbf88e2520]
[0piero:68185] [ 1] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b0)[0x7ffbf48d42b0]
[0piero:68185] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7ffbf8934b43]
[0piero:68185] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7ffbf89c6a00]
[0piero:68185] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 0 on node 0piero exited on signal 11 (Segmentation fault).
I'm currently using the Open MPI 4.1.2.

Related

Reset after hard fault

I'm trying to debug a hard fault in a C++ firmware project for the microbit v1.5 .
The issue at hand is that after a hard fault I would like to reset the microcontroller
and start anew but issuing the dreaded monitor reset halt does not work and execution never restarts properly after a hard fault.
I'm using pyocd (v0.33.1) as my gdb debugserver and a custom built gdb (v8.2.1) with proper support for the nrf51 series.
This is an example interaction with gdb. I set a breakpoint on HardFault_Handler and start execution. The firmware correctly spawns tasks but eventually one of the tasks faults and the HardFault handler gets called. After this I would like to reset the microcontroller and start anew.
I expect the microcontroller to spawn the same set of tasks but this never happens and it also never goes back to main so I'm thinking there must be a specific way to reset it correctly.
What command should I issue to reset the flow of execution to start with main or one of the routines from gcc_startup?
(gdb) info breakpoints
Num Type Disp Enb Address What
1 breakpoint keep y 0x000290e2 ../support/libs/nrfx/mdk/gcc_startup_nrf51.S:234
(gdb) c
Continuing.
[New Thread 2]
[New Thread 536884080]
[New Thread 536880760]
[New Thread 536884152]
Thread 2 "Handler mode" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2]
0x000006b0 in ?? ()
(gdb) info threads
Id Target Id Frame
* 2 Thread 2 "Handler mode" (HardFault) 0x000006b0 in ?? ()
3 Thread 536884080 "IDL" (Ready; Priority 0) prvIdleTask (pvParameters=0x0)
at ../support/freertos/tasks.c:3225
4 Thread 536880760 "KNL" (Ready; Priority 1) starlight::sys::Task::<lambda(void*)>::_FUN(void *)
() at ../include/starlight/sys/task.hpp:154
5 Thread 536884152 "Tmr" (Running; Priority 2) __DSB ()
at ../support/libs/CMSIS-Core/Include/cmsis_gcc.h:946
(gdb) monitor reset halt
Resetting target with halt
Successfully halted device on reset
(gdb) c
Continuing.
[New Thread 1]
Thread 6 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1]
0x000006b0 in ?? ()
(gdb) info threads
Id Target Id Frame
* 6 Thread 1 (HardFault) 0x000006b0 in ?? ()
(gdb) monitor reset halt
Resetting target with halt
Successfully halted device on reset
(gdb) c
Continuing.
Thread 6 received signal SIGSEGV, Segmentation fault.
0x000006b0 in ?? ()
(gdb) backtrace
#0 0x000006b0 in ?? ()
#1 <signal handler called>
Backtrace stopped: Cannot access memory at address 0x4b0547f8

MPI can not create parallel process, fails at MPI_INIT

i am learning the MPI Interface for C and just installed MPI for my system (Mac Os Mojave 10.14.6). I made it with the help of this tutorial.
Everything was fine and now i wanted to start my first simple code.
I tried to understand the error and searched for a solution, but could not find it on my own. Using CLion as IDE.
main.c
#include <stdio.h>
#include <mpi.h>
int main(argc, argv)
int argc;
char **argv;
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &size);
printf("Hi from process %d of $d\n", rank, size);
MPI_Finalize();
return 0;
}
CMakeList.txt
cmake_minimum_required(VERSION 3.15)
project(uebung02 C)
set(CMAKE_C_STANDARD 99)
add_executable(uebung02 main.c)
set(CMAKE_C_COMPILER /opt/openmpi/bin/mpicc)
set(CMAKE_CXX_COMPILER /opt/openmpi/bin/mpic++)
Error output
/Users/admin/Documents/HU/VSuA/uebung02/cmake-build-debug/uebung02
--------------------------------------------------------------------------
PMIx has detected a temporary directory name that results
It looks like orte_init failed for some reason; your parallel process is
in a path that is too long for the Unix domain socket:
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
Temp dir: /var/folders/9j/54dxfbf1451dk82nn7y99d8r0000gn/T/openmpi-sessions-501#admins-MacBook-Pro_0/15653
orte_ess_init failed
Try setting your TMPDIR environmental variable to point to
something shorter in length
[admins-MacBook-Pro.local:44625] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 582
[admins-MacBook-Pro.local:44625] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
--> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
[admins-MacBook-Pro.local:44625] *** An error occurred in MPI_Init
addition[admins-MacBook-Pro.local:44625] *** on a NULL communicator
[admins-MacBook-Pro.local:44625] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[admins-MacBook-Pro.local:44625] *** and potentially your MPI job)
al information (which may only be relevant to an Open MPI
developer):
[admins-MacBook-Pro.local:44625] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
ompi_mpi_init: ompi_rte_init failed
--> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
Process finished with exit code 1

Segmentation fault error due to MPI_comm_size

I have a Fortran code which is design to run with the default communicator MPI_COMM_WORLD, but I intend to run it with a few processors only. I have another code which uses MPI_comm_split to get another communicator MyComm. It is an integer and I got 3 when I printed its value. Now I am calling a C function in my Fortran code to get the rank and size corresponding to MyComm. But I am facing several issues here.
In Fortran, when I printed MyComm, its value was 3, but when I print it inside the C function, it becomes 17278324. I also printed the value of MPI_COMM_WORLD, it's value was about 1140850688. I don't know what is the meaning of these values and why did the value of MyComm change?
My code runs properly and creates the executable, but when I executed it, I got the segmentation fault error. I used gdb to debug my code and the process terminated at following line
Program terminated with signal 11, Segmentation fault.
#0 0x00007fe5e8f6248c in PMPI_Comm_size (comm=0x107a574, size=0x13c4ba0) at pcomm_size.c:62
62 *size = ompi_comm_size((ompi_communicator_t*)comm);
I noticed that MPI_comm_rank gives the rank corresponding to MyComm, but the issue is only with MPI_comm_size. There was no such issue with MPI_COMM_WORLD. So I am unable to understand what is causing this. I checked my inputs but I did not get any clue. Here is my C code,
#include <stdio.h>
#include "utils_sub_names.h"
#include <mpi.h>
#define MAX_MSGTAG 1000
int flag_msgtag=0;
MPI_Request mpi_msgtags[MAX_MSGTAG];
char *ibuff;
int ipos,nbuff;
MPI_Comm MyComm;
void par_init_fortran (MPI_Fint *MyComm_r,MPI_Fint*machnum,MPI_Fint *machsize)
{
MPI_Fint comm_in
comm_in=*MyComm_r;
MyComm=MPI_Comm_f2c(comm_in);
printf("my comm is %d \n",MyComm);
MPI_Comm_rank(MyComm,machnum);
printf("my machnum is %d \n ", machnum);
MPI_Comm_rank(MyComm,machsize);
printf("my machnum is %d \n ", machsize);
}
Edit:
I want to declare MyComm as global communicator for all the functions listed in my C code. But I don't know why my communicator is still invalid. Note that the MPI routines are initialized and finalized in Fortran only, I expect I don't have to initialize them in C again. I am using the following Fortran code.
implicit none
include 'mpif.h'
integer :: MyColor, MyCOMM, MyError, MyKey, Nnodes
integer :: MyRank, pelast
CALL mpi_init (MyError)
CALL mpi_comm_size (MPI_COMM_WORLD, Nnodes, MyError)
CALL mpi_comm_rank (MPI_COMM_WORLD, MyRank, MyError)
MyColor=1
MyKey=0
CALL mpi_comm_split (MPI_COMM_WORLD, MyColor, MyKey, MyComm,MyError)
CALL ramcpl (MyComm)
CALL mpi_barrier (MPI_COMM_WORLD, MyError)
CALL MCTWorld_clean ()
CALL mpi_finalize (MyError)
my subroutine ramcpl is located at another place
subroutine ramcpl (MyComm_r)
implicit none
integer :: MyComm_r, ierr
.
.
.
CALL par_init_fortran (MyComm_r, my_mpi_num,nmachs);
End Subroutine ramcpl
The command line and the output is,
mpirun -np 4 ./ramcplM ramcpl.in
Model Coupling:
[localhost:31472] *** Process received signal ***
[localhost:31473] *** Process received signal ***
[localhost:31472] Signal: Segmentation fault (11)
[localhost:31472] Signal code: Address not mapped (1)
[localhost:31472] Failing at address: (nil)
[localhost:31473] Signal: Segmentation fault (11)
[localhost:31473] Signal code: Address not mapped (1)
[localhost:31473] Failing at address: (nil)
[localhost:31472] [ 0] /lib64/libpthread.so.0() [0x3120c0f7e0]
[localhost:31472] [ 1] ./ramcplM(par_init_fortran_+0x122) [0x842db2]
[localhost:31472] [ 2] ./ramcplM(__rams_MOD_rams_cpl+0x7a0) [0x8428c0]
[localhost:31472] [ 3] ./ramcplM(MAIN__+0xea6) [0x461086]
[localhost:31472] [ 4] ./ramcplM(main+0x2a) [0xc3eefa]
[localhost:31472] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x312081ed1d]
[localhost:31472] [ 6] ./ramcplM() [0x45e2d9]
[localhost:31472] *** End of error message ***
[localhost:31473] [ 0] /lib64/libpthread.so.0() [0x3120c0f7e0]
[localhost:31473] [ 1] ./ramcplM(par_init_fortran_+0x122) [0x842db2]
[localhost:31473] [ 2] ./ramcplM(__rammain_MOD_ramcpl+0x7a0) [0x8428c0]
[localhost:31473] [ 3] ./ramcplM(MAIN__+0xea6) [0x461086]
[localhost:31473] [ 4] ./ramcplM(main+0x2a) [0xc3eefa]
[localhost:31473] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x312081ed1d]
[localhost:31473] [ 6] ./ramcplM() [0x45e2d9]
[localhost:31473] *** End of error message ***
The handles in Fortran and C are NOT compatible. Use MPI_Comm_f2c https://linux.die.net/man/3/mpi_comm_f2c and related connversion functions. Pass it between C and Fortran as an integer, not as MPI_Comm.

(Manjaro Linux) OpenMPI not running any kind of compiled C multiprocess program

I have Manjaro Linux 17.1.10 (kernel 4.17.0-2) in a ThinkPad T420 (Core i5 2450M, Upgraded to 8GB RAM) and I'm trying to run some C programs using OpenMPI (version 3.1.0) but I'm having troubles running the programs.
I am able to compile them without issues both from terminal using mpicc and Ecplipse's "build" option, but when I try to run them, either from terminal or in Eclipse (configured as a parallel aplication in launch options) I face errors, no matter the code that I run.
I'm trying to run the MPI hello world C project that comes with Ecipse Parallel:
/*
============================================================================
Name : test.c
Author : MGMX
Version : 1
Copyright : GNU 3.0
Description : Hello MPI World in C
============================================================================
*/
#include <stdio.h>
#include <string.h>
#include "mpi.h"
int main(int argc, char* argv[]){
int my_rank; /* rank of process */
int p; /* number of processes */
int source; /* rank of sender */
int dest; /* rank of receiver */
int tag=0; /* tag for messages */
char message[100]; /* storage for message */
MPI_Status status ; /* return status for receive */
/* start up MPI */
MPI_Init(&argc, &argv);
/* find out process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* find out number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &p);
if (my_rank !=0){
/* create message */
sprintf(message, "Hello MPI World from process %d!", my_rank);
dest = 0;
/* use strlen+1 so that '\0' get transmitted */
MPI_Send(message, strlen(message)+1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
}
else{
printf("Hello MPI World From process 0: Num processes: %d\n",p);
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &status);
printf("%s\n",message);
}
}
/* shut down MPI */
MPI_Finalize();
return 0;
}
If I run the program using simple ./test without parameters I have an output:
[mgmx#ThinkPad Debug]$ ./test
Hello MPI World From process 0: Num processes: 1
but if I use mpirun i have different results, depending of the number pof procecess (-np #) i select; but all of them throw error:
[mgmx#ThinkPad Debug]$ mpirun -np 0 test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[mgmx#ThinkPad Debug]$ mpirun -np 1 test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[49948,1],0]
Exit code: 1
--------------------------------------------------------------------------
[mgmx#ThinkPad Debug]$ mpirun -np 2 test
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[mgmx#ThinkPad Debug]$ mpirun -np 3 test
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
test
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[mgmx#ThinkPad Debug]$ mpirun -np 4 test
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
test
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
I am running everything local.
Here is the ouput of --version both in mpicc and mpirun:
[mgmx#ThinkPad Debug]$ mpicc --version
gcc (GCC) 8.1.1 20180531
Copyright (C) 2018 Free Software Foundation, Inc.
Esto es software libre; vea el código para las condiciones de copia. NO hay
garantía; ni siquiera para MERCANTIBILIDAD o IDONEIDAD PARA UN PROPÓSITO EN
PARTICULAR
[mgmx#ThinkPad Debug]$ mpirun --version
mpirun (Open MPI) 3.1.0
Report bugs to http://www.open-mpi.org/community/help/
And Eclipse's "about" window
Also I installed OpenMPI in various ways: from the Manjaro Repository using pacman, using pamac (a pacman/yaourt GUI) and the git version from the AUR using both yaourt and pamac.

stack smashing detected when use MPI_Reduce

I have learned to use some MPI functions. When I try to use MPI_Reduce, I get stack smashing detected when I run my code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void main(int argc, char **argv) {
int i, rank, size;
int sendBuf, recvBuf, count;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sendBuf = rank;
count = size;
MPI_Reduce(&sendBuf, &recvBuf, count, MPI_INT,
MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
printf("Sum is %d\n", recvBuf);
}
MPI_Finalize();
}
It seem to be okey with my code. It will print sum of all rank in recvBufwith process 0. In this case, it will print Sum is 45 if I run my code with 10 process mpirun -np 10 myexecutefile. But I don't know why my code has the error:
Sum is 45
*** stack smashing detected ***: example6 terminated
[ubuntu:06538] *** Process received signal ***
[ubuntu:06538] Signal: Aborted (6)
[ubuntu:06538] Signal code: (-6)
[ubuntu:06538] *** Process received signal ***
[ubuntu:06538] Signal: Segmentation fault (11)
[ubuntu:06538] Signal code: (128)
[ubuntu:06538] Failing at address: (nil)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
What is the problem and how can I fix it?
In
MPI_Reduce(&sendBuf, &recvBuf, count, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
The argument count must be the number of elements in send buffer. As sendBuf is a single integer, try count = 1; instead of count = size;.
The reason why Sum is 45 got correctly printed is hard to explain. Accessing values out of bound is undefined behavior: the problem could have remained unnoticed, or the segmentation fault could have been raised before Sum is 45 got printed. The magic of undefined behavior...

Resources