Issues when running MPI program on two cluster nodes - c

I have a very simple MPI program:
int my_rank;
int my_new_rank;
int size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (my_rank == 0 || my_rank == 18 || my_rank == 36){
char hostbuffer[256];
gethostname(hostbuffer, sizeof(hostbuffer));
printf("Hostname: %s\n", hostbuffer);
}
MPI_Finalize();
I am running it on a cluster with two nodes. I have a make file and with mpicc command I generate cannon.run executable file. I run it with the following command:
time mpirun --mca btl ^openib -n 64 -hostfile ../second_machinefile ./cannon.run
in second_machinefile I have names of these two nodes. The wierd problem is that, when I run this command from one node, it executes normally, however when I run the command from another node I get error:
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x30
After trying to tun with GDB I got this backtrace:
#0 0x00007ffff646e936 in ?? ()
from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#1 0x00007ffff6449733 in pmix_common_dstor_init ()
from /lib/x86_64-linux-gnu/libmca_common_dstore.so.1
#2 0x00007ffff646e5b4 in ?? ()
from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#3 0x00007ffff659e46e in pmix_gds_base_select ()
from /lib/x86_64-linux-gnu/libpmix.so.2
#4 0x00007ffff655688d in pmix_rte_init ()
from /lib/x86_64-linux-gnu/libpmix.so.2
#5 0x00007ffff6512d7c in PMIx_Init () from /lib/x86_64-linux-gnu/libpmix.so.2
#6 0x00007ffff660afe4 in ext2x_client_init ()
from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so
#7 0x00007ffff72e1656 in ?? ()
from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so
#8 0x00007ffff7a9d11a in orte_init ()
from /lib/x86_64-linux-gnu/libopen-rte.so.40
#9 0x00007ffff7d6de62 in ompi_mpi_init ()
from /lib/x86_64-linux-gnu/libmpi.so.40
#10 0x00007ffff7d9c17e in PMPI_Init () from /lib/x86_64-linux-gnu/libmpi.so.40
#11 0x00005555555551d6 in main ()
which to be honest I don't fully understand.
My main confusion is that the program is executed properly from machine_1, it connects to the machine_2 without errors and processes are initialized on both machines. But when I try to execute the same command from machine_2, it is not able to connect machine_1. The program is also running correctly if I run it only on machine_2 as well, when decreasing number of processes so it fits in one machine.
Is there anything I am doing wrong? or what could I try to understand better the cause of the problem?

This is indeed a bug in Open PMIx that is addressed at https://github.com/openpmix/openpmix/pull/1580
Meanwhile, a workaround is to blacklist the gds/ds21 component :
One option is to
export PMIX_MCA_gds=^ds21
before invoking mpirun
An other option is to add the following line
gds = ^ds21
to the PMIx config file located in <pmix_prefix>/etc/pmix-mca-params.conf

Related

Reset after hard fault

I'm trying to debug a hard fault in a C++ firmware project for the microbit v1.5 .
The issue at hand is that after a hard fault I would like to reset the microcontroller
and start anew but issuing the dreaded monitor reset halt does not work and execution never restarts properly after a hard fault.
I'm using pyocd (v0.33.1) as my gdb debugserver and a custom built gdb (v8.2.1) with proper support for the nrf51 series.
This is an example interaction with gdb. I set a breakpoint on HardFault_Handler and start execution. The firmware correctly spawns tasks but eventually one of the tasks faults and the HardFault handler gets called. After this I would like to reset the microcontroller and start anew.
I expect the microcontroller to spawn the same set of tasks but this never happens and it also never goes back to main so I'm thinking there must be a specific way to reset it correctly.
What command should I issue to reset the flow of execution to start with main or one of the routines from gcc_startup?
(gdb) info breakpoints
Num Type Disp Enb Address What
1 breakpoint keep y 0x000290e2 ../support/libs/nrfx/mdk/gcc_startup_nrf51.S:234
(gdb) c
Continuing.
[New Thread 2]
[New Thread 536884080]
[New Thread 536880760]
[New Thread 536884152]
Thread 2 "Handler mode" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2]
0x000006b0 in ?? ()
(gdb) info threads
Id Target Id Frame
* 2 Thread 2 "Handler mode" (HardFault) 0x000006b0 in ?? ()
3 Thread 536884080 "IDL" (Ready; Priority 0) prvIdleTask (pvParameters=0x0)
at ../support/freertos/tasks.c:3225
4 Thread 536880760 "KNL" (Ready; Priority 1) starlight::sys::Task::<lambda(void*)>::_FUN(void *)
() at ../include/starlight/sys/task.hpp:154
5 Thread 536884152 "Tmr" (Running; Priority 2) __DSB ()
at ../support/libs/CMSIS-Core/Include/cmsis_gcc.h:946
(gdb) monitor reset halt
Resetting target with halt
Successfully halted device on reset
(gdb) c
Continuing.
[New Thread 1]
Thread 6 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1]
0x000006b0 in ?? ()
(gdb) info threads
Id Target Id Frame
* 6 Thread 1 (HardFault) 0x000006b0 in ?? ()
(gdb) monitor reset halt
Resetting target with halt
Successfully halted device on reset
(gdb) c
Continuing.
Thread 6 received signal SIGSEGV, Segmentation fault.
0x000006b0 in ?? ()
(gdb) backtrace
#0 0x000006b0 in ?? ()
#1 <signal handler called>
Backtrace stopped: Cannot access memory at address 0x4b0547f8

SIGABRT inside fgets

I have a relatively simple program that runs a bunch of shell scripts, concatenates their output into one string and sets it as statusbar for my display manager. For the most part everything is working
fine, but from time to time it crashes for no apart reason. Have inspected the coredump I found the following backtrace:
#0 0x00007fda71dc3d22 raise (libc.so.6 + 0x3cd22)
#1 0x00007fda71dad862 abort (libc.so.6 + 0x26862)
#2 0x00007fda71e05d28 __libc_message (libc.so.6 + 0x7ed28)
#3 0x00007fda71e0d92a malloc_printerr (libc.so.6 + 0x8692a)
#4 0x00007fda71e1109c _int_malloc (libc.so.6 + 0x8a09c)
#5 0x00007fda71e12397 malloc (libc.so.6 + 0x8b397)
#6 0x00007fda71dfb564 _IO_file_doallocate (libc.so.6 + 0x74564)
#7 0x00007fda71e09db0 _IO_doallocbuf (libc.so.6 + 0x82db0)
#8 0x00007fda71e08cbc _IO_file_underflow##GLIBC_2.2.5 (libc.so.6 + 0x81cbc)
#9 0x00007fda71e09e66 _IO_default_uflow (libc.so.6 + 0x82e66)
#10 0x00007fda71dfcf2c _IO_getline_info (libc.so.6 + 0x75f2c)
#11 0x00007fda71dfbe8a _IO_fgets (libc.so.6 + 0x74e8a)
#12 0x0000564c2b290484 getcmd (dwmblocks + 0x1484)
#13 0x0000564c2b2906ab getsigcmds (dwmblocks + 0x16ab)
#14 0x0000564c2b290b6f sighandler (dwmblocks + 0x1b6f)
#15 0x00007fda71dc3da0 __restore_rt (libc.so.6 + 0x3cda0)
#16 0x00007fda71e112cc _int_malloc (libc.so.6 + 0x8a2cc)
#17 0x00007fda71e13175 __libc_calloc (libc.so.6 + 0x8c175)
#18 0x00007fda71f83d23 XOpenDisplay (libX11.so.6 + 0x30d23)
#19 0x0000564c2b290952 setroot (dwmblocks + 0x1952)
#20 0x0000564c2b290b1a statusloop (dwmblocks + 0x1b1a)
#21 0x0000564c2b290e28 main (dwmblocks + 0x1e28)
#22 0x00007fda71daeb25 __libc_start_main (libc.so.6 + 0x27b25)
#23 0x0000564c2b29020e _start (dwmblocks + 0x120e)
The last function of the program which has been run before crash looks something like this:
void getcmd(const Block *block, char *output)
{
...
char *cmd = block->command;
FILE *cmdf = popen(cmd,"r");
if (!cmdf){
return;
}
char tmpstr[CMDLENGTH] = "";
char *s;
int e;
do {
errno = 0;
s = fgets(tmpstr, CMDLENGTH-(strlen(delim)+1), cmdf);
e = errno;
} while (!s && e == EINTR);
pclose(cmdf);
...
}
So it's just calling popen and trying to read the output with fgets.
From the backtrace it is apparent that SIGABRT is generated inside the fgets call. I have two questions:
How is it even possible? Isn't fgets has to return a string or an error if anything went wrong and let me deal with that error instead of bringing the whole program down?
What should I do to prevent that behavior?
UPDATE:
Inspecting strings from coredump I found out that error which malloc_printerr was trying to report was malloc(): mismatching next->prev_size (unsorted).
Don't know if it means anything...
UPDATE:
It appears the problem is that getcmd is called from signal handler, but popen and fgets is not signal-safe.
UPDATE:
I've added setvbuf(cmdf, NULL, _IONBF, 0); after popen call to make stream unbuffered so fgets wouldn't try to allocate buffers with malloc and hopefully prevent that crash. Unfortunately, I can't reliably reproduce the crash, so I can't tell if this hack helps.
From the stack trace, I can see calls to malloc twice with a signal handler between them. This is going to fail becuase malloc is (generally) not reentrant, so trying to call it from a signal handler is never a good idea. In general, you should not call ANY POSIX async-unsafe function in a signal handler unless you can somehow guarentee that the signal will never be delivered while running any other async-unsafe function1.
So the real question here is why does your signal need to call popen or fgets (both async-unsafe) and what can you do about it? What is the signal being caught? Is it likely to be fatal anyways (SIGSEGV or SIGBUS), or is it an informational signal like SIGIO?
If it is a fatal signal, you should be looking into why it is occurring; the failure in the signal handler is secondary.
If it is a non-fatal signal, then you should move the async-unsafe code out of the signal handler and have the signal handler either set some global variable that the main program will check, or arrange for another thread to do whatever work is needed
1This is possible but quite hard -- generally requires wrapping sigblock calls around all calls to async unsafe things. However, if you only have a few of those in your main program, it may be practical.
Your code is calling popen() to run some arbitrary Linux command.
The "arbitrary command" is calling XOpenDisplay() to display an X Windows GUI to the user.
The crash is occurring in malloc(), deep inside XOpenDisplay. Many other C library functions also use malloc() - including popen().
THEORY: You've corrupted memory, hence the "malloc()" failure.
LIKELY CANDIDATE: fgets(tmpstr, CMDLENGTH-(strlen(delim)+1), cmdf);
<= You need to ensure that "n" (the second argument) is NEVER larger than sizeof(tmpstr)-1.
It certainly looks like you're trying to do that ("n" should always be less than CMDLENGTH)... but it's worth double-checking.
SUGGESTION: try Valgrind

Deadlock (fork + malloc) libc (glibc-2.17, glibc-2.23)

Im running into a pretty annoying problem :
I have a program which creates one thread at the start, this thread will launch other stuff during its execution (fork() immediatly followed by execve()).
Here is the bt of both threads at the point where my program reached (I think) the deadlock :
Thread 2 (LWP 8839):
#0 0x00007ffff6cdf736 in __libc_fork () at ../sysdeps/nptl/fork.c:125
#1 0x00007ffff6c8f8c0 in _IO_new_proc_open (fp=fp#entry=0x7ffff00031d0, command=command#entry=0x7ffff6c26e20 "ps -u brejon | grep \"cvc\"
#2 0x00007ffff6c8fbcc in _IO_new_popen (command=0x7ffff6c26e20 "ps -u user | grep \"cvc\" | wc -l", mode=0x42c7fd "r") at iopopen.c:296
#3-4 ...
#5 0x00007ffff74d9434 in start_thread (arg=0x7ffff6c27700) at pthread_create.c:333
#6 0x00007ffff6d0fcfd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 1 (LWP 8835):
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff6ca0ad9 in malloc_atfork (sz=140737337120848, caller=) at arena.c:179
#2 0x00007ffff6c8d875 in __GI__IO_file_doallocate (fp=0x17a72230) at filedoalloc.c:127
#3 0x00007ffff6c9a964 in __GI__IO_doallocbuf (fp=fp#entry=0x17a72230) at genops.c:398
#4 0x00007ffff6c99de8 in _IO_new_file_overflow (f=0x17a72230, ch=-1) at fileops.c:820
#5 0x00007ffff6c98f8a in _IO_new_file_xsputn (f=0x17a72230, data=0x17a16420, n=682) at fileops.c:1331
#6 0x00007ffff6c6fcb2 in _IO_vfprintf_internal (s=0x17a72230, format=, ap=ap#entry=0x7fffffffcf18) at vfprintf.c:1632
#7 0x00007ffff6c76a97 in __fprintf (stream=, format=) at fprintf.c:32
#8-11 ...
#12 0x000000000042706e in main (argc=3, argv=0x7fffffffd698, envp=0x7fffffffd6b8) at mains/ignore/.c:146
Both stays stuck here forever with both glibc-2.17 and glibc-2.23
Any help is welcomed :'D
EDIT :
Here is a minimal example :
1 #include <stdlib.h>
2 #include <pthread.h>
3 #include <unistd.h>
4
5 void * thread_handler(void * args)
6 {
7 char * argv[] = { "/usr/bin/ls" };
8 char * newargv[] = { "/usr/bin/ls", NULL };
9 char * newenviron[] = { NULL };
10 while (1) {
11 if (vfork() == 0) {
12 execve(argv[0], newargv, newenviron);
13 }
14 }
15
16 return 0;
17 }
18
19 int main(void)
20 {
21 pthread_t thread;
22 pthread_create(&thread, NULL, thread_handler, NULL);
23
24 int * dummy_alloc;
25
26 while (1) {
27 dummy_alloc = malloc(sizeof(int));
28 free(dummy_alloc);
29 }
30
31 return 0;
32 }
Environment :
user:deadlock$ cat /etc/redhat-release
Scientific Linux release 7.3 (Nitrogen)
user:deadlock$ ldd --version
ldd (GNU libc) 2.17
EDIT 2 :
The rpm package version is : glibc-2.17-196.el7.x86_64
I can not get the line numbers using the rpm package. Here is the BT using the glibc given with the distribution : solved with debuginfo.
(gdb) thread apply all bt
Thread 2 (Thread 0x7ffff77fb700 (LWP 59753)):
#0 vfork () at ../sysdeps/unix/sysv/linux/x86_64/vfork.S:44
#1 0x000000000040074e in thread_handler (args=0x0) at deadlock.c:11
#2 0x00007ffff7bc6e25 in start_thread (arg=0x7ffff77fb700) at pthread_create.c:308
#3 0x00007ffff78f434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7ffff7fba740 (LWP 59746)):
#0 0x00007ffff7878226 in _int_free (av=0x7ffff7bb8760 , p=0x602240, have_lock=0) at malloc.c:3927
#1 0x00000000004007aa in main () at deadlock.c:28
This is a custom-compiled glibc. It is possible that something went wrong with the installation. Note that Red Hat Enterprise Linux 7.4 backports a fix for a deadlock between malloc and fork, and you are missing that because you compiled your own glibc. The fix for the upstream bug went only into upstream version 2.24, so if you are basing your custom build on that, you may not have this fix. (Although the backtrace would look differently for that one.)
I think we fixed at least another post-2.17 libio-related deadlock bug.
EDIT I have been dealing with fork-related deadlocks for too long. There are multiple issues with the reproducer as posted:
There is no waitpid call for the PID. As a result, the process table will be quickly filled with zombies.
No error checking for execve. If the /usr/bin/ls pathname does not exist (for example, on a system which did not undergo UsrMove), execve will return, and the next iteration of the loop will launch another vfork call.
I fixed both issues (because debugging what is approaching a fork bomb is not fun at all), but I can't reproduce a hang with glibc-2.17-196.el7.x86_64.

getaddrinfo and gethostbyname crashing when called from child thread?

We have created a multithreaded, single core application running on Ubuntu.
When we call getaddrinfo and gethostbyname from the main process, it does not crash.
However when we create a thread from the main process and the functions getaddrinfo and gethostbyname are called from the created thread, it always crashes.
Kindly help.
Please find the call stack below:
#0 0xf7e9f890 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xf7e9fa73 in __res_ninit () from /lib/i386-linux-gnu/libc.so.6
#2 0xf7ea0a68 in __res_maybe_init () from /lib/i386-linux-gnu/libc.so.6
#3 0xf7e663be in ?? () from /lib/i386-linux-gnu/libc.so.6
#4 0xf7e696bb in getaddrinfo () from /lib/i386-linux-gnu/libc.so.6
#5 0x080c4e35 in mn_task_entry (args=0xa6c4130 <ipc_os_input_params>) at /home/nextg/Alps_RT/mn/src/mn_main.c:699
#6 0xf7fa5d78 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#7 0xf7e9001e in clone () from /lib/i386-linux-gnu/libc.so.6
The reason the getaddrinfo was crashing because, the child thread making the call did not have sufficient stack space.
Using ACE C++ version 6.5.1 libraries classes which use ACE_Thread::spawn_n with default ACE_DEFAULT_THREAD_PRIORITY (1024*1024) will crash when calling gethostbyname/getaddrinfo inside child as reported by Syed Aslam. libxml2 schema parsing takes forever, using a child thread Segment Faulted after calling xmlNanoHTTPConnectHost as it tries to resolve schemaLocation.
ACE_Task activate
const ACE_TCHAR *thr_name[1];
thr_name[0] = "Flerf";
// libxml2-2.9.7/nanohttp.c:1133
// gethostbyname will crash when child thread making the call
// has insufficient stack space.
size_t stack_sizes[1] = {
ACE_DEFAULT_THREAD_STACKSIZE * 100
};
const INT ret = this->activate (
THR_NEW_LWP/*Light Weight Process*/ | THR_JOINABLE,
1,
0/*force_active*/,
ACE_DEFAULT_THREAD_PRIORITY,
-1/*grp_id*/,
NULL/*task*/,
NULL/*thread_handles[]*/,
NULL/*stack[]*/,
stack_sizes/*stack_size[]*/,
NULL/*thread_ids[]*/,
thr_name
);

How can I get GDB to tell me what address caused a segfault?

I'd like to know if my program is accessing NULL pointers or stale memory.
The backtrace looks like this:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2b0fa4c8 (LWP 1333)]
0x299a6ad4 in pthread_mutex_lock () from /lib/libpthread.so.0
(gdb) bt
#0 0x299a6ad4 in pthread_mutex_lock () from /lib/libpthread.so.0
#1 0x0058e900 in ?? ()
With GDB 7 and higher, you can examine the $_siginfo structure that is filled out when the signal occurs, and determine the faulting address:
(gdb) p $_siginfo._sifields._sigfault.si_addr
If it shows (void *) 0x0 (or a small number) then you have a NULL pointer dereference.
Run your program under GDB. When the segfault occurs, GDB will inform you of the line and statement of your program, along with the variable and its associated address.
You can use the "print" (p) command in GDB to inspect variables. If the crash occurred in a library call, you can use the "frame" series of commands to see the stack frame in question.

Resources