I have a relatively simple program that runs a bunch of shell scripts, concatenates their output into one string and sets it as statusbar for my display manager. For the most part everything is working
fine, but from time to time it crashes for no apart reason. Have inspected the coredump I found the following backtrace:
#0 0x00007fda71dc3d22 raise (libc.so.6 + 0x3cd22)
#1 0x00007fda71dad862 abort (libc.so.6 + 0x26862)
#2 0x00007fda71e05d28 __libc_message (libc.so.6 + 0x7ed28)
#3 0x00007fda71e0d92a malloc_printerr (libc.so.6 + 0x8692a)
#4 0x00007fda71e1109c _int_malloc (libc.so.6 + 0x8a09c)
#5 0x00007fda71e12397 malloc (libc.so.6 + 0x8b397)
#6 0x00007fda71dfb564 _IO_file_doallocate (libc.so.6 + 0x74564)
#7 0x00007fda71e09db0 _IO_doallocbuf (libc.so.6 + 0x82db0)
#8 0x00007fda71e08cbc _IO_file_underflow##GLIBC_2.2.5 (libc.so.6 + 0x81cbc)
#9 0x00007fda71e09e66 _IO_default_uflow (libc.so.6 + 0x82e66)
#10 0x00007fda71dfcf2c _IO_getline_info (libc.so.6 + 0x75f2c)
#11 0x00007fda71dfbe8a _IO_fgets (libc.so.6 + 0x74e8a)
#12 0x0000564c2b290484 getcmd (dwmblocks + 0x1484)
#13 0x0000564c2b2906ab getsigcmds (dwmblocks + 0x16ab)
#14 0x0000564c2b290b6f sighandler (dwmblocks + 0x1b6f)
#15 0x00007fda71dc3da0 __restore_rt (libc.so.6 + 0x3cda0)
#16 0x00007fda71e112cc _int_malloc (libc.so.6 + 0x8a2cc)
#17 0x00007fda71e13175 __libc_calloc (libc.so.6 + 0x8c175)
#18 0x00007fda71f83d23 XOpenDisplay (libX11.so.6 + 0x30d23)
#19 0x0000564c2b290952 setroot (dwmblocks + 0x1952)
#20 0x0000564c2b290b1a statusloop (dwmblocks + 0x1b1a)
#21 0x0000564c2b290e28 main (dwmblocks + 0x1e28)
#22 0x00007fda71daeb25 __libc_start_main (libc.so.6 + 0x27b25)
#23 0x0000564c2b29020e _start (dwmblocks + 0x120e)
The last function of the program which has been run before crash looks something like this:
void getcmd(const Block *block, char *output)
{
...
char *cmd = block->command;
FILE *cmdf = popen(cmd,"r");
if (!cmdf){
return;
}
char tmpstr[CMDLENGTH] = "";
char *s;
int e;
do {
errno = 0;
s = fgets(tmpstr, CMDLENGTH-(strlen(delim)+1), cmdf);
e = errno;
} while (!s && e == EINTR);
pclose(cmdf);
...
}
So it's just calling popen and trying to read the output with fgets.
From the backtrace it is apparent that SIGABRT is generated inside the fgets call. I have two questions:
How is it even possible? Isn't fgets has to return a string or an error if anything went wrong and let me deal with that error instead of bringing the whole program down?
What should I do to prevent that behavior?
UPDATE:
Inspecting strings from coredump I found out that error which malloc_printerr was trying to report was malloc(): mismatching next->prev_size (unsorted).
Don't know if it means anything...
UPDATE:
It appears the problem is that getcmd is called from signal handler, but popen and fgets is not signal-safe.
UPDATE:
I've added setvbuf(cmdf, NULL, _IONBF, 0); after popen call to make stream unbuffered so fgets wouldn't try to allocate buffers with malloc and hopefully prevent that crash. Unfortunately, I can't reliably reproduce the crash, so I can't tell if this hack helps.
From the stack trace, I can see calls to malloc twice with a signal handler between them. This is going to fail becuase malloc is (generally) not reentrant, so trying to call it from a signal handler is never a good idea. In general, you should not call ANY POSIX async-unsafe function in a signal handler unless you can somehow guarentee that the signal will never be delivered while running any other async-unsafe function1.
So the real question here is why does your signal need to call popen or fgets (both async-unsafe) and what can you do about it? What is the signal being caught? Is it likely to be fatal anyways (SIGSEGV or SIGBUS), or is it an informational signal like SIGIO?
If it is a fatal signal, you should be looking into why it is occurring; the failure in the signal handler is secondary.
If it is a non-fatal signal, then you should move the async-unsafe code out of the signal handler and have the signal handler either set some global variable that the main program will check, or arrange for another thread to do whatever work is needed
1This is possible but quite hard -- generally requires wrapping sigblock calls around all calls to async unsafe things. However, if you only have a few of those in your main program, it may be practical.
Your code is calling popen() to run some arbitrary Linux command.
The "arbitrary command" is calling XOpenDisplay() to display an X Windows GUI to the user.
The crash is occurring in malloc(), deep inside XOpenDisplay. Many other C library functions also use malloc() - including popen().
THEORY: You've corrupted memory, hence the "malloc()" failure.
LIKELY CANDIDATE: fgets(tmpstr, CMDLENGTH-(strlen(delim)+1), cmdf);
<= You need to ensure that "n" (the second argument) is NEVER larger than sizeof(tmpstr)-1.
It certainly looks like you're trying to do that ("n" should always be less than CMDLENGTH)... but it's worth double-checking.
SUGGESTION: try Valgrind
Related
I have a very simple MPI program:
int my_rank;
int my_new_rank;
int size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (my_rank == 0 || my_rank == 18 || my_rank == 36){
char hostbuffer[256];
gethostname(hostbuffer, sizeof(hostbuffer));
printf("Hostname: %s\n", hostbuffer);
}
MPI_Finalize();
I am running it on a cluster with two nodes. I have a make file and with mpicc command I generate cannon.run executable file. I run it with the following command:
time mpirun --mca btl ^openib -n 64 -hostfile ../second_machinefile ./cannon.run
in second_machinefile I have names of these two nodes. The wierd problem is that, when I run this command from one node, it executes normally, however when I run the command from another node I get error:
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x30
After trying to tun with GDB I got this backtrace:
#0 0x00007ffff646e936 in ?? ()
from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#1 0x00007ffff6449733 in pmix_common_dstor_init ()
from /lib/x86_64-linux-gnu/libmca_common_dstore.so.1
#2 0x00007ffff646e5b4 in ?? ()
from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#3 0x00007ffff659e46e in pmix_gds_base_select ()
from /lib/x86_64-linux-gnu/libpmix.so.2
#4 0x00007ffff655688d in pmix_rte_init ()
from /lib/x86_64-linux-gnu/libpmix.so.2
#5 0x00007ffff6512d7c in PMIx_Init () from /lib/x86_64-linux-gnu/libpmix.so.2
#6 0x00007ffff660afe4 in ext2x_client_init ()
from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so
#7 0x00007ffff72e1656 in ?? ()
from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so
#8 0x00007ffff7a9d11a in orte_init ()
from /lib/x86_64-linux-gnu/libopen-rte.so.40
#9 0x00007ffff7d6de62 in ompi_mpi_init ()
from /lib/x86_64-linux-gnu/libmpi.so.40
#10 0x00007ffff7d9c17e in PMPI_Init () from /lib/x86_64-linux-gnu/libmpi.so.40
#11 0x00005555555551d6 in main ()
which to be honest I don't fully understand.
My main confusion is that the program is executed properly from machine_1, it connects to the machine_2 without errors and processes are initialized on both machines. But when I try to execute the same command from machine_2, it is not able to connect machine_1. The program is also running correctly if I run it only on machine_2 as well, when decreasing number of processes so it fits in one machine.
Is there anything I am doing wrong? or what could I try to understand better the cause of the problem?
This is indeed a bug in Open PMIx that is addressed at https://github.com/openpmix/openpmix/pull/1580
Meanwhile, a workaround is to blacklist the gds/ds21 component :
One option is to
export PMIX_MCA_gds=^ds21
before invoking mpirun
An other option is to add the following line
gds = ^ds21
to the PMIx config file located in <pmix_prefix>/etc/pmix-mca-params.conf
Im running into a pretty annoying problem :
I have a program which creates one thread at the start, this thread will launch other stuff during its execution (fork() immediatly followed by execve()).
Here is the bt of both threads at the point where my program reached (I think) the deadlock :
Thread 2 (LWP 8839):
#0 0x00007ffff6cdf736 in __libc_fork () at ../sysdeps/nptl/fork.c:125
#1 0x00007ffff6c8f8c0 in _IO_new_proc_open (fp=fp#entry=0x7ffff00031d0, command=command#entry=0x7ffff6c26e20 "ps -u brejon | grep \"cvc\"
#2 0x00007ffff6c8fbcc in _IO_new_popen (command=0x7ffff6c26e20 "ps -u user | grep \"cvc\" | wc -l", mode=0x42c7fd "r") at iopopen.c:296
#3-4 ...
#5 0x00007ffff74d9434 in start_thread (arg=0x7ffff6c27700) at pthread_create.c:333
#6 0x00007ffff6d0fcfd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 1 (LWP 8835):
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff6ca0ad9 in malloc_atfork (sz=140737337120848, caller=) at arena.c:179
#2 0x00007ffff6c8d875 in __GI__IO_file_doallocate (fp=0x17a72230) at filedoalloc.c:127
#3 0x00007ffff6c9a964 in __GI__IO_doallocbuf (fp=fp#entry=0x17a72230) at genops.c:398
#4 0x00007ffff6c99de8 in _IO_new_file_overflow (f=0x17a72230, ch=-1) at fileops.c:820
#5 0x00007ffff6c98f8a in _IO_new_file_xsputn (f=0x17a72230, data=0x17a16420, n=682) at fileops.c:1331
#6 0x00007ffff6c6fcb2 in _IO_vfprintf_internal (s=0x17a72230, format=, ap=ap#entry=0x7fffffffcf18) at vfprintf.c:1632
#7 0x00007ffff6c76a97 in __fprintf (stream=, format=) at fprintf.c:32
#8-11 ...
#12 0x000000000042706e in main (argc=3, argv=0x7fffffffd698, envp=0x7fffffffd6b8) at mains/ignore/.c:146
Both stays stuck here forever with both glibc-2.17 and glibc-2.23
Any help is welcomed :'D
EDIT :
Here is a minimal example :
1 #include <stdlib.h>
2 #include <pthread.h>
3 #include <unistd.h>
4
5 void * thread_handler(void * args)
6 {
7 char * argv[] = { "/usr/bin/ls" };
8 char * newargv[] = { "/usr/bin/ls", NULL };
9 char * newenviron[] = { NULL };
10 while (1) {
11 if (vfork() == 0) {
12 execve(argv[0], newargv, newenviron);
13 }
14 }
15
16 return 0;
17 }
18
19 int main(void)
20 {
21 pthread_t thread;
22 pthread_create(&thread, NULL, thread_handler, NULL);
23
24 int * dummy_alloc;
25
26 while (1) {
27 dummy_alloc = malloc(sizeof(int));
28 free(dummy_alloc);
29 }
30
31 return 0;
32 }
Environment :
user:deadlock$ cat /etc/redhat-release
Scientific Linux release 7.3 (Nitrogen)
user:deadlock$ ldd --version
ldd (GNU libc) 2.17
EDIT 2 :
The rpm package version is : glibc-2.17-196.el7.x86_64
I can not get the line numbers using the rpm package. Here is the BT using the glibc given with the distribution : solved with debuginfo.
(gdb) thread apply all bt
Thread 2 (Thread 0x7ffff77fb700 (LWP 59753)):
#0 vfork () at ../sysdeps/unix/sysv/linux/x86_64/vfork.S:44
#1 0x000000000040074e in thread_handler (args=0x0) at deadlock.c:11
#2 0x00007ffff7bc6e25 in start_thread (arg=0x7ffff77fb700) at pthread_create.c:308
#3 0x00007ffff78f434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7ffff7fba740 (LWP 59746)):
#0 0x00007ffff7878226 in _int_free (av=0x7ffff7bb8760 , p=0x602240, have_lock=0) at malloc.c:3927
#1 0x00000000004007aa in main () at deadlock.c:28
This is a custom-compiled glibc. It is possible that something went wrong with the installation. Note that Red Hat Enterprise Linux 7.4 backports a fix for a deadlock between malloc and fork, and you are missing that because you compiled your own glibc. The fix for the upstream bug went only into upstream version 2.24, so if you are basing your custom build on that, you may not have this fix. (Although the backtrace would look differently for that one.)
I think we fixed at least another post-2.17 libio-related deadlock bug.
EDIT I have been dealing with fork-related deadlocks for too long. There are multiple issues with the reproducer as posted:
There is no waitpid call for the PID. As a result, the process table will be quickly filled with zombies.
No error checking for execve. If the /usr/bin/ls pathname does not exist (for example, on a system which did not undergo UsrMove), execve will return, and the next iteration of the loop will launch another vfork call.
I fixed both issues (because debugging what is approaching a fork bomb is not fun at all), but I can't reproduce a hang with glibc-2.17-196.el7.x86_64.
We have created a multithreaded, single core application running on Ubuntu.
When we call getaddrinfo and gethostbyname from the main process, it does not crash.
However when we create a thread from the main process and the functions getaddrinfo and gethostbyname are called from the created thread, it always crashes.
Kindly help.
Please find the call stack below:
#0 0xf7e9f890 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xf7e9fa73 in __res_ninit () from /lib/i386-linux-gnu/libc.so.6
#2 0xf7ea0a68 in __res_maybe_init () from /lib/i386-linux-gnu/libc.so.6
#3 0xf7e663be in ?? () from /lib/i386-linux-gnu/libc.so.6
#4 0xf7e696bb in getaddrinfo () from /lib/i386-linux-gnu/libc.so.6
#5 0x080c4e35 in mn_task_entry (args=0xa6c4130 <ipc_os_input_params>) at /home/nextg/Alps_RT/mn/src/mn_main.c:699
#6 0xf7fa5d78 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#7 0xf7e9001e in clone () from /lib/i386-linux-gnu/libc.so.6
The reason the getaddrinfo was crashing because, the child thread making the call did not have sufficient stack space.
Using ACE C++ version 6.5.1 libraries classes which use ACE_Thread::spawn_n with default ACE_DEFAULT_THREAD_PRIORITY (1024*1024) will crash when calling gethostbyname/getaddrinfo inside child as reported by Syed Aslam. libxml2 schema parsing takes forever, using a child thread Segment Faulted after calling xmlNanoHTTPConnectHost as it tries to resolve schemaLocation.
ACE_Task activate
const ACE_TCHAR *thr_name[1];
thr_name[0] = "Flerf";
// libxml2-2.9.7/nanohttp.c:1133
// gethostbyname will crash when child thread making the call
// has insufficient stack space.
size_t stack_sizes[1] = {
ACE_DEFAULT_THREAD_STACKSIZE * 100
};
const INT ret = this->activate (
THR_NEW_LWP/*Light Weight Process*/ | THR_JOINABLE,
1,
0/*force_active*/,
ACE_DEFAULT_THREAD_PRIORITY,
-1/*grp_id*/,
NULL/*task*/,
NULL/*thread_handles[]*/,
NULL/*stack[]*/,
stack_sizes/*stack_size[]*/,
NULL/*thread_ids[]*/,
thr_name
);
Firstly, I use pthread library to write multithreading C programs. Threads always hung by their waited mutexes. When I use the strace utility to find a thread in the FUTEX_WAIT status, I want to know which thread holds that mutex at that time. But I don't know how I could I do it. Are there any utilities that could do that?
Someone told me the Java virtual machine supports this, so I want to know whether Linux support this feature.
You can use knowledge of the mutex internals to do this. Ordinarily this wouldn't be a very good idea, but it's fine for debugging.
Under Linux with the NPTL implementation of pthreads (which is any modern glibc), you can examine the __data.__owner member of the pthread_mutex_t structure to find out the thread that currently has it locked. This is how to do it after attaching to the process with gdb:
(gdb) thread 2
[Switching to thread 2 (Thread 0xb6d94b90 (LWP 22026))]#0 0xb771f424 in __kernel_vsyscall ()
(gdb) bt
#0 0xb771f424 in __kernel_vsyscall ()
#1 0xb76fec99 in __lll_lock_wait () from /lib/i686/cmov/libpthread.so.0
#2 0xb76fa0c4 in _L_lock_89 () from /lib/i686/cmov/libpthread.so.0
#3 0xb76f99f2 in pthread_mutex_lock () from /lib/i686/cmov/libpthread.so.0
#4 0x080484a6 in thread (x=0x0) at mutex_owner.c:8
#5 0xb76f84c0 in start_thread () from /lib/i686/cmov/libpthread.so.0
#6 0xb767784e in clone () from /lib/i686/cmov/libc.so.6
(gdb) up 4
#4 0x080484a6 in thread (x=0x0) at mutex_owner.c:8
8 pthread_mutex_lock(&mutex);
(gdb) print mutex.__data.__owner
$1 = 22025
(gdb)
(I switch to the hung thread; do a backtrace to find the pthread_mutex_lock() it's stuck on; change stack frames to find out the name of the mutex that it's trying to lock; then print the owner of that mutex). This tells me that the thread with LWP ID 22025 is the culprit.
You can then use thread find 22025 to find out the gdb thread number for that thread and switch to it.
I don't know of any such facility so I don't think you will get off that easily - and it probably wouldn't be as informative as you think in helping to debug your program. As low tech as it might seem, logging is your friend in debugging these things. Start collecting your own little logging functions. They don't have to be fancy, they just have to get the job done while debugging.
Sorry for the C++ but something like:
void logit(const bool aquired, const char* lockname, const int linenum)
{
pthread_mutex_lock(&log_mutex);
if (! aquired)
logfile << pthread_self() << " tries lock " << lockname << " at " << linenum << endl;
else
logfile << pthread_self() << " has lock " << lockname << " at " << linenum << endl;
pthread_mutex_unlock(&log_mutex);
}
void someTask()
{
logit(false, "some_mutex", __LINE__);
pthread_mutex_lock(&some_mutex);
logit(true, "some_mutex", __LINE__);
// do stuff ...
pthread_mutex_unlock(&some_mutex);
}
Logging isn't a perfect solution but nothing is. It usually gets you what you need to know.
Normally libc/platforms calls are abstracted by OS abstraction layer. The mutex dead locks can be tracked using a owner variable and pthread_mutex_timedlock. Whenever the thread locks it should update the variable with own tid(gettid() and can also have another variable for pthread id storage) . So when the other threads blocks and timed out on pthread_mutex_timedlock it can print the value of owner tid and pthread_id. this way you can easily find out the owner thread. please find the code snippet below, note that all the error conditions are not handled
pid_t ownerTid;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
class TimedMutex {
public:
TimedMutex()
{
struct timespec abs_time;
while(1)
{
clock_gettime(CLOCK_MONOTONIC, &abs_time);
abs_time.tv_sec += 10;
if(pthread_mutex_timedlock(&mutex,&abs_time) == ETIMEDOUT)
{
log("Lock held by thread=%d for more than 10 secs",ownerTid);
continue;
}
ownerTid = gettid();
}
}
~TimedMutex()
{
pthread_mutex_unlock(&mutex);
}
};
There are other ways to find out dead locks, maybe this link might help http://yusufonlinux.blogspot.in/2010/11/debugging-core-using-gdb.html.
Please read below link, This has a generic solution for finding the lock owner. It works even if lock in side a library and you don't have the source code.
https://en.wikibooks.org/wiki/Linux_Applications_Debugging_Techniques/Deadlocks
I'd like to know if my program is accessing NULL pointers or stale memory.
The backtrace looks like this:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2b0fa4c8 (LWP 1333)]
0x299a6ad4 in pthread_mutex_lock () from /lib/libpthread.so.0
(gdb) bt
#0 0x299a6ad4 in pthread_mutex_lock () from /lib/libpthread.so.0
#1 0x0058e900 in ?? ()
With GDB 7 and higher, you can examine the $_siginfo structure that is filled out when the signal occurs, and determine the faulting address:
(gdb) p $_siginfo._sifields._sigfault.si_addr
If it shows (void *) 0x0 (or a small number) then you have a NULL pointer dereference.
Run your program under GDB. When the segfault occurs, GDB will inform you of the line and statement of your program, along with the variable and its associated address.
You can use the "print" (p) command in GDB to inspect variables. If the crash occurred in a library call, you can use the "frame" series of commands to see the stack frame in question.