gdb script to isolate a thread based on variable value - c

I have a program having over 300 threads to which I have attached gdb. I need to identify one particular thread whose call stack has a frame containing a variable whose value I want to use for matching. Can I script this in gdb?
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f16c1eeb700 (LWP 18833))]
#4 0x00007f17f3a3bdd5 in start_thread () from /lib64/libpthread.so.0
(gdb) backtrace
#0 0x00007f17f3a3fd12 in pthread_cond_timedwait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f17e72838be in __afr_shd_healer_wait (healer=healer#entry=0x7f17e05203d0) at afr-self-heald.c:101
#2 0x00007f17e728392d in afr_shd_healer_wait (healer=healer#entry=0x7f17e05203d0) at afr-self-heald.c:125
#3 0x00007f17e72848e8 in afr_shd_index_healer (data=0x7f17e05203d0) at afr-self-heald.c:572
#4 0x00007f17f3a3bdd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f17f3302ead in clone () from /lib64/libc.so.6
(gdb) frame 3
#3 0x00007f17e72848e8 in afr_shd_index_healer (data=0x7f17e05203d0) at afr-self-heald.c:572
572 afr_shd_healer_wait (healer);
(gdb) p this->name
$6 = 0x7f17e031b910 "testvol-replicate-0"
For example, can I run a macro to loop over each thread, go to frame 3 in each of it, inspect the variable this->name and print the thead number only if the value matches testvol-replicate-0 ?

It's possible to integrate Python into GDB. Then, with the Python GDB API, you could loop over threads and search for a match. Below two examples of debugging threads with GDB and Python.
https://www.linuxjournal.com/article/11027
https://fy.blackhats.net.au/blog/html/2017/08/04/so_you_want_to_script_gdb_with_python.html

Related

How can I find which of my code cause the SIGABRT in pthread_crate.c file, but without any my code information in the coredump

There is a coredump caused by my program, which shows below:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `./remote_speaker plug:SLAVE='dmix:tlv320aic3106au' default rtmp://pili-publish.'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0xffff802791d0 (LWP 1511))]
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000ffff812b9f54 in __GI_abort () at abort.c:79
#2 0x0000ffff81304d3c in __libc_message (action=action#entry=do_abort, fmt=fmt#entry=0xffff813bf638 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x0000ffff8130c32c in malloc_printerr (str=str#entry=0xffff813bafb0 "free(): invalid pointer") at malloc.c:5332
#4 0x0000ffff8130db04 in _int_free (av=0xffff813fb9f8 <main_arena>, p=0xffff7d783ff0, have_lock=<optimized out>) at malloc.c:4173
#5 0x0000ffff81310b50 in tcache_thread_shutdown () at malloc.c:2964
#6 __malloc_arena_thread_freeres () at arena.c:949
#7 0x0000ffff81313e8c in __libc_thread_freeres () at thread-freeres.c:38
#8 0x0000ffff81614844 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:493
#9 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb)
The total threads info shows below:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffff802791d0 (LWP 1511) __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
2 Thread 0xffff80a7a1d0 (LWP 1510) 0x0000ffff8135d084 in __GI___poll (fds=0xffff80a79740, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
3 Thread 0xffff8127c010 (LWP 1492) 0x0000ffff81339d88 in __GI___nanosleep (requested_time=requested_time#entry=0xffffc2f510a0, remaining=remaining#entry=0x0)
at ../sysdeps/unix/sysv/linux/nanosleep.c:28
4 Thread 0xffff7e2751d0 (LWP 1515) 0x0000ffff81365bcc in __GI_epoll_pwait (epfd=6, events=0x594160 <self+128>, maxevents=32, timeout=100, set=0x0)
at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
5 Thread 0xffff8127b1d0 (LWP 1509) futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x596538 <output_notice>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:208
6 Thread 0xffff7fa781d0 (LWP 1512) 0x0000ffff81339d84 in __GI___nanosleep (requested_time=requested_time#entry=0xffff7fa77870, remaining=remaining#entry=0xffff7fa77870)
at ../sysdeps/unix/sysv/linux/nanosleep.c:28
thread 1-6
(gdb) thread 1
[Switching to thread 1 (Thread 0xffff802791d0 (LWP 1511))]
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 in ../sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000ffff812b9f54 in __GI_abort () at abort.c:79
#2 0x0000ffff81304d3c in __libc_message (action=action#entry=do_abort, fmt=fmt#entry=0xffff813bf638 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x0000ffff8130c32c in malloc_printerr (str=str#entry=0xffff813bafb0 "free(): invalid pointer") at malloc.c:5332
#4 0x0000ffff8130db04 in _int_free (av=0xffff813fb9f8 <main_arena>, p=0xffff7d783ff0, have_lock=<optimized out>) at malloc.c:4173
#5 0x0000ffff81310b50 in tcache_thread_shutdown () at malloc.c:2964
#6 __malloc_arena_thread_freeres () at arena.c:949
#7 0x0000ffff81313e8c in __libc_thread_freeres () at thread-freeres.c:38
#8 0x0000ffff81614844 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:493
#9 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 2
[Switching to thread 2 (Thread 0xffff80a7a1d0 (LWP 1510))]
#0 0x0000ffff8135d084 in __GI___poll (fds=0xffff80a79740, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x0000ffff8135d084 in __GI___poll (fds=0xffff80a79740, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1 0x00000000004e8714 in nn_efd_wait ()
#2 0x00000000004e46a0 in nn_sock_recv ()
#3 0x00000000004e24b0 in nn_recvmsg ()
#4 0x00000000004e1ef4 in nn_recv ()
#5 0x0000000000445c2c in nanomsg_recv ()
#6 0x0000ffff816148f8 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:479
#7 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 3
[Switching to thread 3 (Thread 0xffff8127c010 (LWP 1492))]
#0 0x0000ffff81339d88 in __GI___nanosleep (requested_time=requested_time#entry=0xffffc2f510a0, remaining=remaining#entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) bt
#0 0x0000ffff81339d88 in __GI___nanosleep (requested_time=requested_time#entry=0xffffc2f510a0, remaining=remaining#entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1 0x0000ffff8135fb54 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:32
#2 0x0000000000434278 in main ()
(gdb) thread 4
[Switching to thread 4 (Thread 0xffff7e2751d0 (LWP 1515))]
#0 0x0000ffff81365bcc in __GI_epoll_pwait (epfd=6, events=0x594160 <self+128>, maxevents=32, timeout=100, set=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
42 ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory.
(gdb) bt
#0 0x0000ffff81365bcc in __GI_epoll_pwait (epfd=6, events=0x594160 <self+128>, maxevents=32, timeout=100, set=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1 0x00000000004f2a14 in nn_poller_wait ()
#2 0x00000000004e712c in nn_worker_routine ()
#3 0x00000000004e9eb8 in nn_thread_main_routine ()
#4 0x0000ffff816148f8 in start_thread (arg=0xffff80a796c6) at pthread_create.c:479
#5 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 5
[Switching to thread 5 (Thread 0xffff8127b1d0 (LWP 1509))]
#0 futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x596538 <output_notice>) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
208 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x596538 <output_notice>) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
#1 do_futex_wait (sem=sem#entry=0x596538 <output_notice>, abstime=0x0, clockid=0) at sem_waitcommon.c:112
#2 0x0000ffff8161dd10 in __new_sem_wait_slow (sem=0x596538 <output_notice>, abstime=0x0, clockid=0) at sem_waitcommon.c:184
#3 0x0000000000529aa8 in async_output ()
#4 0x0000ffff816148f8 in start_thread (arg=0xffffc2f50fa6) at pthread_create.c:479
#5 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 6
[Switching to thread 6 (Thread 0xffff7fa781d0 (LWP 1512))]
#0 0x0000ffff81339d84 in __GI___nanosleep (requested_time=requested_time#entry=0xffff7fa77870, remaining=remaining#entry=0xffff7fa77870) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) bt
#0 0x0000ffff81339d84 in __GI___nanosleep (requested_time=requested_time#entry=0xffff7fa77870, remaining=remaining#entry=0xffff7fa77870) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1 0x0000ffff81339c14 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x0000000000434ee0 in play_sound ()
#3 0x0000ffff816148f8 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:479
#4 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
I am sorry that I can not post the source code, because the code is a little much important for us.
The question is that I can not find which of my code caused this dump. There is no any my code in thread 1 after the pthread_create function. For a normal way, there must be some of my code in the threads, Is that right?
Your help would be greatly appreciated.
The question is that I can not find which of my code caused this dump
Any crash inside malloc or free is a 99.9% sign of heap corruption (freeing something twice, freeing unallocated memory, writing past the end of allocated buffer, etc.).
Here you have free() telling you that you are freeing something that was not allocated. The address being freed: 0xffff7d783ff0 looks like a stack address. It is probable that you have freed some stack address earlier.
Unfortunately, it is nearly impossible to debug heap corruption via post-mortem debugging, because the root cause of corruption may have happened 1000s of instructions earlier, possibly in completely unrelated code.
The good news: instrumenting your program with address sanitizer (gcc -fsanitize=address ...) and running such program through your tests (you do have tests, right?) usually leads you straight to the problem.

How can I find out the source of this glibc backtrace originating with clone()?

This backtrace comes from a deadlock situation in a multi-threaded application.
The other deadlocked threads are locking inside the call to malloc(), and appear
to be waiting on this thread.
I don't understand what creates this thread, since it deadlocks before calling
any functions in my application:
Thread 6 (Thread 0x7ff69d43a700 (LWP 14191)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a299460d in _L_lock_27 () from /usr/lib64/libc.so.6
#2 0x00007ff6a29945bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3 0x00007ff6a2994662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4 0x00007ff6a3875e38 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
clone() is used to implement fork(), pthread_create(), and perhaps other functions. See here and here.
How can I find out if this trace comes from a fork(), pthread_create(), a signal handler, or something else? Do I just need to dig through the glibc code, or can I use gdb or some other tool? Why does this thread need the internal glibc lock? This would be useful in determining the cause of the deadlock.
Additional information and research:
malloc() is thread-safe, but not reentrant (recursive-safe) (see this and this, so malloc() is also not async-signal-safe. We don't define signal handlers for this process, so I know that we don't call malloc() from signal handlers. The deadlocked threads don't ever call recursive functions, and callbacks are handled in a new thread, so I don't think we should need to worry about reentrancy here. (Maybe I'm wrong?)
This deadlock happens when many callbacks are being spawned to signal (ultimately kill) different processes. The callbacks are spawned in their own threads.
Are we possibly using malloc in an unsafe way?
Possibly related:
glibc malloc internals
Malloc inside of signal handler causes deadlock.
How are signal handlers delivered in a multi-threaded application?
glibc fork/malloc deadlock bug that was fixed in glibc-2.17-162.el7. This looks similar, but is NOT my bug - I'm on a fixed version of glibc.
(I've been unsuccessful in creating a minimal, complete, verifiable example. Unfortunately the only way to reproduce is with the application (Slurm), and it's quite difficult to reproduce.)
EDIT:
Here's the backtrace from all the threads. Thread 6 is the trace I originally posted. Thread 1 is just waiting on a pthread_join(). Threads 2-5 are locked after a call to malloc(). Thread 7 is listening for messages and spawning callbacks in new threads (threads 2-5). Those would be callbacks that would eventually signal other processes.
Thread 7 (Thread 0x7ff69e672700 (LWP 12650)):
#0 0x00007ff6a291aa3d in poll () from /usr/lib64/libc.so.6
#1 0x00007ff6a3c09064 in _poll_internal (shutdown_time=<optimized out>, nfds=2,
pfds=0x7ff6980009f0) at ../../../../slurm/src/common/eio.c:364
#2 eio_handle_mainloop (eio=0xf1a970) at ../../../../slurm/src/common/eio.c:328
#3 0x000000000041ce78 in _msg_thr_internal (job_arg=0xf07760)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:245
#4 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7ff69d43a700 (LWP 14191)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a299460d in _L_lock_27 () from /usr/lib64/libc.so.6
#2 0x00007ff6a29945bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3 0x00007ff6a2994662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4 0x00007ff6a3875e38 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7ff69e773700 (LWP 22471)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ff6a4086700 (LWP 5633)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7ff69d53b700 (LWP 12963)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7ff69f182700 (LWP 19734)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ff6a4088880 (LWP 12616)):
#0 0x00007ff6a3876f57 in pthread_join () from /usr/lib64/libpthread.so.0
#1 0x000000000041084a in _wait_for_io (job=0xf07760)
at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:2219
#2 job_manager (job=job#entry=0xf07760)
at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1397
#3 0x000000000040ca07 in main (argc=1, argv=0x7fffacab93d8)
at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:172
The presence of start_thread() in the backtrace indicates that this is a pthread_create() thread.
__libc_thread_freeres() is a function that glibc calls at thread exit, which invokes a set of callbacks to free internal per-thread state. This indicates the thread you have highlighted is in the process of exiting.
arena_thread_freeres() is one of those callbacks. It is for the malloc arena allocator, and it moves the free list from the exiting thread's private arena to the global free list. To do this, it must take a lock that protects the global free list (this is the list_lock in arena.c).
It appears to be this lock that the highlighted thread (Thread 6) is blocked on.
The arena allocator installs pthread_atfork() handlers which lock the list lock at the start of fork() processing, and unlock it at the end. This means that while other pthread_atfork() handlers are running, all other threads will block on this lock.
Are you installing your own pthread_atfork() handlers? It seems likely that one of these may be causing your deadlock.

Process threads stuck with __pthread_enable_asynccancel ()

I am debugging an issue with a multi-threaded TCP server application on CentOS platform. The application suddenly stopped processing connections. Even the event logging with syslog was not seen in the log files. It was as if the application become a black hole.
I killed the process with signal 11 to get the core dump. In the core dump I observed that all threads are stuck with similar patterns.
Ex:
Thread 19 (Thread 0xb2de8b70 (LWP 3722)):
#0 0x00c9e424 in __kernel_vsyscall ()
#1 0x00c17189 in __pthread_enable_asynccancel () from /lib/libpthread.so.0
#2 0x080b367d in server_debug_stats_timer_handler (tmp=0x0) at server_debug_utils.c:75 ==> this line as a event print with syslog(...)
All most all threads are attempting a syslog(...) print but gets stuck with __pthread_enable_asynccancel ().
What is being done with __pthread_enable_asynccancel () and why it isn't returning?
Here is info reg from the thread mentioned:
(gdb) info reg
eax 0xfffffe00 -512
ecx 0x80 128
edx 0x2 2
ebx 0x154ea3c4 357475268
esp 0xb2de8174 0xb2de8174
ebp 0xb2de81a8 0xb2de81a8
esi 0x0 0
edi 0x0 0
eip 0xc9e424 0xc9e424 <__kernel_vsyscall+16>
eflags 0x200246 [ PF ZF IF ID ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51
(gdb)
(gdb) print $orig_eax
$1 = 240
($orig_eax = 240 is SYS_futex)
One of the thread state is as shown below:
Thread 27 (Thread 0xa97d9b70 (LWP 3737)):
#0 0x00c9e424 in __kernel_vsyscall ()
#1 0x00faabb3 in inet_ntop () from /lib/libc.so.6
#2 0x00f20e76 in freopen64 () from /lib/libc.so.6
#3 0x00f96a55 in fcvt_r () from /lib/libc.so.6
#4 0x00f96fd7 in qfcvt_r () from /lib/libc.so.6
#5 0x00a19932 in app_signal_handler (signum=11) at appBaseClass.cpp:920
#6 <signal handler called>
#7 0x00f2aec5 in sYSMALLOc () from /lib/libc.so.6
#8 0x0043431a in CRYPTO_free () from /usr/local/lr/packages/stg_app/5.3.8/lib/ssl/libcrypto.so.10
#9 0x00000000 in ?? ()
(gdb) print $orig_eax $5 = 240`
Your stuck threads are in the futex() syscall, probably on an internal glibc lock taken within malloc() (the syslog() call allocates memory internally).
The reason for this deadlock isn't apparent, but I have two suggestions:
If you call syslog() (or any other non-AS-safe function, like printf()) from a signal handler anywhere in your program, that could cause a deadlock like this;
It's possible you're being bitten by a bug in futex() that was introduced in Linux 3.14 and fixed in 3.18. The bug was also backported RHEL 6.6 so would have been present for a while in CentOS too. The effect of this bug was to cause processes to fail to wake up from a FUTEX_WAIT when they should have.

getaddrinfo and gethostbyname crashing when called from child thread?

We have created a multithreaded, single core application running on Ubuntu.
When we call getaddrinfo and gethostbyname from the main process, it does not crash.
However when we create a thread from the main process and the functions getaddrinfo and gethostbyname are called from the created thread, it always crashes.
Kindly help.
Please find the call stack below:
#0 0xf7e9f890 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xf7e9fa73 in __res_ninit () from /lib/i386-linux-gnu/libc.so.6
#2 0xf7ea0a68 in __res_maybe_init () from /lib/i386-linux-gnu/libc.so.6
#3 0xf7e663be in ?? () from /lib/i386-linux-gnu/libc.so.6
#4 0xf7e696bb in getaddrinfo () from /lib/i386-linux-gnu/libc.so.6
#5 0x080c4e35 in mn_task_entry (args=0xa6c4130 <ipc_os_input_params>) at /home/nextg/Alps_RT/mn/src/mn_main.c:699
#6 0xf7fa5d78 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#7 0xf7e9001e in clone () from /lib/i386-linux-gnu/libc.so.6
The reason the getaddrinfo was crashing because, the child thread making the call did not have sufficient stack space.
Using ACE C++ version 6.5.1 libraries classes which use ACE_Thread::spawn_n with default ACE_DEFAULT_THREAD_PRIORITY (1024*1024) will crash when calling gethostbyname/getaddrinfo inside child as reported by Syed Aslam. libxml2 schema parsing takes forever, using a child thread Segment Faulted after calling xmlNanoHTTPConnectHost as it tries to resolve schemaLocation.
ACE_Task activate
const ACE_TCHAR *thr_name[1];
thr_name[0] = "Flerf";
// libxml2-2.9.7/nanohttp.c:1133
// gethostbyname will crash when child thread making the call
// has insufficient stack space.
size_t stack_sizes[1] = {
ACE_DEFAULT_THREAD_STACKSIZE * 100
};
const INT ret = this->activate (
THR_NEW_LWP/*Light Weight Process*/ | THR_JOINABLE,
1,
0/*force_active*/,
ACE_DEFAULT_THREAD_PRIORITY,
-1/*grp_id*/,
NULL/*task*/,
NULL/*thread_handles[]*/,
NULL/*stack[]*/,
stack_sizes/*stack_size[]*/,
NULL/*thread_ids[]*/,
thr_name
);

How can I get GDB to tell me what address caused a segfault?

I'd like to know if my program is accessing NULL pointers or stale memory.
The backtrace looks like this:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2b0fa4c8 (LWP 1333)]
0x299a6ad4 in pthread_mutex_lock () from /lib/libpthread.so.0
(gdb) bt
#0 0x299a6ad4 in pthread_mutex_lock () from /lib/libpthread.so.0
#1 0x0058e900 in ?? ()
With GDB 7 and higher, you can examine the $_siginfo structure that is filled out when the signal occurs, and determine the faulting address:
(gdb) p $_siginfo._sifields._sigfault.si_addr
If it shows (void *) 0x0 (or a small number) then you have a NULL pointer dereference.
Run your program under GDB. When the segfault occurs, GDB will inform you of the line and statement of your program, along with the variable and its associated address.
You can use the "print" (p) command in GDB to inspect variables. If the crash occurred in a library call, you can use the "frame" series of commands to see the stack frame in question.

Resources