I have a user level thread library and I changed a benchmark program to use mythreads instead of pthreads, but it always gets stuck somewhere in the code where there is a malloc or free function.
this is output of gdb:
^C
Program received signal SIGINT, Interrupt.
__lll_lock_wait_private ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) where
#0 __lll_lock_wait_private ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff7569bb3 in _int_free (av=0x7ffff78adc00 <main_arena>,
p=0x6f1f40, have_lock=0) at malloc.c:3929
#2 0x00007ffff756d89c in __GI___libc_free (mem=<optimized out>)
at malloc.c:2950
#3 0x000000000040812d in mbuffer_free (m=m#entry=0x6a7660) at mbuffer.c:209
#4 0x00000000004038a8 in write_chunk_to_file (chunk=0x6a7610,
fd=<optimized out>) at encoder.c:279
#5 Reorder (targs=0x7fffffffab60,
targs#entry=<error reading variable: value has been optimized out>)
at encoder.c:1292
#6 0x000000000040b069 in wrapper_function (func=<optimized out>,
arg=<optimized out>) at gtthread.c:75
#7 0x00007ffff7532620 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x0000000000000000 in ?? ()
(gdb)
and here is some code
mbuffer.c
208 if(ref==0) {
209 pausee();
210 free(m->mcb->ptr);
211 resume();
212 m->mcb->ptr=NULL;
213 free(m->mcb);
214 m->mcb=NULL;
215 }
pausee and resume functions
void pausee(){
//printf("pauseeing\n");
sigemptyset(&mask);
sigaddset(&mask, SIGPROF); // block SIGPROF...
if (sigprocmask(SIG_BLOCK, &mask, &orig_mask) < 0) {
perror ("sigprocmask");
exit(1);
}
}
void resume(){
//printf("restarting\n");
sigemptyset(&mask);
sigaddset(&mask, SIGPROF); // unblock SIGPROF...
if (sigprocmask(SIG_SETMASK, &orig_mask, NULL) < 0) {
perror ("sigprocmask");
exit(1);
}
}
I'm not sure if it's related to my problem or not, but for scheduling of the threads I use SIGPROF signal and a handler function. I tried blocking SIGPROF before every malloc/free function but It had no effect.
I don't have any concurrent threads, only one thread runs at a time.
any help or idea would be very much appreciated.
as your code mbuffer.c, there is no mistake.I suggest you get a test of mbuffer.c for except your guess.
Related
There is a coredump caused by my program, which shows below:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `./remote_speaker plug:SLAVE='dmix:tlv320aic3106au' default rtmp://pili-publish.'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0xffff802791d0 (LWP 1511))]
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000ffff812b9f54 in __GI_abort () at abort.c:79
#2 0x0000ffff81304d3c in __libc_message (action=action#entry=do_abort, fmt=fmt#entry=0xffff813bf638 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x0000ffff8130c32c in malloc_printerr (str=str#entry=0xffff813bafb0 "free(): invalid pointer") at malloc.c:5332
#4 0x0000ffff8130db04 in _int_free (av=0xffff813fb9f8 <main_arena>, p=0xffff7d783ff0, have_lock=<optimized out>) at malloc.c:4173
#5 0x0000ffff81310b50 in tcache_thread_shutdown () at malloc.c:2964
#6 __malloc_arena_thread_freeres () at arena.c:949
#7 0x0000ffff81313e8c in __libc_thread_freeres () at thread-freeres.c:38
#8 0x0000ffff81614844 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:493
#9 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb)
The total threads info shows below:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffff802791d0 (LWP 1511) __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
2 Thread 0xffff80a7a1d0 (LWP 1510) 0x0000ffff8135d084 in __GI___poll (fds=0xffff80a79740, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
3 Thread 0xffff8127c010 (LWP 1492) 0x0000ffff81339d88 in __GI___nanosleep (requested_time=requested_time#entry=0xffffc2f510a0, remaining=remaining#entry=0x0)
at ../sysdeps/unix/sysv/linux/nanosleep.c:28
4 Thread 0xffff7e2751d0 (LWP 1515) 0x0000ffff81365bcc in __GI_epoll_pwait (epfd=6, events=0x594160 <self+128>, maxevents=32, timeout=100, set=0x0)
at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
5 Thread 0xffff8127b1d0 (LWP 1509) futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x596538 <output_notice>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:208
6 Thread 0xffff7fa781d0 (LWP 1512) 0x0000ffff81339d84 in __GI___nanosleep (requested_time=requested_time#entry=0xffff7fa77870, remaining=remaining#entry=0xffff7fa77870)
at ../sysdeps/unix/sysv/linux/nanosleep.c:28
thread 1-6
(gdb) thread 1
[Switching to thread 1 (Thread 0xffff802791d0 (LWP 1511))]
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 in ../sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000ffff812b9f54 in __GI_abort () at abort.c:79
#2 0x0000ffff81304d3c in __libc_message (action=action#entry=do_abort, fmt=fmt#entry=0xffff813bf638 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x0000ffff8130c32c in malloc_printerr (str=str#entry=0xffff813bafb0 "free(): invalid pointer") at malloc.c:5332
#4 0x0000ffff8130db04 in _int_free (av=0xffff813fb9f8 <main_arena>, p=0xffff7d783ff0, have_lock=<optimized out>) at malloc.c:4173
#5 0x0000ffff81310b50 in tcache_thread_shutdown () at malloc.c:2964
#6 __malloc_arena_thread_freeres () at arena.c:949
#7 0x0000ffff81313e8c in __libc_thread_freeres () at thread-freeres.c:38
#8 0x0000ffff81614844 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:493
#9 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 2
[Switching to thread 2 (Thread 0xffff80a7a1d0 (LWP 1510))]
#0 0x0000ffff8135d084 in __GI___poll (fds=0xffff80a79740, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x0000ffff8135d084 in __GI___poll (fds=0xffff80a79740, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1 0x00000000004e8714 in nn_efd_wait ()
#2 0x00000000004e46a0 in nn_sock_recv ()
#3 0x00000000004e24b0 in nn_recvmsg ()
#4 0x00000000004e1ef4 in nn_recv ()
#5 0x0000000000445c2c in nanomsg_recv ()
#6 0x0000ffff816148f8 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:479
#7 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 3
[Switching to thread 3 (Thread 0xffff8127c010 (LWP 1492))]
#0 0x0000ffff81339d88 in __GI___nanosleep (requested_time=requested_time#entry=0xffffc2f510a0, remaining=remaining#entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) bt
#0 0x0000ffff81339d88 in __GI___nanosleep (requested_time=requested_time#entry=0xffffc2f510a0, remaining=remaining#entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1 0x0000ffff8135fb54 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:32
#2 0x0000000000434278 in main ()
(gdb) thread 4
[Switching to thread 4 (Thread 0xffff7e2751d0 (LWP 1515))]
#0 0x0000ffff81365bcc in __GI_epoll_pwait (epfd=6, events=0x594160 <self+128>, maxevents=32, timeout=100, set=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
42 ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory.
(gdb) bt
#0 0x0000ffff81365bcc in __GI_epoll_pwait (epfd=6, events=0x594160 <self+128>, maxevents=32, timeout=100, set=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1 0x00000000004f2a14 in nn_poller_wait ()
#2 0x00000000004e712c in nn_worker_routine ()
#3 0x00000000004e9eb8 in nn_thread_main_routine ()
#4 0x0000ffff816148f8 in start_thread (arg=0xffff80a796c6) at pthread_create.c:479
#5 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 5
[Switching to thread 5 (Thread 0xffff8127b1d0 (LWP 1509))]
#0 futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x596538 <output_notice>) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
208 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x596538 <output_notice>) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
#1 do_futex_wait (sem=sem#entry=0x596538 <output_notice>, abstime=0x0, clockid=0) at sem_waitcommon.c:112
#2 0x0000ffff8161dd10 in __new_sem_wait_slow (sem=0x596538 <output_notice>, abstime=0x0, clockid=0) at sem_waitcommon.c:184
#3 0x0000000000529aa8 in async_output ()
#4 0x0000ffff816148f8 in start_thread (arg=0xffffc2f50fa6) at pthread_create.c:479
#5 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 6
[Switching to thread 6 (Thread 0xffff7fa781d0 (LWP 1512))]
#0 0x0000ffff81339d84 in __GI___nanosleep (requested_time=requested_time#entry=0xffff7fa77870, remaining=remaining#entry=0xffff7fa77870) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) bt
#0 0x0000ffff81339d84 in __GI___nanosleep (requested_time=requested_time#entry=0xffff7fa77870, remaining=remaining#entry=0xffff7fa77870) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1 0x0000ffff81339c14 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x0000000000434ee0 in play_sound ()
#3 0x0000ffff816148f8 in start_thread (arg=0xffffc2f50ff6) at pthread_create.c:479
#4 0x0000ffff81365a7c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
I am sorry that I can not post the source code, because the code is a little much important for us.
The question is that I can not find which of my code caused this dump. There is no any my code in thread 1 after the pthread_create function. For a normal way, there must be some of my code in the threads, Is that right?
Your help would be greatly appreciated.
The question is that I can not find which of my code caused this dump
Any crash inside malloc or free is a 99.9% sign of heap corruption (freeing something twice, freeing unallocated memory, writing past the end of allocated buffer, etc.).
Here you have free() telling you that you are freeing something that was not allocated. The address being freed: 0xffff7d783ff0 looks like a stack address. It is probable that you have freed some stack address earlier.
Unfortunately, it is nearly impossible to debug heap corruption via post-mortem debugging, because the root cause of corruption may have happened 1000s of instructions earlier, possibly in completely unrelated code.
The good news: instrumenting your program with address sanitizer (gcc -fsanitize=address ...) and running such program through your tests (you do have tests, right?) usually leads you straight to the problem.
What will happen if we sleep in an interrupt handler on a SMP Machine,
I wrote a sample keyboard driver and added sleep on the interrupt handler
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/interrupt.h>
#include <linux/delay.h>
#include <linux/sched/signal.h>
MODULE_LICENSE("GPL");
static int irq = 1, dev = 0xaa, counter = 0;
static irqreturn_t keyboard_handler(int irq, void *dev)
{
pr_info("Keyboard Counter:%d\n", counter++);
msleep(1000);
return IRQ_NONE;
}
/* registering irq */
static int test_interrupt_init(void)
{
pr_info("%s: In init\n", __func__);
return request_irq(irq, keyboard_handler, IRQF_SHARED,"my_keyboard_handler", &dev);
}
static void test_interrupt_exit(void)
{
pr_info("%s: In exit\n", __func__);
synchronize_irq(irq); /* synchronize interrupt */
free_irq(irq, &dev);
}
module_init(test_interrupt_init);
module_exit(test_interrupt_exit);
The system ran for few minutes and then panic. Why can't the system work with one processor disabled, as the timer interrupts will be fired on the other CPU and can schedule processes.
Back Trace captured using kgdb setup:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1254]
0x0000000000000000 in irq_stack_union ()
(gdb) bt
#0 0x0000000000000000 in irq_stack_union ()
#1 0xffffffff810ad8e4 in ttwu_activate (en_flags=<optimized out>, p=<optimized out>, rq=<optimized out>)
at kernel/sched/core.c:1638
#2 ttwu_do_activate (rq=0xffff888237422c40, p=0xffffffff82213780 <init_task>, wake_flags=9,
rf=0xffffc900022b3f10) at kernel/sched/core.c:1697
#3 0xffffffff810aec00 in sched_ttwu_pending () at kernel/sched/core.c:1740
#4 0xffffffff810aedcd in scheduler_ipi () at kernel/sched/core.c:1771
#5 0xffffffff81a01aef in reschedule_interrupt () at arch/x86/entry/entry_64.S:888
#6 0xffffc900022b3f58 in ?? ()
#7 0xffffffff81a01aea in reschedule_interrupt () at arch/x86/entry/entry_64.S:888
#8 0x0000000000000002 in irq_stack_union ()
#9 0x00007fcec3421b40 in ?? ()
#10 0x0000000000000006 in irq_stack_union ()
#11 0x00007fceb00008c0 in ?? ()
#12 0x0000000000000002 in irq_stack_union ()
#13 0x00000000020bd380 in ?? ()
#14 0x0012c8d2cc413914 in ?? ()
Cannot access memory at address 0x5000
The keyboard interrupt can happen at any time - including during a kernel call. Normally, that’s OK: the interrupt happens, the driver does its thing, the interrupt handler returns, and the kernel continues.
But if you sleep() in the interrupt handler, the kernel is in an intermediate state. Other processors can’t execute kernel calls, because it’s already busy. Each will be forced to pause waiting for the kernel - which isn’t coming back. No wonder it panics!
This backtrace comes from a deadlock situation in a multi-threaded application.
The other deadlocked threads are locking inside the call to malloc(), and appear
to be waiting on this thread.
I don't understand what creates this thread, since it deadlocks before calling
any functions in my application:
Thread 6 (Thread 0x7ff69d43a700 (LWP 14191)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a299460d in _L_lock_27 () from /usr/lib64/libc.so.6
#2 0x00007ff6a29945bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3 0x00007ff6a2994662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4 0x00007ff6a3875e38 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
clone() is used to implement fork(), pthread_create(), and perhaps other functions. See here and here.
How can I find out if this trace comes from a fork(), pthread_create(), a signal handler, or something else? Do I just need to dig through the glibc code, or can I use gdb or some other tool? Why does this thread need the internal glibc lock? This would be useful in determining the cause of the deadlock.
Additional information and research:
malloc() is thread-safe, but not reentrant (recursive-safe) (see this and this, so malloc() is also not async-signal-safe. We don't define signal handlers for this process, so I know that we don't call malloc() from signal handlers. The deadlocked threads don't ever call recursive functions, and callbacks are handled in a new thread, so I don't think we should need to worry about reentrancy here. (Maybe I'm wrong?)
This deadlock happens when many callbacks are being spawned to signal (ultimately kill) different processes. The callbacks are spawned in their own threads.
Are we possibly using malloc in an unsafe way?
Possibly related:
glibc malloc internals
Malloc inside of signal handler causes deadlock.
How are signal handlers delivered in a multi-threaded application?
glibc fork/malloc deadlock bug that was fixed in glibc-2.17-162.el7. This looks similar, but is NOT my bug - I'm on a fixed version of glibc.
(I've been unsuccessful in creating a minimal, complete, verifiable example. Unfortunately the only way to reproduce is with the application (Slurm), and it's quite difficult to reproduce.)
EDIT:
Here's the backtrace from all the threads. Thread 6 is the trace I originally posted. Thread 1 is just waiting on a pthread_join(). Threads 2-5 are locked after a call to malloc(). Thread 7 is listening for messages and spawning callbacks in new threads (threads 2-5). Those would be callbacks that would eventually signal other processes.
Thread 7 (Thread 0x7ff69e672700 (LWP 12650)):
#0 0x00007ff6a291aa3d in poll () from /usr/lib64/libc.so.6
#1 0x00007ff6a3c09064 in _poll_internal (shutdown_time=<optimized out>, nfds=2,
pfds=0x7ff6980009f0) at ../../../../slurm/src/common/eio.c:364
#2 eio_handle_mainloop (eio=0xf1a970) at ../../../../slurm/src/common/eio.c:328
#3 0x000000000041ce78 in _msg_thr_internal (job_arg=0xf07760)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:245
#4 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7ff69d43a700 (LWP 14191)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a299460d in _L_lock_27 () from /usr/lib64/libc.so.6
#2 0x00007ff6a29945bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3 0x00007ff6a2994662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4 0x00007ff6a3875e38 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7ff69e773700 (LWP 22471)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ff6a4086700 (LWP 5633)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7ff69d53b700 (LWP 12963)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7ff69f182700 (LWP 19734)):
#0 0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2 0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3 0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4 0x00007ff6a3c02e60 in slurm_xmalloc (size=size#entry=24, clear=clear#entry=false,
file=file#entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
line=line#entry=152, func=func#entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
at ../../../../slurm/src/common/xmalloc.c:86
#5 0x00007ff6a3b2e5b7 in init_buf (size=16384)
at ../../../../slurm/src/common/pack.c:152
#6 0x000000000041caab in _handle_accept (arg=0x0)
at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7 0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ff6a4088880 (LWP 12616)):
#0 0x00007ff6a3876f57 in pthread_join () from /usr/lib64/libpthread.so.0
#1 0x000000000041084a in _wait_for_io (job=0xf07760)
at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:2219
#2 job_manager (job=job#entry=0xf07760)
at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1397
#3 0x000000000040ca07 in main (argc=1, argv=0x7fffacab93d8)
at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:172
The presence of start_thread() in the backtrace indicates that this is a pthread_create() thread.
__libc_thread_freeres() is a function that glibc calls at thread exit, which invokes a set of callbacks to free internal per-thread state. This indicates the thread you have highlighted is in the process of exiting.
arena_thread_freeres() is one of those callbacks. It is for the malloc arena allocator, and it moves the free list from the exiting thread's private arena to the global free list. To do this, it must take a lock that protects the global free list (this is the list_lock in arena.c).
It appears to be this lock that the highlighted thread (Thread 6) is blocked on.
The arena allocator installs pthread_atfork() handlers which lock the list lock at the start of fork() processing, and unlock it at the end. This means that while other pthread_atfork() handlers are running, all other threads will block on this lock.
Are you installing your own pthread_atfork() handlers? It seems likely that one of these may be causing your deadlock.
Im running into a pretty annoying problem :
I have a program which creates one thread at the start, this thread will launch other stuff during its execution (fork() immediatly followed by execve()).
Here is the bt of both threads at the point where my program reached (I think) the deadlock :
Thread 2 (LWP 8839):
#0 0x00007ffff6cdf736 in __libc_fork () at ../sysdeps/nptl/fork.c:125
#1 0x00007ffff6c8f8c0 in _IO_new_proc_open (fp=fp#entry=0x7ffff00031d0, command=command#entry=0x7ffff6c26e20 "ps -u brejon | grep \"cvc\"
#2 0x00007ffff6c8fbcc in _IO_new_popen (command=0x7ffff6c26e20 "ps -u user | grep \"cvc\" | wc -l", mode=0x42c7fd "r") at iopopen.c:296
#3-4 ...
#5 0x00007ffff74d9434 in start_thread (arg=0x7ffff6c27700) at pthread_create.c:333
#6 0x00007ffff6d0fcfd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 1 (LWP 8835):
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff6ca0ad9 in malloc_atfork (sz=140737337120848, caller=) at arena.c:179
#2 0x00007ffff6c8d875 in __GI__IO_file_doallocate (fp=0x17a72230) at filedoalloc.c:127
#3 0x00007ffff6c9a964 in __GI__IO_doallocbuf (fp=fp#entry=0x17a72230) at genops.c:398
#4 0x00007ffff6c99de8 in _IO_new_file_overflow (f=0x17a72230, ch=-1) at fileops.c:820
#5 0x00007ffff6c98f8a in _IO_new_file_xsputn (f=0x17a72230, data=0x17a16420, n=682) at fileops.c:1331
#6 0x00007ffff6c6fcb2 in _IO_vfprintf_internal (s=0x17a72230, format=, ap=ap#entry=0x7fffffffcf18) at vfprintf.c:1632
#7 0x00007ffff6c76a97 in __fprintf (stream=, format=) at fprintf.c:32
#8-11 ...
#12 0x000000000042706e in main (argc=3, argv=0x7fffffffd698, envp=0x7fffffffd6b8) at mains/ignore/.c:146
Both stays stuck here forever with both glibc-2.17 and glibc-2.23
Any help is welcomed :'D
EDIT :
Here is a minimal example :
1 #include <stdlib.h>
2 #include <pthread.h>
3 #include <unistd.h>
4
5 void * thread_handler(void * args)
6 {
7 char * argv[] = { "/usr/bin/ls" };
8 char * newargv[] = { "/usr/bin/ls", NULL };
9 char * newenviron[] = { NULL };
10 while (1) {
11 if (vfork() == 0) {
12 execve(argv[0], newargv, newenviron);
13 }
14 }
15
16 return 0;
17 }
18
19 int main(void)
20 {
21 pthread_t thread;
22 pthread_create(&thread, NULL, thread_handler, NULL);
23
24 int * dummy_alloc;
25
26 while (1) {
27 dummy_alloc = malloc(sizeof(int));
28 free(dummy_alloc);
29 }
30
31 return 0;
32 }
Environment :
user:deadlock$ cat /etc/redhat-release
Scientific Linux release 7.3 (Nitrogen)
user:deadlock$ ldd --version
ldd (GNU libc) 2.17
EDIT 2 :
The rpm package version is : glibc-2.17-196.el7.x86_64
I can not get the line numbers using the rpm package. Here is the BT using the glibc given with the distribution : solved with debuginfo.
(gdb) thread apply all bt
Thread 2 (Thread 0x7ffff77fb700 (LWP 59753)):
#0 vfork () at ../sysdeps/unix/sysv/linux/x86_64/vfork.S:44
#1 0x000000000040074e in thread_handler (args=0x0) at deadlock.c:11
#2 0x00007ffff7bc6e25 in start_thread (arg=0x7ffff77fb700) at pthread_create.c:308
#3 0x00007ffff78f434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7ffff7fba740 (LWP 59746)):
#0 0x00007ffff7878226 in _int_free (av=0x7ffff7bb8760 , p=0x602240, have_lock=0) at malloc.c:3927
#1 0x00000000004007aa in main () at deadlock.c:28
This is a custom-compiled glibc. It is possible that something went wrong with the installation. Note that Red Hat Enterprise Linux 7.4 backports a fix for a deadlock between malloc and fork, and you are missing that because you compiled your own glibc. The fix for the upstream bug went only into upstream version 2.24, so if you are basing your custom build on that, you may not have this fix. (Although the backtrace would look differently for that one.)
I think we fixed at least another post-2.17 libio-related deadlock bug.
EDIT I have been dealing with fork-related deadlocks for too long. There are multiple issues with the reproducer as posted:
There is no waitpid call for the PID. As a result, the process table will be quickly filled with zombies.
No error checking for execve. If the /usr/bin/ls pathname does not exist (for example, on a system which did not undergo UsrMove), execve will return, and the next iteration of the loop will launch another vfork call.
I fixed both issues (because debugging what is approaching a fork bomb is not fun at all), but I can't reproduce a hang with glibc-2.17-196.el7.x86_64.
I'm developing an project on an embedded linux OS(uclinux, mips CPU), It crashed occasionally.
When I try to check coredump with gdb, I can see that it received a SIGILL signal.
Sometime I can see the backtrace, which showed it died in pthread_mutex_lock. but most time, backtrace is not valid.
A valid backtrace
(gdb) bt
#0 <signal handler called>
#1 0x2ab87fd8 in sigsuspend () from /lib/libc.so.0
#2 0x2aade80c in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#3 0x2aadc7ac in __pthread_alt_lock () from /lib/libpthread.so.0
#4 0x2aad81a4 in pthread_mutex_lock () from /lib/libpthread.so.0
#5 0x0042fde8 in aos_mutex_lock (mutex=0x66bea8) at ../../source/ssp/os/sys/linux/aos_lock_linux.c:184
invalid backtrace
(gdb) bt
#0 0x00690430 in ?? ()
#1 0x00690430 in ?? ()
I used pthread_attr_setstackaddr to initial a stack for each thread, so that I can see its call frame through checking its stack. Also, I found it died in pthread_mutex_lock.
I used a wrapper for lock and unlock like
struct aos_mutex_t
{
pthread_mutex_t mutex;
S8 obj_name[AOS_MAX_OBJ_NAME];
S32 lFlag;
};
S32 aos_mutex_lock(AOS_MUTEX_T *mutex)
{
S32 status;
AOS_ASSERT_RETURN(mutex, AOS_EINVAL);
mutex->lFlag++;
status = pthread_mutex_lock( &mutex->mutex );
if (status == 0)
{
return AOS_SUCC;
}
else
{
return AOS_RETURN_OS_ERROR(status);
}
}
/*
* aos_mutex_unlock()
*/
S32 aos_mutex_unlock(AOS_MUTEX_T *mutex)
{
S32 status;
AOS_ASSERT_RETURN(mutex, AOS_EINVAL);
status = pthread_mutex_unlock( &mutex->mutex );
mutex->lFlag--;
if (status == 0)
return AOS_SUCC;
else
{
return AOS_RETURN_OS_ERROR(status);
}
}
All of these mutex is initiated before using them.
I tried gdb to run the program, it didn't die.
I wrote a simple program, 11 threads do nothing bug just lock and unlock in a dead loop. It didn't die.
Is there any suggestion?