Our multi-threaded process is deadlocked in several threads, each showing the 3 frames below at the top of the stack. GDB shows that another thread is stuck in fork (called via popen), which is presumably why malloc_atfork, instead of malloc, is being called to allocate memory.
#0 0x00007f4f02c4aeec in __lll_lock_wait_private () from
/usr/lib64/libc.so.6
#1 0x00007f4f02bc807c in _L_lock_14817 () from /usr/lib64/libc.so.6
#2 0x00007f4f02bc51df in malloc_atfork () from /usr/lib64/libc.so.6
There is a RedHat bug (https://bugzilla.redhat.com/show_bug.cgi?id=906468) about a deadlock in glibc between fork and malloc and other reports about deadlocks in malloc_atfork.
And this link, https://sourceware.org/ml/libc-alpha/2016-02/msg00269.html, from Feb, 2016, contains a patch for removing malloc_atfork.
Does anyone know a solution to this problem?
While this is a bug in glibc, it should not be able to happen except when you are calling fork from an async-signal context, where it has interrupted code that's already holding the malloc lock and the interrupted code cannot make forward progress. Otherwise, it's another thread holding the lock, and that thread should eventually make forward progress and allow the fork to continue.
Are you possibly calling popen from a signal handler? If so, that's not valid usage, and you should expect it to be able to fail in many other ways, not just this one.
Related
I have a fairly straightforward C program that runs much faster on one thread than on multiple threads.
(I'm running on a four-core i5 processor.)
By using the highly scientific "GDB halt debugging" technique, I've determined that it looks like only one thread is actually executing at a time.
Basically, when I hit ^C in GDB and type info threads, I get something like this:
Id Target Id Frame
29 Thread 0x7ffff5cec700 (LWP 14787) "corr" __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
28 Thread 0x7ffff64ed700 (LWP 14786) "corr" __lll_unlock_wake_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:341
27 Thread 0x7ffff6cee700 (LWP 14785) "corr" 0x00007ffff752ca2c in __random () at random.c:296
26 Thread 0x7ffff74ef700 (LWP 14784) "corr" __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
* 1 Thread 0x7ffff7fd5740 (LWP 14755) "corr" 0x00007ffff78bf66b in pthread_join (threadid=140737342535424, thread_return=0x7fffffffdd80) at pthread_join.c:92
(Thread 1 is the main thread; threads 26–29 are worker threads.)
A quick Google search seems to imply that these functions have something to do with deadlock detection, but I can't get much beyond that.
What are these functions, and why are they slowing down?
Possibly relevant:
If I join with each thread immediately after creating it, and before creating the others (i.e., not really multithreading at all, but still incurring the thread overhead), this effect does not occur, and my program runs more quickly.
In case it's useful, here's a code dump (159 lines).
Your threads are fighting over the random number generator. Whenever one thread has access to its context, the others have to wait until it releases the lock that protects it. You should use rand_r (or lrand48_r, or whatever sane random number generator meets your needs) instead of rand so each thread has its own context.
My program is deadlocking and here are the top 4 frames of the deadlock:
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1 0x00007f926250b7aa in _L_lock_12502 () at malloc.c:3507
#2 0x00007f926250a2df in malloc_atfork (sz=12, caller=<value optimized out>) at arena.c:217
#3 0x00007f926250881a in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4040
I'm leaning towards this being a problem caused by something I'm doing wrong. We see the deadlock when stressing the server and taking it to high usage levels, but otherwise we can't reproduce this. Does anyone know what kind of mistake causes this?
Per POSIX, after calling fork in a multithreaded process, the child process is in an async signal context and undefined behavior is invoked if you do anything other than calling async-signal-safe functions before calling _exit or one of the exec family functions.
You most frequently get a deadlock if difering execution threads acquire shared resources in a different order. Appearing under stress is a good indicator of this. Support you have:
A == 1 2
B == 2 1
Now, suppose you get a thread reschedule right after A acquires 1 but before it grabs 2. Thread B runs and acquires 2 and then control returns to A; it is now blocked waiting on resource 2, which is held by B which is waiting for resource 1 held by A. Now, A cannot proceed and neither can B; deadlock.
Another cause of deadlocks is a slight variation on this, where one execution path claims a resource without honoring the resource locking; this will mislead other execution threads that follow the rules.
Hope this helps.
i am using posix threads my question is as to whether or not a thread can cancel itself by passing its own thread id in pthread_cancel function?
if yes then what are its implications
also if a main program creates two threads and one of the thread cancels the other thread then what happens to the return value and the resources of the cancelled thread
and how to know from main program as to which thread was cancelled ..since main program is not cancelling any of the threads
i am using asynchronous cancellation
kindly help
Q1: Yes, a thread can cancel itself. However, doing so has all of the negative consequences of cancellation in general; you probably want to use pthread_exit instead, which is somewhat more predictable.
Q2: When a thread has been cancelled, it doesn't get to generate a return value; instead, pthread_join will put the special value PTHREAD_CANCELED in the location pointed to by its retval argument. Unfortunately, you have to know by some other means that a specific thread has definitely terminated (in some fashion) before you call pthread_join, or the calling thread will block forever. There is no portable equivalent of waitpid(..., WNOHANG) nor of waitpid(-1, ...). (The manpage says "If you believe you need this functionality, you probably need to rethink your application design" which makes me want to punch someone in the face.)
Q2a: It depends what you mean by "resources of the thread". The thread control block and stack will be deallocated. All destructors registered with pthread_cleanup_push or pthread_key_create will be executed (on the thread, before it terminates); some runtimes also execute C++ class destructors for objects on the stack. It is the application programmer's responsibility to make sure that all resources owned by the thread are covered by one of these mechanisms. Note that some of these mechanisms have inherent race conditions; for instance, it is impossible to open a file and push a cleanup that closes it as an atomic action, so there is a window where cancellation can leak the open file. (Do not think this can be worked around by pushing the cleanup before opening the file, because a common implementation of deferred cancels is to check for them whenever a system call returns, i.e. exactly timed to hit the tiny gap between the OS writing the file descriptor number to the return-value register, and the calling function copying that register to the memory location where the cleanup expects it to be.)
Qi: you didn't ask this, but you should be aware that a thread with asynchronous cancellation enabled is officially not allowed to do anything other than pure computation. The behavior is undefined if it calls any library function other than pthread_cancel, pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED), or pthread_setcancelstate(PTHREAD_CANCEL_DISABLE).
Q1. Yes,thread can cancel itself.
Q2. If one thread cancel another thread , its resources are hang around until main thread
join that thread with pthread_join() function(if the thread is joinable). And if the canceled
thread is not join in main thread resources are free with program ends/terminate.
Q3. I am not sure, but main program don't know which thread was canceled.
thread can cancel any other thread (within the same process) including itself
threads do not have return values (in general way, they can have return status only), resources of the thread will be freed upon cancellation
main program can store thread's handler and test whether it valid or not
Is there any way to find what where the signal that interrupted a call to sleep() came from?
I have a ginormous amount of code, and I get this stacktrace from gdb:
#0 0x00418422 in __kernel_vsyscall ()
#1 0x001adfc6 in nanosleep () from /lib/libc.so.6
#2 0x001adde1 in sleep () from /lib/libc.so.6
#3 0x080a3cbd in MRT::setUp (this=0x9c679d8) at /code/Core/exec/mrt.cc:50
#4 0x080a1efc in main (argc=13, argv=0xbfcb6934) at /code/Core/exec/rpn.cc:211
I'm not entirely sure what all the code does, but I think this is what is going on:
Program 1 starts
Calls program 2 for shared memory allocation
Waits predetermined amount of time for allocation to complete
Program 1 continues
Find what interrupts sleep
At the time you attached GDB to the program, the sleep was in fact not interrupted by anything -- your stack trace indicates that your program is still blocked in the sleep system call.
Do you know what the sleep address is inside setup()? For example, sleep(&variable). Look for all callers of wakeup(&variable) and one of them is the sleep breaker. If there are too many, then I would add a trace array to remember the wakeups that were issued i.e. just store the PC from where wakeup was called...you can read that in the core file.
If you are sure that the sleep is interruptible && the sleep was actually interrupted, then I would do what one other poster said...catch the signal in a signal handler, capture signal info and re-arm it with the same signal.
If you are attaching to a running process, the process is interrupted by GDB itself to allow you to debug. The stack trace you observe is simply the stack of the running process at the time you attached to it. sleep() would not be an unreasonable system call for the process to be in when you are attaching to a process that appears to be idle.
If you are debugging a core file that shows the stack trace in sleep(), then when you start GDB to load a core file, it will display the top of the current stack frame of the core file. But just above that, it shows the signal that caused the core file. I wrote a test program, and this is what it showed when I loaded the core file into GDB:
Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000400458 in main ()
(gdb)
A core file is just a process snapshot, it is not always due to an internal error from the code. Sometimes it is generated by a signal delivered from an external program or the shell. Sometimes it is generated by executing the command generate-core-file from within GDB. In these cases, your core file may not actually point to anything wrong, but just the state the program was in at the time the core file was created.
(Working in Win32 api , in C environment with VS2010)
I have a two thread app. The first thread forks the second and waits for a given interval - 'TIMEOUT', and then calls TerminateThread() on it.
Meanwhile, second thread calls NetServerEnum().
It appears that when timeout is reached , whether NetServerEnum returned successfully or not, the first thread get deadlocked.
I've already noticed that NetServerEnum creates worker threads of it's own.
I ultimately end up with one of those threads in deadlock, typically on ntdll.dll!RtlInitializeExceptionChain, unable to exit my process gracefully.
As this to too long for a comment:
Verbatim from MSDN, allow me to use te answer form (emphasis by me):
TerminateThread is a dangerous function that should only be used in the most extreme cases. You should call TerminateThread only if you know exactly what the target thread is doing, and you control all of the code that the target thread could possibly be running at the time of the termination. For example, TerminateThread can result in the following problems:
If the target thread owns a critical section, the critical section will not be released.
If the target thread is allocating memory from the heap, the heap lock will not be released.
*If the target thread is executing certain kernel32 calls when it is terminated, the kernel32 state for the thread's process could be inconsistent.
If the target thread is manipulating the global state of a shared DLL, the state of the DLL could be destroyed, affecting other users of the DLL.
From reading this it is easy to understanf why it is a bad idea to cancel (terminate) a thread stucking in a system call.
A possible alternative approach to the OP's design might be to spawn off a thread calling NetServerEnum() and simply let it run until the system call returned.
In the mean while the main thread could do other things like for example informing the user that scanning the net takes longer as expected.