Deadlock inside malloc_atfork - c

My program is deadlocking and here are the top 4 frames of the deadlock:
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1 0x00007f926250b7aa in _L_lock_12502 () at malloc.c:3507
#2 0x00007f926250a2df in malloc_atfork (sz=12, caller=<value optimized out>) at arena.c:217
#3 0x00007f926250881a in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4040
I'm leaning towards this being a problem caused by something I'm doing wrong. We see the deadlock when stressing the server and taking it to high usage levels, but otherwise we can't reproduce this. Does anyone know what kind of mistake causes this?

Per POSIX, after calling fork in a multithreaded process, the child process is in an async signal context and undefined behavior is invoked if you do anything other than calling async-signal-safe functions before calling _exit or one of the exec family functions.

You most frequently get a deadlock if difering execution threads acquire shared resources in a different order. Appearing under stress is a good indicator of this. Support you have:
A == 1 2
B == 2 1
Now, suppose you get a thread reschedule right after A acquires 1 but before it grabs 2. Thread B runs and acquires 2 and then control returns to A; it is now blocked waiting on resource 2, which is held by B which is waiting for resource 1 held by A. Now, A cannot proceed and neither can B; deadlock.
Another cause of deadlocks is a slight variation on this, where one execution path claims a resource without honoring the resource locking; this will mislead other execution threads that follow the rules.
Hope this helps.

Related

How does _dl_runtime_resolve achieve thread safety?

I'm reading this article: https://ypl.coffee/dl-resolve/
Since the GOT table is resolved at runtime (at first call to this function), I'm wondering how does _dl_runtime_resolve ensure it's multi-thread safe?
how does _dl_runtime_resolve ensure it's multi-thread safe?
Any pure function is thread-safe by definition.
_dl_runtime_resolve is not pure (updates the GOT slot), but if two threads are resolving the same symbol in parallel, both threads will resolve to the same symbol (i.e. they'll compute the exact same target address), and will write the same value into the GOT slot.
While technically this is a data race, practically it doesn't lead to any observable problems.

Why is notify required inside a critical section?

I'm reading this book here (official link, it's free) to understand threads and parallel programming.
Here's the question.
Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?
What and where is the race condition being described?
The code and passage in question is as follows.
...
The code to wake a thread, which would run in some other thread, looks like this:
pthread_mutex_lock(&lock);
ready = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);
A few things to note about this code sequence. First, when signaling (as well as when modifying the global variable ready), we always make sure to have the lock held. This ensures that we don’t accidentally introduce a race condition into our code.
...
(please refer to the free, official pdf to get context.)
I couldn't comment with a small question in the link-2, so here is a full question.
Edit 1: I understand the lock is to control access to the ready variable. I am wondering why there's a race condition associated with the signaling. Specifically,
First, when signaling [...] we always make sure to have the lock held. This ensures that we don’t accidentally introduce a race condition into our code
Edit 2: I've seen resources and comments (from links commented below and during my own research), sometimes within the same page that say it doesn't matter or you must put it within a lock for Predictable BehaviorTM (would be nice if this can be touched upon too, if the behavior can be other than spurious wakeups). What must I follow?
Edit 3: I'm looking for more of a 'theoretical' answer, not implementation specific so that I can understand the core idea. I understand answers to these can be platform specific, but an answer that focuses on the core ideas of lock, mutex, condition variable as all implementations must follow these semantics, perhaps adding their own little quirks. Example, wait() can wake up spuriously, and given bad timing of signaling, can happen on 'pure' implementations too. Mentioning these would help.
My apologies for so many edits, but my dearth of in-depth knowledge in this field is confusing the heck outta me.
Any insight would be really helpful, thanks. Also, please feel free to point me to books where I can read these concepts in detail, and where I can learn C++ with these concepts too. Thanks.
Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this
question (and this question too), which basically said "no, it's not
required". Why would a race condition occur?
The book not presenting a complete example, my best guess as to the intended meaning is that there can be a data race with the CV itself if it is signaled without the associated mutex being held. That may be the case for some CV implementations, but the book is talking specifically about pthreads, and pthreads CVs are not subject to such a limitation. Neither is C++ std::condition_variable, which is what the two other SO questions you referred to are talking about. So in that sense, the book is just wrong.
It is true that one can compose examples of poor CV use, in conjunction with which signaling under protection of the associated mutex largely protects against data races, but signaling without such protection is susceptible to data races. But In such a case, the fault is not with the signaling itself, but with the waiting, and if that's what the book means then it is deceptively worded. And probably still wrong.
What and where is the race condition being described?
One can only guess what the author had in mind.
For the record, the proper usage of condition variables involves firstly determining what condition one wants to ensure holds before execution proceeds. That condition will necessarily involve shared variables, else there is no reason to expect that anything another thread does could change whether the condition is satisfied. That being the case, all access to the shared variables involved needs to be protected by a mutex if more than one thread is alive.
That mutex should then, secondly, also be the one associated with the CV, and threads must wait on the CV only while the mutex is held. This is a requirement of every CV implementation I know, and it protects against signals being missed and possible deadlock resulting from that. Consider this faulty, and somewhat contrived, example:
// BAD
int temp;
result = pthread_mutex_lock(m);
// handle failure results ...
temp = shared;
result = pthread_mutex_unlock(m);
// handle failure results ...
if (temp == 0) {
result = pthread_cond_wait(cv, m);
// handle failure results ...
}
// do something ...
Suppose that it was allowed to wait on the CV without holding the mutex, as that code does. That code supposes that at some point in the future, some other thread (T2) will update shared (under protection of the mutex) and then signal the CV to tell the waiting one (T1) that it can proceed. But what if T2 does that between when T1 unlocks the mutex and when it begins its wait? It doesn't matter whether T2 signals the CV under protection of the mutex or not -- T1 will begin a wait for a signal that has already been delivered. And CV signals do not queue.
So suppose that T1 only waits under protection of the mutex, as is in fact required. That's not enough. Consider this:
// ALSO BAD
result = pthread_mutex_lock(m);
// handle failure results ...
if (shared == 0) {
result = pthread_cond_wait(cv, m);
// handle failure results ...
}
result = pthread_mutex_unlock(m);
// handle failure results ...
// do something ...
This is still wrong, because it does not reliably prevent T1 from proceeding past the wait when the condition of interest is unsatisfied. Such a scenario can arise from
the signal being legitimately sent and received even though the particular condition of interest to T1 is not satisfied
the signal being legitimately sent and received, and the condition being satisfied when the signal is sent, but T2 or another thread modifying the shared variable again before T1 returns from its wait.
spurious return from the wait, which is very rare, but does occasionally happen in many real-world implementations.
None of that depends on T2 sending the signal without mutex protection.
The correct way to wait on a condition variable is to check the condition of interest before waiting, and afterward to loop back and check again before proceeding:
// OK
result = pthread_mutex_lock(m);
// handle failure results ...
while (shared == 0) { // <-- 'while', not 'if'
result = pthread_cond_wait(cv, m);
// handle failure results ...
}
// typically, shared = 0 at this point
result = pthread_mutex_unlock(m);
// handle failure results ...
// do something ...
It may sometimes be the case that thread T1 executing that code will return from its wait when the condition is not satisfied, but if ever it does then it will simply return to waiting instead of proceeding when it shouldn't. If other threads signal only under protection of the mutex then that should be rare, but still possible. If other threads signal without mutex protection then T1 may wake more often than strictly needed, but there is no data race involved, and no inherent risk of misbehavior.
Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?
Yes, condition variable notification should generally be performed with the corresponding mutex locked. The reason is not so much to avoid a race condition but to avoid a missed or superfluous notification.
Consider the following piece of code:
std::queue< int > events;
std::mutex mutex;
std::condition_variable cond;
// Thread 1
void consume_events()
{
std::unique_lock< std::mutex > lock(mutex); // #1
while (true)
{
if (events.empty()) // #2
{
cond.wait(lock); // #3
continue;
}
// Process an event
events.pop();
}
}
// Thread 2
void produce_event(int event)
{
{
std::unique_lock< std::mutex > lock(mutex); // #4
events.push(event); // #5
} // #6
cond.notify_one(); // #7
}
This is a classical example of one producer/one consumer queue of data.
In the line #1 the consumer (Thread 1) locks the mutex. Then, in line #2, it tests if there are any events in the queue and, if there are none, in line #3 unlocks mutex and blocks. When the notification on the condition variable happens, the thread unblocks, immediately locks mutex and continues execution past line #3 (which is to go to line #2 again).
In the line #4 the producer (Thread 2) locks the mutex and in line #5 it enqueues a new event. Because the mutex is locked, event queue modification is safe (line #5 cannot be executed concurrently with line #2), so there is no data race. Then, in line #6, the mutex is unlocked and in line #7 the condition variable is notified.
It is possible that the following happens:
Thread 2 acquires the mutex in line #4.
Thread 1 attempts to acquire the mutex in line #1 or #3 (upon being unblocked by a previous notification). Since the mutex is locked by Thread 2, Thread 1 blocks.
Thread 2 enqueues the event in line #5 and unlocks the mutex in line #6.
Thread 1 unblocks and acquires the mutex. In line #2 it sees that the event queue is not empty and processes the event. On the next loop iteration the queue is empty and the thread blocks in line #3.
Thread 2 notifies Thread 1 in line #7. But there are no queued events, and Thread 1 wakes up in vain.
Though in this particular example, the extra wake up is benign, depending on the loop contents, it may be detrimental. The correct code should call notify_one before unlocking the mutex.
Another example is when one thread is used to initiate some work in the other thread without an explicit queue of events:
std::mutex mutex;
std::condition_variable cond;
// Thread 1
void process_work()
{
std::unique_lock< std::mutex > lock(mutex); // #1
while (true)
{
cond.wait(lock); // #2
// Do some processing // #3
}
}
// Thread 2
void initiate_work_processing()
{
cond.notify_one(); // #4
}
In this case Thread 1 waits until it is time to perform some activity (e.g. render a frame in a video game). Thread 2 periodically initiates that activity by notifying Thread 1 via condition variable.
The problem is that the condition variable does not buffer notifications and acts only on the threads that are actually blocked on it at the point of notification. If there are no threads blocked then the notification does nothing. This means that the following sequence of events is possible:
Thread 1 acquires the mutex in line #1 and blocks in line #2.
Thread 2 decides it is time to perform the periodic activity and notifies Thread 1 in line #4.
Thread 1 unblocks and goes to perform the activities (e.g. render a frame).
It turns out that this frame is a lot of work, and when Thread 2 comes to notify Thread 1 about the next frame in line #2, Thread 1 is still busy with the previous one. This notification gets missed.
Thread 1 is finally done with the frame and blocks in line #2. The user observes a frame dropped.
The above wouldn't have happened if Thread 2 locked mutex before notifying Thread 1 in line #4. If Thread 1 is still busy rendering a frame, Thread 2 would block until Thread 1 is done and only then issue the notification.
However, the correct solution for the above task is to introduce a flag or some other data protected by the mutex that Thread 2 can use to signal Thread 1 that it is time to perform its activities. Aside from fixing the missed notification problem, this also takes care of spurious wakeups.
What and where is the race condition being described?
Definition of a data race depends on the memory model used in the particular environment. This means primarily your programming language memory model and may include the underlying hardware memory model (if the programming language relies on the hardware memory model, which is the case with e.g. Assembler).
C++ defines data races as follows:
When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless
both evaluations execute on the same thread or in the same signal handler, or
both conflicting evaluations are atomic operations (see std::atomic), or
one of the conflicting evaluations happens-before another (see std::memory_order)
If a data race occurs, the behavior of the program is undefined.
So basically, when multiple threads access the same memory location concurrently (by means other than std::atomic) and at least one of the threads is modifying the data at that location, that is a data race.

Deadlock in malloc_atfork

Our multi-threaded process is deadlocked in several threads, each showing the 3 frames below at the top of the stack. GDB shows that another thread is stuck in fork (called via popen), which is presumably why malloc_atfork, instead of malloc, is being called to allocate memory.
#0 0x00007f4f02c4aeec in __lll_lock_wait_private () from
/usr/lib64/libc.so.6
#1 0x00007f4f02bc807c in _L_lock_14817 () from /usr/lib64/libc.so.6
#2 0x00007f4f02bc51df in malloc_atfork () from /usr/lib64/libc.so.6
There is a RedHat bug (https://bugzilla.redhat.com/show_bug.cgi?id=906468) about a deadlock in glibc between fork and malloc and other reports about deadlocks in malloc_atfork.
And this link, https://sourceware.org/ml/libc-alpha/2016-02/msg00269.html, from Feb, 2016, contains a patch for removing malloc_atfork.
Does anyone know a solution to this problem?
While this is a bug in glibc, it should not be able to happen except when you are calling fork from an async-signal context, where it has interrupted code that's already holding the malloc lock and the interrupted code cannot make forward progress. Otherwise, it's another thread holding the lock, and that thread should eventually make forward progress and allow the fork to continue.
Are you possibly calling popen from a signal handler? If so, that's not valid usage, and you should expect it to be able to fail in many other ways, not just this one.

Thread not acquiring available mutex

The following is a bit of output from gdb. It seems to me that the thread running here is not actually acquiring the lock.
120 pthread_mutex_lock(&queue_p->lock);
(gdb) p queue_p->lock->__data->__owner
$45 = 0 //This makes sense; the instruction hasn't run yet.
(gdb) n
121 if (queue_p->back == NULL) {
(gdb) p queue_p->lock->__data->__owner
$46 = 0 // What?
There are no other threads running when this happens. It's just a single line of code, but it's not operating the way I expect. Which probably means my expectations are wrong. Can anyone un-befuddle me?
EDIT: Hoping this may be of use to others. What I missed is that the main thread continues execution after it creates threads. This is obvious to me now, but at the time I posted this question, I was under the influence of a lifetime spent writing single-threaded procedural code.
So my code created threads, each of which attempted to acquire this mutex. But before they could, the main thread had run its course, which included destroying the mutex. I was focussed on the code shown and didn't fully comprehend that it wasn't the only code running.

Find what interrupts sleep()

Is there any way to find what where the signal that interrupted a call to sleep() came from?
I have a ginormous amount of code, and I get this stacktrace from gdb:
#0 0x00418422 in __kernel_vsyscall ()
#1 0x001adfc6 in nanosleep () from /lib/libc.so.6
#2 0x001adde1 in sleep () from /lib/libc.so.6
#3 0x080a3cbd in MRT::setUp (this=0x9c679d8) at /code/Core/exec/mrt.cc:50
#4 0x080a1efc in main (argc=13, argv=0xbfcb6934) at /code/Core/exec/rpn.cc:211
I'm not entirely sure what all the code does, but I think this is what is going on:
Program 1 starts
Calls program 2 for shared memory allocation
Waits predetermined amount of time for allocation to complete
Program 1 continues
Find what interrupts sleep
At the time you attached GDB to the program, the sleep was in fact not interrupted by anything -- your stack trace indicates that your program is still blocked in the sleep system call.
Do you know what the sleep address is inside setup()? For example, sleep(&variable). Look for all callers of wakeup(&variable) and one of them is the sleep breaker. If there are too many, then I would add a trace array to remember the wakeups that were issued i.e. just store the PC from where wakeup was called...you can read that in the core file.
If you are sure that the sleep is interruptible && the sleep was actually interrupted, then I would do what one other poster said...catch the signal in a signal handler, capture signal info and re-arm it with the same signal.
If you are attaching to a running process, the process is interrupted by GDB itself to allow you to debug. The stack trace you observe is simply the stack of the running process at the time you attached to it. sleep() would not be an unreasonable system call for the process to be in when you are attaching to a process that appears to be idle.
If you are debugging a core file that shows the stack trace in sleep(), then when you start GDB to load a core file, it will display the top of the current stack frame of the core file. But just above that, it shows the signal that caused the core file. I wrote a test program, and this is what it showed when I loaded the core file into GDB:
Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000400458 in main ()
(gdb)
A core file is just a process snapshot, it is not always due to an internal error from the code. Sometimes it is generated by a signal delivered from an external program or the shell. Sometimes it is generated by executing the command generate-core-file from within GDB. In these cases, your core file may not actually point to anything wrong, but just the state the program was in at the time the core file was created.

Resources