I have a problem that realloc() deadlocks sometime after clone() syscall.
My code is:
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/types.h>
#define CHILD_STACK_SIZE 4096*4
#define gettid() syscall(SYS_gettid)
#define log(str) fprintf(stderr, "[pid:%d tid:%d] "str, getpid(),gettid())
int clone_func(void *arg){
int *ptr=(int*)malloc(10);
int i;
for (i=1; i<200000; i++)
ptr = realloc(ptr, sizeof(int)*i);
free(ptr);
return 0;
}
int main(){
int flags = 0;
flags = CLONE_VM;
log("Program started.\n");
int *ptr=NULL;
ptr = malloc(16);
void *child_stack_start = malloc(CHILD_STACK_SIZE);
int ret = clone(clone_func, child_stack_start +CHILD_STACK_SIZE, flags, NULL, NULL, NULL, NULL);
int i;
for (i=1; i<200000; i++)
ptr = realloc(ptr, sizeof(int)*i);
free(ptr);
return 0;
}
the callstack in gdb is:
[pid:13268 tid:13268] Program started.
^Z[New LWP 13269]
Program received signal SIGTSTP, Stopped (user).
0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) bt
#0 0x000000000040ba0e in __lll_lock_wait_private ()
#1 0x0000000000408630 in _L_lock_11249 ()
#2 0x000000000040797f in realloc ()
#3 0x0000000000400515 in main () at test-realloc.c:36
(gdb) i thr
2 LWP 13269 0x000000000040ba0e in __lll_lock_wait_private ()
* 1 LWP 13268 0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) thr 2
[Switching to thread 2 (LWP 13269)]#0 0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) bt
#0 0x000000000040ba0e in __lll_lock_wait_private ()
#1 0x0000000000408630 in _L_lock_11249 ()
#2 0x000000000040797f in realloc ()
#3 0x0000000000400413 in clone_func (arg=0x7fffffffe53c) at test-realloc.c:20
#4 0x000000000040b889 in clone ()
#5 0x0000000000000000 in ?? ()
My OS is debian linux-2.6.32-5-amd64, with GNU C Library (Debian EGLIBC 2.11.3-4) stable release version 2.11.3. I deeply suspect that eglibc is the criminal on this bug.
On clone() syscall, is it not enough before using realloc()?
You cannot use clone with CLONE_VM yourself -- or if you do, you have to at least make sure you restrict yourself from invoking any function from the standard library after calling clone in either the parent or the child. In order for multiple threads or processes to share the same memory, the implementations of any functions which access shared resources (like the heap) need to
be aware of the fact that multiple flows of control are potentially accessing it so they can arrange to perform the appropriate synchronization, and
be able to obtain information about their own identities via the thread pointer, usually stored in a special machine register. This is completely implementation-internal, and thus you cannot arrange for a new "thread" which you create yourself via clone to have a properly setup thread pointer.
The proper solution is to use pthread_create, not clone.
You cannot do this:
for (i=0; i<200000; i++)
ptr = realloc(ptr, sizeof(int)*i);
free(ptr);
The first time through the loop, i is zero. realloc( ptr, 0 ) is equivalent to free( ptr ), and you cannot free twice.
I add a flag, CLONE_SETTLS, in clone() syscall. Then the deadlock is gone.
So I think eglibc's realloc() used some TLS data. When new thread create without a new TLS, some locks (in TLS) shared between this thread and his father, and realloc() using those locks stucked. So, if somebody want to use clone() directly, the best way is alloc a new TLS to new thread.
code snippet likes this:
flags = CLONE_VM | CLONE_SETTLS;
struct user_desc* p_tls_desc = malloc(sizeof(struct user_desc));
clone(clone_func, child_stack_start +CHILD_STACK_SIZE, flags, NULL, NULL, p_tls_desc, NULL);
Related
Im running into a pretty annoying problem :
I have a program which creates one thread at the start, this thread will launch other stuff during its execution (fork() immediatly followed by execve()).
Here is the bt of both threads at the point where my program reached (I think) the deadlock :
Thread 2 (LWP 8839):
#0 0x00007ffff6cdf736 in __libc_fork () at ../sysdeps/nptl/fork.c:125
#1 0x00007ffff6c8f8c0 in _IO_new_proc_open (fp=fp#entry=0x7ffff00031d0, command=command#entry=0x7ffff6c26e20 "ps -u brejon | grep \"cvc\"
#2 0x00007ffff6c8fbcc in _IO_new_popen (command=0x7ffff6c26e20 "ps -u user | grep \"cvc\" | wc -l", mode=0x42c7fd "r") at iopopen.c:296
#3-4 ...
#5 0x00007ffff74d9434 in start_thread (arg=0x7ffff6c27700) at pthread_create.c:333
#6 0x00007ffff6d0fcfd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 1 (LWP 8835):
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff6ca0ad9 in malloc_atfork (sz=140737337120848, caller=) at arena.c:179
#2 0x00007ffff6c8d875 in __GI__IO_file_doallocate (fp=0x17a72230) at filedoalloc.c:127
#3 0x00007ffff6c9a964 in __GI__IO_doallocbuf (fp=fp#entry=0x17a72230) at genops.c:398
#4 0x00007ffff6c99de8 in _IO_new_file_overflow (f=0x17a72230, ch=-1) at fileops.c:820
#5 0x00007ffff6c98f8a in _IO_new_file_xsputn (f=0x17a72230, data=0x17a16420, n=682) at fileops.c:1331
#6 0x00007ffff6c6fcb2 in _IO_vfprintf_internal (s=0x17a72230, format=, ap=ap#entry=0x7fffffffcf18) at vfprintf.c:1632
#7 0x00007ffff6c76a97 in __fprintf (stream=, format=) at fprintf.c:32
#8-11 ...
#12 0x000000000042706e in main (argc=3, argv=0x7fffffffd698, envp=0x7fffffffd6b8) at mains/ignore/.c:146
Both stays stuck here forever with both glibc-2.17 and glibc-2.23
Any help is welcomed :'D
EDIT :
Here is a minimal example :
1 #include <stdlib.h>
2 #include <pthread.h>
3 #include <unistd.h>
4
5 void * thread_handler(void * args)
6 {
7 char * argv[] = { "/usr/bin/ls" };
8 char * newargv[] = { "/usr/bin/ls", NULL };
9 char * newenviron[] = { NULL };
10 while (1) {
11 if (vfork() == 0) {
12 execve(argv[0], newargv, newenviron);
13 }
14 }
15
16 return 0;
17 }
18
19 int main(void)
20 {
21 pthread_t thread;
22 pthread_create(&thread, NULL, thread_handler, NULL);
23
24 int * dummy_alloc;
25
26 while (1) {
27 dummy_alloc = malloc(sizeof(int));
28 free(dummy_alloc);
29 }
30
31 return 0;
32 }
Environment :
user:deadlock$ cat /etc/redhat-release
Scientific Linux release 7.3 (Nitrogen)
user:deadlock$ ldd --version
ldd (GNU libc) 2.17
EDIT 2 :
The rpm package version is : glibc-2.17-196.el7.x86_64
I can not get the line numbers using the rpm package. Here is the BT using the glibc given with the distribution : solved with debuginfo.
(gdb) thread apply all bt
Thread 2 (Thread 0x7ffff77fb700 (LWP 59753)):
#0 vfork () at ../sysdeps/unix/sysv/linux/x86_64/vfork.S:44
#1 0x000000000040074e in thread_handler (args=0x0) at deadlock.c:11
#2 0x00007ffff7bc6e25 in start_thread (arg=0x7ffff77fb700) at pthread_create.c:308
#3 0x00007ffff78f434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7ffff7fba740 (LWP 59746)):
#0 0x00007ffff7878226 in _int_free (av=0x7ffff7bb8760 , p=0x602240, have_lock=0) at malloc.c:3927
#1 0x00000000004007aa in main () at deadlock.c:28
This is a custom-compiled glibc. It is possible that something went wrong with the installation. Note that Red Hat Enterprise Linux 7.4 backports a fix for a deadlock between malloc and fork, and you are missing that because you compiled your own glibc. The fix for the upstream bug went only into upstream version 2.24, so if you are basing your custom build on that, you may not have this fix. (Although the backtrace would look differently for that one.)
I think we fixed at least another post-2.17 libio-related deadlock bug.
EDIT I have been dealing with fork-related deadlocks for too long. There are multiple issues with the reproducer as posted:
There is no waitpid call for the PID. As a result, the process table will be quickly filled with zombies.
No error checking for execve. If the /usr/bin/ls pathname does not exist (for example, on a system which did not undergo UsrMove), execve will return, and the next iteration of the loop will launch another vfork call.
I fixed both issues (because debugging what is approaching a fork bomb is not fun at all), but I can't reproduce a hang with glibc-2.17-196.el7.x86_64.
There is a crash inside 'pthread_join' when main function calls it and before that the child thread already terminated. This is the backtrace from gdb:
Core was generated by `./bin/test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0xb76fb530 in __call_tls_dtors#plt () from /lib/libpthread.so.0
(gdb) bt
#0 0xb76fb530 in __call_tls_dtors#plt () from /lib/libpthread.so.0
#1 0xb76fdd5a in start_thread (arg=0xb40fab40) at pthread_create.c:319
#2 0xb762f74e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:129
The pthread activation function receives NULL argument and return NULL argument. I am clueless why I am seeing this crash consistently.
Could somebody help what could be wrong in child thread activation function? I am using Fedora 20 and gcc version 4.8.3 20140911 (Red Hat 4.8.3-7) (GCC)
Skeleton of child activation function is below
void* testControl(void* param)
{
...................
return NULL;
}
As the my code is huge I am giving here the code snippet which explains how I am creating child threads, their exits and termination.
unsigned long int rcThId1;
unsigned long int rcThId2;
unsigned long int rcThId3;
unsigned long int rcThId4;
unsigned long int rcThId5;
unsigned long int rcThId6;
void* rcControl1(void* arg)
{
bool th_loop = true;
while(th_loop)
{
/*Listen and receive the message on message queue*/
...........
..........
switch(message_type)
{
............
............
case EXIT:
th_loop = false;
break;
default:
break;
}
}
return NULL;
}
/*Activation functions for rcControl2 rcControl3 rcControl4
rcControl5 rcControl6 similar to the defination of rcControl1*/
int main(void)
{
pthread_create(&rcThId1,NULL,rcControl1,NULL);
pthread_create(&rcThId2,NULL,rcControl2,NULL);
pthread_create(&rcThId3,NULL,rcControl3,NULL);
pthread_create(&rcThId4,NULL,rcControl4,NULL);
pthread_create(&rcThId5,NULL,rcControl5,NULL);
pthread_create(&rcThId6,NULL,rcControl6,NULL);
..............
..............
/*Post EXIT event to Thread1*/
/*Post EXIT event to Thread2*/
/*Post EXIT event to Thread3*/
/*Post EXIT event to Thread4*/
/*Post EXIT event to Thread5*/
/*Post EXIT event to Thread6*/
/*By now all threads would have already exited */
pthread_join(rcThId1, NULL);/*Inside this function crash is happening*/
pthread_join(rcThId2, NULL);
pthread_join(rcThId3, NULL);
pthread_join(rcThId4, NULL);
pthread_join(rcThId5, NULL);
pthread_join(rcThId6, NULL);
return 0;
}
Inside pthread_join(rcThId1, NULL); call the crash is happened.
Thanks
In the (pseudo)code you posted, the main issue is the type of thread identifiers: they all should be of type pthread_t. But you have unsigned long ints. The crash is most likely because pthread_join() attempts to read rcThId1 et al as if they are pthread_t which they are not.
Change the type of rcThId1 ..rcThId6 to pthread_t.
You should be getting some warnings. If not compiler with:
gcc -Wall -Wextra -pedantic-errors
Aside:
You probably need to have th thread ids as global. Move them inside main() unless you have a good reason not to.
We have created a multithreaded, single core application running on Ubuntu.
When we call getaddrinfo and gethostbyname from the main process, it does not crash.
However when we create a thread from the main process and the functions getaddrinfo and gethostbyname are called from the created thread, it always crashes.
Kindly help.
Please find the call stack below:
#0 0xf7e9f890 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xf7e9fa73 in __res_ninit () from /lib/i386-linux-gnu/libc.so.6
#2 0xf7ea0a68 in __res_maybe_init () from /lib/i386-linux-gnu/libc.so.6
#3 0xf7e663be in ?? () from /lib/i386-linux-gnu/libc.so.6
#4 0xf7e696bb in getaddrinfo () from /lib/i386-linux-gnu/libc.so.6
#5 0x080c4e35 in mn_task_entry (args=0xa6c4130 <ipc_os_input_params>) at /home/nextg/Alps_RT/mn/src/mn_main.c:699
#6 0xf7fa5d78 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#7 0xf7e9001e in clone () from /lib/i386-linux-gnu/libc.so.6
The reason the getaddrinfo was crashing because, the child thread making the call did not have sufficient stack space.
Using ACE C++ version 6.5.1 libraries classes which use ACE_Thread::spawn_n with default ACE_DEFAULT_THREAD_PRIORITY (1024*1024) will crash when calling gethostbyname/getaddrinfo inside child as reported by Syed Aslam. libxml2 schema parsing takes forever, using a child thread Segment Faulted after calling xmlNanoHTTPConnectHost as it tries to resolve schemaLocation.
ACE_Task activate
const ACE_TCHAR *thr_name[1];
thr_name[0] = "Flerf";
// libxml2-2.9.7/nanohttp.c:1133
// gethostbyname will crash when child thread making the call
// has insufficient stack space.
size_t stack_sizes[1] = {
ACE_DEFAULT_THREAD_STACKSIZE * 100
};
const INT ret = this->activate (
THR_NEW_LWP/*Light Weight Process*/ | THR_JOINABLE,
1,
0/*force_active*/,
ACE_DEFAULT_THREAD_PRIORITY,
-1/*grp_id*/,
NULL/*task*/,
NULL/*thread_handles[]*/,
NULL/*stack[]*/,
stack_sizes/*stack_size[]*/,
NULL/*thread_ids[]*/,
thr_name
);
I am interested in running a multithreaded application in the supervision of another monitoring process. The monitoring process should be able to get and set CPU registers of all the threads in the monitored application. I know how to do this for a single threaded application. But I'm interested in knowing how to extend this for multithreaded applications.
You can use thread id instead of pid in ptrace and it should work fine. However thread management needs to be done by you.
Use thread id instead of pid in ptrace, is not a solution.
Because in Linux-64, pthread_t--unsigned long, pid_t--unsigned int.
I wondered this issue, too.
I have another method to get thread-reg-info, using gdb.
This is my code:
void *ThrFunc(void *para)
{
printf("hello world.\n");
sleep(-1); // suspend the thread.
}
int main()
{
pthread_t ptid;
int ret = pthread_create(&ptid, NULL, ThrFunc, NULL);
if(ret != 0)
{
exit(errno);
}
pthread_join(ptid, NULL);// suspend the main thread.
return 0;
}
The following is gdb debug details:
(gdb) info thread
2 Thread 0x7ffff7fe9700 (LWP 4533) 0x00000033d98ab91d in nanosleep () from /lib64/libc.so.6
* 1 Thread 0x7ffff7feb720 (LWP 4530) 0x00000033d9c080ad in pthread_join () from /lib64/libpthread.so.0
(gdb) info reg
rax 0xfffffffffffffe00 -512
...
rip 0x33d9c080ad 0x33d9c080ad <pthread_join+269>
eflags 0x246 [ PF ZF IF ]
...
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff7fe9700 (LWP 4533))]#0 0x00000033d98ab91d in nanosleep () from /lib64/libc.so.6
(gdb) info thread
* 2 Thread 0x7ffff7fe9700 (LWP 4533) 0x00000033d98ab91d in nanosleep () from /lib64/libc.so.6
1 Thread 0x7ffff7feb720 (LWP 4530) 0x00000033d9c080ad in pthread_join () from /lib64/libpthread.so.0
(gdb) info reg
rax 0xfffffffffffffdfc -516
...
rip 0x33d98ab91d 0x33d98ab91d <nanosleep+45>
eflags 0x293 [ CF AF SF IF ]
...
I hope this will help you.
By the way, I also want to know: How to use ptrace() to get a thread registers details?
Firstly, I use pthread library to write multithreading C programs. Threads always hung by their waited mutexes. When I use the strace utility to find a thread in the FUTEX_WAIT status, I want to know which thread holds that mutex at that time. But I don't know how I could I do it. Are there any utilities that could do that?
Someone told me the Java virtual machine supports this, so I want to know whether Linux support this feature.
You can use knowledge of the mutex internals to do this. Ordinarily this wouldn't be a very good idea, but it's fine for debugging.
Under Linux with the NPTL implementation of pthreads (which is any modern glibc), you can examine the __data.__owner member of the pthread_mutex_t structure to find out the thread that currently has it locked. This is how to do it after attaching to the process with gdb:
(gdb) thread 2
[Switching to thread 2 (Thread 0xb6d94b90 (LWP 22026))]#0 0xb771f424 in __kernel_vsyscall ()
(gdb) bt
#0 0xb771f424 in __kernel_vsyscall ()
#1 0xb76fec99 in __lll_lock_wait () from /lib/i686/cmov/libpthread.so.0
#2 0xb76fa0c4 in _L_lock_89 () from /lib/i686/cmov/libpthread.so.0
#3 0xb76f99f2 in pthread_mutex_lock () from /lib/i686/cmov/libpthread.so.0
#4 0x080484a6 in thread (x=0x0) at mutex_owner.c:8
#5 0xb76f84c0 in start_thread () from /lib/i686/cmov/libpthread.so.0
#6 0xb767784e in clone () from /lib/i686/cmov/libc.so.6
(gdb) up 4
#4 0x080484a6 in thread (x=0x0) at mutex_owner.c:8
8 pthread_mutex_lock(&mutex);
(gdb) print mutex.__data.__owner
$1 = 22025
(gdb)
(I switch to the hung thread; do a backtrace to find the pthread_mutex_lock() it's stuck on; change stack frames to find out the name of the mutex that it's trying to lock; then print the owner of that mutex). This tells me that the thread with LWP ID 22025 is the culprit.
You can then use thread find 22025 to find out the gdb thread number for that thread and switch to it.
I don't know of any such facility so I don't think you will get off that easily - and it probably wouldn't be as informative as you think in helping to debug your program. As low tech as it might seem, logging is your friend in debugging these things. Start collecting your own little logging functions. They don't have to be fancy, they just have to get the job done while debugging.
Sorry for the C++ but something like:
void logit(const bool aquired, const char* lockname, const int linenum)
{
pthread_mutex_lock(&log_mutex);
if (! aquired)
logfile << pthread_self() << " tries lock " << lockname << " at " << linenum << endl;
else
logfile << pthread_self() << " has lock " << lockname << " at " << linenum << endl;
pthread_mutex_unlock(&log_mutex);
}
void someTask()
{
logit(false, "some_mutex", __LINE__);
pthread_mutex_lock(&some_mutex);
logit(true, "some_mutex", __LINE__);
// do stuff ...
pthread_mutex_unlock(&some_mutex);
}
Logging isn't a perfect solution but nothing is. It usually gets you what you need to know.
Normally libc/platforms calls are abstracted by OS abstraction layer. The mutex dead locks can be tracked using a owner variable and pthread_mutex_timedlock. Whenever the thread locks it should update the variable with own tid(gettid() and can also have another variable for pthread id storage) . So when the other threads blocks and timed out on pthread_mutex_timedlock it can print the value of owner tid and pthread_id. this way you can easily find out the owner thread. please find the code snippet below, note that all the error conditions are not handled
pid_t ownerTid;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
class TimedMutex {
public:
TimedMutex()
{
struct timespec abs_time;
while(1)
{
clock_gettime(CLOCK_MONOTONIC, &abs_time);
abs_time.tv_sec += 10;
if(pthread_mutex_timedlock(&mutex,&abs_time) == ETIMEDOUT)
{
log("Lock held by thread=%d for more than 10 secs",ownerTid);
continue;
}
ownerTid = gettid();
}
}
~TimedMutex()
{
pthread_mutex_unlock(&mutex);
}
};
There are other ways to find out dead locks, maybe this link might help http://yusufonlinux.blogspot.in/2010/11/debugging-core-using-gdb.html.
Please read below link, This has a generic solution for finding the lock owner. It works even if lock in side a library and you don't have the source code.
https://en.wikibooks.org/wiki/Linux_Applications_Debugging_Techniques/Deadlocks