thread died in the pthread_mutex_lock with SIGILL signal - c

I'm developing an project on an embedded linux OS(uclinux, mips CPU), It crashed occasionally.
When I try to check coredump with gdb, I can see that it received a SIGILL signal.
Sometime I can see the backtrace, which showed it died in pthread_mutex_lock. but most time, backtrace is not valid.
A valid backtrace
(gdb) bt
#0 <signal handler called>
#1 0x2ab87fd8 in sigsuspend () from /lib/libc.so.0
#2 0x2aade80c in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#3 0x2aadc7ac in __pthread_alt_lock () from /lib/libpthread.so.0
#4 0x2aad81a4 in pthread_mutex_lock () from /lib/libpthread.so.0
#5 0x0042fde8 in aos_mutex_lock (mutex=0x66bea8) at ../../source/ssp/os/sys/linux/aos_lock_linux.c:184
invalid backtrace
(gdb) bt
#0 0x00690430 in ?? ()
#1 0x00690430 in ?? ()
I used pthread_attr_setstackaddr to initial a stack for each thread, so that I can see its call frame through checking its stack. Also, I found it died in pthread_mutex_lock.
I used a wrapper for lock and unlock like
struct aos_mutex_t
{
pthread_mutex_t mutex;
S8 obj_name[AOS_MAX_OBJ_NAME];
S32 lFlag;
};
S32 aos_mutex_lock(AOS_MUTEX_T *mutex)
{
S32 status;
AOS_ASSERT_RETURN(mutex, AOS_EINVAL);
mutex->lFlag++;
status = pthread_mutex_lock( &mutex->mutex );
if (status == 0)
{
return AOS_SUCC;
}
else
{
return AOS_RETURN_OS_ERROR(status);
}
}
/*
* aos_mutex_unlock()
*/
S32 aos_mutex_unlock(AOS_MUTEX_T *mutex)
{
S32 status;
AOS_ASSERT_RETURN(mutex, AOS_EINVAL);
status = pthread_mutex_unlock( &mutex->mutex );
mutex->lFlag--;
if (status == 0)
return AOS_SUCC;
else
{
return AOS_RETURN_OS_ERROR(status);
}
}
All of these mutex is initiated before using them.
I tried gdb to run the program, it didn't die.
I wrote a simple program, 11 threads do nothing bug just lock and unlock in a dead loop. It didn't die.
Is there any suggestion?

Related

What happens if we sleep in an interrupt handler on SMP

What will happen if we sleep in an interrupt handler on a SMP Machine,
I wrote a sample keyboard driver and added sleep on the interrupt handler
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/interrupt.h>
#include <linux/delay.h>
#include <linux/sched/signal.h>
MODULE_LICENSE("GPL");
static int irq = 1, dev = 0xaa, counter = 0;
static irqreturn_t keyboard_handler(int irq, void *dev)
{
pr_info("Keyboard Counter:%d\n", counter++);
msleep(1000);
return IRQ_NONE;
}
/* registering irq */
static int test_interrupt_init(void)
{
pr_info("%s: In init\n", __func__);
return request_irq(irq, keyboard_handler, IRQF_SHARED,"my_keyboard_handler", &dev);
}
static void test_interrupt_exit(void)
{
pr_info("%s: In exit\n", __func__);
synchronize_irq(irq); /* synchronize interrupt */
free_irq(irq, &dev);
}
module_init(test_interrupt_init);
module_exit(test_interrupt_exit);
The system ran for few minutes and then panic. Why can't the system work with one processor disabled, as the timer interrupts will be fired on the other CPU and can schedule processes.
Back Trace captured using kgdb setup:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1254]
0x0000000000000000 in irq_stack_union ()
(gdb) bt
#0 0x0000000000000000 in irq_stack_union ()
#1 0xffffffff810ad8e4 in ttwu_activate (en_flags=<optimized out>, p=<optimized out>, rq=<optimized out>)
at kernel/sched/core.c:1638
#2 ttwu_do_activate (rq=0xffff888237422c40, p=0xffffffff82213780 <init_task>, wake_flags=9,
rf=0xffffc900022b3f10) at kernel/sched/core.c:1697
#3 0xffffffff810aec00 in sched_ttwu_pending () at kernel/sched/core.c:1740
#4 0xffffffff810aedcd in scheduler_ipi () at kernel/sched/core.c:1771
#5 0xffffffff81a01aef in reschedule_interrupt () at arch/x86/entry/entry_64.S:888
#6 0xffffc900022b3f58 in ?? ()
#7 0xffffffff81a01aea in reschedule_interrupt () at arch/x86/entry/entry_64.S:888
#8 0x0000000000000002 in irq_stack_union ()
#9 0x00007fcec3421b40 in ?? ()
#10 0x0000000000000006 in irq_stack_union ()
#11 0x00007fceb00008c0 in ?? ()
#12 0x0000000000000002 in irq_stack_union ()
#13 0x00000000020bd380 in ?? ()
#14 0x0012c8d2cc413914 in ?? ()
Cannot access memory at address 0x5000
The keyboard interrupt can happen at any time - including during a kernel call. Normally, that’s OK: the interrupt happens, the driver does its thing, the interrupt handler returns, and the kernel continues.
But if you sleep() in the interrupt handler, the kernel is in an intermediate state. Other processors can’t execute kernel calls, because it’s already busy. Each will be forced to pause waiting for the kernel - which isn’t coming back. No wonder it panics!

__lll_lock_wait_private () when using malloc/free

I have a user level thread library and I changed a benchmark program to use mythreads instead of pthreads, but it always gets stuck somewhere in the code where there is a malloc or free function.
this is output of gdb:
^C
Program received signal SIGINT, Interrupt.
__lll_lock_wait_private ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) where
#0 __lll_lock_wait_private ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ffff7569bb3 in _int_free (av=0x7ffff78adc00 <main_arena>,
p=0x6f1f40, have_lock=0) at malloc.c:3929
#2 0x00007ffff756d89c in __GI___libc_free (mem=<optimized out>)
at malloc.c:2950
#3 0x000000000040812d in mbuffer_free (m=m#entry=0x6a7660) at mbuffer.c:209
#4 0x00000000004038a8 in write_chunk_to_file (chunk=0x6a7610,
fd=<optimized out>) at encoder.c:279
#5 Reorder (targs=0x7fffffffab60,
targs#entry=<error reading variable: value has been optimized out>)
at encoder.c:1292
#6 0x000000000040b069 in wrapper_function (func=<optimized out>,
arg=<optimized out>) at gtthread.c:75
#7 0x00007ffff7532620 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x0000000000000000 in ?? ()
(gdb)
and here is some code
mbuffer.c
208 if(ref==0) {
209 pausee();
210 free(m->mcb->ptr);
211 resume();
212 m->mcb->ptr=NULL;
213 free(m->mcb);
214 m->mcb=NULL;
215 }
pausee and resume functions
void pausee(){
//printf("pauseeing\n");
sigemptyset(&mask);
sigaddset(&mask, SIGPROF); // block SIGPROF...
if (sigprocmask(SIG_BLOCK, &mask, &orig_mask) < 0) {
perror ("sigprocmask");
exit(1);
}
}
void resume(){
//printf("restarting\n");
sigemptyset(&mask);
sigaddset(&mask, SIGPROF); // unblock SIGPROF...
if (sigprocmask(SIG_SETMASK, &orig_mask, NULL) < 0) {
perror ("sigprocmask");
exit(1);
}
}
I'm not sure if it's related to my problem or not, but for scheduling of the threads I use SIGPROF signal and a handler function. I tried blocking SIGPROF before every malloc/free function but It had no effect.
I don't have any concurrent threads, only one thread runs at a time.
any help or idea would be very much appreciated.
as your code mbuffer.c, there is no mistake.I suggest you get a test of mbuffer.c for except your guess.

getaddrinfo and gethostbyname crashing when called from child thread?

We have created a multithreaded, single core application running on Ubuntu.
When we call getaddrinfo and gethostbyname from the main process, it does not crash.
However when we create a thread from the main process and the functions getaddrinfo and gethostbyname are called from the created thread, it always crashes.
Kindly help.
Please find the call stack below:
#0 0xf7e9f890 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xf7e9fa73 in __res_ninit () from /lib/i386-linux-gnu/libc.so.6
#2 0xf7ea0a68 in __res_maybe_init () from /lib/i386-linux-gnu/libc.so.6
#3 0xf7e663be in ?? () from /lib/i386-linux-gnu/libc.so.6
#4 0xf7e696bb in getaddrinfo () from /lib/i386-linux-gnu/libc.so.6
#5 0x080c4e35 in mn_task_entry (args=0xa6c4130 <ipc_os_input_params>) at /home/nextg/Alps_RT/mn/src/mn_main.c:699
#6 0xf7fa5d78 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#7 0xf7e9001e in clone () from /lib/i386-linux-gnu/libc.so.6
The reason the getaddrinfo was crashing because, the child thread making the call did not have sufficient stack space.
Using ACE C++ version 6.5.1 libraries classes which use ACE_Thread::spawn_n with default ACE_DEFAULT_THREAD_PRIORITY (1024*1024) will crash when calling gethostbyname/getaddrinfo inside child as reported by Syed Aslam. libxml2 schema parsing takes forever, using a child thread Segment Faulted after calling xmlNanoHTTPConnectHost as it tries to resolve schemaLocation.
ACE_Task activate
const ACE_TCHAR *thr_name[1];
thr_name[0] = "Flerf";
// libxml2-2.9.7/nanohttp.c:1133
// gethostbyname will crash when child thread making the call
// has insufficient stack space.
size_t stack_sizes[1] = {
ACE_DEFAULT_THREAD_STACKSIZE * 100
};
const INT ret = this->activate (
THR_NEW_LWP/*Light Weight Process*/ | THR_JOINABLE,
1,
0/*force_active*/,
ACE_DEFAULT_THREAD_PRIORITY,
-1/*grp_id*/,
NULL/*task*/,
NULL/*thread_handles[]*/,
NULL/*stack[]*/,
stack_sizes/*stack_size[]*/,
NULL/*thread_ids[]*/,
thr_name
);

PTHREAD_CANCEL_ASYNCHRONOUS Cancels the whole process

In a C program, I am using PTHREAD_CANCEL_ASYNCHRONOUS to cancel the thread immediately, as soon as the pthread_cancel is fired from the parent thread. But it is causing the whole process to get crash with Segmentation Fault. The job of child thread is to get some data from a database server. And my logic is that if it doesnt get data within 10 seconds, the thread should get killed from the parent thread.
I want only to kill the child thread, not the whole process.
struct str_thrd_data
{
SQLHANDLE hstmt;
int rc;
bool thrd_completed_flag;
};
void * str_in_thread_call(void *in_str_arg)
{
int thrd_rc;
struct str_thrd_data *str_arg;
str_arg = in_str_arg;
thrd_rc = pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
if (thrd_rc != 0)
handle_error_en(thrd_rc, "pthread_setcancelstate");
thrd_rc = pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
if (thrd_rc != 0)
handle_error_en(thrd_rc, "pthread_setcancelstate");
thrd_rc = pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
if (thrd_rc != 0)
handle_error_en(thrd_rc, "pthread_setcanceltype");
// Code to call SQL Dynamic Query from a Database Server. This takes time more than 10 seconds.
thrd_rc = SQLExecute(hstmt);
printf("\n*********************Normal Thread termination withing timelimit %d\n",str_arg->rc);
str_arg->thrd_completed_flag = true;
}
int main()
{
printf("\nPJH: New THread created.\n");
pthread_attr_t tattr;
pthread_t th;
size_t mysize = 1;
struct str_thrd_data atd;
atd.hstmt = hstmt;
atd.rc= rc;
atd.thrd_completed_flag = false;
thrd_rc = pthread_attr_init(&tattr);
thrd_rc = pthread_attr_setstacksize(&tattr, mysize);
thrd_rc = pthread_create(&th, &tattr, &str_in_thread_call, &atd);
if (thrd_rc != 0)
handle_error_en(thrd_rc, "pthread_create");
// While Loop tp count till 10 seconds.
while(timeout !=0)
{
printf("%d Value of rc=%d\n",timeout, atd.rc);
if(atd.rc != 999) break;
timeout--;
usleep(10000);
}
rc = atd.rc;
//Condition to check if thread is completed or not yet.
if(atd.thrd_completed_flag == false)
{
//Thread not comepleted within time, so Kill it now.
printf("PJH ------- 10 Seconds Over\n");
thrd_rc = pthread_cancel(th);
printf("PJH ------- Thread Cancelled Immediately \n");
if (thrd_rc != 0)
{
handle_error_en(thrd_rc, "pthread_cancel");
}
printf("\nPJH &&&&&&&& Thread Cancelled Manually\n");
}
thrd_rc = pthread_join(th,NULL);
// some other job .....
}
gdb process_name corefile shows the below backtrace:- Mostly all SQL Library functions.
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x0059fe30 in raise () from /lib/libc.so.6
#2 0x005a1741 in abort () from /lib/libc.so.6
#3 0xdef3f5d7 in ?? () from /usr/lib/libstdc++.so.5
#4 0xdef3f624 in std::terminate() () from /usr/lib/libstdc++.so.5
#5 0xdef3f44c in __gxx_personality_v0 () from /usr/lib/libstdc++.so.5
#6 0x007e1917 in ?? () from /lib/libgcc_s.so.1
#7 0x007e1c70 in _Unwind_ForcedUnwind () from /lib/libgcc_s.so.1
#8 0x007cda46 in _Unwind_ForcedUnwind () from /lib/libpthread.so.0
#9 0x007cb471 in __pthread_unwind () from /lib/libpthread.so.0
#10 0x007c347a in sigcancel_handler () from /lib/libpthread.so.0
#11 <signal handler called>
#12 0xffffe410 in __kernel_vsyscall ()
#13 0x0064decb in semop () from /lib/libc.so.6
#14 0xe0245901 in sqloSSemP () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#15 0xe01e7f3c in sqlccipcrecv(sqlcc_comhandle*, sqlcc_cond*) () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#16 0xe03fe135 in sqlccrecv () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#17 0xe02a0307 in sqljcReceive(sqljCmnMgr*) () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#18 0xe02d0ba3 in sqljrReceive(sqljrDrdaArCb*, db2UCinterface*) () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#19 0xe02c510d in sqljrDrdaArExecute(db2UCinterface*, UCstpInfo*) () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#20 0xe01392bc in CLI_sqlCallProcedure(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#21 0xe00589c7 in SQLExecute2(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#22 0xe0050fc9 in SQLExecute () from /opt/IBM/db2/V9.1/lib32/libdb2.so.1
#23 0x080a81f7 in apcd_in_thread_call (in_apcd_arg=0xbc8e8f34) at dcs_db2_execute.c:357
#24 0x007c4912 in start_thread () from /lib/libpthread.so.0
#25 0x0064c60e in clone () from /lib/libc.so.6
Asynchronous thread cancellation can only be safely used on threads which perform a very restricted set of operations — the official rules are long and confusing, but in effect threads subject to async cancels can only perform pure computation. They can't do I/O, they can't allocate memory, they can't take locks of any kind, and they can't call any library function that might do any of the above. There is no way it is safe to apply async cancels to a thread that talks to a database.
Deferred cancellation is less restricted, but is still extremely finicky. If your database library is not coded to cope with the possibility that the calling thread might be cancelled mid-operation — and it probably isn't — then you can't safely use deferred cancellation, either.
You will need to find some other mechanism for aborting queries which run too long.
EDIT: Since this is DB2 and the confusingly-named "CLI" API, try using SqlSetStmtAttr to set the SQL_ATTR_QUERY_TIMEOUT parameter on the prepared statement. This is the full list of parameters that can be set this way, and here is some more discussion of query timeouts.
SON OF EDIT: According to a friend who has done a lot more database work than me, it is quite likely that there is a server-side mechanism for cancelling slow queries regardless of their source. If this exists in DB2 it may be more convenient than manually setting timeouts on all your queries client-side, especially as it may be able to log slow queries so you know which ones they are and can optimize them.
Since the database client code is probably not written in such a way that it can deal with cancellation (most library code isn't), I don't think this approach will work. See Zack's answer for details.
If you need to be able to cancel database connections, you will probably have to proxy the connection and kill the proxy. Basically, what you would do is create a second thread that listens on a port and forwards the connection to the database server, and direct your database client to connect to this port on localhost instead of the real database server/port. The proxy thread could then be cancellable (with normal deferred cancellation, not asynchronous), with a cancellation cleanup handler to shutdown the sockets. Losing connection to the database server via a closed socket (rather than just a non-responsive socket) should cause the database client library code to return with an error, and you can then have its thread exit too.
Keep in mind when setting up such a proxy that you will need to make sure you don't introduce security issues with access to the database.
Here is a sketch of the code you could use for a proxy, without any error checking logic and without anything to account for unintended clients connecting:
int s, c;
struct addrinfo *ai;
struct sockaddr_in sa;
char portstr[8];
getaddrinfo(0, 0, &(struct addrinfo){ .ai_flags = AI_PASSIVE, .ai_family = AF_INET }, &ai);
s = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);
bind(s, ai->ai_addr, ai_addrlen);
freeaddrinfo(ai);
getsockname(s, (void *)&sa, &(socklen_t){sizeof sa});
port = ntohs(sa.sin_port);
/* Here, do something to pass the port (assigned by kernel) back to the caller. */
listen(s, 1);
c = accept(s, &sa, &(socklen_t){sizeof sa});
close(s);
getaddrinfo("dbserver", "dbport", 0, &ai);
s = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);
connect(s, ai->ai_addr, ai->ai_addrlen);
freeaddrinfo(ai);
At this point, you have two sockets, s connected to the database server, and c connected to the database client in another thread of your program. Whatever you read from one should be written to the other; use poll to detect which one is ready for reading or writing.
During the above setup code, cancellation should be blocked except around the accept and connect calls, and at those points, you need appropriate cleanup handlers to close your sockets and call freeaddrinfo if cancellation happens. It might make sense to copy the data you're using from getaddrinfo to local variables so you can freeaddrinfo before the blocking calls and not have to worry about doing it from a cancellation cleanup handler.

getting and settings CPU registers of multiple threads using ptrace

I am interested in running a multithreaded application in the supervision of another monitoring process. The monitoring process should be able to get and set CPU registers of all the threads in the monitored application. I know how to do this for a single threaded application. But I'm interested in knowing how to extend this for multithreaded applications.
You can use thread id instead of pid in ptrace and it should work fine. However thread management needs to be done by you.
Use thread id instead of pid in ptrace, is not a solution.
Because in Linux-64, pthread_t--unsigned long, pid_t--unsigned int.
I wondered this issue, too.
I have another method to get thread-reg-info, using gdb.
This is my code:
void *ThrFunc(void *para)
{
printf("hello world.\n");
sleep(-1); // suspend the thread.
}
int main()
{
pthread_t ptid;
int ret = pthread_create(&ptid, NULL, ThrFunc, NULL);
if(ret != 0)
{
exit(errno);
}
pthread_join(ptid, NULL);// suspend the main thread.
return 0;
}
The following is gdb debug details:
(gdb) info thread
2 Thread 0x7ffff7fe9700 (LWP 4533) 0x00000033d98ab91d in nanosleep () from /lib64/libc.so.6
* 1 Thread 0x7ffff7feb720 (LWP 4530) 0x00000033d9c080ad in pthread_join () from /lib64/libpthread.so.0
(gdb) info reg
rax 0xfffffffffffffe00 -512
...
rip 0x33d9c080ad 0x33d9c080ad <pthread_join+269>
eflags 0x246 [ PF ZF IF ]
...
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff7fe9700 (LWP 4533))]#0 0x00000033d98ab91d in nanosleep () from /lib64/libc.so.6
(gdb) info thread
* 2 Thread 0x7ffff7fe9700 (LWP 4533) 0x00000033d98ab91d in nanosleep () from /lib64/libc.so.6
1 Thread 0x7ffff7feb720 (LWP 4530) 0x00000033d9c080ad in pthread_join () from /lib64/libpthread.so.0
(gdb) info reg
rax 0xfffffffffffffdfc -516
...
rip 0x33d98ab91d 0x33d98ab91d <nanosleep+45>
eflags 0x293 [ CF AF SF IF ]
...
I hope this will help you.
By the way, I also want to know: How to use ptrace() to get a thread registers details?

Resources