I've combined my Eagle GUI library with Allegro 5 for graphics and input. When I use Allegro 5's al_register_trace_handler function to pipe the output from allegro's debugging log to my own, I get a deadlock in a thread spawned by allegro to create a win32 window and display. It specifically hangs on a call to ALLEGRO_INFO which is a logging macro used by allegro. The CRITICAL_SECTION used to prevent race conditions in the log shows up as held by my main thread. When I output the CRITICAL_SECTION in gdb, I get the following report :
(gdb) p *(trace_info.trace_mutex.cs)
$1 = {DebugInfo = 0xffffffff, LockCount = -2, RecursionCount = 176, OwningThread = 0x4750, LockSemaphore = 0x0, SpinCount = 33556432}
Thread 4750 is Main as identified by gdb and info threads.
If I don't register a trace handler with allegro, everything works fine, but if I do, and I use a debugging level of 'Debug' or 'Info' it deadlocks in the mentioned log output call. I found a case where the allegro trace function wasn't releasing the CRITICAL_SECTION in the case of a registered trace handler and I thought that would fix it by releasing the lock, but it did nothing, and the output remains the same.
Does the value of the ReferenceCount field in the critical section indicate a failure to properly unlock the log's mutex (CS) and why is the lock still held by the main thread?
I'm reaching the end of my debugging skills. I log the state of all my own threads and none of them are in contention. But the fact that main holds the CRITICAL_SECTION being used by allegro in a different thread seems to indicate I've done something wrong.
So, any help getting relevant info out of allegro and gdb would be appreciated. Like I said, it works fine if I don't register a trace handler, but if I do, it hangs on allegro code.
Advice and debugging tips welcome. Please and thank you for helping me out.
Marc
The offending missing LeaveCriticalSection call was left out in the path in allegro code where a user trace handler was used. The following patch fixed the problem.
--- C:/Users/Marc/AppData/Local/Temp/TortoiseGit/debug-619c69e3.002.c Thu May 13 11:18:03 2021
+++ E:/usr/libs/Allegro52X/src/debug.c Wed May 12 11:20:57 2021
## -300,6 +300,7 ##
if (_al_user_trace_handler) {
_al_user_trace_handler(static_trace_buffer);
static_trace_buffer[0] = '\0';
+ _al_mutex_unlock(&trace_info.trace_mutex);
return;
}
Related
First SO question, so here it goes.
I'm not asking for someone to review the code, i want to get to the bottom of this.
It would be helpful if someone knew what change in the kernel could be responsible for the following.
In the University we were tasked to implement extended functionality in a modeled Operating System written in C (written by my professor), that models each core with a pthread.
Project github forked by me.
We had to implement the necessary functionality by implementing the required syscalls. (multithreading, sockets, pipes, mlfq, etc).
After implementing each functionality we had to confirm that it was working using the validate_api program.
Problem time:
The validate_api.c contains a lot of tests to check the functionality of the OS.
BOOT_TEST: bare-boots the machine and tests something.
A simple test for creating a new thread inside a process:
BOOT_TEST(test_create_join_thread,
"Test that a process thread can be created and joined. Also, that "
"the argument of the thread is passed correctly."
)
{
int flag = 0;
int task(int argl, void* args) {
ASSERT(args == &flag);
*(int*)args = 1;
return 2;
}
Tid_t t = CreateThread(task, sizeof(flag), &flag);
/* Success in creating thread */
ASSERT(t!=NOTHREAD);
int exitval;
/* Join should succeed */
ASSERT(ThreadJoin(t, &exitval)==0);
/* Exit status should be correct */
ASSERT(exitval==2);
/* Shared variable should be updates */
ASSERT(flag==1);
/* A second Join should fail! */
ASSERT(ThreadJoin(t, NULL)==-1);
return 0;
}
As you can see there is a nested function called task() which is the starting point of the thread that is going to be created using the createThread() syscall we implemented.
The problem is that, although the thread is created correctly, when scheduled to run, the program exits with segmentation fault and cannot access the memory of the task function, gdb doesn't even recognize it as a variable (in the thread struct field pointing to it). The weird thing is that this happens ONLY when using a kernel version newer than 5.7. I opened an issue in the original project's repo.
Running the actual OS and its programs it's fine with no issues whatsoever, only validate_api fails due to that nested function. If i move the task function into global scope then the test finishes successfully. Same goes for every other test that has a nested function inside.
Note: The project is finished (1 month now), i downgraded to 5.4 just to test my implementation.
Note2: I dont need help with the implementation of any functionality (the project is finished any way), i just want to figure out why it doesn't work on kernels > 5.7
Note3: I'm here because my prof. doesn't respond to my repeated emails regarding the issue
I tried compiling using -fno-stack-protector and with -z execstack with no luck. Also simple nested functions like:
int main(){
int foo(){
puts("Hello there");
}
foo();
}
work with any kernel
Machine Details:
Arch Linux - 5.10 / 5.4 LTS
GCC 10.2
Thank you
UPDATE:
The test joins the thread, so it never goes out of scope.
I'm trying to debug my application that is based on an STM32F3 uC running FreeRTOS. I have manually set the PSP to an invalid value (e.g. 0) at random places in thread context in the application expecting my memManageFault/busFault/usageFault/hardFault handlers to fire. Unfortunately none of the fault handlers are executed, but the core locks up on the first push to the invalid stack. What am I missing?
Some more details from the lockup state:
SCB->SHCSR: 0x74001 (all three faultHandlers are enabled, busFault pending, memFault active)
SCB->HFSR:0x40000000 (fault escalated to hardFault even though all handlers are defined and enabled)
SCB->CFSR: 0x28601 (BFAR valid, precise error)
SCB->BFAR/SCB->MMFAR: 0xfffffff7 (erroneous SP after sub, I assume)
PRIMASK/FAULTMASK/BASEPRI: 0
MSP: 0x2000ffe0 (still valid, the handler should run just fine)
Any ideas are welcome.
It seems like once again the core is right and I am wrong. The mistake I made was that although I have implemented the HardFault_Handler as a naked function, all the other fault handlers were simple application failure hooks implemented in C, trying to access the stack in whatever context they interrupted. Needless to say, things went dirty quickly.
Implementing all handlers in asm solved the issue of the core locking up on corrupted a SP.
busFault pending, memFault active - memFault has caused busError - and it kills the micro
Exception stacking uses the same stack as the current context. By providing an invalid stack pointer, you've prevented any of the exception handlers being able to complete. Lockup specifically addresses this scenario.
How would be the correct way to prevent a soft lockup/unresponsiveness in a long running while loop in a C program?
(dmesg is reporting a soft lockup)
Pseudo code is like this:
while( worktodo ) {
worktodo = doWork();
}
My code is of course way more complex, and also includes a printf statement which gets executed once a second to report progress, but the problem is, the program ceases to respond to ctrl+c at this point.
Things I've tried which do work (but I want an alternative):
doing printf every loop iteration (don't know why, but the program becomes responsive again that way (???)) - wastes a lot of performance due to unneeded printf calls (each doWork() call does not take very long)
using sleep/usleep/... - also seems like a waste of (processing-)time to me, as the whole program will already be running several hours at full speed
What I'm thinking about is some kind of process_waiting_events() function or the like, and normal signals seem to be working fine as I can use kill on a different shell to stop the program.
Additional background info: I'm using GWAN and my code is running inside the main.c "maintenance script", which seems to be running in the main thread as far as I can tell.
Thank you very much.
P.S.: Yes I did check all other threads I found regarding soft lockups, but they all seem to ask about why soft lockups occur, while I know the why and want to have a way of preventing them.
P.P.S.: Optimizing the program (making it run shorter) is not really a solution, as I'm processing a 29GB bz2 file which extracts to about 400GB xml, at the speed of about 10-40MB per second on a single thread, so even at max speed I would be bound by I/O and still have it running for several hours.
While the posed answer using threads might possibly be an option, it would in reality just shift the problem to a different thread. My solution after all was using
sleep(0)
Also tested sched_yield / pthread_yield, both of which didn't really help. Unfortunately I've been unable to find a good resource which documents sleep(0) in linux, but for windows the documentation states that using a value of 0 lets the thread yield it's remaining part of the current cpu slice.
It turns out that sleep(0) is most probably relying on what is called timer slack in linux - an article about this can be found here: http://lwn.net/Articles/463357/
Another possibility is using nanosleep(&(struct timespec){0}, NULL) which seems to not necessarily rely on timer slack - linux man pages for nanosleep state that if the requested interval is below clock granularity, it will be rounded up to clock granularity, which on linux depends on CLOCK_MONOTONIC according to the man pages. Thus, a value of 0 nanoseconds is perfectly valid and should always work, as clock granularity can never be 0.
Hope this helps someone else as well ;)
Your scenario is not really a soft lock up, it is a process is busy doing something.
How about this pseudo code:
void workerThread()
{
while(workToDo)
{
if(threadSignalled)
break;
workToDo = DoWork()
}
}
void sighandler()
{
signal worker thread to finish
waitForWorkerThreadFinished;
}
void main()
{
InstallSignalHandler;
CreateSemaphore
StartThread;
waitForWorkerThreadFinished;
}
Clearly a timing issue. Using a signalling mechanism should remove the problem.
The use of printf solves the problem because printf accesses the console which is an expensive and time consuming process which in your case gives enough time for the worker to complete its work.
I am trying to implement my own new schedule(). I want to debug my code.
Can I use printk function in sched.c?
I used printk but it doesn't work. What did I miss?
Do you know how often schedule() is called? It's probably called faster than your computer can flush the print buffer to the log. I would suggest using another method of debugging. For instance running your kernel in QEMU and using remote GDB by loading the kernel.syms file as a symbol table and setting a breakpoint. Other virtualization software offers similar features. Or do it the manual way and walk through your code. Using printk in interrupt handlers is typically a bad idea (unless you're about to panic or stall).
If the error you are seeing doesn't happen often think of using BUG() or BUG_ON(cond) instead. These do conditional error messages and shouldn't happen as often as a non-conditional printk
Editing the schedule() function itself is typically a bad idea (unless you want to support multiple run queue's etc...). It's much better and easier to instead modify a scheduler class. Look at the code of the CFS scheduler to do this. If you want to accomplish something else I can give better advice.
It's not safe to call printk while holding the runqueue lock. A special function printk_sched was introduced in order to have a mechanism to use printk when holding the runqueue lock (https://lkml.org/lkml/2012/3/13/13). Unfortunatly it can just print one message within a tick (and there cannot be more than one tick when holding the run queue lock because interrupts are disabled). This is because an internal buffer is used to save the message.
You can either use lttng2 for logging to user space or patch the implementation of printk_sched to use a statically allocated pool of buffers that can be used within a tick.
Try trace_printk().
printk() has too much of an overhead and schedule() gets called again before previous printk() calls finish. This creates a live lock.
Here is a good article about it: https://lwn.net/Articles/365835/
It depends, basically it should be work fine.
try to use dmesg in shell to trace your printk if it is not there you apparently didn't invoked it.
2396 if (p->mm && printk_ratelimit()) {
2397 printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
2398 task_pid_nr(p), p->comm, cpu);
2399 }
2400
2401 return dest_cpu;
2402 }
there is a sections in sched.c that printk doesn't work e.g.
1660 static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
1661 {
1662 if (unlikely(!irqs_disabled())) {
1663 /* printk() doesn't work good under rq->lock */
1664 raw_spin_unlock(&this_rq->lock);
1665 BUG_ON(1);
1666 }
1667
1668 return _double_lock_balance(this_rq, busiest);
1669 }
EDIT
you may try to printk once in 1000 times instead of each time.
I'm debugging a piece of (embedded) software. I've set a breakpoint on a function, and for some reason, once I've reached that breakpoint and continue I always come back to the function (which is an initialisation function which should only be called once). When I remove the breakpoint, and continue, GDB tells me:
Program received signal SIGTRAP, Trace/breakpoint trap.
Since I was working with breakpoints, I'm assuming I fell in a "breakpoint trap". What is a breakpoint trap?
Breakpoint trap just means the processor has hit a breakpoint. There are two possibilities for why this is happening. Most likely, your initialization code is being hit because your CPU is resetting and hitting the breakpoint again. The other possibility would be that the code where you set the breakpoint is actually run in places other than initialization. Sometimes with aggressive compiler optimization it can be hard to tell exactly which code your breakpoint maps to and which execution paths can get there.
The other possibility i can think of is:
1.Your process is running more than one thread.
For eg - 2 say x & y.
2.Thread y hits the break point but you have attached gdb to thread x.
This case is a Trace/breakpoint trap.
I got this problem running linux project in Visual studio 2015 and debugging remotely. My solution is project_properties -> Configuration properties -> Debugging -> Debugging mode and change the value from "gdbserver" to "gdb"
If you use V BAT as backup supply and your backup voltage drives lower than 1.65V then you get the same problem after conecting to a power supply.
In this case you have to disconnect all power supplies and reconnect with correct voltage level. Then the problem with debugging goes away.
I stucked with the same problem and in my
case the solution is to decrease SWDs frequency.
(I've got soldering staff between mcu and host, not so reliable)
I changed 4000k to 100k and problem gone.