pthread_mutex_lock locks, but no owner is set - c

I've been working on this one for a few days -
As a background, I'm working on taking a single-threaded C program and making it multi-threaded. I have recently discovered a new deadlock case, but when I look at the mutex in gdb I see that
__lock=2 yet __owner=0
This is not a recursive mutex. Has anyone seen this? The program I'm working on is a daemon and this case only happens after executing at a high-throughput rate for over 20 minutes (approximately) and then relaxing the load. If you have any ideas I'd be grateful.
Edit - I neglected to mention that all of my other threads are idle at this time.
Cheers

This is to be expected. A normal (non-recursive, non-errorchecking) mutex has no need to store its owner, and some time can be saved skipping the step of looking up the caller's thread id. (This makes little difference on x86 but can be a huge difference on platforms like MIPS with broken ABIs, where there is no thread register and getting the thread id incurs a fault into kernelspace.)
The deadlock you're seeing it almost certainly due either to the thread trying to lock a mutex it already holds, or an actual logic error where two or more threads are each waiting for mutexes the other holds.

As far as I can tell, this is due to a limitation of the pthread library. Whenever I have found parts of the code that use excessive locking and unlocking and heavily stressed that section of the code, I have had this kind of failure. I have solved them by re-writing these sections to minimize their locking, which is easier code to maintain (less error checking when re-acquiring potentially freed objects) and eliminates some overhead.

I just fixed the issue I was having - stack corruption caused the mutex.__data.__lock value to get set to some ridiculous number (4 billion-ish) just prior to attempting the pthread_mutex_lock call. See if you can set a breakpoint, or print debugging info on the value of __lock just prior to performing the lock operation, and I'm willing to bet it's invalid right before the deadlock occurs.

Related

Does a page fault cause a thread context switch on Linux?

If a thread suffers a major fault while trying to read from an address, and the data must be swapped in from "disk", does Linux take advantage of that to run another waiting thread, if there is one?
From what I've read online, the answer is yes. But I haven't seen anything conclusive.
That depends on the scheduler you use. In general, the answer is yes, unless the disk operation is sufficiently fast or unless the kernel has another reason not to swap in a different process.

Making process survive failure in its thread

I'm writing app that has many independant threads. While I'm doing quite low level, dangerous stuff there, threads may fail (SIGSEGV, SIGBUS, SIGFPE) but they should not kill whole process. Is there a way to do it proper way?
Currently I intercept aforementioned signals and in their signal handler then I call pthread_exit(NULL). It seems to work but since pthread_exit is not async-signal-safe function I'm a bit concerned about this solution.
I know that splitting this app into multiple processes would solve the problem but in this case it's not an feasible option.
EDIT: I'm aware of all the Bad Thingsā„¢ that can happen (I'm experienced in low-level system and kernel programming) due to ignoring SIGSEGV/SIGBUS/SIGFPE, so please try to answer my particular question instead of giving me lessons about reliability.
The PROPER way to do this is to let the whole process die, and start another one. You don't explain WHY this isn't appropriate, but in essence, that's the only way that is completely safe against various nasty corner cases (which may or may not apply in your situation).
I'm not aware of any method that is 100% safe that doesn't involve letting the whole process. (Note also that sometimes just the act of continuing from these sort of errors are "undefined behaviour" - it doesn't mean that you are definitely going to fall over, just that it MAY be a problem).
It's of course possible that someone knows of some clever trick that works, but I'm pretty certain that the only 100% guaranteed method is to kill the entire process.
Low-latency code design involves a careful "be aware of the system you run on" type of coding and deployment. That means, for example, that standard IPC mechanisms (say, using SysV msgsnd/msgget to pass messages between processes, or pthread_cond_wait/pthread_cond_signal on the PThreads side) as well as ordinary locking primitives (adaptive mutexes) are to be considered rather slow ... because they involve something that takes thousands of CPU cycles ... namely, context switches.
Instead, use "hot-hot" handoff mechanisms such as the disruptor pattern - both producers as well as consumers spin in tight loops permanently polling a single or at worst a small number of atomically-updated memory locations that say where the next item-to-be-processed is found and/or to mark a processed item complete. Bind all producers / consumers to separate CPU cores so that they will never context switch.
In this type of usecase, whether you use separate threads (and get the memory sharing implicitly by virtue of all threads sharing the same address space) or separate processes (and get the memory sharing explicitly by using shared memory for the data-to-be-processed as well as the queue mgmt "metadata") makes very little difference because TLBs and data caches are "always hot" (you never context switch).
If your "processors" are unstable and/or have no guaranteed completion time, you need to add a "reaper" mechanism anyway to deal with failed / timed out messages, but such garbage collection mechanisms necessarily introduce jitter (latency spikes). That's because you need a system call to determine whether a specific thread or process has exited, and system call latency is a few micros even in best case.
From my point of view, you're trying to mix oil and water here; you're required to use library code not specifically written for use in low-latency deployments / library code not under your control, combined with the requirement to do message dispatch with nanosec latencies. There is no way to make e.g. pthread_cond_signal() give you nsec latency because it must do a system call to wake the target up, and that takes longer.
If your "handler code" relies on the "rich" environment, and a huge amount of "state" is shared between these and the main program ... it sounds a bit like saying "I need to make a steam-driven airplane break the sound barrier"...

FreeRTOS - Stack corruption on STM32F4

I am currently having problems with what I think is stack corruption of some error of configuration while running FreeRTOS on an STM32F407 target.
I have looked at FreeRTOS stack corruption on STM32F4 with gcc but got no help there.
The application runs two tasks and relies on one CAN interrupt. The workflow is as follows:
The two tasks, network_task and app_task is created along with two queues, raw_msg_queue and app_msg_queue. The CAN interrupt is also set up.
The network_task has the highest priority and starts waiting on the raw_msg_queue, indefinitely.
The app_task is next and starts waiting on the app_msg_queue.
The CAN interrupt then triggers because of an external event, adding a CAN message to the raw_msg_queue.
The network_task wakes up, process the message, adds the processed message to the app_msg_queue and then continues to wait on the raw_msg_queue.
The app_task wakes up and I get a hard fault.
The thing is that I have wrapped the calls that app_task makes to xQueueReceive in two steps because of end-user convenience and portability. The app_task total function chain is that it calls network_receive(..) -> os_queue_receive(..) -> xQueueReceive(..). This works well, but when it returns from xQueueReceive(..) it only manages to return to os_queue_receive(..) before it returns to a seemingly random memory location and i get a hard-fault.
The stack sizes should be adequate and are set to 2048 for both, all large data structures are passed around as pointers.
I am running my code on two STM32F407. FreeRTOS is at version 7.4.2, the latest at the time of writing.
I am really hoping that someone can help me out here!
First, you can take a look here and try to get more info about the hard fault.
You may also want to check your interrupt priority setting, as the tricky ARM Cortex-M interrupt priority mechanism causes some trouble in FreeRTOS. Refer to here.
I know this question is rather old, but perhaps this could help other people out facing a similar issue. In FreeRTOS, you can utilize the
void vApplicationStackOverflowHook(xTaskHandle xTask, signed char *pcTaskName)
function to detect a stack overflow and grab relevent information about the offending task. It's possible that data would be corrupt due to the overflow, but you can atleast address the fact that an overflow occured (reset system, set error flag/LED, etc.)
For this specific question, I'd be curious to see the thread initialization code as well as the interrupt routine. If the problem is in fact an overflow, I think it would be fairly simply to adjust these parameters until the problem goes away. You mention 2048 bytes should be sufficient for each thread - if that's truly the case, I doubt the problem is an overflow. At that point, it's more likely you're dereferencing a dangling pointer to a stale memory address.

Overhead of Spin Loop in terms of cache coherence

Say a thread in one core is spinning on a variable which will be updated by a thread running on another core. My question is what is the overhead at cache level. Will the waiting thread cache the variable and therefore does not cause any traffic on the bus until the writing thread writes to that variable?
How can this overhead be reduced. Does x86 pause instruction help?
I believe all modern x86 CPUs use the MESI protocol. So the spinning "reader" thread will likely have a cached copy of the data in either "exclusive" or "shared" mode, generating no memory bus traffic while you spin.
It is only when the other core writes to the location that it will have to perform cross-core communication.
[update]
A "spinlock" like this is only a good idea if you will not be spinning for very long. If it may be a while before the variable gets updated, use a mutex + condition variable instead, which will put your thread to sleep so that it adds no overhead while it waits.
(Incidentally, I suspect a lot of people -- including me -- are wondering "what are you actually trying to do?")
If you spin lock for short intervals you are usually fine. However there is a timer interrupt on Linux (and I assume similar on other OSes) so if you spin lock for 10 ms or close to it you will see a cache disturbance.
I have heard its possible to modify the Linux kernel to prevent all interrupts on specific cores and this disturbance goes away, but I don't know what is involved in doing this.
In the case of two threads the overhead may be ignored, anyway it could be a good idea make a simple benchmark. For instance, if you implement spinlocks, how much time the thread spends into the spin.
This effect on the cache it's called cache line bouncing.
I tested this extensively in this post. The overhead in general is incurred by the bus-locking component of the spinlock, usually the instruction "xchg reg,mem" or some variant of it. Since that particular overhead cannot be avoided you have the options of economizing on the frequency with which you invoke the spinlock and performing the absolute minimum amount of work necessary - once the lock is in place - before releasing it.

What could cause the dead loop, indicated by print "Dead loop on virtual device " in linux kernel?

The print comes when the 'current lock owner' of a kernel resource is current CPU. I don't know what could lead to this condition. Couldn't find much on the net. Anyone debugged this?
It is a diagnostic message intended to bring a possible deadlock to attention.
In this particular case there is a transmit queue that is protected by a spinlock. In addition to this lock, the transmit queue also maintains an "owner" field which contains a CPUID which is set when this spinlock is held.
As you probably know, a spinlock will always spin on a CPU if the lock requested has already been taken.
So at this location the code checks if the cpu is the same one that locked the spinlock.
If its not on the same CPU it performs the operations that might need the lock to be taken.
On the otherhand if its the same CPU, something is not right i.e. we should actually be spinning waiting for the lock. Probably we got here due to an incorrect interrupt handler/bottom half.
Since this indicates a potential deadlock a diagnostic message is printed :).
Debug? You mean, need to know where in the source?
Ok, Got it.
This typically happens when you enter the same function twice, referencing to the same kernel resource, in a single execution context of the linux kernel (e.g. a single instance of softIRQ, etc).
The way out of this is to make sure you don't re-enter the function twice in the same execution context. It's a bug in your code if this happens.

Resources