FreeRTOS - Stack corruption on STM32F4 - c

I am currently having problems with what I think is stack corruption of some error of configuration while running FreeRTOS on an STM32F407 target.
I have looked at FreeRTOS stack corruption on STM32F4 with gcc but got no help there.
The application runs two tasks and relies on one CAN interrupt. The workflow is as follows:
The two tasks, network_task and app_task is created along with two queues, raw_msg_queue and app_msg_queue. The CAN interrupt is also set up.
The network_task has the highest priority and starts waiting on the raw_msg_queue, indefinitely.
The app_task is next and starts waiting on the app_msg_queue.
The CAN interrupt then triggers because of an external event, adding a CAN message to the raw_msg_queue.
The network_task wakes up, process the message, adds the processed message to the app_msg_queue and then continues to wait on the raw_msg_queue.
The app_task wakes up and I get a hard fault.
The thing is that I have wrapped the calls that app_task makes to xQueueReceive in two steps because of end-user convenience and portability. The app_task total function chain is that it calls network_receive(..) -> os_queue_receive(..) -> xQueueReceive(..). This works well, but when it returns from xQueueReceive(..) it only manages to return to os_queue_receive(..) before it returns to a seemingly random memory location and i get a hard-fault.
The stack sizes should be adequate and are set to 2048 for both, all large data structures are passed around as pointers.
I am running my code on two STM32F407. FreeRTOS is at version 7.4.2, the latest at the time of writing.
I am really hoping that someone can help me out here!

First, you can take a look here and try to get more info about the hard fault.
You may also want to check your interrupt priority setting, as the tricky ARM Cortex-M interrupt priority mechanism causes some trouble in FreeRTOS. Refer to here.

I know this question is rather old, but perhaps this could help other people out facing a similar issue. In FreeRTOS, you can utilize the
void vApplicationStackOverflowHook(xTaskHandle xTask, signed char *pcTaskName)
function to detect a stack overflow and grab relevent information about the offending task. It's possible that data would be corrupt due to the overflow, but you can atleast address the fact that an overflow occured (reset system, set error flag/LED, etc.)
For this specific question, I'd be curious to see the thread initialization code as well as the interrupt routine. If the problem is in fact an overflow, I think it would be fairly simply to adjust these parameters until the problem goes away. You mention 2048 bytes should be sufficient for each thread - if that's truly the case, I doubt the problem is an overflow. At that point, it's more likely you're dereferencing a dangling pointer to a stale memory address.

Related

Linux: Emulating memory via signal handler

I basically want to emulate memory by catching SIGSEGV to specific locations. These locations will be zero-permission-mapped using mmap(). Performance is something that doesn't matter too much as it's just an experiment. I figured out how to figure out the memory location accessed in the handler, but I am stuck trying to actually figure out weather a read of a write has happened and how do simulate a successful read with fake data or how to simulate a successful write, intercepting the data written.
Can you give me any tips, or other approaches (maybe something that hasn't got anything to do with signals at all) to this problem?
I wish there was more to find about this on the wide internet, guess nobody had this kind of stupid idea before lol
Thanks
I am not sure I understand what you mean by simulate. There is much more unknowns - the width of access, side effects, etc. In any case, if you would find all the missing pieces, what you are going to do next?
If you simply simulate an access, and return from the handler, the CPU will rerun the faulty instruction, and guess what? - it will immediately segfault again.
Now opening a can of worms.
You may try to
find what the instruction pointer in ucontext points to
decode the pointed instruction, simulate it the way you want (tricky)
keeping in mind that the instruction may have side effects (like setting/clearing flags, or modifying registers)
figure out what the nex instruction would be (doubleplustricky)
doctor ucontext appropriately
and return from the handler.
I am sure there will be too many corner cases.
Now opening another can of worms
Another approach, a more debugger-like, is to implement a single-step handler. In a segfault handler
back up the protected area, and prepare it for a read
remove the protection
enable a single-step flag in ucontext
get ready to handle the single-step exception
return
and in a single-step handler
compare the protected area with a back-up. If they are different, it was a write access; do what is deem necessary
restore the protection
return

Optimization and Time Slicing Causes Multitasking Data Issues

I am using FreeRtos and have multiple tasks using the same code at the same priority level. To test my code I pass the same data into each task. When optimization is above -O0 and timeslicing is turned on, there is some sort of problem where the context is not being saved correctly.
My understanding is the each Task has its own stack, and on the context switch from one to another, the stack pointer will be updated accordingly, assuring that each Task stays independent. This isn't happening for me. When I run each task individually, I get one answer, but if I test by running all three tasks, I get one answer correctly and the others are slightly off. There is some sort of crossover of data between the tasks making them not truly independent.
Any idea where this issue could be coming from? I am not using any global variables and my code is reentrant as far as I can tell.
In case anyone runs into this I discovered the problem.
I am running FreeRtos on an Arm Cortex-A9 chip. To prevent processor register corruption a task must not use any floating point registers unless it has a floating point context. In the case of my project, the tasks were not created by default with floating point context.
I added
portTASK_USES_FLOATING_POINT()
to the beginning of my task. That corrected the error and the multitasking works now.
Note that I also had to add this to my UnitTest task that was calling the original three "broken" tasks, as posting to a Queue is error prone as well.
You can see more here: https://www.freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html
and here: https://www.freertos.org/FreeRTOS_Support_Forum_Archive/April_2017/freertos_FreeRtos_native_Floats_and_Task_switching_03b24664j.html

Ptrace mprotect debugging trouble

I'm having trouble with an research project.
What i am trying to is to use ptrace to watch the execution of a target process.
With the help of ptrace i am injecting a mprotect syscall into the targets code segment (similar to a breakpoint) and set the stack protection to PROT_NONE.
After that i restore the original instructions and let the target continue.
When i get an invalid permisson segfault i again inject the syscall to unprotect the stack again and afterwards i execute the instruction which caused the segfault and protect the stack again.
(This does indeed work for simple programs.)
My problem now is, that with this setup the target (pretty) randomly crashes in library function calls (no matter whether i use dynamic or static linking).
By crashing i mean, it either tries to access memory which for some reason is not mapped, or it just keeps hanging in the function __lll_lock_wait_private (that was following a malloc call).
Let me emphasis again, that the crashes don't always happen and don't always happen at the same positions.
It kind of sounds like an synchronisation problem but as far as i can tell (meaning i looked into /proc/pid/tasks/) there is only one thread running.
So do you have any clue what could be the reason for this?
Please tell me your suggestions even if you are not sure, i am running out of ideas here ...
It's also possible the non-determinism is created by address space randomization.
You may want to disable that to try and make the problem more deterministic.
EDIT:
Given that turning ASR off 'fixes' the problem then maybe the under-lying problem might be:
Somewhere thinking 0 is invalid when it should be valid, or visaversa. (What I had).
Using addresses from one run against a different run?

pthread_mutex_lock locks, but no owner is set

I've been working on this one for a few days -
As a background, I'm working on taking a single-threaded C program and making it multi-threaded. I have recently discovered a new deadlock case, but when I look at the mutex in gdb I see that
__lock=2 yet __owner=0
This is not a recursive mutex. Has anyone seen this? The program I'm working on is a daemon and this case only happens after executing at a high-throughput rate for over 20 minutes (approximately) and then relaxing the load. If you have any ideas I'd be grateful.
Edit - I neglected to mention that all of my other threads are idle at this time.
Cheers
This is to be expected. A normal (non-recursive, non-errorchecking) mutex has no need to store its owner, and some time can be saved skipping the step of looking up the caller's thread id. (This makes little difference on x86 but can be a huge difference on platforms like MIPS with broken ABIs, where there is no thread register and getting the thread id incurs a fault into kernelspace.)
The deadlock you're seeing it almost certainly due either to the thread trying to lock a mutex it already holds, or an actual logic error where two or more threads are each waiting for mutexes the other holds.
As far as I can tell, this is due to a limitation of the pthread library. Whenever I have found parts of the code that use excessive locking and unlocking and heavily stressed that section of the code, I have had this kind of failure. I have solved them by re-writing these sections to minimize their locking, which is easier code to maintain (less error checking when re-acquiring potentially freed objects) and eliminates some overhead.
I just fixed the issue I was having - stack corruption caused the mutex.__data.__lock value to get set to some ridiculous number (4 billion-ish) just prior to attempting the pthread_mutex_lock call. See if you can set a breakpoint, or print debugging info on the value of __lock just prior to performing the lock operation, and I'm willing to bet it's invalid right before the deadlock occurs.

What is a privileged instruction?

I have added some code which compiles cleanly and have just received this Windows error:
---------------------------
(MonTel Administrator) 2.12.7: MtAdmin.exe - Application Error
---------------------------
The exception Privileged instruction.
(0xc0000096) occurred in the application at location 0x00486752.
I am about to go on a bug hunt, and I am expecting it to be something silly that I have done which just happens to produce this message. The code compiles cleanly with no errors or warnings. The size of the EXE file has grown to 1,454,132 bytes and includes links to ODCS.lib, but it is otherwise pure C to the Win32 API, with DEBUG on (running on a P4 on Windows 2000).
To answer the question, a privileged instruction is a processor op-code (assembler instruction) which can only be executed in "supervisor" (or Ring-0) mode.
These types of instructions tend to be used to access I/O devices and protected data structures from the windows kernel.
Regular programs execute in "user mode" (Ring-3) which disallows direct access to I/O devices, etc...
As others mentioned, the cause is probably a corrupted stack or a messed up function pointer call.
This sort of thing usually happens when using function pointers that point to invalid data.
It can also happen if you have code that trashes your return stack. It can sometimes be quite tricky to track these sort of bugs down because they usually are hard to reproduce.
A privileged instruction is an IA-32 instruction that is only allowed to be executed in Ring-0 (i.e. kernel mode). If you're hitting this in userspace, you've either got a really old EXE, or a corrupted binary.
As I suspected, it was something silly that I did. I think I solved this twice as fast because of some of the clues in comments in the messages above. Thanks to those, especially those who pointed to something early in the app overwriting the stack. I actually found several answers here more useful than the post I have marked as answering the question as they clued and queued me as to where to look, though I think it best sums up the answer.
As it turned out, I had just added a button that went over the maximum size of an array holding some toolbar button information (which was on the stack). I had forgotten that
#define MAX_NUM_TOOBAR_BUTTONS (24)
even existed!
First probability that I can think of is, you may be using a local array and it is near the top of the function declaration. Your bounds checking gone insane and overwrite the return address and it points to some instruction that only kernel is allowed to execute.
The error location 0x00486752 seems really small to me, before where executable code usually lives. I agree with Daniel, it looks like a wild pointer to me.
I saw this with Visual c++ 6.0 in the year 2000.
The debug C++ library had calls to physical I/O instructions in it, in an exception handler.
If I remember correctly, it was dumping status to an I/O port that used to be for DMA base registers, which I assume someone at Microsoft was using for a debugger card.
Look for some error condition that might be latent causing diagnostics code to run.
I was debugging, backtracked and read the dissassembly. It was an exception while processing std::string, maybe indexing off the end.
The CPU of most processors manufactured in the last 15 years have some special instructions which are very powerful. These privileged instructions are kept for operating system kernel applications and are not able to be used by user written programs.
This restricts the damage that a user-written program can inflict upon the system and cuts down the number of times that the system actually crashes.
When executing in kernel mode, the operating system has unrestricted access to both the kernel and the user program's memory.
The load instructions for the base and limit registers are privileged instructions.

Resources