Optimization and Time Slicing Causes Multitasking Data Issues - c

I am using FreeRtos and have multiple tasks using the same code at the same priority level. To test my code I pass the same data into each task. When optimization is above -O0 and timeslicing is turned on, there is some sort of problem where the context is not being saved correctly.
My understanding is the each Task has its own stack, and on the context switch from one to another, the stack pointer will be updated accordingly, assuring that each Task stays independent. This isn't happening for me. When I run each task individually, I get one answer, but if I test by running all three tasks, I get one answer correctly and the others are slightly off. There is some sort of crossover of data between the tasks making them not truly independent.
Any idea where this issue could be coming from? I am not using any global variables and my code is reentrant as far as I can tell.

In case anyone runs into this I discovered the problem.
I am running FreeRtos on an Arm Cortex-A9 chip. To prevent processor register corruption a task must not use any floating point registers unless it has a floating point context. In the case of my project, the tasks were not created by default with floating point context.
I added
portTASK_USES_FLOATING_POINT()
to the beginning of my task. That corrected the error and the multitasking works now.
Note that I also had to add this to my UnitTest task that was calling the original three "broken" tasks, as posting to a Queue is error prone as well.
You can see more here: https://www.freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html
and here: https://www.freertos.org/FreeRTOS_Support_Forum_Archive/April_2017/freertos_FreeRtos_native_Floats_and_Task_switching_03b24664j.html

Related

how to jump out of and resume at arbitrary locations in c-code without refactoring

BACKGROUND
I'm integrating micropython into my custom cooperative multitasking OS (no, my company won't change to pre-preemptive)
Micropython uses garbage collection and this takes much more time than my alloted time slice even when there's nothing to collect i.e. I called it twice in a row, timed it and still takes A LOT of time.
OBVIOUS SOLUTION
Yes I could refactor micropython source but then whenever there's a change . . .
IDEAL SOLUTION
The ideal solution would involve calling some function void pause(&func_in_call_stack) that would jump out, leaving the stack intact, all the way to the function that is at the top of the call stack, say main. And resume would . . . resume.
QUESTION
Is it possible, using C and assembly, to implement pause?
UPDATE
As I wrote this, I realize that the C-based exception handling code nlr_push()/nlr_pop() already does most of what I need.
Your question is about implementing context switching. As we've covered fairly exhaustively in comments, support for context switching is among the key characteristics of any multitasking system, and of a multitasking OS in particular. Inasmuch as you posit no OS support for context switching, you are talking about implementing multitasking for a single-tasking OS.
That you describe the OS as providing some kind of task queue ("to relinquish control, a thread must simply exit its run loop") does not change this, though to some extent we could consider it a question of semantics. I imagine that a typical task for such a system would operate by creating and executing a series of microtasks (the work of the "run loop"), providing a shared, mutable memory context to each. Such a run loop could safely exit and later be reentered, to resume generating microtasks from where it left off.
Dividing tasks into microtasks at boundaries defined by affirmative application action (i.e. your pause()) would depend on capabilities beyond those provided by ISO C. Very likely, however, it could be done with the help of some assembly, plus some kind of framework support. You need at least these things:
A mechanism for recording a task's current execution context -- stack, register contents, and maybe other details. This is inherently system-specific.
A task-associated place to store recorded execution context. There are various ways in which such a thing could be established. Promising alternatives include (i) provided by the OS; (ii) provided by some kind of userland multi-tasking system running on top of the OS; (iii) built into the task by the compiler.
A mechanism for restoring recorded execution context -- this, too, will be system-specific.
If the OS does not provide such features, then you could consider the (now removed) POSIX context system as a model interface for recording and restoring execution context. (See makecontext(), swapcontext(), getcontext(), and setcontext().) You would need to implement those yourself, however, and you might want to wrap them to present a simpler interface to applications. Details will be highly dependent on hardware and underlying OS.
As an alternative, you might implement transparent multitasking support for such a system by providing compilers that emit specially instrumented code (i.e. even more specially instrumented than you otherwise need). For example, consider compilers that emit bytecode for a VM of your own design. The VMs in which the resulting programs run would naturally track the state of the program running within, and could yield after each sequence of a certain number of opcodes.

Performance of System()

For the function in c, system(), would it affect the hardware counters if you are trying to see how that command you ran performed
For example lets say im using the Performance API(PAPI) and the program is a precompiled matrix multiplication application
PAPI_start_counters();
system("./matmul");
PAPI_read_counters();
//Print out values
PAPI_stop_counters();
I am obviously missing a bit but what I am trying to find out is it is possible, through the use of said counters to get the performance of a program im running.
from my tests I would get wild numbers like the ones below. they are obviously wrong, just want to find out why
Total Cycles =========== 140733358872510
Instructions Completed =========== 4203968
Floating Point Instructions =========== 0
Floating Point Operations =========== 4196867
Loads =========== 140733358872804
Stores =========== 4204037
Branches Taken =========== 15774436
system() is a very slow function in general. On Linux, it spawns /bin/sh (forking and executing a full shell process), which parses your command, and spawns the second program. Loading these two programs requires loading the code to memory, initializing all their libraries, executing startup code, etc. Only then will the program code actually start executing.
Because of the unpredictability of disk access and Linux process scheduling, timing system() calls has a very high inherent variability. Therefore, you won't get accurate results even if you use a high-performance counter.
The better solution would be to compile the target program as a library instead. Load it before initializing your counters, then just execute the main function from the library. That way, all the code executes in your process, and you have negligible startup time. Your performance numbers will be much more precise this way.
Do you have access to the code of matmul? If so, it's much more precise to instrument and measure only the code you're interested in. That means you wrap only those instructions (or C statements) in counters that you want to measure.
For more information see:
Related discussion here
IntelĀ® Performance Counter Monitor here
Performance measurements with x86 RDTSC instruction here
As stated above, measuring using PAPI to wrap system() invocations carries way too much process overhead to give you any idea of how fast your math code is actually running.
The numbers you are getting are odd, but not necessarily wrong. The huge disparity between the instructions completed and the cycles probably indicate that the executable "matmul" is doing a lot of waiting for external processes (e.g. disk I/O) to complete. I do not know the specifics of the msg FP Instructions and FP ops, but if they are displaying those values differently PAPI has a reason.
What is interesting is that the loads and cycles are obviously connected as well as instructions/fp ops and stores.
I would have to know about the internals of "matmul" in order to give you a better description.

FreeRTOS - Stack corruption on STM32F4

I am currently having problems with what I think is stack corruption of some error of configuration while running FreeRTOS on an STM32F407 target.
I have looked at FreeRTOS stack corruption on STM32F4 with gcc but got no help there.
The application runs two tasks and relies on one CAN interrupt. The workflow is as follows:
The two tasks, network_task and app_task is created along with two queues, raw_msg_queue and app_msg_queue. The CAN interrupt is also set up.
The network_task has the highest priority and starts waiting on the raw_msg_queue, indefinitely.
The app_task is next and starts waiting on the app_msg_queue.
The CAN interrupt then triggers because of an external event, adding a CAN message to the raw_msg_queue.
The network_task wakes up, process the message, adds the processed message to the app_msg_queue and then continues to wait on the raw_msg_queue.
The app_task wakes up and I get a hard fault.
The thing is that I have wrapped the calls that app_task makes to xQueueReceive in two steps because of end-user convenience and portability. The app_task total function chain is that it calls network_receive(..) -> os_queue_receive(..) -> xQueueReceive(..). This works well, but when it returns from xQueueReceive(..) it only manages to return to os_queue_receive(..) before it returns to a seemingly random memory location and i get a hard-fault.
The stack sizes should be adequate and are set to 2048 for both, all large data structures are passed around as pointers.
I am running my code on two STM32F407. FreeRTOS is at version 7.4.2, the latest at the time of writing.
I am really hoping that someone can help me out here!
First, you can take a look here and try to get more info about the hard fault.
You may also want to check your interrupt priority setting, as the tricky ARM Cortex-M interrupt priority mechanism causes some trouble in FreeRTOS. Refer to here.
I know this question is rather old, but perhaps this could help other people out facing a similar issue. In FreeRTOS, you can utilize the
void vApplicationStackOverflowHook(xTaskHandle xTask, signed char *pcTaskName)
function to detect a stack overflow and grab relevent information about the offending task. It's possible that data would be corrupt due to the overflow, but you can atleast address the fact that an overflow occured (reset system, set error flag/LED, etc.)
For this specific question, I'd be curious to see the thread initialization code as well as the interrupt routine. If the problem is in fact an overflow, I think it would be fairly simply to adjust these parameters until the problem goes away. You mention 2048 bytes should be sufficient for each thread - if that's truly the case, I doubt the problem is an overflow. At that point, it's more likely you're dereferencing a dangling pointer to a stale memory address.

How can I prevent a task to be moved from a CPU to another?

I am currently building a Kernel module and I want to face SMP issues in a quite-optimal way.
Currently, I have a set of objects and each one is bound to a particular CPU. The following code illustrates this :
struct my_object {
int a_field;
};
struct my_object cpu_object[NR_CPUS];
/*
* cpu_object[i] is "bound" to CPU number "i" !
*/
A simple call to smp_processor_id() will then give me the processor on which the current code is running. So if I have a function foo that does some work using the CPU-bound objects described above, it might look like :
void foo()
{
int cpu = smp_processor_id();
do_some_work_with(cpu_object[cpu]);
}
The question is : How to guarantee that
There is no CPU switch between cpu assignment and do_some_work_with ?
do_some_work_with() will only run on cpu ?
At the time, the solution I think about is :
Disable preemption using a spinlock
Get the CPU with smp_processor_id
Set the processor affinity of the current task to make it stick with the current CPU
Enable preemption again, releasing the lock
Do the work do_some_work_with()
Reset the affinity to its previous state
To me it is quite barbarian and I was wondering if there was smarter and lighter way to do it.
Thanks in advance.
EDIT :
As stated in the comments, I edit to explain why I feel I need such features.
I have to perform on-the-fly encryption on a filesystem level.
To do so, I will use the Kernel built-in cryptographic support (struct crypto_tfm and friends). Here is the original issue...
On multi-core machines, it is possible to perform multiple R/W operations at the same time. The common fs layer does it and does it well. But, here I come and mess things up :
A struct crypto_tfm-like object is in charge for the ciphering operation
A same transform object cannot be used at the same time since some parameters would be altered (private key and initialization vector) and screw all the process
A naive solution as described below is completely out of the question due to the complex cipher allocation system built in crypto.
Allocate the crypto_tfm transformation
Perform the ciphering operation
Free the transformation object
A classic scheme where only one transform is available prevents multiple concurrent R/W operations since one task would have to wait for another to release the lock held to protect the transform object.
For these reasons, I need to deal with multiple transformation objects. I must find an efficient scheme that allows concurrent R/W. I feel my "Y" here is "the solution that is simple, neat ... and wrong".
Any suggestion would be much appreciated.
Note : If I use a solution like the one I gave in the original question, I would limit it to very short sections to avoid heavy impact on CPU load balancing.
So, based on your edited question, I have to say that I think your solution is wrong.
The right thing to do is to have a "per operation" crypto_tfm, that follows that operation across CPU's. Using the "current CPU" is not the right thing here. [What happens if the this is running on a system with hot-swappable CPUs, and someone disconnects the CPU your task is running on - and never puts one back in it's place?]
If it's costly to allocate a crypto_tfm per operation, then you have to find some way to avoid allocating/freeing the objects - have a pool of them and assign an available to the current operation, and when the operation is complete, put it back into the available list again.

Programmatically Detect Context Switch via Assembly

I am aware that one cannot listen for, detect, and perform some action upon encountering context switches on Windows machines via managed languages such as C#, Java, etc. However, I was wondering if there was a way of doing this using assembly (or some other language, perhaps C)? If so, could you provide a small code snippet that gives an idea of how to do this (as I am relatively new to kernel programming)?
What this code will essentially be designed to do is run in the background on a standard Windows UI and listen for when a particular process is either context switched in or out of the CPU. Upon hearing either of these actions, it will send a signal. To clarify, I am looking to detect only the context switches directly involving a specific process, not any context switches. What I ultimately would like to achieve is to be able to notify another machine (via the internet signal) whenever a specific process begins making use of the CPU, as well as when it ceases doing so.
My first attempt at doing this involved simply calculating the CPU usage percentage of the specific process, but this ultimately proved to be too course-grained to catch the most minute calculations. For example, I wrote a test program that simply performed the operation 2+2 and placed the answer inside of an int. The CPU usage method did not pick up on this. Thus, I am looking for something lower level, hence the origin of this question. If there are potential alternatives, I would be more than happy to field them.
There's Event Tracing for Windows (ETW), which you can configure to receive messages about a variety of events occurring in the system.
You should be able to receive messages about thread scheduling events. The CSwitch class of events is for that.
Sorry, I don't know any good ETW samples that you could easily reuse for your task. Read MSDN and look around.
Simon pointed out a good link explaining why ETW can be useful. Very enlightening: http://randomascii.wordpress.com/2012/05/11/the-lost-xperf-documentationcpu-scheduling/
Please see the edits below. In particular #3, ETW appears to be the way to go.
In theory you could install your own trap handler for the old int 2Eh and the new sysenter. However, in practice this isn't going to be as easy anymore as it used to be because of Patchguard (since Vista) and signing requirements. I'm not aware of any other generic means to detect context switches, meaning you'd have to roll your own. All context switches of the OS go through call gates (the aforementioned trap handlers) and ReactOS allows you to peek behind the scenes if you feel uncomfortable with debugging/disassembling.
However, in either case there shouldn't be a generic way to install something like this without kernel mode privileges (usually referred to as ring 0) - anything else would be a security flaw in Windows. I'm not aware of a Windows-supplied method to achieve what you want either.
The book "Undocumented Windows NT" has a pretty good chapter about the exact topic (although obviously targeted at the old int 2Eh method).
If you can live with hooking only certain functions, you may be able to get away with some filter driver(s) or user-mode API hooking. Depends on your exact requirements.
Update: reading your updated question, I think you need to read up on the internals, in particular on the concept of IRQLs (not to be confused with IRQs from DOS times) and the scheduler. The problem is that there can - and usually will - be literally hundreds of context switches every second. However, your watcher process (the one watching for context switches) will, like any user-mode process be preemptable. This means that there is no way for you to achieve real-time signaling or anything close to it, which puts a big question mark on the method.
What is it actually that you want to achieve? The number of context switches doesn't really give you anything. Every single SEH exception will cause a context switch. What is it that you are interested in? Perhaps performance counters cater your needs better?
Update 2: the sheer amount of context switches even for a single thread will be flabbergasting within a single second. So assuming you'd install your own trap handler, you'd still end up (adversely) affecting all other threads on the system (after all you'd catch every context switch and then see whether it's the process/threads you care about and then do your thing or pass it on).
If you could tell us what you ultimately want to achieve, not with the means already pre-defined, we may be able to suggest alternatives.
Update 3: so apparently I was wrong in one respect here. Windows comes with something on board that signals context switches. And ETW can be harnessed to tap into those. Thanks to Simon for pointing out.

Resources