Cortex-M4F lazy FPU stacking - arm

I'm writing threading code for a Cortex M4F. Everything's working and I'm now looking into making FPU context switching more efficient via lazy stacking.
I've read ARM's AN298 and I implemented the alternative approach based on disabling FPU and handling UsageFault, but the lower (S0-S15) registers are not being saved/restored correctly by the hardware. I think the problems lies in figure 11:
According to this, when PendSV runs FPCAR should point to the space reserved in Task A's stack. But as I see it, since CONTROL.FPCA is high in Task C, FPCAR will be updated to point to Task C's stack when entering PendSV. If so, S0-S15 and FPSCR will be saved to Task C's stack instead of Task A's, which is of course not correct.
Am I missing something here, or is the appnote wrong?
One a side note, I checked some open source RTOSes. FreeRTOS and mbed RTOS always stack S16-S31 during the context switch, resulting in automatic S0-S15 stacking, i.e. they make use of lazy stacking only to reduce interrupt latency but do full state preservation for tasks (as in the first approach outlined in the appnote). The TNKernel port for M4F uses the UsageFault approach, but fully saves/restores S0-S31 via software, effectively bypassing any problem with FPCAR (at the cost of 48 load/stores instead of 32, the 16 hardware ones get overwritten on restore). Nobody seems to be using the UsageFault approach while only preserving S16-S31.
(By the way, this is also posted at ARM Community, but a lot of questions seem to go unanswered there. If I get an answer there, I'll replicate it here, too)

It took a while, but in the end I found out how to do this as efficiently as possible.
First off, the appnote is wrong. My initial explanation on the way FPCAR is updated is right. Note that FPCAR is updated even when the FPU is disabled. Also, by testing, I determined FPCAR to indeed always point to the interrupted stack.
My first approach was to manipulate FPCAR, LSPACT and EXC_RETURN, along with the UsageFault pending PendSV. Of course to do this it's essential that FPCAR manipulation doesn't count as an FPU operation from a lazy stacking perspective. When the documentation is lacking, we can only hack the answers out of the CPU...
LDR R2, =0xE000EF38
LDR R3, =0xDEADBEEF
STR R3, [R2]
VSTM R1, {S16-S31}
UDF
FPCAR is at 0xE000EF38. VSTM is part of the context-saving routine. The idea is that, if FPCAR manipulation is an FPU op, lazy stacking will halt the FPCAR store and will succeed since FPCAR is still valid. This will fault on UDF. Otherwise, lazy stacking will happen on VSTM with a corrupted FPCAR, resulting in a bus fault.
Indeed, I got a bus fault. Yay! I repeated the test with a valid address: no fault, works perfectly. So saving is simple enough. Restoring requires pending PendSV and manipulating FPCAR, LSPACT and EXC_RETURN inside it to cause S0-S15 for the current thread to be restored on exception return. The problem here is that you can't keep state for the current thread on its stack, as it's going to be popped off. Copying is inefficient, so the best bet is to point FPCAR to the persistent TCB state instead of saving the CPU-generated one.
This is getting quite complex, it requires to perform a PendSV after the UsageFault, and it has quite some corner cases and races. There's a better way.
The approach I ended up using runs completely inside UsageFault and bypasses hardware stacking, without losing efficiency over it. After enabling the FPU and determining an FPU context switch is required, I:
Set LSPACT to zero;
Save/restore the full S0-S31 state to/from the TCB;
Set LSPACT back to one.
By doing this, I can work on the whole S0-S31 state without lazy stacking getting on the way, because the CPU thinks it has already stacked the context since LSPACT is zero. This of course relies on the UsageFault handler not using FPU ops outside of save/restore and not being preempted by FPU-using ISRs, which are pretty trivial assumptions given it's hand-coded ASM and fault handlers can't be preempted by ISRs. I also tried disabling lazy stacking via ASPEN/LSPEN instead of working on LSPACT, but it doesn't seem to work (it still triggers lazy stacking, verified by setting an invalid FPCAR).
Efficiency-wise, this is as efficient as hardware stacking. If I wanted to nitpick, it saves one cycle as I don't need to writeback the incremented pointer.
By the way, I included the first approach even though I didn't end up using it because I think it has some useful info in there, if anyone else comes looking for this.

Related

Optimization and Time Slicing Causes Multitasking Data Issues

I am using FreeRtos and have multiple tasks using the same code at the same priority level. To test my code I pass the same data into each task. When optimization is above -O0 and timeslicing is turned on, there is some sort of problem where the context is not being saved correctly.
My understanding is the each Task has its own stack, and on the context switch from one to another, the stack pointer will be updated accordingly, assuring that each Task stays independent. This isn't happening for me. When I run each task individually, I get one answer, but if I test by running all three tasks, I get one answer correctly and the others are slightly off. There is some sort of crossover of data between the tasks making them not truly independent.
Any idea where this issue could be coming from? I am not using any global variables and my code is reentrant as far as I can tell.
In case anyone runs into this I discovered the problem.
I am running FreeRtos on an Arm Cortex-A9 chip. To prevent processor register corruption a task must not use any floating point registers unless it has a floating point context. In the case of my project, the tasks were not created by default with floating point context.
I added
portTASK_USES_FLOATING_POINT()
to the beginning of my task. That corrected the error and the multitasking works now.
Note that I also had to add this to my UnitTest task that was calling the original three "broken" tasks, as posting to a Queue is error prone as well.
You can see more here: https://www.freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html
and here: https://www.freertos.org/FreeRTOS_Support_Forum_Archive/April_2017/freertos_FreeRtos_native_Floats_and_Task_switching_03b24664j.html

Why Do Page Faults and Unrecoverable Errors Need to be Unmaskable?

Looking for a quick clarification on why unrecoverable errors and page faults must be non-maskable interrupts? What happens when they aren't?
Interrupts and exceptions are very different kinds of events.
An interrupt is external to a CPU event that happens and arrives in the processor asynchronously (moment of arrival does not depend on currently executing programs).
An exception is internal to a CPU event that happens as a side effect of instruction execution.
Consider processor as an overcomplex unstoppable automaton with a well-defined and strictly specified behavior. It continuously fetches, decodes, and executes instructions, one by one. When it executes each instruction, it applies the result to the state of the automaton (registers and memory) by its type. It moves without pauses and interrupts. You only can change the direction of this continuous instruction crunching using function calls and jumps.
Such an automaton-like model supported by well-defined and strictly specified instructions behavior makes it extremely predictable and convenient for programming for compilers and software engineers. When you look at the assembler listing, you can precisely say what the processor will do, when it will execute this program. However, under some specific circumstances, the execution of an instruction can fall out of this well-defined model. And in such cases CPU literally does not know what to do next and how to react. For example, the program tries to divide by zero. What reaction do you expect? What value does it need to place into the target register as a result of division? How can it report to the program that something goes wrong? Now imagine another case. The program makes a jump to some virtual address, but it has no physical address mapped. How should CPU proceed with its unstoppable fetch-decode-execute job? From where should it take the next instruction to execute? Which instruction should it execute? Or maybe it should hang in response? There are no ways out from such states.
An exception is a tool for the CPU to go out from such situations gracefully and restore its unstoppable movement. At the same time is a tool to report the encountered error to the operating system and ask it to help with its handling. If you can turn off exceptions, you can steal that tool from the CPU and put all of the above issues back on the table. CPU designers do not have good answers for them and do not what to see them. Due to this, they make exceptions unmaskable.

Practical Delimited Continuations in C / x64 ASM

I've look at a paper called A Primer on Scheduling Fork-Join Parallelism with Work Stealing. I want to implement continuation stealing, where the rest of the code after calling spawn is eligible to be stolen. Here's the code from the paper.
1 e();
2 spawn f();
3 g();
4 sync;
5 h();
An import design choice is which branch to offer to thief threads.
Using Figure 1, the choices are:
Child Stealing:
f() is made available to thief threads.
The thread that executed e() executes g().
Continuation Stealing:
Also called “parent stealing”.
The thread that executed e() executes f().
The continuation (which will next call g()) becomes available to thief threads.
I hear that saving a continuation requires saving both sets of registers (volatile/non-volatile/FPU). In the fiber implementation I did, I ended up implementing child stealing. I read about the (theoretical) negatives of child stealing (unbounded number of runnable tasks, see the paper for more info), so I want to use continuations instead.
I'm thinking of two functions, shift and reset, where reset delimits the current continuation, and shift reifies the current continuation. Is what I'm asking even plausible in a C environment?
EDIT: I'm thinking of making reset save return address / NV GPRs for the current function call (= line 3), and making shift transfer control to the next continuation after returning a value to the caller of reset.
I've implemented work stealing for a HLL called PARLANSE rather than C on an x86. PARLANSE is used daily to build production symbolic parallel programs at the million line scale.
In general, you have preserve the registers of both the continuation or the "child".
Consider that your compiler may see a computation in f() and see the same computation in g(), and might lift that computation to the point just before the spawn, and place that computation result in a register that both f() and g() use as in implied parameter.
Yes, this assumes a sophisticated compiler, but if you are using a stupid compiler that doesn't optimize, why are you trying to go parallel for speed?
In specific, however, your compiler could arrange for the registers to be empty before the call to spawn if it understood what spawn means. Then neither the continuation or the child has to preserve registers. (The PARLANSE compiler in fact does this).
So how much has to be saved depends on how much your compiler is willing to help, and that depends on whether it knows what spawn really does.
Your local friendly C compiler likely doesn't know about your implementation of spawn. So either you do something to force a register flush (don't ask me, its your compiler) or you put up with the fact that you personally don't know what's in the registers, and your implementation preserves them all to be safe.
If the amount of work spawned is significant, arguably it wouldn't matter if you saved all the registers. However, the x86 (and other modern architectures) seems have an enormous amount of state, mostly in the vector registers, that might be in use; last time I looked it was well in excess of 500 bytes ~~ 100 writes to memory to save these and IMHO that's an excessive price. If you don't believe these registers are going to be passed from the parent thread to the spawned thread, then you can work on enforcing spawn with no registers.
If you spawn routine wakes up using a standard continuation mechanism you have invented, then you have worry about whether your continuations pass large register state or not, also. Same problem, same solutions as for spawn; the compiler has to help or you personally have to intervene.
You'll find this a lot of fun.
[If you want to make it really interesting, try timeslicing the threads in case they go into deep computation without an occasional yeild causing thread starvation. Now you surely have save the entire state. I managed to get PARLANSE to realize spawning with no registers saved, yet have the time slicing save/restore full register state, by saving full state on a time slice, and continuing at a special place that refilled all the registers before it passed control to the time-sliced PC location].

remapping Interrupt vectors and boot block

I am not able to understand the concept of remapping Interrupt vectors or boot block. What is the use of remapping vector table? How it works with remap and without remap? Any links to good articles on this? I googled for this, but unable to get good answer. What is the advantage of mapping RAM to 0x0000 and mapping whatever existing in 0x0000 to elsewhere? Is it that execution is faster if executed from 0x0000?
It's a simple matter of practicality. The reset vector is at 0x0*, and when the system first powers up the core is going to start fetching instructions from there. Thus you have to have some code available there immediately from powerup - it's got to be some kind of ROM, since RAM would be uninitialised at this point. Now, once you've got through the initial boot process and started your application proper, you have a problem - your exception vectors, and the code to handle them, are in ROM! What if you want to install a different interrupt handler? What if you want to switch the reset vector for a warm-reset handler? By having the vector area remappable, the application is free to switch out the ROM boot firmware for the RAM area in which it's installed its own vectors and handler code.
Of course, this may not always be necessary - e.g. for a microcontroller running a single dedicated application which handles powerup itself - but as soon as you get into the more complex realm of separate bootloaders and application code it becomes more important. Performance is also a theoretical concern, at least - if you have slow flash but fast RAM you might benefit from copying your vectors and interrupt handlers into that RAM - but I think that's far less of an issue on modern micros.
Furthermore, if an application wants to be able to update the boot flash at runtime, then it absolutely needs a way of putting the vectors and handlers elsewhere. Otherwise, if an interrupt fires whilst the flash block is in programming mode, the device will lock up in a recursive hard fault due to not being able to read from the vectors, never finish the programming operation and brick itself.
Whilst most types of ARM core have some means to change their own vector base address, some (like Cortex-M0), not to mention plenty of non-ARM cores, do not, which necessitates this kind of non-architecture-specific system-level remapping functionality to achieve the same result. In the case of microcontrollers built around older cores like ARM7TDMI, it's also quite likely for there to be no RAM behind the fixed alternative "high vectors" address (more suited for use withg an MMU), rendering that option useless.
* Yeah, OK, 0x4 if we're talking Cortex-M, but you know what I mean... ;)

Real-life use cases of barriers (DSB, DMB, ISB) in ARM

I understand that DSB, DMB, and ISB are barriers for prevent reordering of instructions.
I also can find lots of very good explanations for each of them, but it is pretty hard to imagine the case that I have to use them.
Also, from the open source codes, I see those barriers from time to time, but it is quite hard to understand why they are used. Just for an example, in Linux kernel 3.7 tcp_rcv_synsent_state_process function, there is a line as follows:
if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
smp_mb();
if (po->tp_version <= TPACKET_V2)
__packet_set_status(po, h.raw, status);
where smp_mb() is basically DMB.
Could you give me some of your real-life examples?
It would help understand more about barriers.
Sorry, not going to give you a straight-out example like you're asking, because as you are already looking through the Linux source code, you have plenty of those to go around, and they don't appear to help. No shame in that - every sane person is at least initially confused by memory access ordering issues :)
If you are mainly an application developer, then there is every chance you won't need to worry too much about it - whatever concurrency frameworks you use will resolve it for you.
If you are mainly a device driver developer, then examples are fairly straightforward to find - whenever there is a dependency in your code on a previous access having had an effect (cleared an interrupt source, written a DMA descriptor) before some other access is performed (re-enabling interrupts, initiating the DMA transaction).
If you are in the process of developing a concurrency framework (, or debugging one), you probably need to read up on the topic a bit more - but your question suggests a superficial curiosity rather than an immediate need?
If you are developing your own method for passing data between threads, not based on primitives provided by a concurrency framework, that is for all intents and purposes a concurrency framework.
Paul McKenney wrote an excellent paper on the need for memory barriers, and what effects they actually have in the processor: Memory Barriers: a Hardware View for Software Hackers
If that's a bit too hardcore, I wrote a 3-part blog series that's a bit more lightweight, and finishes off with an ARM-specific view. First part is Memory access ordering - an introduction.
But if it is specifically lists of examples you are after, especially for the ARM architecture, you could do a lot worse than Barrier Litmus Tests and Cookbook.
The extra-extra light programmer's view and not entirely architecturally correct version is:
DMB - whenever a memory access requires ordering with regards to another memory access.
DSB - whenever a memory access needs to have completed before program execution progresses.
ISB - whenever instruction fetches need to explicitly take place after a certain point in the program, for example after memory map updates or after writing code to be executed. (In practice, this means "throw away any prefetched instructions at this point".)
Usually you need to use a memory barrier in cases where you have to make SURE that memory access occurs in a specific order. This might be required for a number of reasons, usually it's required when two or more processes/threads or a hardware component access the same memory structure, which has to be kept consistent.
It's used very often in DMA-transfers. A simple DMA control structures might look like this:
struct dma_control {
u32 owner;
void * data;
u32 len;
};
The owner will usually be set to something like OWNER_CPU or OWNER_HARDWARE, to indicate who of the two participants is allowed to work with the structure.
Code which changes this will usually like like this
dma->data = data;
dma->len = length;
smp_mb();
dma->owner = OWNER_HARDWARE;
So, data an len are always set before the ownership gets transfered to the DMA hardware. Otherwise the engine might get stale data, like a pointer or length which was not updated, because the CPU reordered the memory access.
The same goes for processes or threads running on different cores. The could communicate in a similar manner.
One simple example of a barrier requirement is a spinlock. If you implement a spinlock using compare-and-swap(or LDREX/STREX on ARM) and without a barrier, the processor is allowed to speculatively load values from memory and lazily store computed values to memory, and neither of those are required to happen in the order of the loads/stores in the instruction stream.
The DMB in particular prevents memory access reordering around the DMB. Without DMB, the processor could reorder a store to memory protected by the spinlock after the spinlock is released. Or the processor could read memory protected by the spinlock before the spinlock was actually locked, or while it was locked by a different context.
unixsmurf already pointed it out, but I'll also point you toward Barrier Litmus Tests and Cookbook. It has some pretty good examples of where and why you should use barriers.

Resources