I'm looking for hints in using dynamic memory handler safe in multi-threaded system. Details of the issue:
written in C will run on cortex-M3 processor, with RTOS (CooCox OS),
TLSF memory allocator will be used (other allocators might be used if I will find them better suited and they will be free and open-source),
Solution I'm looking for is using memory allocator safe from OS tasks and interrupts.
So far thought of 2 possible approaches, both have few yet unknown for me details:
disable and enable interrupts when calling allocator functions. Problem - if I'm not mistaking I can't play with interrupts disable and enable in normal mode, only in privileged mode (so if I'm not mistaken, that is only in interrupts), I need to do that from runtime also - to prevent interupts and task switching during memory handler operations.
call allocator from SWI. This one is still very unclear for me. 1st - is SWI same as FIQ (if so is it true that FIQ code needs to be written in asm - since allocator is written in C). Then still have few doubts about calling FIQ from IRQ (that scenarion would happen - tho not often), but most likely this part will not cause issues.
So any ideas on possible solutions for this situation?
Regarding your suggestions 1 and 2:
On Cortex-M3 you can enable and disable interrupts at any time in privileged level code through the CMSIS intrinsics __disable_irq/_enable_irq functions. privileged level is not restricted to handler mode; thread mode code can run at privileged level too (and in many small RTOS that is the default).
SWI and FIQ are concepts from legacy ARM architectures. They do not exist in Cortex-M3.
You would not ideally want to perform memory allocation in an interrupt handler - even if the allocator is deterministic, it may still take significant amount of time; I can think of few reasons you would want to do that.
The best approach is to modify the tlsf code to use an RTOS mutex for each of the calls with external linkage. Other libraries I have used have stubs already in the library that normally do nothing, but which you can override with your own implementation to map it to any RTOS.
Now you cannot of course use a mutex in an ISR, but as I said you should probably not allocate memory there either. If you really must perform allocation in an interrupt handler, then enable/disable interrupts is your only option, but you are then confounding all the real-time deterministic behaviour that an RTOS provides. A better solution to that is to have your ISR do not more than issue an event-flag or semaphore to a thread context handler. This allows you to use all RTOS services and scheduling, and the context switch time from ISR to a high priority thread will be insignificant compared to the memory allocation time.
Another possibility would be to not use this allocator at all, but instead use a fixed-block allocator using RTOS queues. You pre-allocate blocks of memory (statically or dynamically), post pointers to the start of each block onto a queue, then to allocate you simply receive a pointer from the queue, and to free you post back to the queue. If memory is exhausted (queue is empty), you can baulk or block on the queue (do not block in an ISR though). You can create multiple queues for different sized blocks, and use the one appropriate to your needs (ensuring you post back to the same queue of course!)
Related
Would it be necessary to use a mutex for atomic operations on shared memory, in a multicore environment, where one CPU is only ever reading and the other CPU is only ever writing? I am guessing that this may depend on architecture, so if an example is needed then ARM (Cortex) and/or ESP32?
I already know that a mutex is not needed for atomic operations in a single-core environment where one thread is only ever reading and the other thread only ever writing (https://www.freertos.org/FreeRTOS_Support_Forum_Archive/May_2019/freertos_Shared_variable_between_a_write_thred_and_a_read_thread_a0408decbaj.html).
One solution that has been around for decades (I already used this 30 years ago) is the concept of mailboxes.
Simplest mailbox is just a structure or buffer with a flag. This flag should be of the minimum size that can be accessed in an atomic operation from both processors sharing the memory. It should also be located at a memory address that both processors see as "aligned" to ensure single-cycle read/write accesses, e.g. 32 bit word boundaries in the case of 32 bit ARM processors. This might be tricky to implement in non- RISC-alike architectures.
The flag usage is very simple. The processor that writes the data waits for the flag to be signalled as "buffer empty", maybe a simple null value, then write the data to the mailbox's buffer and signal "buffer not empty" by setting a magic number into the flag, maybe a non- null value.
The processor receiving the data just has to wait for the flag to be signalled as "buffer not empty" before reading the data, and setting the flag back to "buffer empty".
Whether you have primitives supporting this mechanism without relying in a constant flag polling, or not, is tightly dependent on your hardware and operating system.
I've used this mechanism in heterogeneous architectures (processor + co-processor of different architectures/capabilities running different applications), but homogeneous multicore processors are well supported by many RTOSes today, including freeRTOS, and other mechanisms as queues and semaphores/mutexes are probably more appropriated for the synchronization part. Some current SoC's support hardware semaphores and memory-access interrupts that can improve performance greatly.
EDIT:
There is one freeRTOS feature that can assist you here, message buffers. There is one example using ST's STM32H745 dual-core SoC [here] that comes with a companion article [here] written by freeRTOS's Richard Barry.
I know this might be a strange usage. I just want to know if I can use LDREX/STREX with SCU disabled.
I am using a dual-core Cortext-A9 SoC. The two cores are running in an AMP mode: each core has its own OS. Although memory controller is shared resource, each core has its own memory space. One can't access the other's memory space. Because no cache coherency is required, SCU isn't enabled. At the same time, I also have a shared memory region that both cores can access to. The shared memory region is non-cached to avoid cache coherency issue.
I define a spin lock in this shared memory region. This spin lock is used to protect shared resource accessing. Right now, the spin lock is implemented simply like this:
void spin_lock(uint32_t *lock)
{
while(*lock);
*lock = 1;
}
void spin_unlock(uint32_t *lock)
{
*lock = 0;
}
where, lock is a variable in shared memory so both core can access this lock.
The problem of this implementation is that accessing lock is not exclusive. That's why I want to use LDREX/STREX to implement spin lock. Please allow me to restate my question:
Can I use LDREX/STREX without SCU enabled?
Thank you!
So ... the direct answer to your question is that, yes, it is possible - so long as something else out in the memory system implements an exclusive monitor for the shared memory region. If it does not, then your STREXs will always return OK (rather than EXOK), observable as a failure in the result register.
However, why would you not enable the SCU?
Clearly, what you are trying to do requires a coherent view of memory between the two operating systems for at least that region. And with PIPT data caches, you are not going to see any aliasing of cache lines depending on how they are mapped in each image.
Overall, the answer is no. There are two issues here:
1) You cannot use load/store exclusive on uncached memory. The exclusive operations operate only on "normal" idempotent memory.
2) The ARM manual doesn't specify how exclusive monitors work in conjunction with memory coherence, but any sane implementation is essentially going to put the monitor in the cache line acquisition mechanism. If you disabled cache line snooping, you have most likely rendered the monitors non-functional on your chip.
Your only (poorly formed) question,
Can I use LDREX/STREX without SCU enabled?
In an ideal ARM universe, yes, it is possible. Ie, it is possible that somewhere, some day you might be able to do this. I think you mean,
Can I use LDREX/STREX without SCU enabled in my system?
Unfortunately, the ARM ARM is a bit of a political/bureaucratic document. You must take extreme care when reading "strongly advised", "UNPREDICTABLE" "UNKNOWN" and can. All programmers would desire the ldrex/strex to apply to all memory. In fact, if the BUS controller (typically AXI-NIC) implemented a monitor, then there would be no trouble to support the much loved swp instruction. There are various posts on StackOverflow where people want to replace the swp with an ldrex/strex.
After you read and re-read the double speak (it is written for the programmer, but also the silicon implementer) of the ARM ARM, it becomes pretty clear that the monitor logic is probably implemented in the cache. A cache controller must implement dirty line broadcasts. Dirty line broadcasts are very similar to a 'monitor' and your 'reserve granule' is most likely a cache line size (what a co-incidence).
The ARM ARM is written as a generic document for people who may wish to implement a Cortex-A CPU. It is written so that their hands (creativity) are not tied to implement the monitor with-in the cache.
So you need to read the specific documentation on your particular Cortex-A9 SOC. It will probably only support ldrex/strex with cached memory. In fact, it is advisable to issue a pld to ensure the memory is in cache before doing the ldrex and this will mean you need to activate the SCU in your system. I guess you are concerned about some additional cycle(s) that the SCU will add to latency?
I think some of this information has confuse many extremely intelligent people. Beware the difference between possible and is. Every person on StackOverflow probably desires the case where the monitor is implemented in the bus controller (or core memory chip). However, for most real chips, this is not the case.
For certain, if you want to future proof your code/OS to port to newer or different Cortex-A CPUs, you should not make this assumption even if your chipset does support a 'global monitor' outside the cache sub-systems.
I am working with a multithreaded bare-metal C/Assembler application on a Cortex-A9.
I have some shared variables, i.e. adresses that are used from more than one thread. To perform an atomic exchange of a variables value I use LDRX and STRX. Now my question is if I need LDRX and STRX on every access to one of this variables even if interrupts are disabled.
Assume the following example:
Thread 1 uses LDRX and STRX to exchange the value of address a.
Thread 2 disables interrupts, uses normal LDR and STR to exchange the value of address a, does something else that should not be interrupted and then enables interrupts again.
What happens if Thread 1 gets interrupted right after the LDRX by Thread 2? Does the STRX in Thread 1 still recognize, that there was an access on address a or do I have to use LDRX and STRX in Thread 2, too?
LDREX/STREX are something that have to be implemented by the chip vendor, hopefully to arms specification. You can and should get the arm documentation on the topic, in this case in additional to arm arms and trms you should get the amba-axi documentation.
So if you have
ldrex thread 1
interrupt
ldrex thread 2
strex thread 2
return from interrupt
strex thread 1
Between the thread 2 ldrex and strex there has been no modification of that memory location, so the strex should work. But between the thread 1 strex and the prior ldrex there has been a modification to that location, the thread 2 strex. So in theory that means the thread 1 strex should fail and you have to try your thread 1 ldrex/strex pair again until it works. But that is exactly by design, you keep trying the ldrex/strex pair in a loop until it succeeds.
But this is all implementation defined so you have to look at the specific chip vendor and model and rev and do your own experiments. The bug in linux for example is that ldrex/strex is an infinite loop, apply it to a system/situation where ldrex/strex is not supported you get an OKAY instead of an EXOKAY, and the strex will fail forever you are stuck in that infinite loop forever (ever wonder how I know all of this, had to debug this problem at the logic level).
First off ARM documents that exclusive access support is not required for uniprocessor systems so the ldrex/strex pair CAN fail to work IF you touch vendor specific logic on single core systems. Uniprocessor or not if your ldrex/strex remains within the arm logic (L1 and optional L2 caches) then the ldrex/strex pair are goverened by ARM and not the chip vendor so you fall under one set of rules, if the pair touches system memory outside the arm core, then you fall under the vendors rules.
The big problem is that ARM's documentation is unusually incomplete on the topic. Depending on which manual and where in the manual you read it for example says if some OTHER master has modified that location which in your case it is the same master, so the location has been modified but since it was by you the second strex should succeed. Then the same document says that another exclusive read resets the monitor to a different address, well what if it is another exclusive read of the same address?
Basically yours is a question of what about two exclusive writes to the same address without an exclusive read in between, does/should the second succeed. A very good question...I cant see that there is a definitive answer either within all the arm cores or in the whole world of arm based chips.
The bottom line with ldrex/strex it is not completely ARM core specific but also chip specific (vendor). You need to do experiments to insure you can use that instruction pair on that system (uniprocessor or not). You need to know what the ARM core does (the caches) and what happens when that exclusive access goes out past the core to the vendor logic. Repeat for every core and vendor you care to port this code to.
Apologies for just throwing in an "it's wrong" statement to dwelch, but I did not have time to write a proper answer yesterday. dwelch's answer to your question is correct - but pieces of it are at the very least possible to misinterpret.
The short answer is that, yes, you need to either disable interrupts for both threads or use ldrex/strex for both threads.
But to set one thing straight: support for ldrex/strex is mandatory in all ARM processors of v6 or later (with the exception of v6M microcontrollers). Support for SWP however, is optional for certain ARMv7 processors.
The behaviour of ldrex/strex is dependent on whether your MMU is enabled and what memory type and attributes the accessed region is configured with. Certain possible configurations will require additional support to be added to either the interconnect or RAM controllers in order for ldrex/strex to be able to operate correctly.
The entire concept is based around the idea of local and global exclusive monitors. If operating on memory regions marked as non-shareable (in a uniprocessor configuration), the processor needs only be concerned with the local monitor.
In multi-core configurations, coherent regions are managed using what is architecturally considered to be a global monitor, but still resides within the multi-core processor and does not rely on externally implemented logic.
Now, dwelch is correct in that there are way too many "implementation defined" options surrounding this. The sequence you describe is NOT architecturally guaranteed to work. The architecture does not require that an str transitions the local (or global) monitor from exclusive to open state (although in certain implementations, it might).
Hence, the architecturally safe options are:
Use ldrex/strex in both contexts.
Disable interrupts in both contexts.
I understand that DSB, DMB, and ISB are barriers for prevent reordering of instructions.
I also can find lots of very good explanations for each of them, but it is pretty hard to imagine the case that I have to use them.
Also, from the open source codes, I see those barriers from time to time, but it is quite hard to understand why they are used. Just for an example, in Linux kernel 3.7 tcp_rcv_synsent_state_process function, there is a line as follows:
if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
smp_mb();
if (po->tp_version <= TPACKET_V2)
__packet_set_status(po, h.raw, status);
where smp_mb() is basically DMB.
Could you give me some of your real-life examples?
It would help understand more about barriers.
Sorry, not going to give you a straight-out example like you're asking, because as you are already looking through the Linux source code, you have plenty of those to go around, and they don't appear to help. No shame in that - every sane person is at least initially confused by memory access ordering issues :)
If you are mainly an application developer, then there is every chance you won't need to worry too much about it - whatever concurrency frameworks you use will resolve it for you.
If you are mainly a device driver developer, then examples are fairly straightforward to find - whenever there is a dependency in your code on a previous access having had an effect (cleared an interrupt source, written a DMA descriptor) before some other access is performed (re-enabling interrupts, initiating the DMA transaction).
If you are in the process of developing a concurrency framework (, or debugging one), you probably need to read up on the topic a bit more - but your question suggests a superficial curiosity rather than an immediate need?
If you are developing your own method for passing data between threads, not based on primitives provided by a concurrency framework, that is for all intents and purposes a concurrency framework.
Paul McKenney wrote an excellent paper on the need for memory barriers, and what effects they actually have in the processor: Memory Barriers: a Hardware View for Software Hackers
If that's a bit too hardcore, I wrote a 3-part blog series that's a bit more lightweight, and finishes off with an ARM-specific view. First part is Memory access ordering - an introduction.
But if it is specifically lists of examples you are after, especially for the ARM architecture, you could do a lot worse than Barrier Litmus Tests and Cookbook.
The extra-extra light programmer's view and not entirely architecturally correct version is:
DMB - whenever a memory access requires ordering with regards to another memory access.
DSB - whenever a memory access needs to have completed before program execution progresses.
ISB - whenever instruction fetches need to explicitly take place after a certain point in the program, for example after memory map updates or after writing code to be executed. (In practice, this means "throw away any prefetched instructions at this point".)
Usually you need to use a memory barrier in cases where you have to make SURE that memory access occurs in a specific order. This might be required for a number of reasons, usually it's required when two or more processes/threads or a hardware component access the same memory structure, which has to be kept consistent.
It's used very often in DMA-transfers. A simple DMA control structures might look like this:
struct dma_control {
u32 owner;
void * data;
u32 len;
};
The owner will usually be set to something like OWNER_CPU or OWNER_HARDWARE, to indicate who of the two participants is allowed to work with the structure.
Code which changes this will usually like like this
dma->data = data;
dma->len = length;
smp_mb();
dma->owner = OWNER_HARDWARE;
So, data an len are always set before the ownership gets transfered to the DMA hardware. Otherwise the engine might get stale data, like a pointer or length which was not updated, because the CPU reordered the memory access.
The same goes for processes or threads running on different cores. The could communicate in a similar manner.
One simple example of a barrier requirement is a spinlock. If you implement a spinlock using compare-and-swap(or LDREX/STREX on ARM) and without a barrier, the processor is allowed to speculatively load values from memory and lazily store computed values to memory, and neither of those are required to happen in the order of the loads/stores in the instruction stream.
The DMB in particular prevents memory access reordering around the DMB. Without DMB, the processor could reorder a store to memory protected by the spinlock after the spinlock is released. Or the processor could read memory protected by the spinlock before the spinlock was actually locked, or while it was locked by a different context.
unixsmurf already pointed it out, but I'll also point you toward Barrier Litmus Tests and Cookbook. It has some pretty good examples of where and why you should use barriers.
I wanted to know how to implement my own threading library.
What I have is a CPU (PowerPC architecture) and the C Standard Library.
Is there an open source light-weight implementation I can look at?
At its very simplest a thread will need:
Some memory for stack space
Somewhere to store its context (ie. register contents, program counter, stack pointer, etc.)
On top of that you will need to implement a simple "kernel" that will be responsible for the thread switching. And if you're trying to implement pre-emptive threading then you'll also need a periodic source of interrupts. eg. a timer. In this case you can execute your thread switching code in the timer interrupt.
Take a look at the setjmp()/longjmp() routines, and the corresponding jmp_buf structure. This will give you easy access to the stack pointer so that you can assign your own stack space, and will give you a simple way of capturing all of the register contents to provide your thread's context.
Typically the longjmp() function is a wrapper for a return from interrupt instruction, which fits very nicely with having thread scheduling functionality in the timer interrupt. You will need to check the implementation of longjmp() and jmp_buf for your platform though.
Try looking for thread implementations on smaller microprocessors, which typically don't have OS's. eg. Atmel AVR, or Microchip PIC.
For example : discussion on AVRFreaks
For a decent thread library you need:
atomic operations to avoid races (to implement e.g a mutex)
some OS support to do the scheduling and to avoid busy waiting
some OS support to implement context switching
All three leave the scope of what C99 offers you. Atomic operations are introduced in C11, up to now C11 implementations don't seem to be ready, so these are usually implemented in assembler. For the later two, you'd have to rely on your OS.
Maybe you could look at C++ which has threading support. I'd start by picking some of their most useful primitives (for example futures), see how they work, and do a simple implementation.