Critical sections in ARM

Critical sections in ARM - c

I am experienced in implementing critical sections on the AVR family of processors, where all you do is disable interrupts (with a memory barrier of course), do the critical operation, and then reenable interrupts:
void my_critical_function()
{
cli(); //Disable interrupts
// Mission critical code here
sei(); //Enable interrupts
}
Now my question is this:
Does this simple method apply to the ARM architecture of processor as well? I have heard things about the processor doing lookahead on the instructions, and other black magic, and was wondering primarily if these types of things could be problematic to this implementation of critical sections.

Assuming you're on a Cortex-M processor, take a look at the LDREX and STREX instructions, which are available in C via the __LDREXW() and __STREXW() macros provided by CMSIS (the Cortex Microcontroller Software Interface Standard). They can be used to build extremely lightweight mutual exclusion mechanisms.
Basically,
data = __LDREXW(address)
works like data = *address except that it sets an 'exclusive access flag' in the CPU. When you've finished manipulating your data, write it back using
success = __STREXW(address, data)
which is like *address = data but will only succeed in writing if the exclusive access flag is still set. If it does succeed in writing then it also clears the flag. It returns 0 on success and 1 on failure. If the STREX fails, you have to go back to the LDREX and try again.
For simple exclusive access to a shared variable, nothing else is required. For example:
do {
data = LDREX(address);
data++;
} while (STREXW(address, data));
The interesting thing about this mechanism is that it's effectively 'last come, first served'; if this code is interrupted and the interrupt uses LDREX and STREX, the STREX interrupt will succeed and the (lower-priority) user code will have to retry.
If you're using an operating system, the same primitives can be used to build 'proper' semaphores and mutexes (see this application note, for example); but then again if you're using an OS you probably already have access to mutexes through its API!

ARM architecture is very wide and as I understand you probably mean ARM Cortex M micro controllers.
You can use this technique, but many ARM uCs offer much more. As I do know what is the actual hardware I can only give you some examples:
bitband area. In this memory regions you can set and reset bits atomic way.
Hardware semaphores (STM32H7)
Hardware MUTEX-es (some NXP uCs)
etc etc.

Related

Benchmarking on CMSIS RTOS Cortex M-33

I'm trying to time the duration of a function on a Cortex M33 with CMSIS RTOS. I'm currently reading cycles directly from the ARM_CM_DWT_CYCCNT register.
This is working, but I'm wondering whether I can do anything else to increase the precision/variance of my measurement? I.e. limit interrupts etc.?
Some third party code has included the use of int_lock() and int_unlock(lock) but I can't find any CMSIS RTOS documentation of this usage.

Your "third-pary" routines are most likely some sort of wrapper around a CPSR manipulation. Intuitively I'd expect to wrap the code section in something like :
lock_handle = int_lock() ;
...
int_unlock( lock_handle );
But unless you have documentation or access to the source of these functions to be certain how they behave, you might do well to avoid using them.
CMSIS RTOS does not provide facilities to disable interrupts or implement critical sections. In general you have failed in your design if your application needs critical-sections. More generally however CMSIS CORE (CMSIS is more than just CMSIS RTOS) has the NVIC API with NVIC_DisableIRQ() and NVIC_EnableIRQ() functions to enable/disable individual interrupts (in case some such as SYSTICK are essential to your benchmark).
Means to globally disable all interrupts is part of CMSIS Core Register Access which defines __disable_irq() and __enable_irq().
It is likely that the third-party enable/disable functions provide a handle to ensure that enabling and disabling are correctly paired or perhaps a nest counter so that only the outer enable of a nested disable re-enables interrupts. These are all mechanisms that are trivial to implement using the available the CMSIS primitives, but may not be needed in your case.

How could we sleep when we are executing a syscall that execute in interrupt mode

When I am executing a system call to do write or something else, the ISR corresponded to the exception is executing in interrupt mode (on cortex-m3 the IPSR register is having a non-zero value, 0xb). And what I have learned is that when we execute a code in an interrupt mode we can not sleep, we can not use functions that might block ...
My question is that: is there any kind of a mechanism with which the ISR could still executing in interrupt mode and in the same time it could use functions that might block, or is there any kind of trick is implemented.

Caveat: This is more of a comment than an answer but is too big to fit in a comment or series of comments.
TL;DR: Needing to sleep or execute a blocking operation from an ISR is a fundamental misdesign. This seems like an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Doing sleep [even in application code] is generally a code smell. Why do you feel you need to sleep [vs. some event/completion driven mechanism]?
Please edit your question and add clarification [i.e. don't just add comments]
When I am executing a system call to do write or something else
What is your application doing? A write to what device? What "something else"?
What is the architecture/board, kernel/distro? (e.g. Raspberry Pi running Raspian? nvidia Jetson? Beaglebone? Xilinx FPGA with petalinux?)
What is the H/W device and where did the device driver come from? Did you write the device driver yourself or is it a standard one that comes with the kernel/distro? If you wrote it, please post it in your question.
Is the device configured properly? (e.g.) Are the DTB entries correct?
Is the device a block device, such as a disk controller? Or, is it a character device, such as a UART? Does the device transfer data via DMA? Or, does it transfer data by reading/writing to/from an IO port?
What do you mean by "exception"? Generally, exception is an abnormal condition (e.g. segfault, bus error, etc.). Please describe the exact context/scenario for which this occurs.
Generally, an ISR does little things. (e.g.) Grab [and save] status from the device. Clear/rearm the interrupt in the interrupt controller. Start the next queued transfer request. Wake up the sleeping base level task (usually the task that executed the syscall [waiting on a completion event in kernel mode]).
More elaborate actions are generally deferred and handled in the interrupt's "bottom half" handler and/or tasklet. Or, the base level is woken up and it handles the remaining processing.
What kernel subsystems are involved? Are you using platform drivers? Are you interfacing from within the DMA device driver framework? Are message buses involved (e.g. I2C, SPI, etc.)?
Interrupt and device handling in the linux kernel is somewhat different than what one might do in a "bare metal" system or RTOS (e.g. FreeRTOS). So, if you're coming from those environments, you'll need to think about restructuring your driver code [and/or application code].
What are your requirements for device throughput and latency?
You may wish to consult a good book on linux device driver design. And, you may wish to consult the kernel's Documentation subdirectory.
If you're able to provide more information, I may be able to help you further.
UPDATE:
A system call is not really in the same class as a hardware interrupt as far as the kernel is concerned, even if the CPU hardware uses the same sort of exception vector mechanisms for handling both hardware and software interrupts. The kernel treats the system call as a transition from user mode to kernel mode. – Ian Abbott
This is a succinct/great explanation. The "mode" or "context" has little to do with how we got/get there from a H/W mechanism.
The CPU doesn't really "understand" interrupt mode [as defined by the kernel]. It understands "supervisor" vs "user" privilege level [sometimes called "mode"].
When executing at user privilege level, an interrupt/exception will notice the transition from "user" level to "supvervisor" level. It may have a special register that specifies the address of the [initial] supervisor stack pointer. Atomically, it swaps in the value, pushing the user SP onto the new kernel stack.
If the interrupt is interrupting a CPU that is already at supervisor level, the existing [supervisor] SP will be used unchanged.
Note that x86 has privilege "ring" levels. User mode is ring 3 and the highest [most privileged] level is ring 0. For arm, some arches can have a "hypervisor" privilege level [which is higher privilege than "supervisor" privilege].
The setup of the mode/context is handled in arch/arm/kernel/entry-*.S code.
An svc is a synchronous interrupt [generated by a special CPU instruction]. The resulting context is the context of the currently executing thread. It is analogous to "call function in kernel mode". The resulting context is "kernel thread mode". At that point, it's not terribly useful to think of it as an "interrupt" anymore.
In fact, on some arches, the syscall instruction/mechanism doesn't use the interrupt vector table. It may have a fixed address or use a "call gate" mechanism (e.g. x86).
Each thread has its own stack which is different than the initial/interrupt stack.
Thus, once the entry code has established the context/mode, it is not executing in "interrupt mode". So, the full range of kernel functions is available to it.
An interrupt from a H/W device is asynchronous [may occur at any time the CPU's internal interrupt enable flag is set]. It may interrupt a userspace application [executing in application mode] OR kernel thread mode OR an existing ISR executing in interrupt mode [from another interrupt]. The resulting ISR is executing in "interrupt mode" or "ISR mode".
Because the ISR can interrupt a kernel thread, it may not do certain things. For example, if the CPU were in [kernel] thread mode, and it was in the middle of a kmalloc call [GFP_KERNEL], the ISR would see partial state and any action that tried to adjust the kernel's heap would result in corruption of the heap.
This is a design choice by linux for speed.
Kernel ISRs may be classified as "fast interrupts". The ISR executes with the CPU interrupt enable [IE] flag cleared. No normal H/W interrupt may interrupt the ISR.
So, if another H/W device asserts its interrupt request line [in the external interrupt controller], that request will be "pending". That is, the request line has been asserted but the CPU has not acknowledged it [and the CPU has not jumped via the interrupt table].
The request will remain pending until the CPU allows further interrupts by asserting IE. Or, the CPU may clear the pending interrupt without taking action by clearing the pending interrupt in the interrupt controller.
For a "slow" interrupt ISR, the interrupt entry code will clear the interrupt in the external interrupt controller. It will then rearm interrupts by setting IE and call the ISR. This ISR can be interrupted by other [higher priority] interrupts. The result is a "stacked" interrupt.
I have been searching all over the places, I come to the conclusion is that the interrupts have a higher priority than some of exceptions in the Linux kernel.
An exception [synchronous interrupt] can be interrupted if the IE flag is enabled. An svc is treated differently but after the entry code is executed, the IE flag is set, so the actual syscall code [executing in kernel thread mode] can be interrupted by a H/W interrupt.
Or, in limited circumstances, the kernel code can generate an exception (e.g. a page fault caused by a kernel action [which is usually deemed fatal]).
but I am still looking on how exactly the context switching happen when executing an exception and letting the processor to execute in a thread mode while the SVCall exception is pending (was preempted and have not returned yet)... I think when I understand that, it would be more clear to me.
I think you have to be very careful with the terminology. In particular, when combining terms from disparate sources. Although user mode, kernel thread mode, or interrupt mode can be considered a context [in the dictionary sense of the word], context switching usually means that the current thread is suspended, the scheduler selects a new thread to run and resumes it. That is separate from the user-to-kernel transition.
And if there is any recommended resources about that for ARM-Cortex-M3/4, it would be nice
Here is something: https://interrupt.memfault.com/blog/arm-cortex-m-exceptions-and-nvic But, be very careful in applying the terminology therein. What it considers "pending" only exists in the kernel during the entry code. What is more relevant is what the kernel does to set up mode/context and the terms are not equivalent.
So, from the kernel's standpoint, it's probably better to not consider an svc as "pending".

DSB on ARM Cortex M4 processors

I have read the ARM documentation and it appears that they say in some places that the Cortex M4 can reorder memory writes, while in other places it indicates that M4 will not.
Specifically I am wondering if the DBM instruction is needed like:
volatile int flag=0;
char buffer[10];
void foo(char c)
{
__ASM volatile ("dbm" : : : "memory");
__disable_irq(); //disable IRQ as we use flag in ISR
buffer[0]=c;
flag=1;
__ASM volatile ("dbm" : : : "memory");
__enable_irq();
}

Uh, it depends on what your flag is, and it also varies from chip to chip.
In case that flag is stored in memory:
DSB is not needed here. An interrupt handler that would access flag would have to load it from memory first. Even if your previous write is still in progress the CPU will make sure that the load following the store will happen in the correct order.
If your flag is stored in peripheral memory:
Now it gets interesting. Lets assume flag is in some hardware peripheral. A write to it may make an interrupt pending or acknowledge an interrupt (aka clear a pending interrupt). Contrary to the memory example above this effect happens without the CPU having to read the flag first. So the automatic ordering of stores and loads won't help you. Also writes to flag may take effect with a surprisingly long delay due to different clock domains between the CPU and the peripheral.
So the following szenario can happen:
you write flag=1 to clear an handled interrupt.
you enable interrupts by calling __enable_irq()
interrupts get enabled, write to flag=1 is still pending.
wheee, an interrupt is pending and the CPU jumps to the interrupt handler.
flag=1 takes effect. You're now in an interrupt handler without anything to do.
Executing a DSB in front of __enable_irq() will prevent this problem because whatever is triggered by flag=1 will be in effect before __enable_irq() executes.
If you think that this case is purely academic: Nope, it's real.
Just think about a real-time clock. These usually runs at 32khz. If you write into it's peripheral space from a CPU running at 64Mhz it can take a whopping 2000 cycles before the write takes effect. Now for real-time clocks the data-sheet usually shows specific sequences that make sure you don't run into this problem.
The same thing can however happen with slow peripherals.
My personal anecdote happened when implementing power-saving late in a project. Everything was working fine. Then we reduced the peripheral clock speed of I²C and SPI peripherals to the lowest possible speed we could get away with. This can save lots of power and extend battery live. What we found out was that suddenly interrupts started to do unexpected things. They seem to fire twice each time wrecking havoc. Putting a DSB at the end of each affected interrupt handler fixed this because - you can guess - the lower clock speed caused us to leave the interrupt handlers before clearing the interrupt source was in effect due to the slow peripheral clock.

This section of the Cortex M4 generic device user guide enumerates the factors which can affect reordering.
the processor can reorder some memory accesses to improve efficiency, providing this does not affect the behavior of the instruction sequence.
the processor has multiple bus interfaces
memory or devices in the memory map have different wait states
some memory accesses are buffered or speculative.
You should also bear in mind that both DSB and ISB are often required (in that order), and that C does not make any guarantees about the ordering (except in-thread volatile accesses).
You will often observe that the short pipeline and instruction sequences can combine in such a way that the race conditions seem unreachable with a specific compiled image, but this isn't something you can rely on. Either the timing conditions might be rare (but possible), or subsequent code changes might change the resulting instruction sequence.

Atomic Block for reading vs ARM SysTicks

I am currently porting my DCF77 library (you may find the source code at GitHub) from Arduino (AVR based) to Arduino Due (ARM Cortex M3). I am an absolute beginner with the ARM platform.
With the AVR based Arduino I can use avr-libc to get atomic blocks. Basically this blocks all interrupts during the block and will allow interrupts later on again. For the AVR this was fine. Now for the ARM Cortex things start to get complicated.
First of all: for the current uses of the library this approach would work as well. So my first question is: is there someting similar to the "ATOMIC" macros of avr-libc for ARM? Obviously other people have thought of something in this directions. Since I am using gcc I could enhance these macors to work almost exactly like the avr-libv ATOMIC macors. I already found some CMSIS documentation however this seems only to provide an "enable_irq" macro instead of a "restore_irq" macro.
Question 1: is there any library out there (for gcc) that already does this?
Because ARM has different priority interrupts I could establish the atomicity in different ways as well. In my case the "atomic" blocks must only make sure that they are not interrupted by the systick interrupt. So actually I would not need to block everything to make my blocks "atomic enough". Searching further I found an ARM synchronization primitives article in the developer infocenter. Especially there is a hint at lockless programming. According to the article this is an advanced concept and that there are many publications on it. Searching the net I found only general explanations of the concept, e.g. here. I assume that a lockless implementation would be very cool but at this time I feel not confident enough on ARM to implement this from scratch.
Question 2: does anyone have some hints for me on lockless reads of memory blocks on ARM Cortex M3?
As I already said I only need to protect the lower priority thread from sysTicks. So another option would be to disable sysTicks briefly. Since I am implementing a timing sensitive clock algorithm this must not slow down the overall sysTick frequency in the long run. Introducing some small jitter would be OK though. At this time I would find this most attractive.
Question 3: is there any good way to block sysTick interrupts without losing any ticks?
I also found the CMSIS documentation for semaphores. However I am somewhat overwhelmed. Especially I am wondering if I should use CMSIS and how to do this on an Arduino Due.
Question 4: What would be my best option? Or where should I continue reading?
Partial Answer:
with the hint from Notlikethat I implemented
#if defined(ARDUINO_ARCH_AVR)
#include <util/atomic.h>
#define CRITICAL_SECTION ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
#elif defined(ARDUINO_ARCH_SAM)
// Workaround as suggested by Stackoverflow user "Notlikethat"
// http://stackoverflow.com/questions/27998059/atomic-block-for-reading-vs-arm-systicks
static inline int __int_disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n" : "=r"(primask));
asm volatile("cpsid i\n");
return primask & 1;
}
static inline void __int_restore_irq(int *primask) {
if (!(*primask)) {
asm volatile ("" ::: "memory");
asm volatile("cpsie i\n");
}
}
// This critical section macro borrows heavily from
// avr-libc util/atomic.h
// --> http://www.nongnu.org/avr-libc/user-manual/atomic_8h_source.html
#define CRITICAL_SECTION for (int primask_save __attribute__((__cleanup__(__int_restore_irq))) = __int_disable_irq(), __ToDo = 1; __ToDo; __ToDo = 0)
#else
#error Unsupported controller architecture
#endif
This macro does more or less what I need. However I find there is room for improvement as this blocks all interrupts although it would be sufficient to block only systicks. So Question 3 is still open.

Most of what you've referenced is about synchronising memory accesses between multiple CPUs, or pre-emptively scheduled threads on the same CPU, which seems entirely inappropriate given the stated situation. "Atomicity" in that sense refers to guaranteeing that when one observer is updating memory, any observer reading memory sees either the initial state, or the updated state, but never something part-way in between.
"Atomicity" with respect to interrupts follows the same principle - i.e. ensuring that if an interrupt occurs, a sequence of code has either not run at all, or run completely - but is a conceptually different thing1. There are only two things guaranteed to be atomic w.r.t. interrupts: a single instruction2, or a sequence of instructions executed with interrupts disabled.
The "right" way to achieve that is indeed via the CPSID/CPSIE instructions, which are wrapped in the __disable_irq()/__enable_irq() intrinsics. Note that there are two "stages" of interrupt handling in the system: the M3 core itself only has a single IRQ signal - it's the external NVIC's job to do all the routing/multiplexing/prioritisation of the system IRQs into this one line. When the CPU wants to enter a critical section, all it needs to do is mask its own IRQ input with CPSID, do what it needs, then unmask with CPSIE, at which point any pending IRQ from the NVIC will be taken immediately.
For the case of nested/re-entrant critical sections, the intrinsics provide a handy int __disable_irq(void) form which returns the previous state, so you can unmask conditionally on that.
For other compilers which don't offer such intrinsics, it's straightforward enough to roll your own, e.g.:
static inline int disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n"
"cpsid i\n" : "=r"(primask));
return primask & 1;
}
static inline void enable_irq(int primask) {
if (primask)
asm volatile("cpsie i\n");
}
[1] One confusing overlap is the latter sense is often used to achieve the former in single-CPU multitasking - if interrupts are off, no other thread can get scheduled until you've finished, thus will never see partially-updated memory.
[2] With the possible exception of load/store-multiple instructions - in the low-latency interrupt configuration, these can be interrupted, and either restarted or continued upon return.

fiq & irq handler -- arm

I am new to arm & have some doubs related to IRQ & FIQ. Please try to clarify these.
How many number of FIQ & IRQ channel arm have ?
And what number of handlers can we write for each channel ?
Also if we can register multiple handler for single interrupt channel how arm comes to know which handler to run.

The distinction between IRQ and FIQ goes right the way back to early days of ARM when it was designed by Acorn. It was always the case that the IRQ line was attached to an interrupt controller that multiplexed a large number of interrupt sources together. This is precisely what happens in all modern ARMs
The rationale behind the FIQ was to provide an extremely low latency response with maximum priority (it can safely pre-empt the IRQ handler). The comparatively large number of shadow registers facilitate writing handlers that store the handler's state in CPU registers and not hitting the stack.
The shadow registers are almost of the opposite set to those commonly used by APCS for function call, so writing handlers in C, would cause a push and eventual pop of up to 8 non-shadowed registers. Having any kind of interrupt demultiplexing wipes out any performance advantage that FIQ might have given.
All of this means that there is only really any benefit in using FIQ for very specialised applications where really hard-real time interrupt response is required for one interrupting device, and you're willing to write your handler in assembler. You'll also be left with working out how to synchronise with the rest of the system - some of which would rely on disabling IRQ to keep data synchronised.

traditionally the arm has one interrupt line which you can send to one of two handlers FIQ or IRQ. FIQ has a larger bank of FIQ mode only registers so you have fewer that you need to store on the stack. From there you read the vendor specific registers if any to determine the source of the interrupt and then branch into separate handlers.
More recently there have bend arm architectures with many interrupts 128, 256 each with a separate handler. So generically asking about arm is not as varied but about like asking something generic about x86.
All of this information is easily available in the ARM architectural reference manuals for the different architectures and the pinouts to the core (what the vendor builds its chip around) is documented in the technical reference manuals for the various cores (also very easy to obtain). infocenter.arm.com has the architecture and technical reference manuals as well as amba/axi (the data bus that the vendor connects to). Your question is completely answered in those documents.

The ARM processor directly supports only ONE IRQ and ONE FIQ. ARM supports multiple interrupts through a peripheral called Interrupt Controller. ARM standard interrupt controllers are called GIC (Generic Interrupt Controller).
The GIC has a number of inputs for peripherals to connect their interrupt lines and two output lines that connect to IRQ and FIQ. Basically it acts as a MUX. A GIC driver will setup configurations such as interrupt priority, type (IRQ/FIQ), masking etc.
In traditional ARM systems there is one entry each for IRQ and FIQ in the Exception Vectors. Depending on which line the interrupt fired, IRQ or FIQ handler is called. The interrupt handler queries the GIC (GIC CPU interface registers, to be specific) to get the interrupt number. Based on this interrupt number, corresponding device handler is invoked.
Number of interrupts depends on the specific GIC implementation. So you would have to check the manual for the interrupt controller in your system to get those specifics.
Note: The interrupt handling is slightly different depending on which specific ARM core you are coding for.

Actually the question is a bit tricky. You must specify in the question to which architecture in ARM you work. ARM v7-A and ARM v7-R Architecture Reference Manual (ARM ARM) specifies one FIQ and one IRQ, as many already answered. But ARMv7-M (used in Cortex-M processors) integrates a interrupt controller in the processor, and thus offers one NMI (instead of FIQ) and up to 240 IRQ lines.
For more information: ARMv7 A and ARMv7-R Architecure reference manual: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0406c/index.html
ARMv7-M Architecture Reference Manual: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403e.b/index.html
As an example, Cortex M4 specs sheet: http://www.arm.com/products/processors/cortex-m/cortex-m4-processor.php

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight