Why is there barrier() in KCOV code in Linux kernel? - c

In Linux KCOV code, why is this barrier() placed?
void notrace __sanitizer_cov_trace_pc(void)
{
struct task_struct *t;
enum kcov_mode mode;
t = current;
/*
* We are interested in code coverage as a function of a syscall inputs,
* so we ignore code executed in interrupts.
*/
if (!t || in_interrupt())
return;
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
unsigned long *area;
unsigned long pos;
/*
* There is some code that runs in interrupts but for which
* in_interrupt() returns false (e.g. preempt_schedule_irq()).
* READ_ONCE()/barrier() effectively provides load-acquire wrt
* interrupts, there are paired barrier()/WRITE_ONCE() in
* kcov_ioctl_locked().
*/
barrier();
area = t->kcov_area;
/* The first word is number of subsequent PCs. */
pos = READ_ONCE(area[0]) + 1;
if (likely(pos < t->kcov_size)) {
area[pos] = _RET_IP_;
WRITE_ONCE(area[0], pos);
}
}
}
A barrier() call prevents the compiler from re-ordering instructions. However, how is that related to interrupts here? Why is it needed for semantic correctness?

Without barrier(), the compiler would be free to access t->kcov_area before t->kcov_mode. It's unlikely to want to do that in practice, but that's not the point. Without some kind of barrier, C rules allow the compiler to create asm that doesn't do what we want. (The C11 memory model has no ordering guarantees beyond what you impose explicitly; in C11 via stdatomic or in Linux / GNU C via barriers like barrier() or smp_rb().)
As described in the comment, barrier() is creating an acquire-load wrt. code running on the same core, which is all you need for interrupts.
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
...
barrier();
area = t->kcov_area;
...
I'm not familiar with kcov in general, but it looks like seeing a certain value in t->kcov_mode with an acquire load makes it safe to read t->kcov_area. (Because whatever code writes that object writes kcov_area first, then does a release-store to kcov_mode.)
https://preshing.com/20120913/acquire-and-release-semantics/ explains acq / rel synchronization in general.
Why isn't smp_rb() required? (Even on weakly-ordered ISAs where acquire ordering would need a fence instruction to guarantee seeing other stores done by another core.)
An interrupt handler runs on the same core that was doing the other operations, just like a signal handler interrupts a thread and runs in its context. struct task_struct *t = current means that the data we're looking at is local to a single task. This is equivalent to something within a single thread in user-space. (Kernel pre-emption leading to re-scheduling on a different core will use whatever memory barriers are necessary to preserve correct execution of a single thread when that other core accesses the memory this task had been using).
The user-space C11 stdatomic equivalent of this barrier is atomic_signal_fence(memory_order_acquire). Signal fences only have to block compile-time reordering (like Linux barrier()), unlike atomic_thread_fence that has to emit a memory barrier asm instruction.
Out-of-order CPUs do reorder things internally, but the cardinal rule of OoO exec is to preserve the illusion of instructions running one at a time, in order for the core running the instructions. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a; to correctly load the 1 you just stored; hardware preserves the illusion of serial execution1 in program order. (Typically via having loads snoop the store buffer and store-forward from stores to loads for stores that haven't committed to L1d cache yet.)
Instructions in an interrupt handler logically run after the point where the interrupt happened (as per the interrupt-return address). Therefore we just need the asm instructions in the right order (barrier()), and hardware will make everything work.
Footnote 1: There are some explicitly-parallel ISAs like IA-64 and the Mill, but they provide rules that asm can follow to be sure that one instruction sees the effect of another earlier one. Same for classic MIPS I load delay slots and stuff like that. Compilers take care of this for compiled C.

Related

Thread safe queue implementation (or alternative data structure)

I'm trying to implement a threadsafe queue that will hold data coming in on a UART buffer. The queue is written to as part of the UART RX-complete-ISR. This queue now holds the data that came in on the UART RX channel. The queue also needs to be read by the application using another thread to process the data. But since I'm running all of this on a bare-metal system without any RTOS support, I'm wondering if there is a better data structure to use here. Because when I'm using queues there is one common variable that both the threads need to access and this might cause a race condition.
I realize as I'm writing this that this is the producer-consumer problem and the only way I have solved this in the past is with mutexes. Is there an alternative to that approach?
Edit:
The processor being used is a ST micro cortex-M0 based processor. I looked into some mutex implementations for M0 but couldn't find anything definitive. This is mostly because the M0 processor does not support LDREX or STREX instructions that are usually present in M3 and M4 systems and are used for implementing atomic operations required for mutexes.
As for the system, the code runs straight to main after booting and has NO OS functionality. Even the scheduler was something that was written by me and simply looks at a table that holds function pointers and calls them.
The requirement is that one thread writes into a memory location from the ISR to store data coming in through the UART RX channel and another thread reads from those memory locations to process the data received. So my initial thought was that I would push to a queue from the ISR and read from it using the application thread, but that is looking less and less feasible because of the race condition that comes out of a producer-consumer setup (with the ISR being the producer and the application being the consumer).
Your M0 is a uniprocessor, so you can disable interrupts to serve basic exclusion:
int q_put(int c, Q *q) {
int ps, n, r;
ps = disable();
if ((n = q->tail+1) == q->len) {
n = 0;
}
if (n != q->head) {
q->buf[q->tail] = c;
q->tail = n;
r = 0;
} else {
r = -1;
}
restore(ps);
return r;
}
int q_get(Q *q) {
int ps, n, r;
ps = disable();
if ((n=q->head) == q->tail) {
r = -1;
} else {
r = q->buf[n] & 0xff;
q->head = n+1 == q->len ? 0 : n+1;
}
restore(ps);
return r;
}
where disable disables interrupts returning the previous state, and restore sets the interrupt state to its argument.
If it is bare metal, then you won't have any mutex or higher level concepts, so you need to implement something similar yourself. This is a common scenario however.
The normal datatype to use for this is a ring buffer, which is a manner of queue, implemented over a circular array. You should write that one as a separate module, but include two parameters: interrupt register and bit mask for setting/clearing that register. Then let the ring buffer code temporarily disable the UART RX interrupt during copy from the ring buffer to the caller application. This will protect against race conditions.
Since UART is most of the time relatively slow (< 115.2kbps), disabling the RX interrupt for a brief moment is harmless, since you only need a couple of microseconds to do the copy. The theory behind this is that the ISR will run once per data byte received, but the caller runs completely asynchronous in relation to the data. So the caller should not be allowed to block the ISR for longer than the time it takes to clock in 2 data bytes, in which case there will be overrun errors and data losses.
Which in practice means that the caller should only block the ISR for shorter time than it takes to clock in 1 data byte, because we don't know how far the UART has gotten in clocking in the current byte, at the time we disable the RX interrupt. If the interrupt is disabled at the point the byte is clocked in, that should still be fine since it should become a pending interrupt and trigger once you enable the RX interrupt once again. (At least all UART hardware I've ever used works like this, but double-check the behavior of your specific one just to be sure.)
So this all assuming that you can do the copy faster than the time it takes to clock in 1+8+1 new bits on the UART (no parity 1 stop). So if you are running for example 115.2kbps, your code must be faster than 1/115200 * (1+8+1) = 86.8us. If you are only copying less than a 32 bit word during that time, a Cortex M should have no trouble keeping up, assuming you run a sensible clock speed (8-48MHz something like that) and not some low power clock.
You always need to check for overrun and framing errors. Depending on UART hardware, these might be separate interrupts or the same one as RX. Then handle those errors in whatever way that makes sense for the application. If both sender & receiver is configured correctly and you didn't mess up the timing calculations, you shouldn't have any such errors.

Critical sections in ARM

I am experienced in implementing critical sections on the AVR family of processors, where all you do is disable interrupts (with a memory barrier of course), do the critical operation, and then reenable interrupts:
void my_critical_function()
{
cli(); //Disable interrupts
// Mission critical code here
sei(); //Enable interrupts
}
Now my question is this:
Does this simple method apply to the ARM architecture of processor as well? I have heard things about the processor doing lookahead on the instructions, and other black magic, and was wondering primarily if these types of things could be problematic to this implementation of critical sections.
Assuming you're on a Cortex-M processor, take a look at the LDREX and STREX instructions, which are available in C via the __LDREXW() and __STREXW() macros provided by CMSIS (the Cortex Microcontroller Software Interface Standard). They can be used to build extremely lightweight mutual exclusion mechanisms.
Basically,
data = __LDREXW(address)
works like data = *address except that it sets an 'exclusive access flag' in the CPU. When you've finished manipulating your data, write it back using
success = __STREXW(address, data)
which is like *address = data but will only succeed in writing if the exclusive access flag is still set. If it does succeed in writing then it also clears the flag. It returns 0 on success and 1 on failure. If the STREX fails, you have to go back to the LDREX and try again.
For simple exclusive access to a shared variable, nothing else is required. For example:
do {
data = LDREX(address);
data++;
} while (STREXW(address, data));
The interesting thing about this mechanism is that it's effectively 'last come, first served'; if this code is interrupted and the interrupt uses LDREX and STREX, the STREX interrupt will succeed and the (lower-priority) user code will have to retry.
If you're using an operating system, the same primitives can be used to build 'proper' semaphores and mutexes (see this application note, for example); but then again if you're using an OS you probably already have access to mutexes through its API!
ARM architecture is very wide and as I understand you probably mean ARM Cortex M micro controllers.
You can use this technique, but many ARM uCs offer much more. As I do know what is the actual hardware I can only give you some examples:
bitband area. In this memory regions you can set and reset bits atomic way.
Hardware semaphores (STM32H7)
Hardware MUTEX-es (some NXP uCs)
etc etc.

Issue with global variable while making 32-bit counter

I am trying to do quadrature decoding using atmel xmega avr microcontroller. Xmega has only 16-bit counters. And in addition I have used up all the available timers.
Now to make 32-bit counter I have used one 16-bit counter and in its over/under flow interrupt I have increment/decrement a 16-bit global variable, so that by combining them we can make 32-bit counter.
ISR(timer_16bit)
{
if(quad_enc_mov_forward)
{
timer_over_flow++;
}
else if (quad_enc_mov_backward)
{
timer_over_flow--;
}
}
so far it is working fine. But I need to use this 32-bit value in various tasks running parallel. I'm trying to read 32-bit values as below
uint32_t current_count = timer_over_flow;
current_count = current_count << 16;
current_count = current_count + timer_16bit_count;
`timer_16_bit_count` is a hardware register.
Now the problem I am facing is when I read the read timer_over_flow to current_count in the first statement and by the time I add the timer_16bit_count there may be overflow and the 16bit timer may have become zero. This may result in taking total wrong value.
And I am trying to read this 32-bit value in multiple tasks .
Is there a way to prevent this data corruption and get the working model of 32-bit value.
Details sought by different members:
My motor can move forward or backward and accordingly counter increments/decrements.
In case of ISR, before starting my motor I'm making the global variables(quad_enc_mov_forward & quad_enc_mov_backward) set so that if there is a overflow/underflow timer_over_flow will get changed accordingly.
Variables that are modified in the ISR are declared as volatile.
Multiple tasks means that I'm using RTOS Kernel with about 6 tasks (mostly 3 tasks running parallel).
In the XMEGA I'm directly reading TCCO_CNT register for the lower byte.
One solution is:
uint16_t a, b, c;
do {
a = timer_over_flow;
b = timer_16bit_count;
c = timer_over_flow;
} while (a != c);
uint32_t counter = (uint32_t) a << 16 | b;
Per comment from user5329483, this must not be used with interrupts disabled, since the hardware counter fetched into b may be changing while the interrupt service routine (ISR) that modifies timer_over_flow would not run if interrupts are disabled. It is necessary that the ISR interrupt this code if a wrap occurs during it.
This gets the counters and checks whether the high word changed. If it did, this code tries again. When the loop exits, we know the low word did not wrap during the reads. (Unless there is a possibility we read the high word, then the low word wrapped, then we read the low word, then it wrapped the other way, then we read the high word. If that can happen in your system, an alternative is to add a flag that the ISR sets when the high word changes. The reader would clear the flag, read the timer words, and read the flag. If the flag is set, it tries again.)
Note that timer_over_flow, timer_16bit_count, and the flag, if used, must be volatile.
If the wrap-two-times scenario cannot happen, then you can eliminate the loop:
Read a, b, and c as above.
Compare b to 0x8000.
If b has a high value, either there was no wrap, it was read before a wrap upward (0xffff to 0), or it was read after a wrap downward. Use the lower of a or c.
Otherwise, either there was no wrap, b was read after a wrap upward, or it was read before a wrap downward. Use the larger of a or c.
The #1 fundamental embedded systems programming FAQ:
Any variable shared between the caller and an ISR, or between different ISRs, must be protected against race conditions. To prevent some compilers from doing incorrect optimizations, such variables should also be declared as volatile.
Those who don't understand the above are not qualified to write code containing ISRs. Or programs containing multiple processes or threads for that matter. Programmers who don't realize the above will always write very subtle, very hard-to-catch bugs.
Some means to protect against race conditions could be one of these:
Temporary disabling the specific interrupt during access.
Temporary disabling all maskable interrupts during access (crude way).
Atomic access, verified in the machine code.
A mutex or semaphore. On single-core MCU:s where interrupts cannot be interrupted in turn, you can use a bool as "poor man's mutex".
Just reading TCCO_CNT in multithreaded code is race condition if you do not handle it correctly. Check the section on reading 16bit registers in XMega manual. You should read lower byte first (this will be probably handled transparently by compiler for you). When lower byte is read, higher byte is (atomically) copied into the TEMP register. Then, reading high byte does read the TEMP register, not the counter. In this way atomic reading of 16bit value is ensured, but only if there is no access to TEMP register between low and high byte read.
Note that this TEMP register is shared between all counters, so context switch in right (wrong) moment will probably trash its content and therefore your high byte. You need to disable interrupts for this 16bit read. Because XMega will execute one instruction after the sei with interrupts disabled, the best way is probably:
cli
ld [low_byte]
sei
ld [high byte]
It disables interrupts for four CPU cycles (if I counted it correctly).
An alternative would to save shared TEMP register(s) on each context switch. It is possible (not sure if likely) that your OS already does this, but be sure to check. Even so, you need to make sure colliding access does not occur from an ISR.
This precaution should be applied to any 16bit register read in your code. Either make sure TEMP register is correctly saved/restored (or not used by multiple threads at all) or disable interrupts when reading/writing 16bit value.
This problem is indeed a very common and very hard one. All solutions will toit will have a caveat regarding timing constraints in the lower priority layers. To clarify this: the highest priority function in your system is the hardware counter - it's response time defines the maximum frequency that you can eventually sample. The next lower priority in your solution is the interrupt routine which tries to keep track of bit 2^16 and the lowest is your application level code which tries to read the 32-bit value. The question now is, if you can quantify the shortest time between two level changes on the A- and B- inputs of your encoder. The shortest time usually does occur not at the highest speed that your real world axis is rotating but when halting at a position: through minimal vibrations the encoder can double swing between two increments, thereby producing e.g. a falling and a rising edge on the same encoder output in short succession. Iff (if and only if) you can guarantee that your interrupt processing time is shorter (by a margin) than this minmal time you can use such a method to virtually extend the coordinate range of your encoder.

Test program for CPU out of order effect

I wrote a multi-thread program to demonstrate the out of order effect of Intel processor. The program is attached at the end of this post.
The expected result should be that when x is printed out as 42 or 0 by the handler1. However, the actual result is always 42, which means that the out of order effect does not happen.
I compiled the program with the command "gcc -pthread -O0 out-of-order-test.c"
I run the compiled program on Ubuntu 12.04 LTS (Linux kernel 3.8.0-29-generic) on Intel IvyBridge processor Intel(R) Xeon(R) CPU E5-1650 v2.
Does anyone know what I should do to see the out of order effect?
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int f = 0, x = 0;
void* handler1(void *data)
{
while (f == 0);
// Memory fence required here
printf("%d\n", x);
}
void* handler2(void *data)
{
x = 42;
// Memory fence required here
f = 1;
}
int main(int argc, char argv[])
{
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, handler1, NULL);
pthread_create(&tid2, NULL, handler2, NULL);
sleep(1);
return 0;
}
You are mixing the race condition with an out-of-order execution paradigm. Unfortunately I am pretty sure you cannot "expose" the out-of-order execution as it is explicitly designed and implemented in such a way as to shield you (the running program and its data) from its effects.
More specifically: the out-of-order execution takes place "inside" a CPU in its full entirety. The results of out-of-order instructions are not directly posted to the register file but are instead queued up to preserve the order.
So even if the instructions themselves are executed out of order (based on various rules that primarily ensure that those instructions can be run independently of each other) their results are always re-ordered to be in a correct sequence as is expected by an outside observer.
What your program does is: it tries (very crudely) to simulate a race condition in which you hope to see the assignment of f to be done ahead of x and at the same time you hope to have a context switch happen exactly at that very moment and you assume the new thread will be scheduled on the very same CPU core as the other one.
However, as I have explained above - even if you do get lucky enough to hit all the listed conditions (schedule a second thread right after f assignment but before the x assignment and have the new thread scheduled on the very same CPU core) - which is in itself is an extremely low probability event - even then all you really expose is a potential race condition, but not an out-of-order execution.
Sorry to disappoint you but your program won't help you with observing the out-of-order execution effects. At least not with a high enough probability as to be practical.
You may read a bit more about out-of-order execution here:
http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf
UPDATE
Having given it some thought I think you could go for modifying the instructions on a fly in hopes of exposing the out-of-order execution. But even then I'm afraid this approach will fail as the new "updated" instruction won't be correctly reflected in the CPU's pipeline. What I mean is: the CPU will most likely have had already fetched and parsed the instruction you are about to modify so what will be executed will no longer match the content of the memory word (even the one in the CPU's L1 cache).
But this technique, assuming it can help you, requires some advanced programming directly in Assembly and will require your code running at the highest privilege level (ring 0). I would recommend an extreme caution with writing self-modifying code as it has a great potential for side-effects.
PLEASE NOTE: The following only addresses MEMORY reordering. To my knowledge you cannot observe out-of-order execution outside the pipeline, since that would constitute a failure of the CPU to adhere to its interface. (eg: you should tell Intel, it would be a bug). Specifically, there would have to be a failure in the reorder buffer and instruction retirement bookkeeping.
According to Intel's documentation (specifically Volume 3A, section 8.2.3.4):
The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
It also specifies (I'm summarizing, but all of this is available in section 8.2 Memory Ordering with examples in 8.2.3) that loads are never reordered with loads, stores are never reordered with stores, and stores and never reordered with earlier loads. This means there are implicit fences (3 of the weak types) between these operations in Intel 64.
To observe memory reordering, you just need to implement that example with sufficient carefulness to actually observe the effects. Here is a link to a full implementation I did that demonstrates this. (I will follow up with more details in the accompanying post here).
Essentially the first thread (processor_0 from the example) does this:
x = 1;
#if CPU_FENCE
__cpu_fence();
#endif
r1 = y;
inside of a while loop in its own thread (pinned to a CPU using SCHED_FIFO:99).
The second (observer, in my demo) does this:
y = 1;
#if CPU_FENCE
__cpu_fence();
#endif
r2 = x;
also in a while loop in its own thread with the same scheduler settings.
Reorders are checked for like this (exactly as specified in the example):
if (r1 == 0 and r2 == 0)
++reorders;
With the CPU_FENCE disabled, this is what I see:
[ 0][myles][~/projects/...](master) sudo ./build/ooo
after 100000 attempts, 754 reorders observed
With the CPU_FENCE enabled (which uses the "heavyweight" mfence instruction) I see:
[ 0][myles][~/projects/...](master) sudo ./build/ooo
after 100000 attempts, 0 reorders observed
I hope this clarifies things for you!

Atomic Block for reading vs ARM SysTicks

I am currently porting my DCF77 library (you may find the source code at GitHub) from Arduino (AVR based) to Arduino Due (ARM Cortex M3). I am an absolute beginner with the ARM platform.
With the AVR based Arduino I can use avr-libc to get atomic blocks. Basically this blocks all interrupts during the block and will allow interrupts later on again. For the AVR this was fine. Now for the ARM Cortex things start to get complicated.
First of all: for the current uses of the library this approach would work as well. So my first question is: is there someting similar to the "ATOMIC" macros of avr-libc for ARM? Obviously other people have thought of something in this directions. Since I am using gcc I could enhance these macors to work almost exactly like the avr-libv ATOMIC macors. I already found some CMSIS documentation however this seems only to provide an "enable_irq" macro instead of a "restore_irq" macro.
Question 1: is there any library out there (for gcc) that already does this?
Because ARM has different priority interrupts I could establish the atomicity in different ways as well. In my case the "atomic" blocks must only make sure that they are not interrupted by the systick interrupt. So actually I would not need to block everything to make my blocks "atomic enough". Searching further I found an ARM synchronization primitives article in the developer infocenter. Especially there is a hint at lockless programming. According to the article this is an advanced concept and that there are many publications on it. Searching the net I found only general explanations of the concept, e.g. here. I assume that a lockless implementation would be very cool but at this time I feel not confident enough on ARM to implement this from scratch.
Question 2: does anyone have some hints for me on lockless reads of memory blocks on ARM Cortex M3?
As I already said I only need to protect the lower priority thread from sysTicks. So another option would be to disable sysTicks briefly. Since I am implementing a timing sensitive clock algorithm this must not slow down the overall sysTick frequency in the long run. Introducing some small jitter would be OK though. At this time I would find this most attractive.
Question 3: is there any good way to block sysTick interrupts without losing any ticks?
I also found the CMSIS documentation for semaphores. However I am somewhat overwhelmed. Especially I am wondering if I should use CMSIS and how to do this on an Arduino Due.
Question 4: What would be my best option? Or where should I continue reading?
Partial Answer:
with the hint from Notlikethat I implemented
#if defined(ARDUINO_ARCH_AVR)
#include <util/atomic.h>
#define CRITICAL_SECTION ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
#elif defined(ARDUINO_ARCH_SAM)
// Workaround as suggested by Stackoverflow user "Notlikethat"
// http://stackoverflow.com/questions/27998059/atomic-block-for-reading-vs-arm-systicks
static inline int __int_disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n" : "=r"(primask));
asm volatile("cpsid i\n");
return primask & 1;
}
static inline void __int_restore_irq(int *primask) {
if (!(*primask)) {
asm volatile ("" ::: "memory");
asm volatile("cpsie i\n");
}
}
// This critical section macro borrows heavily from
// avr-libc util/atomic.h
// --> http://www.nongnu.org/avr-libc/user-manual/atomic_8h_source.html
#define CRITICAL_SECTION for (int primask_save __attribute__((__cleanup__(__int_restore_irq))) = __int_disable_irq(), __ToDo = 1; __ToDo; __ToDo = 0)
#else
#error Unsupported controller architecture
#endif
This macro does more or less what I need. However I find there is room for improvement as this blocks all interrupts although it would be sufficient to block only systicks. So Question 3 is still open.
Most of what you've referenced is about synchronising memory accesses between multiple CPUs, or pre-emptively scheduled threads on the same CPU, which seems entirely inappropriate given the stated situation. "Atomicity" in that sense refers to guaranteeing that when one observer is updating memory, any observer reading memory sees either the initial state, or the updated state, but never something part-way in between.
"Atomicity" with respect to interrupts follows the same principle - i.e. ensuring that if an interrupt occurs, a sequence of code has either not run at all, or run completely - but is a conceptually different thing1. There are only two things guaranteed to be atomic w.r.t. interrupts: a single instruction2, or a sequence of instructions executed with interrupts disabled.
The "right" way to achieve that is indeed via the CPSID/CPSIE instructions, which are wrapped in the __disable_irq()/__enable_irq() intrinsics. Note that there are two "stages" of interrupt handling in the system: the M3 core itself only has a single IRQ signal - it's the external NVIC's job to do all the routing/multiplexing/prioritisation of the system IRQs into this one line. When the CPU wants to enter a critical section, all it needs to do is mask its own IRQ input with CPSID, do what it needs, then unmask with CPSIE, at which point any pending IRQ from the NVIC will be taken immediately.
For the case of nested/re-entrant critical sections, the intrinsics provide a handy int __disable_irq(void) form which returns the previous state, so you can unmask conditionally on that.
For other compilers which don't offer such intrinsics, it's straightforward enough to roll your own, e.g.:
static inline int disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n"
"cpsid i\n" : "=r"(primask));
return primask & 1;
}
static inline void enable_irq(int primask) {
if (primask)
asm volatile("cpsie i\n");
}
[1] One confusing overlap is the latter sense is often used to achieve the former in single-CPU multitasking - if interrupts are off, no other thread can get scheduled until you've finished, thus will never see partially-updated memory.
[2] With the possible exception of load/store-multiple instructions - in the low-latency interrupt configuration, these can be interrupted, and either restarted or continued upon return.

Resources