I think I'm missing something obvious here. I have code like this:
int *a = <some_address>;
int *b = <another_address>;
// ...
// desired atomic swap of addresses a and b here
int *tmp = a;
a = b; b = tmp;
I want to do this atomically. I've been looking at __c11_atomic_exchange and on OSX the OSAtomic methods, but nothing seems to perform a straight swap atomically. They can all write a value into 1 of the 2 variables atomically, but never both.
Any ideas?
It depends a bit on your architecture, what is effectively and efficiently possible. In any case you would need that the two pointers that you want to swap be adjacent in memory, the easiest that you have them as some like
typedef struct pair { void *a[2]; } pair;
Then you can, if you have a full C11 compiler use _Atomic(pair) as an atomic type to swap the contents of such a pair: first load the actual pair in a temporary, construct a new one with the two values swapped, do a compare-exchange to store the new value and iterate if necessary:
inline
void pair_swap(_Atomic(pair) *myPair) {
pair actual = { 0 };
pair future = { 0 };
while (!atomic_compare_exchange_weak(myPair, &actual, future)) {
future.a[0] = actual.a[1];
future.a[1] = actual.a[0];
}
}
Depending on your architecture this might be realized with a lock free primitive operation or not. Modern 64 bit architectures usually have 128bit atomics, there this could result in just some 128 bit assembler instructions.
But this would depend a lot on the quality of the implementation of atomics on your platform. P99 would provide you with an implementation that implements the "functionlike" part of atomic operations, and that supports 128 bit swap, when detected; on my machine (64 bit Intel linux) this effectively compiles into a lock cmpxchg16b instruction.
Relevant documentation:
https://en.cppreference.com/w/c/language/atomic
https://en.cppreference.com/w/c/thread
https://en.cppreference.com/w/c/atomic/atomic_exchange
https://en.cppreference.com/w/c/atomic/atomic_compare_exchange
You should use a mutex.
Or, if you don't want your thread to go to sleep if the mutex isn't immediately available, you can improve efficiency by using a spin lock, since the number of times the spin lock has to spin is frequently so small that it is faster to spin than to put a thread to sleep and have a whole context switch.
C11 has a whole suite of atomic types and functions which allow you to easily implement your own spin lock to accomplish this.
Keep in mind, however, that a basic spin lock must not be used on a bare metal (no operating system) single core (processor) system, as it will result in deadlock if the main context takes the lock and then while the lock is still taken, an ISR fires and tries to take the lock. The ISR will block indefinitely.
To avoid this, more-complicated locking mechanisms must be used, such as a spin lock with a timeout, an ISR which auto-reschedules itself for a slightly later time if the lock is taken, automatic yielding, etc.
Quick answer:
Here is what we will be building up to:
int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;
// swap `a` and `b`
atomic_swap_int_ptr(&spinlock, &a, &b);
Here are the details:
For a basic spin lock (mutex) on a multi-core system, which is where spin locks are best, however, try this:
Basic spin lock implementation in C11 using _Atomic types
#include <stdbool.h>
#include <stdatomic.h>
#define SPINLOCK_UNLOCKED 0
#define SPINLOCK_LOCKED 1
typedef volatile atomic_bool spinlock_t;
// This is how you'd create a new spinlock to use in the functions below.
// spinlock_t spinlock = SPINLOCK_UNLOCKED;
/// \brief Set the spin lock to `SPINLOCK_LOCKED`, and return the previous
/// state of the `spinlock` object.
/// \details If the previous state was `SPINLOCK_LOCKED`, you can know that
/// the lock was already taken, and you do NOT now own the lock. If the
/// previous state was `SPINLOCK_UNLOCKED`, however, then you have now locked
/// the lock and own the lock.
bool atomic_test_and_set(spinlock_t* spinlock)
{
bool spinlock_val_old = atomic_exchange(spinlock, SPINLOCK_LOCKED);
return spinlock_val_old;
}
/// Spin (loop) until you take the spin lock.
void spinlock_lock(spinlock_t* spinlock)
{
// Spin lock: loop forever until we get the lock; we know the lock was
// successfully obtained after exiting this while loop because the
// atomic_test_and_set() function locks the lock and returns the previous
// lock value. If the previous lock value was SPINLOCK_LOCKED then the lock
// was **already** locked by another thread or process. Once the previous
// lock value was SPINLOCK_UNLOCKED, however, then it indicates the lock
// was **not** locked before we locked it, but now it **is** locked because
// we locked it, indicating we own the lock.
while (atomic_test_and_set(spinlock) == SPINLOCK_LOCKED) {};
}
/// Release the spin lock.
void spinlock_unlock(spinlock_t* spinlock)
{
*spinlock = SPINLOCK_UNLOCKED;
}
Now, you can use the spin lock, for your example, in your multi-core system like this:
/// Atomically swap two pointers to int (`int *`).
/// NB: this does NOT swap their contents, meaning: the integers they point to.
/// Rather, it swaps the pointers (addresses) themselves.
void atomic_swap_int_ptr(spinlock_t * spinlock, int ** ptr1, int ** ptr2)
{
spinlock_lock(spinlock)
// critical, atomic section start
int * temp = *ptr1;
*ptr1 = *ptr2;
*ptr2 = temp;
// critical, atomic section end
spinlock_unlock(spinlock)
}
// Demo to perform an address swap
int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;
// swap `a` and `b`
atomic_swap_int_ptr(&spinlock, &a, &b); // <=========== final answer
References
https://en.wikipedia.org/wiki/Test-and-set#Pseudo-C_implementation_of_a_spin_lock
C11 Concurrency support library and atomic_bool and other types: https://en.cppreference.com/w/c/thread
C11 atomic_exchange() function: https://en.cppreference.com/w/c/atomic/atomic_exchange
Other links
Note that you could use C11's struct atomic_flag as well, and its corresponding atomic_flag_test_and_set() function, instead of creating your own atomic_test_and_set() function using atomic_exchange(), like I did above.
_Atomic types in C11: https://en.cppreference.com/w/c/language/atomic
C++11 Implementation of Spinlock using header <atomic>
TODO
This implementation would need to be modified to add safety anti-deadlock mechanisms such as automatic deferral, timeout, and retry, to prevent deadlock on single core and bare-metal systems. See my notes here: Add basic mutex (lock) and spin lock implementations in C11, C++11, AVR, and STM32 and here: What are the various ways to disable and re-enable interrupts in STM32 microcontrollers in order to implement atomic access guards?
Related
I have learned from SO threads here and here, among others, that it is not safe to assume that reads/writes of data in multithreaded applications are atomic at the OS/hardware level, and corruption of data may result. I would like to know the simplest way of making reads and writes of int variables atomic, using the <stdatomic.h> C11 library with the GCC compiler on Linux.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
For C11 atomics you don't even have to use functions. If your implementation (= compiler) supports atomics you can just add an atomic specifier to a variable declaration and then subsequently all operations on that are atomic:
_Atomic(int) toto = 65;
...
toto += 2; // is an atomic read-modify-write operation
...
if (toto == 67) // is an atomic read of toto
Atomics have their price (they need much more computing resources) but as long as you use them scarcely they are the perfect tool to synchronize threads.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
You almost never have to do anything. In almost every case, the data which your threads share (or communicate with) are protected from concurrent access via such things as mutexes, semaphores and the like. The implementation of the base operations ensure the synchronization of memory.
The reason for these atomics is to help you construct safer race conditions in your code. There are a number of hazards with them; including:
ai += 7;
would use an atomic protocol if ai were suitably defined. Trying to decipher race conditions is not aided by obscuring the implementation.
There is also a highly machine dependent portion to them. The line above, for example, could fail [1] on some platforms, but how is that failure communicated back to the program? It is not [2].
Only one operation has the option of dealing with failure; atomic_compare_exchange_(weak|strong). Weak just tries once, and lets the program choose how and whether to retry. Strong retries endlessly. It isn't enough to just try once -- spurious failures due to interrupts can occur -- but endless retries on a non-spurious failure is no good either.
Arguably, for robust programs or widely applicable libraries, the only bit of you should use is atomic_compare_exchange_weak().
[1] Load-linked, store-conditional (ll-sc) is a common means for making atomic transactions on asynchronous bus architectures. The load-linked sets a little flag on a cache line, which will be cleared if any other bus agent attempts to modify that cache line. Store-conditional stores a value iff the little flag is set in the cache, and clears the flag; iff the flag is cleared, Store-conditional signals an error, so an appropriate retry operation can be attempted. From these two operations, you can construct any atomic operation you like on a completely asynchronous bus architecture.
ll-sc can have subtle dependencies on the caching attributes of the location. Permissible cache attributes are platform dependent, as is which operations may be performed between the ll and sc.
If you put an ll-sc operation on a poorly cached access, and blindly retry, your program will lock up. This isn't just speculation; I had to debug one of these on an ARMv7-based "safe" system.
[2]:
#include <stdatomic.h>
int f(atomic_int *x) {
return (*x)++;
}
f:
dmb ish
.L2:
ldrex r3, [r0]
adds r2, r3, #1
strex r1, r2, [r0]
cmp r1, #0
bne .L2 /* note the retry loop */
dmb ish
mov r0, r3
bx lr
The most portable way is to use one of the C11 atomic variables. You can also use a spinlock atomic operation to guard non-atomic variables. Here is a simple pthread produce/consumer example to play with, modify as desired. Notice that the cnt_non and cnt_vol can be corrupted.
atomic_uint cnt_atomic;
int cnt_non;
volatile int cnt_vol;
typedef atomic_uint lock_t;
lock_t lockholder = 0;
#define LOCK_C 0x01
#define LOCK_P 0x02
int cnt_lock; /* not atomic on purpose to test spinlock */
atomic_int lock_held_c, lock_held_p;
void
lock(lock_t *bitarrayp, uint32_t desired)
{
uint32_t expected = 0; /* lock is not held */
/* the value in expected is updated if it does not match
* the value in bitarrayp. If the comparison fails then compare
* the returned value with the lock bits and update the appropriate
* counter.
*/
do {
if (expected & LOCK_P) lock_held_p++;
if (expected & LOCK_C) lock_held_c++;
expected = 0;
} while(!atomic_compare_exchange_weak(bitarrayp, &expected, desired));
}
void
unlock(lock_t *bitarrayp)
{
*bitarrayp = 0;
}
void*
fn_c(void *thr_data)
{
(void)thr_data;
for (int i=0; i<40000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_C);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void*
fn_p(void *thr_data)
{
(void)thr_data;
for (int i=0; i<30000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_P);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void
drv_pc(void)
{
pthread_t thr[2];
pthread_create(&thr[0], NULL, fn_c, NULL);
pthread_create(&thr[1], NULL, fn_p, NULL);
for(int n = 0; n < 2; ++n)
pthread_join(thr[n], NULL);
printf("cnt_a=%d, cnt_non=%d cnt_vol=%d\n", cnt_atomic, cnt_non, cnt_vol);
printf("lock %d held_c=%d held_p=%d\n", cnt_lock, lock_held_c, lock_held_p);
}
that it is not safe to assume that reads/writes of data in
multithreaded applications are atomic at the OS/hardware level, and
corruption of data may result
Actually non composite operations on types like int are atomic on all reasonable architecture. What you read is simply a hoax.
(An increment is a composite operation: it has a read, a calculation, and a write component. Each component is atomic but the whole composite operation is not.)
But atomicity at the hardware level isn't the issue. The high level language you use simply doesn't support that kind of manipulations on regular types. You need to use atomic types to even have the right to manipulate objects in such a way that the question of atomicity is relevant: when you are potentially modifying an object in use in another thread.
(Or volatile types. But don't use volatile. Use atomics.)
I've been having an implementation discussion where the idea that a CPU can choose to completely reorder the storing of memory has come up.
I was initializing a static array in C using code similar to:
static int array[10];
static int array_initialized = 0;
void initialize () {
array[0] = 1;
array[1] = 2;
...
array_initialized = -1;
}
and it is used later similar to:
int get_index(int index) {
if (!array_initialized) initialize();
if (index < 0 || index > 9) return -1;
return array[index];
}
is it possible for the CPU to reorder memory access in a multi-core intel architecture (or other architecture) such that it sets array_initialized before the initialize function has finished setting the array elements? or so that another execution thread can see array_initialized as non-zero before the entire array has been initialized in its view of the memory?
TL:DR: to make lazy-init safe if you don't do it before starting multiple threads, you need an _Atomic flag.
is it possible for the CPU to reorder memory access in a multi-core Intel (x86) architecture
No, such reordering is possible at compile time only. x86 asm effectively has acquire/release semantics for normal loads/stores. (seq_cst + a store buffer with store forwarding).
https://preshing.com/20120625/memory-ordering-at-compile-time/
(or other architecture)
Yes, most other ISAs have a weaker asm memory model that does allow StoreStore reordering and LoadLoad reordering. (Effectively memory_order_relaxed, or sort of like memory_order_consume on ISAs other than Alpha AXP, but compilers don't try to maintain data dependencies.)
None of this really matters from C because the C memory model is very weak, allowing compile-time reordering and simultaneous read/write or write+write of any object is data-race UB.
Data Race UB is what lets a compiler keep static variables in registers for the life of a function / inside a loop when compiling for "normal" ISAs.
Having 2 threads run this function is C data-race UB if array_initialized isn't already set before either of them run. (e.g. by having the main thread run it once before starting any more threads). And remove the array_initialized flag entirely, unless you have a use for the lazy-init feature before starting any more threads.
It's 100% safe for a single thread, regardless of how many other threads are running: the C programming model guarantees that a single thread always sees its own operations in program order. (Just like asm for all normal ISAs; other than explicit parallelism in ISAs like Itanium, you always see your own operations in order. It's only other threads seeing your operations where things get weird).
Starting a new thread is (I think) always a "full barrier", or in C terms "synchronizes with" the new thread. Stuff in the new thread can't happen before anything in the parent thread. So just calling get_index once from the main thread makes it safe with no further barriers for other threads to run get_index after that.
You could make lazy init thread-safe with an _Atomic flag
This is similar to what gcc does for function-local static variables with non-constant initializers. Check out the code-gen for that if you're curious: a read-only check of an already-init flag and then a call to an init function that makes sure only one thread runs the initializer.
This requires an acquire load in the fast-path for the already-initialized state. That's free on x86 and SPARC-TSO (same asm as a normal load), but not on weaker ISAs. AArch64 has an acquire load instruction, other ISAs need some barrier instructions.
Turn your array_initialized flag into a 3-state _Atomic variable:
init not started (e.g. init == 0). Check for this with an acquire load.
init started but not finished (e.g. init == -1)
init finished (e.g. init == 1)
You can leave static int array[10]; itself non-atomic by making sure exactly 1 thread "claims" responsibility for doing the init, using atomic_compare_exchange_strong (which will succeed for exactly one thread). And then have other threads spin-wait for the INIT_FINISHED state.
Using initial state == 0 lets it be in the BSS, hopefully next to the data. Otherwise we might prefer INIT_FINISHED=0 for ISAs where branching on an int from memory being (non)zero is slightly more efficient than other numbers. (e.g. AArch64 cbnz, MIPS bne $reg, $zero).
We could get the best of both worlds (cheapest possible fast-path for the already-init case) while still having the flag in the BSS: Have the main thread write it with INIT_NOTSTARTED = -1 before starting any more threads.
Having the flag next to the array is helpful for a small array where the flag is probably in the same cache line as the data we want to index. Or at least the same 4k page.
#include <stdatomic.h>
#include <stdbool.h>
#ifdef __x86_64__
#include <immintrin.h>
#define SPINLOOP_BODY _mm_pause()
#else
#define SPINLOOP_BODY /**/
#endif
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#define NOINLINE __attribute__((noinline))
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#define NOINLINE /**/
#endif
enum init_states {
INIT_NOTSTARTED = 0,
INIT_STARTED = -1,
INIT_FINISHED = 1 // optional: make this 0 to speed up the fast-path on some ISAs, and store an INIT_NOTSTARTED before the first call
};
static int array[10];
static _Atomic int array_initialized = INIT_NOTSTARTED;
// called either before or during init.
// One thread claims responsibility for doing the init, others spin-wait
NOINLINE // this is rare, make sure it doesn't bloat the fast-path
void initialize(void) {
bool winner = false;
// check read-only if another thread has already claimed init
if (array_initialized == INIT_NOTSTARTED) {
int expected = INIT_NOTSTARTED;
winner = atomic_compare_exchange_strong(&array_initialized, &expected, INIT_STARTED);
// seq_cst memory order is fine. Weaker might be ok but it only has to run once
}
if (winner) {
array[0] = 1;
// ...
atomic_store_explicit(&array_initialized, INIT_FINISHED, memory_order_release);
} else {
// spin-wait for the winner in other threads
// yield(); optional.
// Or use some kind of mutex or condition var if init is really slow
// otherwise just spin on a seq_cst load. (Or acquire is fine.)
while(array_initialized != INIT_FINISHED)
SPINLOOP_BODY; // x86 only
// winner's release store syncs with our load:
// array[] stores Happened Before this point so we can read it without UB
}
}
int get_index(int index) {
// atomic acquire load is fine, doesn't need seq_cst. Cheaper than seq_cst on PowerPC
if (unlikely(atomic_load_explicit(&array_initialized, memory_order_acquire) != INIT_FINISHED))
initialize();
if (unlikely(index < 0 || index > 9)) return -1;
return array[index];
}
This does compile to correct-looking and efficient asm on Godbolt. Without unlikely() macros, gcc/clang think that at least the stand-alone version of get_index has initialize() and/or return -1 as the most likely fast-path.
And compilers wanted to inline the init function, which would be silly because it only runs once per thread at most. Hopefully profile-guided optimization would correct that.
I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.
The memory is two two-dimensional arrays:
volatile uint16_t memoryBuffer[2][10][20] = {0};
The DMA operates on the opposite "matrix" than the program function:
void myTask(uint8_t indexOppositeOfDMA)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
//Do some stuff with memory (readings only):
foo(memoryBuffer[indexOppositeOfDMA][n][m]);
}
}
}
Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?
Platform Cortex-M4
The problem without volatile
Let's assume that volatile is omitted from the data array. Then the C compiler
and the CPU do not know that its elements change outside the program-flow. Some
things that could happen then:
The whole array might be loaded into the cache when myTask() is called for
the first time. The array might stay in the cache forever and is never
updated from the "main" memory again. This issue is more pressing on multi-core
CPUs if myTask() is bound to a single core, for example.
If myTask() is inlined into the parent function, the compiler might decide
to hoist loads outside of the loop even to a point where the DMA transfer
has not been completed.
The compiler might even be able to determine that no write happens to
memoryBuffer and assume that the array elements stay at 0 all the time
(which would again trigger a lot of optimizations). This could happen if
the program was rather small and all the code is visible to the compiler
at once (or LTO is used).
Remember: After all the compiler does not know anything about the DMA
peripheral and that it is writing "unexpectedly and wildly into memory"
(from a compiler perspective).
If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...
The problem with volatile
Making
the whole array volatile is often a pessimisation. For speed reasons you
probably want to unroll the loop. So instead of loading from the
array and incrementing the index alternatingly such as
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
it can be faster to load multiple elements at once and increment the index
in larger steps such as
load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;
This is especially true, if the loads can be fused together (e.g. to perform
one 32-bit load instead of two 16-bit loads). Further you want the
compiler to use SIMD instruction to process multiple array elements with
a single instruction.
These optimizations are often prevented if the load happens from
volatile memory because compilers are usually very conservative with
load/store reordering around volatile memory accesses.
Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).
Possible solution 1: fences
So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.
Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).
Possible solution 2: atomic variable holding the buffer state
We use the following pattern a lot in our codebase (e.g. when reading data from
SPI via DMA and calling a function to analyze it): The buffer is declared as
plain array (no volatile) and an atomic flag is added to each buffer, which
is set when the DMA transfer has finished. The code looks something
like this:
typedef struct Buffer
{
uint16_t data[10][20];
// Flag indicating if the buffer has been filled. Only use atomic instructions on it!
int filled;
// C11: atomic_int filled;
// C++: std::atomic_bool filled{false};
} Buffer_t;
Buffer_t buffers[2];
Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy
void setupDMA(void)
{
for (int i = 0; i < 2; ++i)
{
int bufferFilled;
// Atomically load the flag.
bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
// C11: bufferFilled = atomic_load(&buffers[i].filled);
// C++: bufferFilled = buffers[i].filled;
if (!bufferFilled)
{
currentDmaBuffer = &buffers[i];
... configure DMA to write to buffers[i].data and start it
}
}
// If you end up here, there is no free buffer available because the
// data processing takes too long.
}
void DMA_done_IRQHandler(void)
{
// ... stop DMA if needed
// Atomically set the flag indicating that the buffer has been filled.
__sync_fetch_and_or(¤tDmaBuffer->filled, 1);
// C11: atomic_store(¤tDmaBuffer->filled, 1);
// C++: currentDmaBuffer->filled = true;
currentDmaBuffer = 0;
// ... possibly start another DMA transfer ...
}
void myTask(Buffer_t* buffer)
{
for (uint8_t n=0; n<10; n++)
for (uint8_t m=0; m<20; m++)
foo(buffer->data[n][m]);
// Reset the flag atomically.
__sync_fetch_and_and(&buffer->filled, 0);
// C11: atomic_store(&buffer->filled, 0);
// C++: buffer->filled = false;
}
void waitForData(void)
{
// ... see setupDma(void) ...
}
The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more,
make the incoming data slower or the processing code faster or whatever is
sufficient in your case.
Possible solution 3: OS support
If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on
the queue until it is non-empty. The pattern looks a bit like this:
MemoryPool pool; // A pool to acquire DMA buffers.
Queue bufferQueue; // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.
void setupDMA(void)
{
currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
// ... make the DMA write to currentBuffer
}
void DMA_done_IRQHandler(void)
{
// ... stop DMA if needed
Queue_Post(&bufferQueue, currentBuffer);
currentBuffer = 0;
}
void myTask(void)
{
void* buffer = Queue_Wait(&bufferQueue);
[... work with buffer ...]
MemoryPool_Deallocate(&pool, buffer);
}
This is probably the easiest approach to implement but only if you have an OS
and if portability is not an issue.
Here you say that the buffer is non-volatile:
"memoryBuffer is non-volatile inside the scope of myTask"
But here you say that it must be volatile:
"but may be changed next time i call myTask"
These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.
However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.
What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:
void myTask(uint8_t indexOppositeOfDMA)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
uint16_t local_copy = *data; // this access is volatile and wont get optimized away
foo(&local_copy); // optimizations possible here
// if needed, write back again:
*data = local_copy; // optional
}
}
}
You'll have to benchmark it, but I'm pretty sure this should improve performance.
Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.
You're not allowed to cast away the volatile qualifier1.
If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.
1 (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is
made to refer to an object defined with a volatile-qualified type through use of an lvalue
with non-volatile-qualified type, the behavior is undefined.
It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...
typedef struct memory_buffer {
uint16_t buffer[10][20];
} memory_buffer ;
volatile memory_buffer double_buffer[2];
void myTask(memory_buffer *mem_buf)
{
for(uint8_t n=0; n<10; n++)
{
for(uint8_t m=0; m<20; m++)
{
//Do some stuff with memory:
foo(mem_buf->buffer[n][m]);
}
}
}
I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.
What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.
In other words:
DMA is programmed to interrupt when last byte of buffer is written
Task is block on a semaphore/flag waiting that the flag is released
When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.
Something like:
uint16_t memoryBuffer[2][10][20];
volatile uint8_t PingPong = 0;
void interrupt ( void )
{
// Change current DMA pointed buffer
PingPong ^= 1;
}
void myTask(void)
{
static uint8_t lastPingPong = 0;
if (lastPingPong != PingPong)
{
for (uint8_t n = 0; n < 10; n++)
{
for (uint8_t m = 0; m < 20; m++)
{
//Do some stuff with memory:
foo(memoryBuffer[PingPong][n][m]);
}
}
lastPingPong = PingPong;
}
}
I'm trying to instrument some functions in my application to see how long they take. I'm recording all of the times in-memory using a linked list.
In the process, I've introduced a global variable that keeps track of the end of the list. When I enter a new timing region, I insert a new record at the end of the list. Fairly simple stuff.
Some of the functions I want to track are called in OpenMP regions, however. Which means they'll likely be called multiple times in parallel. And this is where I'm stumped.
If this was using normal Pthreads, I'd simply wrap access to the global variable in a mutex and call it a day. However, I'm unsure: will this strategy still work with functions called in an OpenMP region? As in, will they respect the lock?
For example (won't compile, but I think gets the point across):
Record *head;
Record *tail;
void start_timing(char *name) {
Record *r = create_record(name);
tail->next_record = r;
tail = r;
return r;
}
int foo(void) {
Record r = start_timing("foo");
//Do something...
stop_timing(r);
}
int main(void) {
Record r = start_timing("main");
//Do something...
#pragma omp parallel for...
for (int i = 0; i < 42; i++) {
foo();
}
//Do some more...
stop_timing(r);
}
Which I would then update to:
void start_timing(char *name) {
Record *r = create_record(name);
acquire_mutex_on_tail();
tail->next_record = r;
tail = r;
release_mutex_on_tail();
return r;
}
(Apologies if this has an obvious answer - I'm relatively inexperience with the OpenMP framework and multithreading in general.)
The idiomatic mutex solution is to use OpenMP locks:
omp_set_lock(&taillock)
tail->next_record = r;
tail = r;
omp_unset_lock(&taillock)
and somewhere:
omp_lock_t taillock;
omp_init_lock(&taillock);
...
omp_destroy_lock(&taillock);
The simple OpenMP solution:
void start_timing(char *name) {
Record *r = create_record(name);
#pragma omp critical
{
tail->next_record = r;
tail = r;
}
return r;
}
That creates an implicit global lock bound to the source code line. For some detailed discussions see the answers to this question.
For practical purposes, using Pthread locks will also work, at least for scenarios where OpenMP is based on Pthreads.
A word of warning
Using locks in performance measurement code is dangerous. So is memory allocation, which also often implies using locks. This means, that start_time has significant cost and the performance will even get worse with more threads. That doesn't even consider the cache invalidation from having one thread allocating a chunk of memory (record) and then another thread modifying it (tail pointer).
Now that may be fine if the sections you measure take seconds, but it will cause great overhead and perturbation when your sections are only hundreds of cycles.
To create a scalable performance tracing facility, you must pre-allocate thread-local memory in larger chunks and have each thread write only to it's local part.
You can also chose to use some of the existing measurement infrastructures, such as Score-P.
Overhead & perturbation
First, distinguish between the two (linked concepts). Overhead is extra time you spend, while perturbation refers to the impact on what you measure (i.e. you now measure something different than what happens without the measurement). Overhead is undesirable in large quantities, but perturbation is much worse.
Yes, you can avoid some of the perturbation by pausing the timer during your expensive measurement runtime (the overhead remains). However, in a multi-threaded context this is still very problematic.
Slowing down progress in one thread, may lead to other threads waiting for it e.g. during an implicit barrier. How do you attribute the waiting time of that thread and others that follow transitively?
Memory allocation is usually locked - so if you allocate memory during measurement runtime, you will slow down other threads that depend on memory allocation. You could try to mitigate with memory pools, but I'd avoid the linked list in the first place.
Does anyone know of a good/correct implementation of Peterson's Lock algorithm in C? I can't seem to find this. Thanks.
Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.
Peterson's algorithm is as follows:
LOCK:
interested[id] = 1 interested[other] = 1
turn = other turn = id
while turn == other while turn == id
and interested[other] == 1 and interested[id] == 1
UNLOCK:
interested[id] = 0 interested[other] = 0
There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.
Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.
In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):
lock(int id)
{
atomic_store(&interested[id], 1);
atomic_store(&turn, 1 - id);
while (atomic_load(&turn) == 1 - id
&& atomic_load(&interested[1 - id]) == 1);
}
unlock(int id)
{
atomic_store(&interested[id], 0);
}
This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.
C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector
[petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):
lock(int id)
{
atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);
while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
&& atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}
unlock(int id)
{
atomic_store_explicit(&interested[id], 0, memory_order_release);
}
Notice the exchange with acquire and release semantics. An exchange is an
atomic RMW operation. Atomic RMW operations always read the last value stored
before the write in the RMW operation. Also, an acquire on an atomic object
that reads a write from a release on that same atomic object (or any later
write on that object from the thread that performed the release or any later
write from any atomic RMW operation) creates a synchronizes-with relation
between the release and the acquire.
So, this operation is a synchronization point between the threads, there is
always a synchronizes-with relationship between the exchange in one thread and
the last exchange performed by any thread (or the initialization of turn, for
the very first exchange).
So we have a sequenced-before relationship between the store to interested[id]
and the exchange from/to turn, a synchronizes-with relationship between two
consecutive exchanges from/to turn, and a sequenced-before relationship
between the exchange from/to turn and the load of interested[1 - id]. This
amounts to a happens-before relationship between accesses to interested[x] in
different threads, with turn providing the synchronization between threads.
This forces all the ordering needed to make the algorithm work.
So how were these things done before C11? It involved using compiler and
CPU-specific magic. As an example, let's see the pretty strongly-ordered x86.
IIRC, all x86 loads have acquire semantics, and all stores have release
semantics (save non-temporal moves, in SSE, used precisely to achive higher
performance at the cost of ocassionally needing to issue CPU fences to achieve
coherence between CPUs). But this is not enough for Peterson's algorithm, as
Bartosz Milewsky explains at
who-ordered-memory-fences-on-an-x86 ,
for Peterson's algorithm to work we need to establish an ordering between
accesses to turn and interested, failing to do that may result in seeing loads
from interested[1 - id] before writes to interested[id], which is a bad thing.
So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):
lock(int id)
{
interested[id] = 1;
turn = 1 - id;
__asm__ __volatile__("mfence");
do {
__asm__ __volatile__("":::"memory");
} while (turn == 1 - id
&& interested[1 - id] == 1);
}
unlock(int id)
{
interested[id] = 0;
}
The MFENCE prevents stores and loads to different memory addresses from being
reordered. Otherwise the write to interested[id] could be queued in the store
buffer while the load of interested[1 - id] proceeds. On many current
microarchitectures a SFENCE may be enough, since it may be implemented as a
store buffer drain, but IIUC SFENCE doesn't need to be implemented that way,
and may simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need a full MFENCE.
The compiler barrier (__asm__ __volatile__("":::"memory")) prevents
the compiler from deciding that it already knows the value of turn. We're
telling the compiler that we've clobbered memory, so all values cached in
registers must be reloaded from memory.
P.S: I feel this needs a closing paragraph, but my brain is drained.
I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia.
struct petersonslock_t {
volatile unsigned flag[2];
volatile unsigned turn;
};
typedef struct petersonslock_t petersonslock_t;
petersonslock_t petersonslock () {
petersonslock_t l = { { 0U, 0U }, ~0U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 1;
l->turn = !p;
while (l->flag[!p] && (l->turn == !p)) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 0;
};
Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor.
Jens Gustedt and ninjalj recommend modifying the original algorithm to use the atomic_flag type. This means setting the flags and turns would use the atomic_flag_test_and_set and clearing them would use atomic_flag_clear from C11. Alternatively, a memory barrier could be imposed between updates to flag.
Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution.
struct petersonslock_t {
volatile unsigned state;
};
typedef struct petersonslock_t petersonslock_t;
#define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v)
#define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v)
petersonslock_t petersonslock () {
petersonslock_t l = { 0x000000U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00;
ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000);
(p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00);
while ((l->state & mask) && (l->state & 0x0000FF) == !p) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF);
};