I have a single-writer, multiple-reader situation. There's a counter that one thread is writing to, and any thread may read this counter. Since the single writing thread doesn't have to worry about contending with other threads for data access, is the following code safe?
#include <stdatomic.h>
#include <stdint.h>
_Atomic uint32_t counter;
// Only 1 thread calls this function. No other thread is allowed to.
uint32_t increment_counter() {
atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
return counter; // This is the line in question.
}
// Any thread may call this function.
uint32_t load_counter() {
return atomic_load_explicit(&counter, memory_order_relaxed);
}
The writer thread just reads the counter directly without calling any atomic_load* function. This should be safe (since it's safe for multiple threads to read a value), but I don't know if declaring a variable _Atomic restricts you from using that variable directly, or if you're required to always read it using one of the atomic_load* functions.
Yes, all operations that you do on _Atomic objects are guaranteed to be effected as if you would issue the corresponding call with sequential consistency. And in your particular case an evaluation is equivalent to atomic_load.
But the algorithm as used there is wrong, because by doing an atomic_fetch_add and an evaluation the value that is returned may already be change by another thread. Correct would be
uint32_t ret = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
return ret+1;
This looks a bit suboptimal because the addition is done twice, but a good optimizer will sort this out.
If you rewrite the function, this question goes away:
uint32_t increment_counter() {
return 1 + atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
}
Related
I have a few questions regarding memory barriers.
Say I have the following C code (it will be run both from C++ and C code, so atomics are not possible) that writes an array into another one. Multiple threads may call thread_func(), and I want to make sure that my_str is returned only after it was initialized fully. In this case, it is a given that the last byte of the buffer can't be 0. As such, checking for the last byte as not 0, should suffice.
Due to reordering by compiler/CPU, this can be a problem as the last byte might get written before previous bytes, causing my_str to be returned with a partially copied buffer. So to get around this, I want to use a memory barrier. A mutex will work of course, but would be too heavy for my uses.
Keep in mind that all threads will call thread_func() with the same input, so even if multiple threads call init() a couple of times, it's OK as long as in the end, thread_func() returns a valid my_str, and that all subsequent calls after initialization return my_str directly.
Please tell me if all the following different code approaches work, or if there could be issues in some scenarios as aside from getting the solution to the problem, I'd like to get some more information regarding memory barriers.
__sync_bool_compare_and_swap on last byte. If I understand correctly, any memory store/load would not be reordered, not just the one for the particular variable that is sent to the command. Is that correct? if so, I would expect this to work as all previous writes of the previous bytes should be made before the barrier moves on.
#define STR_LEN 100
static uint8_t my_str[STR_LEN] = {0};
static void init(uint8_t input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN - 1; ++i) {
my_str[i] = input_buf[i];
}
__sync_bool_compare_and_swap(my_str, 0, input_buf[STR_LEN - 1]);
}
const char * thread_func(char input_buf[STR_LEN])
{
if (my_str[STR_LEN - 1] == 0) {
init(input_buf);
}
return my_str;
}
__sync_bool_compare_and_swap on each write. I would expect this to work as well, but to be slower than the first one.
static void init(char input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN; ++i) {
__sync_bool_compare_and_swap(my_str + i, 0, input_buf[i]);
}
}
__sync_synchronize before each byte copy. I would expect this to work as well, but is this slower or faster than (2)? __sync_bool_compare_and_swap is supposed to be a full barrier as well, so which would be preferable?
static void init(char input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN; ++i) {
__sync_synchronize();
my_str[i] = input_buf[i];
}
}
__sync_synchronize by condition. As I understand it, __sync_synchronize is both a HW and SW memory barrier. As such, since the compiler can't tell the value of use_sync it shouldn't reorder. And the HW reordering will be done only if use_sync is true. is that correct?
static void init(char input_buf[STR_LEN], bool use_sync)
{
for (int i = 0; i < STR_LEN; ++i) {
if (use_sync) {
__sync_synchronize();
}
my_str[i] = input_buf[i];
}
}
GNU C legacy __sync builtins are not recommended for new code, as the manual says.
Use the __atomic builtins which can take a memory-order parameter like C11 stdatomic. But they're still builtins and still work on plain types not declared _Atomic, so using them is like C++20 std::atomic_ref. In C++20, use std::atomic_ref<unsigned char>(my_str[STR_LEN - 1]), but C doesn't provide an equivalent so you'd have to use compiler builtins to hand-roll it.
Just do the last store separately with a release store in the writer, not an RMW, and definitely not a full memory barrier (__sync_synchronize()) between every byte!!! That's way slower than necessary, and defeats any optimization to use memcpy. Also, you need the store of the final byte to be at least RELEASE, not a plain store, so readers can synchronize with it. See also Who's afraid of a big bad optimizing compiler? re: how exactly compilers can break your code if you try to hand-roll lockless code with just barriers, not atomic loads or stores. (It's written for Linux kernel code, where a macro would use *(volatile char*) to hand-roll something close to __atomic_store_n with __ATOMIC_RELAXED`)
So something like
__atomic_store_n(&my_str[STR_LEN - 1], input_buf[STR_LEN - 1], __ATOMIC_RELEASE);
The if (my_str[STR_LEN - 1] == 0) load in thread_func is of course data-race UB when there are concurrent writers.
For safety it needs to be an acquire load, like __atomic_load_n(&my_str[STR_LEN - 1], __ATOMIC_ACQUIRE) == 0, since you need a thread that loads a non-0 value to also see all other stores by another thread that ran init(). (Which did a release-store to that location, creating acquire/release synchronization and guaranteeing a happens-before relationship between these threads.)
See https://preshing.com/20120913/acquire-and-release-semantics/
Writing the same value non-atomically is also UB in ISO C and ISO C++. See Race Condition with writing same value in C++? and others.
But in practice it should be fine except with clang -fsanitize=thread. In theory a DeathStation9000 could implement non-atomic stores by storing value+1 and then subtracting 1, so temporarily there's be a different value in memory. But AFAIK there aren't real compilers that do that. I'd have a look at the generated asm on any new compiler / ISA combination you're trying, just to make sure.
It would be hard to test; the init stuff can only race once per program invocation. But there's no fully safe way to do it that doesn't totally suck for performance, AFAIK. Perhaps doing the init with a cast to _Atomic unsigned char* or typedef _Atomic unsigned long __attribute__((may_alias)) aliasing_atomic_ulong; as a building block for a manual copy loop?
Bonus question: if(use_sync) __sync_synchronize() inside the loop.
Since the compiler can't tell the value of use_sync it shouldn't reorder.
Optimization is possible to asm that works something like if(use_sync) { slow barrier loop } else { no-barrier loop }. This is called "loop unswitching": making two loops and branching once to decide which to run, instead of every iteration. GCC has been able to do that optimization (in some cases) since 3.4. So that defeats your attempt to take advantage of how the compiler would compile to trick it into doing more ordering than the source actually requires.
And the HW reordering will be done only if use_sync is true.
Yes, that part is correct.
Also, inlining and constant-propagation of use_sync could easily defeat this, unless use_sync was a volatile global or something. At that point you might as well just make a separate _Atomic unsigned char array_init_done flag / guard variable.
And you can use it for mutual exclusion by having threads try to set it to 1 with int old = guard.exchange(1), with the winner of the race being the one to run init while they spin-wait (or C++20 .wait(1)) for the guard variable to become 2 or -1 or something, which the winner of the race will set after finishing init.
Have a look at the asm GCC makes for non-constant-initialized static local vars; they check a guard variable with an acquire load, only doing locking to have one thread do the run_once init stuff and the others wait for that result. IIRC there's a Q&A about doing that yourself with atomics.
I see the list of builtins at https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html. But for an atomic set, do you need to use the pair __sync_lock_test_and_set and __sync_lock_release?
I have seen this example of this on https://attractivechaos.wordpress.com/2011/10/06/multi-threaded-programming-efficiency-of-locking/.
volatile int lock = 0;
void *worker(void*)
{
while (__sync_lock_test_and_set(&lock, 1));
// critical section
__sync_lock_release(&lock);
}
But if I use this example, and do my atomic set inside the critical section, then atomic sets to different variables will be unnecessarily serialized.
Appreciate any input on how to do an atomic set where I have multiple atomic variables.
As per definition need to use both
__sync_synchronize (...)
This builtin issues a full memory barrier. type
__sync_lock_test_and_set (type *ptr, type value, ...)
This builtin, as described by Intel, is not a traditional test-and-set
operation, but rather an atomic exchange operation. It writes value
into *ptr, and returns the previous contents of *ptr. Many targets
have only minimal support for such locks, and do not support a full
exchange operation. In this case, a target may support reduced
functionality here by which the only valid value to store is the
immediate constant 1. The exact value actually stored in *ptr is
implementation defined.
This builtin is not a full barrier, but rather an acquire barrier.
This means that references after the builtin cannot move to (or be
speculated to) before the builtin, but previous memory stores may not
be globally visible yet, and previous memory loads may not yet be
satisfied.
void __sync_lock_release (type *ptr, ...)
This builtin releases the
lock acquired by __sync_lock_test_and_set. Normally this means writing
the constant 0 to *ptr. This builtin is not a full barrier, but rather
a release barrier. This means that all previous memory stores are
globally visible, and all previous memory loads have been satisfied,
but following memory reads are not prevented from being speculated to
before the barrier.
I came up with this solution. Please reply if you know a better one:
typedef struct {
volatile int lock; // must be initialized to 0 before 1st call to atomic64_set
volatile long long counter;
} atomic64_t;
static inline void atomic64_set(atomic64_t *v, long long i)
{
// see https://attractivechaos.wordpress.com/2011/10/06/multi-threaded-programming-efficiency-of-locking/
// for an explanation of __sync_lock_test_and_set
while (__sync_lock_test_and_set(&v->lock, 1)) { // we don't have the lock, so busy wait until
while (v->lock); // it is released (i.e. lock is set to 0)
} // by the holder via __sync_lock_release()
// critical section
v->counter = i;
__sync_lock_release(&v->lock);
}
Code Snippet:
int secret_foo(void)
{
int key = get_secret();
/* use the key to do highly privileged stuff */
....
/* Need to clear the value of key on the stack before exit */
key = 0;
/* Any half decent compiler would probably optimize out the statement above */
/* How can I convince it not to do that? */
return result;
}
I need to clear the value of a variable key from the stack before returning (as shown in the code).
In case you are curious, this was an actual customer requirement (embedded domain).
You can use volatile (emphasis mine):
Every access (both read and write) made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.
volatile int key = get_secret();
volatile might be overkill sometimes as it would also affect all the other uses of a variable.
Use memset_s (since C11): http://en.cppreference.com/w/c/string/byte/memset
memset may be optimized away (under the as-if rules) if the object modified by this function is not accessed again for the rest of its lifetime. For that reason, this function cannot be used to scrub memory (e.g. to fill an array that stored a password with zeroes). This optimization is prohibited for memset_s: it is guaranteed to perform the memory write.
int secret_foo(void)
{
int key = get_secret();
/* use the key to do highly privileged stuff */
....
memset_s(&key, sizeof(int), 0, sizeof(int));
return result;
}
You can find other solutions for various platforms/C standards here: https://www.securecoding.cert.org/confluence/display/c/MSC06-C.+Beware+of+compiler+optimizations
Addendum: have a look at this article Zeroing buffer is insufficient which points out other problems (besides zeroing the actual buffer):
With a bit of care and a cooperative compiler, we can zero a buffer — but that's not what we need. What we need to do is zero every location where sensitive data might be stored. Remember, the whole reason we had sensitive information in memory in the first place was so that we could use it; and that usage almost certainly resulted in sensitive data being copied onto the stack and into registers.
Your key value might have been copied into another location (like a register or temporary stack/memory location) by the compiler and you don't have any control to clear that location.
If you go with dynamic allocation you can control wiping that memory and not be bound by what the system does with the stack.
int secret_foo(void)
{
int *key = malloc(sizeof(int));
*key = get_secret();
memset(key, 0, sizeof(int));
// other magical things...
return result;
}
One solution is to disable compiler optimizations for the section of the code that you dont want optimizations:
int secret_foo(void) {
int key = get_secret();
#pragma GCC push_options
#pragma GCC optimize ("O0")
key = 0;
#pragma GCC pop_options
return result;
}
This question already has answers here:
Why is volatile needed in C?
(18 answers)
Closed 9 years ago.
I am writing program for ARM with Linux environment. its not a low level program, say app level
Can you clarify me what is the difference between,
int iData;
vs
volatile int iData;
Does it have hardware specific impact ?
Basically, volatile tells the compiler "the value here might be changed by something external to this program".
It's useful when you're (for instance) dealing with hardware registers, that often change "on their own", or when passing data to/from interrupts.
The point is that it tells the compiler that each access of the variable in the C code must generate a "real" access to the relevant address, it can't be buffered or held in a register since then you wouldn't "see" changes done by external parties.
For regular application-level code, volatile should never be needed unless (of course) you're interacting with something a lot lower-level.
The volatile keyword specifies that variable can be modified at any moment not by a program.
If we are talking about embedded, then it can be e.g. hardware state register. The value that it contains may be modified by the hardware at any unpredictable moment.
That is why, from the compiler point of view that means that compiler is forbidden to apply optimizations on this variable, as any kind of assumption is wrong and can cause unpredictable result on the program execution.
By making a variable volatile, every time you access the variable, you force the CPU to fetch it from memory rather than from a cache. This is helpful in multithreaded programs where many threads could reuse the value of a variable in a cache. To prevent such reuse ( in multithreaded program) volatile keyword is used. This ensures that any read or write to an volatile variable is stable (not cached)
Generally speaking, the volatile keyword is intended to prevent the compiler from applying any optimizations on the code that assume values of variables cannot change "on their own."
(from Wikipedia)
Now, what does this mean?
If you have a variable that could have its contents changed at any time, usually due to another thread acting on it while you are possibly referencing this variable in a main thread, then you may wish to mark it as volatile. This is because typically a variable's contents can be "depended on" with certainty for the scope and nature of the context in which the variable is used. But if code outside your scope or thread is also affecting the variable, your program needs to know to expect this and query the variable's true contents whenever necessary, more than the normal.
This is a simplification of what is going on, of course, but I doubt you will need to use volatile in most programming tasks.
In the following example, global_data is not explicitly modified. so when the optimization is done, compiler thinks that, it is not going to modified anyway. so it assigns global_data with 0. And uses 0, whereever global_data is used.
But actually global_data updated through some other process/method(say through ptrace ). by using volatile you can force it to read always from memory. so you can get updated result.
#include <stdio.h>
volatile int global_data = 0;
int main()
{
FILE *fp = NULL;
int data = 0;
printf("\n Address of global_data:%x \n", &global_data);
while(1)
{
if(global_data == 0)
{
continue;
}
else if(global_data == 2)
{
;
break;
}
}
return 0;
}
volatile keyword can be used,
when the object is a memory mapped io port.
An 8 bit memory mapped io port at physical address 0x15 can be declared as
char const ptr = (char *) 0x15;
Suppose that we want to change the value at that port at periodic intervals.
*ptr = 0 ;
while(*ptr){
*ptr = 4;//Setting a value
*ptr = 0; // Clearing after setting
}
It may get optimized as
*ptr = 0 ;
while(0){
}
Volatile supress the compiler optimization and compiler assumes that tha value can
be changed at any time even if no explicit code modify it.
Volatile char *const ptr = (volatile char * )0x15;
Used when the object is a modified by ISR.
Sometimes ISR may change tha values used in the mainline codes
static int num;
void interrupt(void){
++num;
}
int main(){
int val;
val = num;
while(val != num)
val = num;
return val;
}
Here the compiler do some optimizations to the while statement.ie the compiler
produce the code it such a way that the value of num will always read form the cpu
registers instead of reading from the memory.The while statement will always be
false.But in actual scenario the valu of num may get changed in the ISR and it will
reflect in the memory.So if the variable is declared as volatile the compiler will know
that the value should always read from the memory
volatile means that variables value could be change any time by any external source. in GCC if we dont use volatile than it optimize the code which is sometimes gives unwanted behavior.
For example if we try to get real time from an external real time clock and if we don't use volatile there then what compiler do is it will always display the value which is stored in cpu register so it will not work the way we want. if we use volatile keyword there then every time it will read from the real time clock so it will serve our purpose....
But as u said you are not dealing with any low level hardware programming then i don't think you need to use volatile anywhere
thanks
I have to use two threads; one to do various operations on matrices, and the other to monitor virtual memory at various points in the matrix operation process. This method is required to use a global state variable 'flag'.
So far I have the following (leaving some out for brevity):
int flag = 0;
int allocate_matrices(int dimension)
{
while (flag == 0) {} //busy wait while main prints memory state
int *matrix = (int *) malloc(sizeof(int)*dimension*dimension);
int *matrix2 = (int *) malloc(sizeof(int)*dimension*dimension);
flag = 0;
while (flag == 0) {} //busy wait while main prints memory state
// more similar actions...
}
int memory_stats()
{
while (flag == 0)
{ system("top"); flag = 1; }
}
int main()
{ //threads are created and joined for these two functions }
As you might expect, the system("top") call happens once, the the matrices are allocated, then the program falls into an infinite loop. It seems apparent to me that this is because the thread assigned to the memory_stats function has already completed its duty, so flag will never be updated again.
Is there an elegant way around this? I know I have to print memory stats four times, so it occurs to me that I could write four while loops in the memory_stats function with busy waiting contingent on the global flag in between each of them, but that seems clunky to me. Any help or pointers would be appreciated.
One of the possible reasons for the hang is that flag is a regular variable and the compiler sees that it's never set to a non-zero value between flag = 0; and while (flag == 0) {} or in this while inside allocate_matrices(). And so it "thinks" the variable stays 0 and the loop becomes infinite. The compiler is entirely oblivious to your threads.
You could define flag as volatile to prevent the above from happening, but you'll likely run into other issues after adding volatile. For one thing, volatile does not guarantee atomicity of variable modifications.
Another issue is that if the compiler sees an infinite loop that has no side effects, it may be considered undefined behavior and anything could happen, or, at least, not what you're thinking should, also this.
You need to use proper synchronization primitives like mutexes.
You can lock it with mutex. I assume you use pthread.
pthread_mutex_t mutex;
pthread_mutex_lock(&mutex);
flag=1;
pthread_mutex_unlock (&mutex);
Here is a very good tutorial about pthreads, mutexes and other stuff: https://computing.llnl.gov/tutorials/pthreads/
Your problem could be solved with a C compiler that follows the latest C standard, C11. C11 has threads and a data type called atomic_flag, that can basically used for a spin lock as you have it in your question.
First of all, the variable flag needs to be declared volatile or else the compiler has license to omit reads to it after the first one.
With that out of the way, a sequencer/event_counter can be used: one thread may increment the variable when it's odd, the other when it's even. Since one thread always "owns" the variable, and transfers the ownership with the increment, there is no race condition.