Double check locking pattern issue in C

Double check locking pattern issue in C - c

As everybody who has looked into this, I've read the paper http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
I have a question about the barriers when DCLP is implemented on a C structure. Here is the code:
typedef struct _singleton_object {
int x;
int y;
} sobject;
static sobject *singleton_object = NULL;
sobject *get_singleton_instance()
{
sobject *tmp = singleton_object;
/* Insert barrier here - compiler or cpu specific or both? */
if (tmp == NULL) {
mutex_lock(&lock); /* assume lock is declared and initialized properly*/
tmp = singleton_object;
if (tmp == NULL) {
tmp = (sobject *)malloc(sizeof(sobject)); /* assume malloc succeeds */
tmp->x = 5;
tmp->y = 7;
/* Insert barrier here - compiler or cpu specific or both ?*/
singleton_object = tmp;
}
mutex_unlock(&lock);
}
return tmp;
}
The first question is as in the comments: When the paper describes insert barriers does it mean just the compiler, CPU or both? I assume both.
My second question is: what prevents the compiler from replacing tmp with singleton_object in the code? What forces the load of singleton_object into tmp, which could be in a register or stack in compiler generated code ? what if the compiler, at every reference to tmp, actually does a load into register from &singleton_object and discard that value?
It seems like the solution in the paper referenced below depends on the fact that we are using the local variable. if the compiler does not load the value in the pointer variable to the local variable tmp, we are back to the original problem described in the paper.
My third question is: Assuming, the compiler does copy the value of singleton_object locally into a register or stack(i.e. variable tmp), Why do we need the first barrier? There should be no reordering of tmp = singleton_object and if (tmp == NULL) in the beginning of the function, since there is an implicit read after write dependency with tmp. Also, even if we read a stale value from the CPU's cache in the first load to tmp, it should be read as NULL. If it is not NULL, then the object construction should be complete, since the thread/CPU that constructs it should execute the barrier, which ensures that the stores to x and y are visible to all CPU's before singleton_object has a non NULL value.

When the paper describes insert barriers does it mean just the compiler, CPU or both?
Both barriers should be CPU-barriers (which implies compiler barriers).
what prevents the compiler from replacing tmp with singleton_object in the code?
The barrier after assignment
sobject *tmp = singleton_object;
among other things means (both for CPU and compiler):
** All read accesses issued before the barrier should be completed before the barrier.
Because of that, compiler is not allowed to read singleton_object variable instead of tmp after the barrier.
If it (singleton_object) is not NULL, then the object construction should be complete, since the thread/CPU that constructs it should execute the barrier, which ensures that the stores to x and y are visible to all CPU's before singleton_object has a non NULL value.
You need to perform barrier for actual use these "visible" variables x and y. Without the barrier read thread may use stale values.
As a rule, every syncrhonization between different threads requires some sort of "barrier" on both sides: read and write.

Related

Read optimizations on shared memory

Suppose you have a function that make several read access to a shared variable whose access is atomic. All in running in the same process. Imagine them as threads of a process or as a sw running on bare metal platform with no MMU.
As a requirement you must ensure that the value of that read is consistent for all the length of the function so the code must not re-read the memory location and have to put in a local variable or on a register. How can we ensure that this behaviour is respected?
As an example...
shared is the only shared variable
extern uint32_t a, b, shared;
void useless_function()
{
__ASM volatile ("":::"memory");
uint32_t value = shared;
a = value *2;
b = value << 3;
}
Can value be optimized out by direct readings of shared variable in some contexts? If yes, how can I be sure this cannot happen?

As a requirement you must ensure that the value of that read is consistent for all the length of the function so the code must not re-read the memory location and have to put in a local variable or on a register. How can we ensure that this behaviour is respected?
You can do that with READ_ONCE macro from Linux kernel:
/*
* Prevent the compiler from merging or refetching reads or writes. The
* compiler is also forbidden from reordering successive instances of
* READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some
* particular ordering. One way to make the compiler aware of ordering is to
* put the two invocations of READ_ONCE or WRITE_ONCE in different C
* statements.
*
* These two macros will also work on aggregate data types like structs or
* unions. If the size of the accessed data type exceeds the word size of
* the machine (e.g., 32 bits or 64 bits) READ_ONCE() and WRITE_ONCE() will
* fall back to memcpy(). There's at least two memcpy()s: one for the
* __builtin_memcpy() and then one for the macro doing the copy of variable
* - '__u' allocated on the stack.
*
* Their two major use cases are: (1) Mediating communication between
* process-level code and irq/NMI handlers, all running on the same CPU,
* and (2) Ensuring that the compiler does not fold, spindle, or otherwise
* mutilate accesses that either do not require ordering or that interact
* with an explicit memory barrier or atomic instruction that provides the
* required ordering.
*/
E.g.:
uint32_t value = READ_ONCE(shared);
READ_ONCE macro essentially casts the object you read to be volatile because the compiler cannot emit extra reads or writes for volatile objects.
The above is equivalent to:
uint32_t value = *(uint32_t volatile*)&shared;
Alternatively:
uint32_t value;
memcpy(&value, &shared, sizeof value);
memcpy breaks the dependency between shared and value, so that the compiler cannot re-load shared instead of loading value.

In the example given you are not using the variable value in the function at all. So it will definitely be optimised.
Also, as mentioned in comments, in a multitasking system, the value of shared can be changed within the function.
What I need is that shared is read only once and it local value keeped for all function length and not re-evaluated
I would suggest something like this below.
extern uint32_t a, b, shared;
void useless_function()
{
__ASM volatile ("":::"memory");
uint32_t value = shared;
a = value*2;
b = value << 3;
}
Here shared is read only once in the function. It will be read again on next call of the function.

(Conditional?) creation of local variables in function

I came across this [question]: How can I store values into a multi-parameter struct and pass typedef struct to a function?.
I suggested declaring a variable of dtErrorMessage once before the if and use it for both conditions instead of declaring two different variables as shown here:
// Initial version
if(GetTemperature() <= -20)
{
dtErrorMessage low;
low.errorCode=ERROR_CODE_UNDER_TEMP;
low.temperature=GetTemperature();
SendErrorReport(&low);
}
if(GetTemperature() >= 160)
{
dtErrorMessage high;
high.errorCode=ERROR_CODE_OVER_TEMP;
high.temperature=GetTemperature();
SendErrorReport(&high);
}
// My version
dtErrorMessage err;
int8_t temp = GetTemperature(); // Do not consider change during evaluation
if(temp <= -20)
{
err.errorCode = ERROR_CODE_UNDER_TEMP;
err.temperature = temp;
SendErrorReport(&err);
}
else if(temp >= 160)
{
err.errorCode = ERROR_CODE_OVER_TEMP;
err.temperature = temp ;
SendErrorReport(&err);
}
else
{
// Do not send error report
}
My question is: (Under the embedded aspect,) am I right, that this way there will be two local variables created in RAM regardless of the condition? Consequently, it would reduce the reqired RAM using one unconditional variable declaration before the if and then using it for both conditions, right?
I couldn't find the correct terms to search for getting this answered myself.

The lifetime of a variable with automatic storage duration is until the end of the block. Storage will be guaranteed and it will retain constant address (i.e. the address given by e.g. &) until the end of the block, and after that all pointers to that object will become indeterminate.
The C standard does not say that low and high must occupy the same portion of memory! The storage must be guaranteed until the end of the block containing the declaration, but it can be around longer too. On the other hand the as-if rule says that the program needs to only behave as if it was a program compiled for the abstract machine according to the rules of the C standard, so a high-quality implementation would probably
not reserve memory for both low and high simultaneously at different addresses, and
not reserve memory for err in case the else branch is taken.
In effect making the execution behaviour of these 2 virtually identical.
The difference is mostly stylistic: in your version there is less to type, and the original version accommodates for bad compilers.

What is the GCC builtin for an atomic set?

I see the list of builtins at https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html. But for an atomic set, do you need to use the pair __sync_lock_test_and_set and __sync_lock_release?
I have seen this example of this on https://attractivechaos.wordpress.com/2011/10/06/multi-threaded-programming-efficiency-of-locking/.
volatile int lock = 0;
void *worker(void*)
{
while (__sync_lock_test_and_set(&lock, 1));
// critical section
__sync_lock_release(&lock);
}
But if I use this example, and do my atomic set inside the critical section, then atomic sets to different variables will be unnecessarily serialized.
Appreciate any input on how to do an atomic set where I have multiple atomic variables.

As per definition need to use both
__sync_synchronize (...)
This builtin issues a full memory barrier. type
__sync_lock_test_and_set (type *ptr, type value, ...)
This builtin, as described by Intel, is not a traditional test-and-set
operation, but rather an atomic exchange operation. It writes value
into *ptr, and returns the previous contents of *ptr. Many targets
have only minimal support for such locks, and do not support a full
exchange operation. In this case, a target may support reduced
functionality here by which the only valid value to store is the
immediate constant 1. The exact value actually stored in *ptr is
implementation defined.
This builtin is not a full barrier, but rather an acquire barrier.
This means that references after the builtin cannot move to (or be
speculated to) before the builtin, but previous memory stores may not
be globally visible yet, and previous memory loads may not yet be
satisfied.
void __sync_lock_release (type *ptr, ...)
This builtin releases the
lock acquired by __sync_lock_test_and_set. Normally this means writing
the constant 0 to *ptr. This builtin is not a full barrier, but rather
a release barrier. This means that all previous memory stores are
globally visible, and all previous memory loads have been satisfied,
but following memory reads are not prevented from being speculated to
before the barrier.

I came up with this solution. Please reply if you know a better one:
typedef struct {
volatile int lock; // must be initialized to 0 before 1st call to atomic64_set
volatile long long counter;
} atomic64_t;
static inline void atomic64_set(atomic64_t *v, long long i)
{
// see https://attractivechaos.wordpress.com/2011/10/06/multi-threaded-programming-efficiency-of-locking/
// for an explanation of __sync_lock_test_and_set
while (__sync_lock_test_and_set(&v->lock, 1)) { // we don't have the lock, so busy wait until
while (v->lock); // it is released (i.e. lock is set to 0)
} // by the holder via __sync_lock_release()
// critical section
v->counter = i;
__sync_lock_release(&v->lock);
}

Clear variable on the stack

Code Snippet:
int secret_foo(void)
{
int key = get_secret();
/* use the key to do highly privileged stuff */
....
/* Need to clear the value of key on the stack before exit */
key = 0;
/* Any half decent compiler would probably optimize out the statement above */
/* How can I convince it not to do that? */
return result;
}
I need to clear the value of a variable key from the stack before returning (as shown in the code).
In case you are curious, this was an actual customer requirement (embedded domain).

You can use volatile (emphasis mine):
Every access (both read and write) made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.
volatile int key = get_secret();

volatile might be overkill sometimes as it would also affect all the other uses of a variable.
Use memset_s (since C11): http://en.cppreference.com/w/c/string/byte/memset
memset may be optimized away (under the as-if rules) if the object modified by this function is not accessed again for the rest of its lifetime. For that reason, this function cannot be used to scrub memory (e.g. to fill an array that stored a password with zeroes). This optimization is prohibited for memset_s: it is guaranteed to perform the memory write.
int secret_foo(void)
{
int key = get_secret();
/* use the key to do highly privileged stuff */
....
memset_s(&key, sizeof(int), 0, sizeof(int));
return result;
}
You can find other solutions for various platforms/C standards here: https://www.securecoding.cert.org/confluence/display/c/MSC06-C.+Beware+of+compiler+optimizations
Addendum: have a look at this article Zeroing buffer is insufficient which points out other problems (besides zeroing the actual buffer):
With a bit of care and a cooperative compiler, we can zero a buffer — but that's not what we need. What we need to do is zero every location where sensitive data might be stored. Remember, the whole reason we had sensitive information in memory in the first place was so that we could use it; and that usage almost certainly resulted in sensitive data being copied onto the stack and into registers.
Your key value might have been copied into another location (like a register or temporary stack/memory location) by the compiler and you don't have any control to clear that location.

If you go with dynamic allocation you can control wiping that memory and not be bound by what the system does with the stack.
int secret_foo(void)
{
int *key = malloc(sizeof(int));
*key = get_secret();
memset(key, 0, sizeof(int));
// other magical things...
return result;
}

One solution is to disable compiler optimizations for the section of the code that you dont want optimizations:
int secret_foo(void) {
int key = get_secret();
#pragma GCC push_options
#pragma GCC optimize ("O0")
key = 0;
#pragma GCC pop_options
return result;
}

Volatile keyword in C [duplicate]

This question already has answers here:
Why is volatile needed in C?
(18 answers)
Closed 9 years ago.
I am writing program for ARM with Linux environment. its not a low level program, say app level
Can you clarify me what is the difference between,
int iData;
vs
volatile int iData;
Does it have hardware specific impact ?

Basically, volatile tells the compiler "the value here might be changed by something external to this program".
It's useful when you're (for instance) dealing with hardware registers, that often change "on their own", or when passing data to/from interrupts.
The point is that it tells the compiler that each access of the variable in the C code must generate a "real" access to the relevant address, it can't be buffered or held in a register since then you wouldn't "see" changes done by external parties.
For regular application-level code, volatile should never be needed unless (of course) you're interacting with something a lot lower-level.

The volatile keyword specifies that variable can be modified at any moment not by a program.
If we are talking about embedded, then it can be e.g. hardware state register. The value that it contains may be modified by the hardware at any unpredictable moment.
That is why, from the compiler point of view that means that compiler is forbidden to apply optimizations on this variable, as any kind of assumption is wrong and can cause unpredictable result on the program execution.

By making a variable volatile, every time you access the variable, you force the CPU to fetch it from memory rather than from a cache. This is helpful in multithreaded programs where many threads could reuse the value of a variable in a cache. To prevent such reuse ( in multithreaded program) volatile keyword is used. This ensures that any read or write to an volatile variable is stable (not cached)

Generally speaking, the volatile keyword is intended to prevent the compiler from applying any optimizations on the code that assume values of variables cannot change "on their own."
(from Wikipedia)
Now, what does this mean?
If you have a variable that could have its contents changed at any time, usually due to another thread acting on it while you are possibly referencing this variable in a main thread, then you may wish to mark it as volatile. This is because typically a variable's contents can be "depended on" with certainty for the scope and nature of the context in which the variable is used. But if code outside your scope or thread is also affecting the variable, your program needs to know to expect this and query the variable's true contents whenever necessary, more than the normal.
This is a simplification of what is going on, of course, but I doubt you will need to use volatile in most programming tasks.

In the following example, global_data is not explicitly modified. so when the optimization is done, compiler thinks that, it is not going to modified anyway. so it assigns global_data with 0. And uses 0, whereever global_data is used.
But actually global_data updated through some other process/method(say through ptrace ). by using volatile you can force it to read always from memory. so you can get updated result.
#include <stdio.h>
volatile int global_data = 0;
int main()
{
FILE *fp = NULL;
int data = 0;
printf("\n Address of global_data:%x \n", &global_data);
while(1)
{
if(global_data == 0)
{
continue;
}
else if(global_data == 2)
{
;
break;
}
}
return 0;
}

volatile keyword can be used,
when the object is a memory mapped io port.
An 8 bit memory mapped io port at physical address 0x15 can be declared as
char const ptr = (char *) 0x15;
Suppose that we want to change the value at that port at periodic intervals.
*ptr = 0 ;
while(*ptr){
*ptr = 4;//Setting a value
*ptr = 0; // Clearing after setting
}
It may get optimized as
*ptr = 0 ;
while(0){
}
Volatile supress the compiler optimization and compiler assumes that tha value can
be changed at any time even if no explicit code modify it.
Volatile char *const ptr = (volatile char * )0x15;
Used when the object is a modified by ISR.
Sometimes ISR may change tha values used in the mainline codes
static int num;
void interrupt(void){
++num;
}
int main(){
int val;
val = num;
while(val != num)
val = num;
return val;
}
Here the compiler do some optimizations to the while statement.ie the compiler
produce the code it such a way that the value of num will always read form the cpu
registers instead of reading from the memory.The while statement will always be
false.But in actual scenario the valu of num may get changed in the ISR and it will
reflect in the memory.So if the variable is declared as volatile the compiler will know
that the value should always read from the memory

volatile means that variables value could be change any time by any external source. in GCC if we dont use volatile than it optimize the code which is sometimes gives unwanted behavior.
For example if we try to get real time from an external real time clock and if we don't use volatile there then what compiler do is it will always display the value which is stored in cpu register so it will not work the way we want. if we use volatile keyword there then every time it will read from the real time clock so it will serve our purpose....
But as u said you are not dealing with any low level hardware programming then i don't think you need to use volatile anywhere
thanks