ensure reading a struct as a whole - c

I have a shared struct variable between 2 threads:
struct {
long a;
long b;
long c;
} myStruct;
struct myStruct A;
All 3 fields of A is initialized to zero. Then 1st thread will update them:
A.a = 1;
A.b = 2;
A.c = 3;
And 2nd thread will read from it. What I want to ensure is that 2nd thread will read A as a whole, either the old value {0, 0, 0}, or the new value {1, 2, 3}, not some corrupted like {1, 2, 0}.
The struct don't fit in 64bit so I can not use builtin atomic of gcc, and I don't want to use mutex either, so I came up with 2 guarding flags:
struct {
long a;
long b;
long c;
volatile int beginCount, endCount;
} A;
then 1st thread will:
A.beginCount++;
A.a = 1;
A.b = 2;
A.c = 3;
A.endCount++;
and 2nd will loop until it get a consistent struct:
int begin, end;
myStruct tmp;
do {
begin = A.beginCount;
end = A.endCount;
tmp = A;
} while (!(begin == A.beginCount && end == A.endCount && A.beginCount == A.endCount))
// now tmp will be either {0,0,0} or {1,2,3}
Are those 2 guarding flags enough? If not then please point out the specificed combination of thread scheduling that could break it.
Edit 1: the reason I don't want to use mutex is that the 1st thread has high priority, it should not wait for anything. If 1st thread want to write when 2nd is reading, then 1st thread still write anyway, and 2nd thread has to redo the reading until it get a consistent value. We can't do that with mutex, at least not something I'm aware of.
Edit 2: about environment: this code run on multiprocessor system, and I dedicated 1 entire cpu core for each thread.
Edit 3: I know that synchronization without mutex or atomic is very tricky. I've listed down all combination I could think of, and could not find any one break the code. So, please, don't just tell that it won't work, I will really appreciated if you point out when it will break.

I don't want to use mutex either
On a uniprocessor system, if the first thread gets preempted while writing, the reading thread will spend its time slice spinning needlessly. You do want a mutex in such a case.
Both Linux futexes and Windows' CriticalSections don't context-switch in the non-contention case and on multiprocessor systems, spin a while before yielding.
Why reimplement the exact same mechanism?

There is absolutely no portable way to do what you want. Some very high-end systems have transactional memory that may be able to accomplish what you want, but the normal pattern for using transactional memory anyway is to write your code with locks and rely on the lock implementation to use transactions.
Simply use a mutex to protect both reads and writes. There is no other way to make your code correct, but lots of ways to make it "seem correct to testing" until it violates an invariant and crashes a few months later or gets run on a slightly different environment/cpu and starts crashing every time you run it.

My first advice is that you really should implement it using a mutex (be sure each thread holds the mutex for as little time as possible) and see if you actually run into any problems. Most likely you will find out that using a mutex works just fine, and that nothing more is required. Doing it that way has the advantage of being portable to any hardware, simple to understand, and easy to debug.
That said, if you insist on not using a mutex, then the only other option is to use atomic variables. Since atomic variables are word-sized, you won't be able to make an entire struct atomic, but you can fake it by instantiating an array of structs instead (the necessary size of the array will depend on how often you intend to update the struct) and then using atomic integers as indices into the "currently valid for reading" and "okay to write into for writing" structs in the array. Reading the current value of the struct out of the array is simple enough -- you just need to read from the "currently valid for reading" index in the array, which is guaranteed not to be written to -- but writing a new value is more elaborate; you need to atomically increment the "okay to write into for writing" index (and wrap it around if necessary to avoid indexing off the end of the array, and check for an overflow condition if, after doing this, the okay-for-writing index equals the read-from index). Then write your new struct into the slot specified by the okay-for-writing index. Then you've got to do an atomic compare-and-set operation to set the read-from index equal to the okay-for-writing index; if the compare-and-set operation fails, you need to restart the whole operation because another thread beat you to the update. Repeat the whole set() process again until the compare-and-set operation succeeds.
(if this all sounds dubious and error-prone, that's because it is. It can be implemented correctly, but it's very easy to instead implement it almost-correctly, and end up with code that works 99.999% of the time and then does something regrettable and non-reproducible the other 0.0001% of the time. Consider yourself warned :))

Related

How Do I Enforce Write-only Behavior to a Page in Windows?

I'm reading the documentation for the Win32 VirtualAlloc function, and in the protections listed, there is no PAGE_WRITEONLY option that would pair with the PAGE_READONLY option, naturally. Is there any way to obtain such support by the operating system, or will I have to implement this in software somehow, or can I use processor features that may be available for implementing such things in hardware from user code? A software implementation is undesirable for obvious performance reasons.
Now this also introduces an obvious problem: the memory cannot be read, effectively making the writes an expensive NOP sequence, so the question is whether or not I can make a page have different protections from different contexts so that from one context, the page is write-only, but from another context, the page is read-only.
Security is only one small consideration, but in principle, it is for the sake of ensuring consistency of implementation with design which has security as a benefit (no unwanted reading of what should only be written from one context and vice versa). If you only need to write to something (which is obvious in the case of e.g. the output of a procedure, a hardware send buffer [or software model thereof in RAM], etc.), then it is worthwhile to ensure it is only written, and if you only need to read something, then it is worthwhile to ensure it is only read.
Reading you comments I think you are looking for a lock system where only one thread can write or read to memory at the same time. Is that correct?
You may be looking for the cmpxchg instruction which is implemented in Windows by function InterlockedCompareExchange, InterlockedCompareExchange64 and InterlockedCompareExchange128. This will help you compare two 32/64/128 bit values and copy a new value to the location if they are equal. You can compare it to the following C code
if(a==b)
a = c;
The difference between this C example and the cmpxchg instruction is that cmpxchg is one single instruction and the C example consist out of multiple instructions. This means the cmpxchg cannot be interrupted, where the C example can be interrupted. If the C example is interrupted after the 'if' statement and before the 'set' instruction, another thread will get CPU time and can change variable 'a'. This cannot happen with cmpxchg.
This still causes problems if the system has multiple cores. To fix this, the lock prefix is used. This causes synchronization through all the CPU's. This is also used in the windows API I mentioned above, so don't worry about this.
For every piece of memory you want to lock, you create an integer. You use the InterlockedCompareExchange to set this variable to '1', but only if it equals '0'. If the function returns that it didn't equal '0', you wait by calling sleep, and retry until it does. Every thread needs to set this variable to '0' when it's done using it.
Example:
LONG volatile lock;
int main()
{
//init the lock
lock = (LONG)0;
for (int i = 0; i < 100; i++)
CreateThread(0, 0, (LPTHREAD_START_ROUTINE) &newThread, (LPVOID) i, 0, 0);
ExitThread(0);
}
int newThread(int var) {
//Request lock
while (InterlockedCompareExchange((long *)&lock, 1, 0) != 0)
Sleep(1);
printf("Thread %x (%d) got the lock, waiting %dms seconds before releasing the lock.\n", GetCurrentThreadId(), var, var*100);
//Do whatever you want to do
Sleep(var * 100);
printf("Lock released.\n");
//unlock
lock = (LONG)0;
return 0;
}

Self-written Mutex for 2+ Threads

I have written the following code, and so far in all my tests it seems as if I have written a working Mutex for my 4 Threads, but I would like to get someone else's opinion on the validity of my solution.
typedef struct Mutex{
int turn;
int * waiting;
int num_processes;
} Mutex;
void enterLock(Mutex * lock, int id){
int i;
for(i = 0; i < lock->num_processes; i++){
lock->waiting[id] = 1;
if (i != id && lock->waiting[i])
i = -1;
lock->waiting[id] = 0;
}
printf("ID %d Entered\n",id);
}
void leaveLock(Mutex * lock, int id){
printf("ID %d Left\n",id);
lock->waiting[id] = 0;
}
void foo(Muted * lock, int id){
enterLock(lock,id);
// do stuff now that i have access
leaveLock(lock,id);
}
I feel compelled writing an answer here because the question is a good one, taking into concern it could help others to understand the general problem with mutual exclusion. In your case, you came a long way to hide this problem, but you can't avoid it. It boils down to this:
01 /* pseudo-code */
02 if (! mutex.isLocked())
03 mutex.lock();
You always have to expect a thread switch between lines 02 and 03. So there is a possible situation where two threads find mutex unlocked and be interrupted after that ... only to resume later and lock this mutex individually. You will have two threads entering the critical section at the same time.
What you definitely need to implement reliable mutual exclusion is therefore an atomic operation that tests a condition and at the same time sets a value without any chance to be interrupted meanwhile.
01 /* pseudo-code */
02 while (! test_and_lock(mutex));
As soon as this test_and_lock function cannot be interrupted, your implementation is safe. Until c11, C didn't provide anything like this, so implementations of pthreads needed to use e.g. assembly or special compiler intrinsics. With c11, there is finally a "standard" way to write atomic operations like this, but I can't give an example here, because I don't have experience doing that. For general use, the pthreads library will give you what you need.
edit: of course, this is still simplified -- in a multi-processor scenario, you need to ensure that even memory accesses are mutually exclusive.
The Problem I see in you code:
The idea behind a mutex is to provide mutual exclusion, means that when thread_a is in the critical section, thread_b must wait(in case he wants also to enter) for thread_a.
This waiting part should be implemented in enterLock function. But what you have is a for loop which might end way before thread_a is done from the critical section and thus thread_b could also enter, hence you can't have mutual exclusion.
Way to fix it:
Take a look for example at Peterson's algorithm or Dekker's(more complicated), what they did there is what's called busy waiting which is basically a while loop which says:
while(i can't enter) { do nothing and wait...}
You are totally ignoring the topic of memory models. Unless you are on a machine with a sequential consistent memory model (which none of today's PC CPUs are), your code is incorrect, as any store executed by one thread is not necessarily immediately visible to other CPUs. However, exactly this seems to be an assumption in your code.
Bottom line: Use the existing synchronization primitives provided by the OS or a runtime library such a POSIX or Win32 API and don't try to be smart and implement this yourself. Unless you have years of experince in parallel programming as well as in-depth knowledge of CPU architecture, chances are quite good that you end up with an incorrect implementation. And debugging parallel programms can be hell...
After enterLock() returns, the state of the Mutex object is the same as before the function was called. Hence it will not prevent a second thread to enter the same Mutex object even before the first one released it calling leaveLock(). There is no mutual exclusiveness.

pointer shared between two threads without mutex [duplicate]

Is there a problem with multiple threads using the same integer memory location between pthreads in a C program without any synchronization utilities?
To simplify the issue,
Only one thread will write to the integer
Multiple threads will read the integer
This pseudo-C illustrates what I am thinking
void thread_main(int *a) {
//wait for something to finish
//dereference 'a', make decision based on its value
}
int value = 0;
for (int i=0; i<10; i++)
pthread_create(NULL,NULL,thread_main,&value);
}
// do something
value = 1;
I assume it is safe, since an integer occupies one processor word, and reading/writing to a word should be the most atomic of operations, right?
Your pseudo-code is NOT safe.
Although accessing a word-sized integer is indeed atomic, meaning that you'll never see an intermediate value, but either "before write" or "after write", this isn't enough for your outlined algorithm.
You are relying on the relative order of the write to a and making some other change that wakes the thread. This is not an atomic operation and is not guaranteed on modern processors.
You need some sort of memory fence to prevent write reordering. Otherwise it's not guaranteed that other threads EVER see the new value.
Unlike java where you explicitly start a thread, posix threads start executing immediatelly.
So there is no guarantee that the value you set to 1 in main function (assuming that is what you refer in your pseudocode) will be executed before or after the threads try to access it.
So while it is safe to read the integer concurrently, you need to do some synchronization if you need to write to the value in order to be used by the threads.
Otherwise there is no guarantee what is the value they will read (in order to act depending on the value as you note).
You should not be making assumptions on multithreading e.g.that there is some processing in each thread befor accessing the value etc.
There are no guarantees
I wouldn't count on it. The compiler may emit code that assumes it knows what the value of 'value' is at any given time in a CPU register without re-loading it from memory.
EDIT:
Ben is correct (and I'm an idiot for saying he wasn't) that there is the possibility that the cpu will re-order the instructions and execute them down multiple pipelines at the same time. This means that the value=1 could possibly get set before the pipeline performing "the work" finished. In my defense (not a full idiot?) I have never seen this happen in real life and we do have an extensive thread library and we do run exhaustive long term tests and this pattern is used throughout. I would have seen it if it were happening, but none of our tests ever crash or produce the wrong answer. But... Ben is correct, the possibility exists. It is probably happening all the time in our code, but the re-ordering is not setting flags early enough that the consumers of the data protected by the flags can use the data before its finished. I will be changing our code to include barriers, because there is no guarantee that this will continue to work in the wild. I believe the correct solution is similar to this:
Threads that read the value:
...
if (value)
{
__sync_synchronize(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
__sync_synchronize(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
That being said, I found this to be a simple explanation of barriers.
COMPILER BARRIER
Memory barriers affect the CPU. Compiler barriers affect the compiler. Volatile will not keep the compiler from re-ordering code. Here for more info.
I believe you can use this code to keep gcc from rearranging the code during compile time:
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
So maybe this is what should really be done?
#define GENERAL_BARRIER() do { COMPILER_BARRIER(); __sync_synchronize(); } while(0)
Threads that read the value:
...
if (value)
{
GENERAL_BARRIER(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
GENERAL_BARRIER(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
Using GENERAL_BARRIER() keeps gcc from re-ordering the code and also keeps the cpu from re-ordering the code. Now, I wonder if gcc wont re-order code over its memory barrier builtin, __sync_synchronize(), which would make the use of COMPILER_BARRIER redundant.
X86
As Ben points out, different architectures have different rules regarding how they rearrange code in the execution pipelines. Intel seems to be fairly conservative. So the barriers might not be required nearly as much on Intel. Not a good reason to avoid the barriers though, since that could change.
ORIGINAL POST:
We do this all the time. its perfectly safe (not for all situations, but a lot). Our application runs on 1000's of servers in a huge farm with 16 instances per server and we don't have race conditions. You are correct to wonder why people use mutexes to protect already atomic operations. In many situations the lock is a waste of time. Reading and writing to 32 bit integers on most architectures is atomic. Don't try that with 32 bit bit-fields though!
Processor write re-ordering is not going to affect one thread reading a global value set by another thread. In fact, the result using locks is the same as the result not without locks. If you win the race and check the value before its changed ... well that's the same as winning the race to lock the value so no-one else can change it while you read it. Functionally the same.
The volatile keyword tells the compiler not to store a value in a register, but to keep referring to the original memory location. this should have no effect unless you are optimizing code. We have found that the compiler is pretty smart about this and have not run into a situation yet where volatile changed anything. The compiler seems to be pretty good at coming up with candidates for register optimization. I suspect that the const keyword might encourage register optimization on a variable.
The compiler might re-order code in a function if it knows the end result will not be different. I have not seen the compiler do this with global variables, because the compiler has no idea how changing the order of a global variable will affect code outside of the immediate function.
If a function is acting up, you can control the optimization level at the function level using __attrribute__.
Now, that said, if you use that flag as a gateway to allow only one thread of a group to perform some work, that wont work. Example: Thread A and Thread B both could read the flag. Thread A gets scheduled out. Thread B sets the flag to 1 and starts working. Thread A wakes up and sets the flag to 1 and starts working. Ooops! To avoid locks and still do something like that you need to look into atomic operations, specifically gcc atomic builtins like __sync_bool_compare_and_swap(value, old, new). This allows you to set value = new if value is currently old. In the previous example, if value = 1, only one thread (A or B) could execute __sync_bool_compare_and_swap(&value, 1, 2) and change value from 1 to 2. The losing thread would fail. __sync_bool_compare_and_swap returns the success of the operation.
Deep down, there is a "lock" when you use the atomic builtins, but it is a hardware instruction and very fast when compared to using mutexes.
That said, use mutexes when you have to change a lot of values at the same time. atomic operations (as of todayu) only work when all the data that has to change atomicly can fit into a contiguous 8,16,32,64 or 128 bits.
Assume the first thing you're doing in thread func in sleeping for a second. So value after that will be definetly 1.
In any instant you should at least declare the shared variable volatile. However you should in all cases prefer some other form of thread IPC or synchronisation; in this case it looks like a condition variable is what you actually need.
Hm, I guess it is secure, but why don't you just declare a function that returns the value to the other threads, as they will only read it?
Because the simple idea of passing pointers to separate threads is already a security fail, in my humble opinion. What I'm telling you is: why to give a (modifiable, public accessible) integer address when you only need the value?

Race condition and mutex

I have 2 questions regarding to threads, one is about race condition and the other is about mutex.
So the first question :
I've read about race condition in wikipedia page :
http://en.wikipedia.org/wiki/Race_condition
And in the example of race condition between 2 threads this is shown :
http://i60.tinypic.com/2vrtuz4.png[
Now so far I believed that threads works parallel to each other, but judging from this picture it's seems that I interpreted on how actions done by the computer wrong.
From this picture only 1 action is done at a time, and although the threads gets switched from time to time and the other thread gets to do some actions this is still 1 action at a time done by the computer. Is it really like this ? There's no "real" parallel computing, just 1 action done at a time in a very fast rate which gives the illusion of parallel computing ?
This leads me to my second question about mutex.
I've read that if threads read/write to the same memory we need some sort of synchronization mechanism. I've read the normal data types won't do and we need a mutex.
Let's take for example the following code :
#include <stdio.h>
#include <stdbool.h>
#include <windows.h>
#include <process.h>
bool lock = false;
void increment(void*);
void decrement(void*);
int main()
{
int n = 5;
HANDLE hIncrement = (HANDLE)_beginthread(increment, 0, (void*)&n);
HANDLE hDecrement = (HANDLE)_beginthread(decrement, 0, (void*)&n);
WaitForSingleObject(hIncrement, 1000 * 500);
WaitForSingleObject(hDecrement, 1000 * 500);
return 0;
}
void increment(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)++;
lock = false;
}
}
void decrement(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)--;
lock = false;
}
}
Now in my example here, I use bool lock as my synchronization mechanism to avoid a race condition between the 2 threads over the memory space pointed by pointer n.
Now what I did here won't obviously work because although I avoided a race condition over the memory space pointed by pointer n between the 2 threads a new race condition over bool lock variable may occur.
Let's consider the following sequence of events (A = increment thread, B = decrement thread) :
A gets out of the while loop since lock is false
A gets to set lock to true
B waits in the while loop because lock is set to true
A increment the value pointed by n
A sets lock to false
A gets to the while loop
A gets out of the while loop since lock is false
B gets out of the while loop since lock is false
A sets lock to true
B sets lock to true
and from here we get an unexpected behavior of 2 un-synchronized threads because the bool lock is not race condition proof.
Ok, so far this is my understanding and the solution to our problem above we need a mutex.
I'm fine with that, a data type that will magically be condition race proof.
I just don't understand how with mutex type it won't happen where as with every other type it will and here lies my problem, I want to understand why mutex and how this is happening.
About your first question: Whether or not there are actually several different threads running at once, or whether it is just implemented as as fast switching, is a matter of your hardware. Typical PCs these days have several cores (often with more than one thread each), so you have to assume that things actually DO happen at the same time.
But even if you have only a single-core system, things are not quite so easy. This is because the compiler is usually allowed to re-order instructions in order to optimize code. It can also e.g. choose to cache a variable in a CPU register instead of loading it from memory every time you access it, and it also doesn't have to write it back to memory every time you write to that variable. The compiler is allowed to do that as long as the result is the same AS IF it had run your original code in its original order - as long as nobody else is looking closely at what's actually going on, such as a different thread.
And once you actually do have different cores, consider that they all have their own CPU registers and even their own cache. Even if a thread on one core wrote to a certain variable, as long as that core doesn't write its cache back to the shared memory a different core won't see that change.
In short, you have to be very careful in making any assumptions about what happens when two threads access variables at the same time, especially in C/C++. The interactions can be so surprising that I'd say, to stay on the safe side, you should make sure that there are no race conditions in your code, e.g. by always using mutexes for accessing memory that is shared between threads.
Which is where we can neatly segway into the second question: What's so special about mutexes, and how can they work if all basic data types are not threadsafe?
The thing about mutexes is that they are implemented with a lot of knowledge about the system for which they are being used (hardware and operating system), and with either the direct help or a deep knowledge of the compiler itself.
The C language does not give you direct access to all the capabilities of your hardware and operating system, because platforms can be very different from each other. Instead, C focuses on providing a level of abstraction that allows you to compile the same code for many different platforms. The different "basic" data types are just something that the C standard came up with as a set of data types which can in some way be supported on almost any platform - but the actual hardware that your program will be compiled for is usually not limited to those types and operations.
In other word, not everything that you can do with your PC can be expressed in terms of C's ints, bytes, assignments, arithmetic operators and so on. For example, PCs often calculate with 80-bit floating point types which are usually not mapped directly to a C floating point type at all. More to the point of our topic, there are also CPU instructions that influence how multiple CPU cores will work together. Additionally, if you know the CPU, you often know a few things about the behaviour of the basic types that the C standard doesn't guarantee (for example, whether loads and stores to 32-bit integers are atomic). With that extra knowledge, it can become possible to implement mutexes for that particular platform, and it will often require code that is e.g. written directly in assembly language, because the necessary features are not available in plain C.

Can an integer be shared between threads safely?

Is there a problem with multiple threads using the same integer memory location between pthreads in a C program without any synchronization utilities?
To simplify the issue,
Only one thread will write to the integer
Multiple threads will read the integer
This pseudo-C illustrates what I am thinking
void thread_main(int *a) {
//wait for something to finish
//dereference 'a', make decision based on its value
}
int value = 0;
for (int i=0; i<10; i++)
pthread_create(NULL,NULL,thread_main,&value);
}
// do something
value = 1;
I assume it is safe, since an integer occupies one processor word, and reading/writing to a word should be the most atomic of operations, right?
Your pseudo-code is NOT safe.
Although accessing a word-sized integer is indeed atomic, meaning that you'll never see an intermediate value, but either "before write" or "after write", this isn't enough for your outlined algorithm.
You are relying on the relative order of the write to a and making some other change that wakes the thread. This is not an atomic operation and is not guaranteed on modern processors.
You need some sort of memory fence to prevent write reordering. Otherwise it's not guaranteed that other threads EVER see the new value.
Unlike java where you explicitly start a thread, posix threads start executing immediatelly.
So there is no guarantee that the value you set to 1 in main function (assuming that is what you refer in your pseudocode) will be executed before or after the threads try to access it.
So while it is safe to read the integer concurrently, you need to do some synchronization if you need to write to the value in order to be used by the threads.
Otherwise there is no guarantee what is the value they will read (in order to act depending on the value as you note).
You should not be making assumptions on multithreading e.g.that there is some processing in each thread befor accessing the value etc.
There are no guarantees
I wouldn't count on it. The compiler may emit code that assumes it knows what the value of 'value' is at any given time in a CPU register without re-loading it from memory.
EDIT:
Ben is correct (and I'm an idiot for saying he wasn't) that there is the possibility that the cpu will re-order the instructions and execute them down multiple pipelines at the same time. This means that the value=1 could possibly get set before the pipeline performing "the work" finished. In my defense (not a full idiot?) I have never seen this happen in real life and we do have an extensive thread library and we do run exhaustive long term tests and this pattern is used throughout. I would have seen it if it were happening, but none of our tests ever crash or produce the wrong answer. But... Ben is correct, the possibility exists. It is probably happening all the time in our code, but the re-ordering is not setting flags early enough that the consumers of the data protected by the flags can use the data before its finished. I will be changing our code to include barriers, because there is no guarantee that this will continue to work in the wild. I believe the correct solution is similar to this:
Threads that read the value:
...
if (value)
{
__sync_synchronize(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
__sync_synchronize(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
That being said, I found this to be a simple explanation of barriers.
COMPILER BARRIER
Memory barriers affect the CPU. Compiler barriers affect the compiler. Volatile will not keep the compiler from re-ordering code. Here for more info.
I believe you can use this code to keep gcc from rearranging the code during compile time:
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
So maybe this is what should really be done?
#define GENERAL_BARRIER() do { COMPILER_BARRIER(); __sync_synchronize(); } while(0)
Threads that read the value:
...
if (value)
{
GENERAL_BARRIER(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
GENERAL_BARRIER(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
Using GENERAL_BARRIER() keeps gcc from re-ordering the code and also keeps the cpu from re-ordering the code. Now, I wonder if gcc wont re-order code over its memory barrier builtin, __sync_synchronize(), which would make the use of COMPILER_BARRIER redundant.
X86
As Ben points out, different architectures have different rules regarding how they rearrange code in the execution pipelines. Intel seems to be fairly conservative. So the barriers might not be required nearly as much on Intel. Not a good reason to avoid the barriers though, since that could change.
ORIGINAL POST:
We do this all the time. its perfectly safe (not for all situations, but a lot). Our application runs on 1000's of servers in a huge farm with 16 instances per server and we don't have race conditions. You are correct to wonder why people use mutexes to protect already atomic operations. In many situations the lock is a waste of time. Reading and writing to 32 bit integers on most architectures is atomic. Don't try that with 32 bit bit-fields though!
Processor write re-ordering is not going to affect one thread reading a global value set by another thread. In fact, the result using locks is the same as the result not without locks. If you win the race and check the value before its changed ... well that's the same as winning the race to lock the value so no-one else can change it while you read it. Functionally the same.
The volatile keyword tells the compiler not to store a value in a register, but to keep referring to the original memory location. this should have no effect unless you are optimizing code. We have found that the compiler is pretty smart about this and have not run into a situation yet where volatile changed anything. The compiler seems to be pretty good at coming up with candidates for register optimization. I suspect that the const keyword might encourage register optimization on a variable.
The compiler might re-order code in a function if it knows the end result will not be different. I have not seen the compiler do this with global variables, because the compiler has no idea how changing the order of a global variable will affect code outside of the immediate function.
If a function is acting up, you can control the optimization level at the function level using __attrribute__.
Now, that said, if you use that flag as a gateway to allow only one thread of a group to perform some work, that wont work. Example: Thread A and Thread B both could read the flag. Thread A gets scheduled out. Thread B sets the flag to 1 and starts working. Thread A wakes up and sets the flag to 1 and starts working. Ooops! To avoid locks and still do something like that you need to look into atomic operations, specifically gcc atomic builtins like __sync_bool_compare_and_swap(value, old, new). This allows you to set value = new if value is currently old. In the previous example, if value = 1, only one thread (A or B) could execute __sync_bool_compare_and_swap(&value, 1, 2) and change value from 1 to 2. The losing thread would fail. __sync_bool_compare_and_swap returns the success of the operation.
Deep down, there is a "lock" when you use the atomic builtins, but it is a hardware instruction and very fast when compared to using mutexes.
That said, use mutexes when you have to change a lot of values at the same time. atomic operations (as of todayu) only work when all the data that has to change atomicly can fit into a contiguous 8,16,32,64 or 128 bits.
Assume the first thing you're doing in thread func in sleeping for a second. So value after that will be definetly 1.
In any instant you should at least declare the shared variable volatile. However you should in all cases prefer some other form of thread IPC or synchronisation; in this case it looks like a condition variable is what you actually need.
Hm, I guess it is secure, but why don't you just declare a function that returns the value to the other threads, as they will only read it?
Because the simple idea of passing pointers to separate threads is already a security fail, in my humble opinion. What I'm telling you is: why to give a (modifiable, public accessible) integer address when you only need the value?

Resources