Practical Delimited Continuations in C / x64 ASM - c

I've look at a paper called A Primer on Scheduling Fork-Join Parallelism with Work Stealing. I want to implement continuation stealing, where the rest of the code after calling spawn is eligible to be stolen. Here's the code from the paper.
1 e();
2 spawn f();
3 g();
4 sync;
5 h();
An import design choice is which branch to offer to thief threads.
Using Figure 1, the choices are:
Child Stealing:
f() is made available to thief threads.
The thread that executed e() executes g().
Continuation Stealing:
Also called “parent stealing”.
The thread that executed e() executes f().
The continuation (which will next call g()) becomes available to thief threads.
I hear that saving a continuation requires saving both sets of registers (volatile/non-volatile/FPU). In the fiber implementation I did, I ended up implementing child stealing. I read about the (theoretical) negatives of child stealing (unbounded number of runnable tasks, see the paper for more info), so I want to use continuations instead.
I'm thinking of two functions, shift and reset, where reset delimits the current continuation, and shift reifies the current continuation. Is what I'm asking even plausible in a C environment?
EDIT: I'm thinking of making reset save return address / NV GPRs for the current function call (= line 3), and making shift transfer control to the next continuation after returning a value to the caller of reset.

I've implemented work stealing for a HLL called PARLANSE rather than C on an x86. PARLANSE is used daily to build production symbolic parallel programs at the million line scale.
In general, you have preserve the registers of both the continuation or the "child".
Consider that your compiler may see a computation in f() and see the same computation in g(), and might lift that computation to the point just before the spawn, and place that computation result in a register that both f() and g() use as in implied parameter.
Yes, this assumes a sophisticated compiler, but if you are using a stupid compiler that doesn't optimize, why are you trying to go parallel for speed?
In specific, however, your compiler could arrange for the registers to be empty before the call to spawn if it understood what spawn means. Then neither the continuation or the child has to preserve registers. (The PARLANSE compiler in fact does this).
So how much has to be saved depends on how much your compiler is willing to help, and that depends on whether it knows what spawn really does.
Your local friendly C compiler likely doesn't know about your implementation of spawn. So either you do something to force a register flush (don't ask me, its your compiler) or you put up with the fact that you personally don't know what's in the registers, and your implementation preserves them all to be safe.
If the amount of work spawned is significant, arguably it wouldn't matter if you saved all the registers. However, the x86 (and other modern architectures) seems have an enormous amount of state, mostly in the vector registers, that might be in use; last time I looked it was well in excess of 500 bytes ~~ 100 writes to memory to save these and IMHO that's an excessive price. If you don't believe these registers are going to be passed from the parent thread to the spawned thread, then you can work on enforcing spawn with no registers.
If you spawn routine wakes up using a standard continuation mechanism you have invented, then you have worry about whether your continuations pass large register state or not, also. Same problem, same solutions as for spawn; the compiler has to help or you personally have to intervene.
You'll find this a lot of fun.
[If you want to make it really interesting, try timeslicing the threads in case they go into deep computation without an occasional yeild causing thread starvation. Now you surely have save the entire state. I managed to get PARLANSE to realize spawning with no registers saved, yet have the time slicing save/restore full register state, by saving full state on a time slice, and continuing at a special place that refilled all the registers before it passed control to the time-sliced PC location].

Related

how to jump out of and resume at arbitrary locations in c-code without refactoring

BACKGROUND
I'm integrating micropython into my custom cooperative multitasking OS (no, my company won't change to pre-preemptive)
Micropython uses garbage collection and this takes much more time than my alloted time slice even when there's nothing to collect i.e. I called it twice in a row, timed it and still takes A LOT of time.
OBVIOUS SOLUTION
Yes I could refactor micropython source but then whenever there's a change . . .
IDEAL SOLUTION
The ideal solution would involve calling some function void pause(&func_in_call_stack) that would jump out, leaving the stack intact, all the way to the function that is at the top of the call stack, say main. And resume would . . . resume.
QUESTION
Is it possible, using C and assembly, to implement pause?
UPDATE
As I wrote this, I realize that the C-based exception handling code nlr_push()/nlr_pop() already does most of what I need.
Your question is about implementing context switching. As we've covered fairly exhaustively in comments, support for context switching is among the key characteristics of any multitasking system, and of a multitasking OS in particular. Inasmuch as you posit no OS support for context switching, you are talking about implementing multitasking for a single-tasking OS.
That you describe the OS as providing some kind of task queue ("to relinquish control, a thread must simply exit its run loop") does not change this, though to some extent we could consider it a question of semantics. I imagine that a typical task for such a system would operate by creating and executing a series of microtasks (the work of the "run loop"), providing a shared, mutable memory context to each. Such a run loop could safely exit and later be reentered, to resume generating microtasks from where it left off.
Dividing tasks into microtasks at boundaries defined by affirmative application action (i.e. your pause()) would depend on capabilities beyond those provided by ISO C. Very likely, however, it could be done with the help of some assembly, plus some kind of framework support. You need at least these things:
A mechanism for recording a task's current execution context -- stack, register contents, and maybe other details. This is inherently system-specific.
A task-associated place to store recorded execution context. There are various ways in which such a thing could be established. Promising alternatives include (i) provided by the OS; (ii) provided by some kind of userland multi-tasking system running on top of the OS; (iii) built into the task by the compiler.
A mechanism for restoring recorded execution context -- this, too, will be system-specific.
If the OS does not provide such features, then you could consider the (now removed) POSIX context system as a model interface for recording and restoring execution context. (See makecontext(), swapcontext(), getcontext(), and setcontext().) You would need to implement those yourself, however, and you might want to wrap them to present a simpler interface to applications. Details will be highly dependent on hardware and underlying OS.
As an alternative, you might implement transparent multitasking support for such a system by providing compilers that emit specially instrumented code (i.e. even more specially instrumented than you otherwise need). For example, consider compilers that emit bytecode for a VM of your own design. The VMs in which the resulting programs run would naturally track the state of the program running within, and could yield after each sequence of a certain number of opcodes.

pointer shared between two threads without mutex [duplicate]

Is there a problem with multiple threads using the same integer memory location between pthreads in a C program without any synchronization utilities?
To simplify the issue,
Only one thread will write to the integer
Multiple threads will read the integer
This pseudo-C illustrates what I am thinking
void thread_main(int *a) {
//wait for something to finish
//dereference 'a', make decision based on its value
}
int value = 0;
for (int i=0; i<10; i++)
pthread_create(NULL,NULL,thread_main,&value);
}
// do something
value = 1;
I assume it is safe, since an integer occupies one processor word, and reading/writing to a word should be the most atomic of operations, right?
Your pseudo-code is NOT safe.
Although accessing a word-sized integer is indeed atomic, meaning that you'll never see an intermediate value, but either "before write" or "after write", this isn't enough for your outlined algorithm.
You are relying on the relative order of the write to a and making some other change that wakes the thread. This is not an atomic operation and is not guaranteed on modern processors.
You need some sort of memory fence to prevent write reordering. Otherwise it's not guaranteed that other threads EVER see the new value.
Unlike java where you explicitly start a thread, posix threads start executing immediatelly.
So there is no guarantee that the value you set to 1 in main function (assuming that is what you refer in your pseudocode) will be executed before or after the threads try to access it.
So while it is safe to read the integer concurrently, you need to do some synchronization if you need to write to the value in order to be used by the threads.
Otherwise there is no guarantee what is the value they will read (in order to act depending on the value as you note).
You should not be making assumptions on multithreading e.g.that there is some processing in each thread befor accessing the value etc.
There are no guarantees
I wouldn't count on it. The compiler may emit code that assumes it knows what the value of 'value' is at any given time in a CPU register without re-loading it from memory.
EDIT:
Ben is correct (and I'm an idiot for saying he wasn't) that there is the possibility that the cpu will re-order the instructions and execute them down multiple pipelines at the same time. This means that the value=1 could possibly get set before the pipeline performing "the work" finished. In my defense (not a full idiot?) I have never seen this happen in real life and we do have an extensive thread library and we do run exhaustive long term tests and this pattern is used throughout. I would have seen it if it were happening, but none of our tests ever crash or produce the wrong answer. But... Ben is correct, the possibility exists. It is probably happening all the time in our code, but the re-ordering is not setting flags early enough that the consumers of the data protected by the flags can use the data before its finished. I will be changing our code to include barriers, because there is no guarantee that this will continue to work in the wild. I believe the correct solution is similar to this:
Threads that read the value:
...
if (value)
{
__sync_synchronize(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
__sync_synchronize(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
That being said, I found this to be a simple explanation of barriers.
COMPILER BARRIER
Memory barriers affect the CPU. Compiler barriers affect the compiler. Volatile will not keep the compiler from re-ordering code. Here for more info.
I believe you can use this code to keep gcc from rearranging the code during compile time:
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
So maybe this is what should really be done?
#define GENERAL_BARRIER() do { COMPILER_BARRIER(); __sync_synchronize(); } while(0)
Threads that read the value:
...
if (value)
{
GENERAL_BARRIER(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
GENERAL_BARRIER(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
Using GENERAL_BARRIER() keeps gcc from re-ordering the code and also keeps the cpu from re-ordering the code. Now, I wonder if gcc wont re-order code over its memory barrier builtin, __sync_synchronize(), which would make the use of COMPILER_BARRIER redundant.
X86
As Ben points out, different architectures have different rules regarding how they rearrange code in the execution pipelines. Intel seems to be fairly conservative. So the barriers might not be required nearly as much on Intel. Not a good reason to avoid the barriers though, since that could change.
ORIGINAL POST:
We do this all the time. its perfectly safe (not for all situations, but a lot). Our application runs on 1000's of servers in a huge farm with 16 instances per server and we don't have race conditions. You are correct to wonder why people use mutexes to protect already atomic operations. In many situations the lock is a waste of time. Reading and writing to 32 bit integers on most architectures is atomic. Don't try that with 32 bit bit-fields though!
Processor write re-ordering is not going to affect one thread reading a global value set by another thread. In fact, the result using locks is the same as the result not without locks. If you win the race and check the value before its changed ... well that's the same as winning the race to lock the value so no-one else can change it while you read it. Functionally the same.
The volatile keyword tells the compiler not to store a value in a register, but to keep referring to the original memory location. this should have no effect unless you are optimizing code. We have found that the compiler is pretty smart about this and have not run into a situation yet where volatile changed anything. The compiler seems to be pretty good at coming up with candidates for register optimization. I suspect that the const keyword might encourage register optimization on a variable.
The compiler might re-order code in a function if it knows the end result will not be different. I have not seen the compiler do this with global variables, because the compiler has no idea how changing the order of a global variable will affect code outside of the immediate function.
If a function is acting up, you can control the optimization level at the function level using __attrribute__.
Now, that said, if you use that flag as a gateway to allow only one thread of a group to perform some work, that wont work. Example: Thread A and Thread B both could read the flag. Thread A gets scheduled out. Thread B sets the flag to 1 and starts working. Thread A wakes up and sets the flag to 1 and starts working. Ooops! To avoid locks and still do something like that you need to look into atomic operations, specifically gcc atomic builtins like __sync_bool_compare_and_swap(value, old, new). This allows you to set value = new if value is currently old. In the previous example, if value = 1, only one thread (A or B) could execute __sync_bool_compare_and_swap(&value, 1, 2) and change value from 1 to 2. The losing thread would fail. __sync_bool_compare_and_swap returns the success of the operation.
Deep down, there is a "lock" when you use the atomic builtins, but it is a hardware instruction and very fast when compared to using mutexes.
That said, use mutexes when you have to change a lot of values at the same time. atomic operations (as of todayu) only work when all the data that has to change atomicly can fit into a contiguous 8,16,32,64 or 128 bits.
Assume the first thing you're doing in thread func in sleeping for a second. So value after that will be definetly 1.
In any instant you should at least declare the shared variable volatile. However you should in all cases prefer some other form of thread IPC or synchronisation; in this case it looks like a condition variable is what you actually need.
Hm, I guess it is secure, but why don't you just declare a function that returns the value to the other threads, as they will only read it?
Because the simple idea of passing pointers to separate threads is already a security fail, in my humble opinion. What I'm telling you is: why to give a (modifiable, public accessible) integer address when you only need the value?

Calling convention which only allows one instance of a function at a time

Say I have multiple threads and all threads call the same function at approximately the same time.
Is there a calling convention which would only allow one instance of the function at any time? What I mean is that the function called by the second thread would only start after the function called by the first thread had returned.
Or are these calling conventions compiler specific? I don't have a whole lot of experience using them.
(Skip to the bottom if you don't care about the threading mumbo-jumbo)
As mentioned before, this is not a "calling convention" but a general problem of computing: concurrency. And the particular case where two or more threads can enter a shared zone at a time, and have a different outcome, is called a race condition (and also extends to/from electronics, and other areas).
The hard thing about threading is that computing is such a deterministic affair, but when threading gets involved, it adds a degree of uncertainty, which vary per platform/OS.
A one-thread affair would guarantee that it can do all tasks in the same order, always, but when you got multiple threads, and the order depends on how fast they can complete a task, shared other applications wanting to use the CPU, then the underlying hardware affects the results.
There's not much of a "sure fire way to do threading", as there's techniques, tools and libraries to deal with individual cases.
Locking in
The most well known technique is using semaphores (or locks), and the most well known semaphore is the mutex one, which only allows one thread at a time to access a shared space, by having a sort of "flag" that is raised once a thread has entered.
if (locked == NO)
{
locked = YES;
// Do ya' thing
locked = NO;
}
The code above, although it looks like it could work, it would not guarantee against cases where both threads pass the if () and then set the variable (which threads can easily do). So there's hardware support for this kind of operation, that guarantees that only one thread can execute it: The testAndSet operation, that checks and then, if available, sets the variable. (Here's the x86 instruction from the instruction set)
On the same vein of locks and semaphores, there's also the read-write lock, that allows multiple readers and one writer, specially useful for things with low volatility. And there's many other variations, some that limit an X amount of threads and whatnot.
But overall, locks are lame, since they are basically forcing serialisation of multi-threading, where threads actually need to get stuck trying to get a lock (or just testing it and leaving). Kinda defeats the purpose of having multiple threads, doesn't it?
The best solution in terms of threading, is to minimise the amount of shared space that threads need to use, possibly, elmininating it completely. Maybe use rwlocks when volatility is low, try to have "try and leave" kind of threads, that check if the lock is up, and then go away if it isn't, etc.
As my OS teacher once said (in Zen-like fashion): "The best kind of locking is the one you can avoid".
Thread Pools
Now, threading is hard, no way around it, that's why there are patterns to deal with such kind of problems, and the Thread Pool Pattern is a popular one, at least in iOS since the introduction of Grand Central Dispatch (GCD).
Instead of having a bunch of threads running amok and getting enqueued all over the place, let's have a set of threads, waiting for tasks in a "pool", and having queues of things to do, ideally, tasks that shouldn't overlap each other.
Now, the thread pattern doesn't solve the problems discussed before, but it changes the paradigm to make it easier to deal with, mentally. Instead of having to think about "threads that need to execute such and such", you just switch the focus to "tasks that need to be executed" and the matter of which thread is doing it, becomes irrelevant.
Again, pools won't solve all your problems, but it will make them easier to understand. And easier to understand may lead to better solutions.
All the theoretical things above mentioned are implemented already, at POSIX level (semaphore.h, pthreads.h, etc. pthreads has a very nice of r/w locking functions), try reading about them.
(Edit: I thought this thread was about Obj-C, not plain C, edited out all the Foundation and GCD stuff)
Calling convention defines how stack & registers are used to implement function calls. Because each thread has its own stack & registers, synchronising threads and calling convention are separate things.
To prevent multiple threads from executing the same code at the same time, you need a mutex. In your example of a function, you'd typically put the mutex lock and unlock inside the function's code, around the statements you don't want your threads to be executing at the same time.
In general terms: Plain code, including function calls, does not know about threads, the operating system does. By using a mutex you tap into the system that manages the running of threads. More details are just a Google search away.
Note that C11, the new C standard revision, does include multi-threading support. But this does not change the general concept; it simply means that you can use C library functions instead of operating system specific ones.

Making process survive failure in its thread

I'm writing app that has many independant threads. While I'm doing quite low level, dangerous stuff there, threads may fail (SIGSEGV, SIGBUS, SIGFPE) but they should not kill whole process. Is there a way to do it proper way?
Currently I intercept aforementioned signals and in their signal handler then I call pthread_exit(NULL). It seems to work but since pthread_exit is not async-signal-safe function I'm a bit concerned about this solution.
I know that splitting this app into multiple processes would solve the problem but in this case it's not an feasible option.
EDIT: I'm aware of all the Bad Things™ that can happen (I'm experienced in low-level system and kernel programming) due to ignoring SIGSEGV/SIGBUS/SIGFPE, so please try to answer my particular question instead of giving me lessons about reliability.
The PROPER way to do this is to let the whole process die, and start another one. You don't explain WHY this isn't appropriate, but in essence, that's the only way that is completely safe against various nasty corner cases (which may or may not apply in your situation).
I'm not aware of any method that is 100% safe that doesn't involve letting the whole process. (Note also that sometimes just the act of continuing from these sort of errors are "undefined behaviour" - it doesn't mean that you are definitely going to fall over, just that it MAY be a problem).
It's of course possible that someone knows of some clever trick that works, but I'm pretty certain that the only 100% guaranteed method is to kill the entire process.
Low-latency code design involves a careful "be aware of the system you run on" type of coding and deployment. That means, for example, that standard IPC mechanisms (say, using SysV msgsnd/msgget to pass messages between processes, or pthread_cond_wait/pthread_cond_signal on the PThreads side) as well as ordinary locking primitives (adaptive mutexes) are to be considered rather slow ... because they involve something that takes thousands of CPU cycles ... namely, context switches.
Instead, use "hot-hot" handoff mechanisms such as the disruptor pattern - both producers as well as consumers spin in tight loops permanently polling a single or at worst a small number of atomically-updated memory locations that say where the next item-to-be-processed is found and/or to mark a processed item complete. Bind all producers / consumers to separate CPU cores so that they will never context switch.
In this type of usecase, whether you use separate threads (and get the memory sharing implicitly by virtue of all threads sharing the same address space) or separate processes (and get the memory sharing explicitly by using shared memory for the data-to-be-processed as well as the queue mgmt "metadata") makes very little difference because TLBs and data caches are "always hot" (you never context switch).
If your "processors" are unstable and/or have no guaranteed completion time, you need to add a "reaper" mechanism anyway to deal with failed / timed out messages, but such garbage collection mechanisms necessarily introduce jitter (latency spikes). That's because you need a system call to determine whether a specific thread or process has exited, and system call latency is a few micros even in best case.
From my point of view, you're trying to mix oil and water here; you're required to use library code not specifically written for use in low-latency deployments / library code not under your control, combined with the requirement to do message dispatch with nanosec latencies. There is no way to make e.g. pthread_cond_signal() give you nsec latency because it must do a system call to wake the target up, and that takes longer.
If your "handler code" relies on the "rich" environment, and a huge amount of "state" is shared between these and the main program ... it sounds a bit like saying "I need to make a steam-driven airplane break the sound barrier"...

Can an integer be shared between threads safely?

Is there a problem with multiple threads using the same integer memory location between pthreads in a C program without any synchronization utilities?
To simplify the issue,
Only one thread will write to the integer
Multiple threads will read the integer
This pseudo-C illustrates what I am thinking
void thread_main(int *a) {
//wait for something to finish
//dereference 'a', make decision based on its value
}
int value = 0;
for (int i=0; i<10; i++)
pthread_create(NULL,NULL,thread_main,&value);
}
// do something
value = 1;
I assume it is safe, since an integer occupies one processor word, and reading/writing to a word should be the most atomic of operations, right?
Your pseudo-code is NOT safe.
Although accessing a word-sized integer is indeed atomic, meaning that you'll never see an intermediate value, but either "before write" or "after write", this isn't enough for your outlined algorithm.
You are relying on the relative order of the write to a and making some other change that wakes the thread. This is not an atomic operation and is not guaranteed on modern processors.
You need some sort of memory fence to prevent write reordering. Otherwise it's not guaranteed that other threads EVER see the new value.
Unlike java where you explicitly start a thread, posix threads start executing immediatelly.
So there is no guarantee that the value you set to 1 in main function (assuming that is what you refer in your pseudocode) will be executed before or after the threads try to access it.
So while it is safe to read the integer concurrently, you need to do some synchronization if you need to write to the value in order to be used by the threads.
Otherwise there is no guarantee what is the value they will read (in order to act depending on the value as you note).
You should not be making assumptions on multithreading e.g.that there is some processing in each thread befor accessing the value etc.
There are no guarantees
I wouldn't count on it. The compiler may emit code that assumes it knows what the value of 'value' is at any given time in a CPU register without re-loading it from memory.
EDIT:
Ben is correct (and I'm an idiot for saying he wasn't) that there is the possibility that the cpu will re-order the instructions and execute them down multiple pipelines at the same time. This means that the value=1 could possibly get set before the pipeline performing "the work" finished. In my defense (not a full idiot?) I have never seen this happen in real life and we do have an extensive thread library and we do run exhaustive long term tests and this pattern is used throughout. I would have seen it if it were happening, but none of our tests ever crash or produce the wrong answer. But... Ben is correct, the possibility exists. It is probably happening all the time in our code, but the re-ordering is not setting flags early enough that the consumers of the data protected by the flags can use the data before its finished. I will be changing our code to include barriers, because there is no guarantee that this will continue to work in the wild. I believe the correct solution is similar to this:
Threads that read the value:
...
if (value)
{
__sync_synchronize(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
__sync_synchronize(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
That being said, I found this to be a simple explanation of barriers.
COMPILER BARRIER
Memory barriers affect the CPU. Compiler barriers affect the compiler. Volatile will not keep the compiler from re-ordering code. Here for more info.
I believe you can use this code to keep gcc from rearranging the code during compile time:
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
So maybe this is what should really be done?
#define GENERAL_BARRIER() do { COMPILER_BARRIER(); __sync_synchronize(); } while(0)
Threads that read the value:
...
if (value)
{
GENERAL_BARRIER(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
GENERAL_BARRIER(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
Using GENERAL_BARRIER() keeps gcc from re-ordering the code and also keeps the cpu from re-ordering the code. Now, I wonder if gcc wont re-order code over its memory barrier builtin, __sync_synchronize(), which would make the use of COMPILER_BARRIER redundant.
X86
As Ben points out, different architectures have different rules regarding how they rearrange code in the execution pipelines. Intel seems to be fairly conservative. So the barriers might not be required nearly as much on Intel. Not a good reason to avoid the barriers though, since that could change.
ORIGINAL POST:
We do this all the time. its perfectly safe (not for all situations, but a lot). Our application runs on 1000's of servers in a huge farm with 16 instances per server and we don't have race conditions. You are correct to wonder why people use mutexes to protect already atomic operations. In many situations the lock is a waste of time. Reading and writing to 32 bit integers on most architectures is atomic. Don't try that with 32 bit bit-fields though!
Processor write re-ordering is not going to affect one thread reading a global value set by another thread. In fact, the result using locks is the same as the result not without locks. If you win the race and check the value before its changed ... well that's the same as winning the race to lock the value so no-one else can change it while you read it. Functionally the same.
The volatile keyword tells the compiler not to store a value in a register, but to keep referring to the original memory location. this should have no effect unless you are optimizing code. We have found that the compiler is pretty smart about this and have not run into a situation yet where volatile changed anything. The compiler seems to be pretty good at coming up with candidates for register optimization. I suspect that the const keyword might encourage register optimization on a variable.
The compiler might re-order code in a function if it knows the end result will not be different. I have not seen the compiler do this with global variables, because the compiler has no idea how changing the order of a global variable will affect code outside of the immediate function.
If a function is acting up, you can control the optimization level at the function level using __attrribute__.
Now, that said, if you use that flag as a gateway to allow only one thread of a group to perform some work, that wont work. Example: Thread A and Thread B both could read the flag. Thread A gets scheduled out. Thread B sets the flag to 1 and starts working. Thread A wakes up and sets the flag to 1 and starts working. Ooops! To avoid locks and still do something like that you need to look into atomic operations, specifically gcc atomic builtins like __sync_bool_compare_and_swap(value, old, new). This allows you to set value = new if value is currently old. In the previous example, if value = 1, only one thread (A or B) could execute __sync_bool_compare_and_swap(&value, 1, 2) and change value from 1 to 2. The losing thread would fail. __sync_bool_compare_and_swap returns the success of the operation.
Deep down, there is a "lock" when you use the atomic builtins, but it is a hardware instruction and very fast when compared to using mutexes.
That said, use mutexes when you have to change a lot of values at the same time. atomic operations (as of todayu) only work when all the data that has to change atomicly can fit into a contiguous 8,16,32,64 or 128 bits.
Assume the first thing you're doing in thread func in sleeping for a second. So value after that will be definetly 1.
In any instant you should at least declare the shared variable volatile. However you should in all cases prefer some other form of thread IPC or synchronisation; in this case it looks like a condition variable is what you actually need.
Hm, I guess it is secure, but why don't you just declare a function that returns the value to the other threads, as they will only read it?
Because the simple idea of passing pointers to separate threads is already a security fail, in my humble opinion. What I'm telling you is: why to give a (modifiable, public accessible) integer address when you only need the value?

Resources