I'm studying multithreading and trying to understand the concept of semaphores and mutual exclusion. Most of the examples I find online use some sort of library (e.g. pthread) to implement the semaphore or mutex, but I'm more interested in the implementation of a simple semaphore that establishes a critical section -- no more than one thread accessing a particular region of memory.
For this task, I believe I would need a mutex (a.k.a. a binary semaphore if I understand the terminology correctly). I can see how the semaphore would prevent a race condition by "locking" the section of code to a single thread, but what prevents a race condition from occurring at the semaphore itself?
I imagine a binary semaphore to hold an int value to keep track of the lock:
Semaphore
---------
int lock = 1;
unsigned P(void){
if(lock > 0){
lock--;
return 0; /* success */
}
return 1; /* fail */
}
void V(void){
lock++;
}
Suppose two threads call the P function at the same time, they both reach the if(lock > 0) check at the same time and evaluate the condition as true -- this creates a race condition where both threads are granted access to the same region of memory at the same time.
So what prevents this race condition from occurring in real world implementations of semaphores?
Locking and relasing semaphores and/or mutexes happen as atomic operations, this means the CPU cannot be withdrawn from the current process. This ensures, that as soon as a mutex-lock is started (it consists of either a single or a few CPU-instruction (microcode)), the process keeps the CPU until the locking/releasing is done.
There are also different ways to implement threading, which can either be a direct support by CPU (kernel-space) or through a library (such as pthreads) in user-space.
From OSDev.org
An atomic operation is an operation that will always be executed without any other process being able to read or change state that is read or changed during the operation. It is effectively executed as a single step, and is an important quality in a number of algorithms that deal with multiple indepent processes, both in synchronization and algorithms that update shared data without requiring synchronization.
Here is a nice article on atomicity, too (although in Delphi).
The most common (although definitely not the only) way to implement most locking primitives are compare-and-set instructions. An normal move instruction would just set the value of a memory location to whatever value you ask it to while a compare-and-set instruction does "atomically set this memory location to value X only if the value of the memory location is Y, then set some flag if the operation succeeded or not". The keyword "atomic" is that the CPU can in hardware make sure that nothing else can interfere with that operation.
Using a compare-and-swap instruction your example P could be implemented as:
int oldlock;
retry:
oldlock = lock;
if (oldlock > 0) {
if (compare_and_swap(&lock, oldlock, oldlock - 1))
goto retry;
return 0;
}
return 1;
Of course reality is much more complex than that, but compare-and-set is easy to understand and explain and has the nice property that it can implement (almost?) all other locking primitives.
Here's a wikipedia article.
The difference between a semaphore (or a mutex) and a "normal" variable isn't that big. Those libraries which offer you this functionality just make sure that the semaphore is only accessed through atomic operations. There are multiple ways to achieve that:
Special assembly instructions which guarantee atomic access, e.g.: TSL or XCHG.
Turning off the scheduler's interrupts before the variable gets accessed and afterwards turning them back on again. So the scheduler can't remove your process from the CPU. But you have to be aware that this only works on single CPU systems.
Using language specific features like Java's synchronise keyword.
An example of how to use the TSL instruction:
enter_region: ; A "jump to" tag; function entry point.
tsl reg, flag ; Test and Set Lock; flag is the
; shared variable; it is copied
; into the register reg and flag
; then atomically set to 1.
cmp reg, #0 ; Was flag zero on entry_region?
jnz enter_region ; Jump to enter_region if
; reg is non-zero; i.e.,
; flag was non-zero on entry.
ret ; Exit; i.e., flag was zero on
; entry. If we get here, tsl
; will have set it non-zero; thus,
; we have claimed the resource
; associated with flag.
leave_region:
move flag, #0 ; store 0 in flag
ret ; return to caller
By the way, as you already and correctly pointed out, a mutex is just a special kind of a semaphore, only allowing FALSE (represented by 0 in C) and TRUE (represented by 1 or any other value != 0) for it's internal int variable. Thus making it a so called binary semaphore.
Related
I'm reading the documentation for the Win32 VirtualAlloc function, and in the protections listed, there is no PAGE_WRITEONLY option that would pair with the PAGE_READONLY option, naturally. Is there any way to obtain such support by the operating system, or will I have to implement this in software somehow, or can I use processor features that may be available for implementing such things in hardware from user code? A software implementation is undesirable for obvious performance reasons.
Now this also introduces an obvious problem: the memory cannot be read, effectively making the writes an expensive NOP sequence, so the question is whether or not I can make a page have different protections from different contexts so that from one context, the page is write-only, but from another context, the page is read-only.
Security is only one small consideration, but in principle, it is for the sake of ensuring consistency of implementation with design which has security as a benefit (no unwanted reading of what should only be written from one context and vice versa). If you only need to write to something (which is obvious in the case of e.g. the output of a procedure, a hardware send buffer [or software model thereof in RAM], etc.), then it is worthwhile to ensure it is only written, and if you only need to read something, then it is worthwhile to ensure it is only read.
Reading you comments I think you are looking for a lock system where only one thread can write or read to memory at the same time. Is that correct?
You may be looking for the cmpxchg instruction which is implemented in Windows by function InterlockedCompareExchange, InterlockedCompareExchange64 and InterlockedCompareExchange128. This will help you compare two 32/64/128 bit values and copy a new value to the location if they are equal. You can compare it to the following C code
if(a==b)
a = c;
The difference between this C example and the cmpxchg instruction is that cmpxchg is one single instruction and the C example consist out of multiple instructions. This means the cmpxchg cannot be interrupted, where the C example can be interrupted. If the C example is interrupted after the 'if' statement and before the 'set' instruction, another thread will get CPU time and can change variable 'a'. This cannot happen with cmpxchg.
This still causes problems if the system has multiple cores. To fix this, the lock prefix is used. This causes synchronization through all the CPU's. This is also used in the windows API I mentioned above, so don't worry about this.
For every piece of memory you want to lock, you create an integer. You use the InterlockedCompareExchange to set this variable to '1', but only if it equals '0'. If the function returns that it didn't equal '0', you wait by calling sleep, and retry until it does. Every thread needs to set this variable to '0' when it's done using it.
Example:
LONG volatile lock;
int main()
{
//init the lock
lock = (LONG)0;
for (int i = 0; i < 100; i++)
CreateThread(0, 0, (LPTHREAD_START_ROUTINE) &newThread, (LPVOID) i, 0, 0);
ExitThread(0);
}
int newThread(int var) {
//Request lock
while (InterlockedCompareExchange((long *)&lock, 1, 0) != 0)
Sleep(1);
printf("Thread %x (%d) got the lock, waiting %dms seconds before releasing the lock.\n", GetCurrentThreadId(), var, var*100);
//Do whatever you want to do
Sleep(var * 100);
printf("Lock released.\n");
//unlock
lock = (LONG)0;
return 0;
}
I was studying OS and synchronizing and I got an idea about dealing with this shared data without synchronizing but I am not sure if it will work.Here is the code
Now,the race condition is obviously the increment and decrement in a shared data.But what if the integer variable was atomic?I think I read something about this when I just a beginner in CS so question might not be perfect.As far as I remember it was blocking something to prevent the increment and decrement at the same time.Now,I am a bit confused about this because if the atomic variables really worked there would not be any need to find synchronization methods for simple codes like this one.
Note:Code is removed since it just changes the focus of people and answer provides enough info
As it stands, the code is indeed not safe to call concurrently, so there must be some kind of syncronization that prevents this.
Now, concerning the idea to make num_processes atomic, that could work. It wouldn't be a simple substitution though, in particular comparing to the max and incrementing must be done atomically and not in two steps, otherwise you still have a race condition. In particular, the following steps must be prevented:
Thread A checks if the limit is reached, which it isn't.
Thread B checks if the limit is reached, which it isn't.
Thread B increments the PID counter.
Thread A increments the PID counter.
Each step in and of itself is atomic, but obviously that didn't help preventing a PID overflow. Instead, the code must check if the counter is not at the limit and then increment it atomically. This is also a common task (compare and increment), so you should easily find existing code examples.
However, I'm pretty sure this isn't all code that is involved and some other code (e.g. in get_processID() or the code that releases a PID) could still require a lock around the whole.
For your code, synchronization is not necessary at all because here num_processes is incremented and decremented by only one process i.e. Parent process.And also num_processes is not a shared variable here. To create shared variable you have to first learn about shmget() and shmat() function in UNIX.
And race condition arises if two or more processes want to access a shared memory.An operation will be atomic if that operation is going to executed entirely (i.e. no switching) or not at all. For example
Consider increment operator on a shared data. This operator is not atomic. Because if go to the lower level instruction for increment operator then this operation is performed in several steps as:
1. First load the value of variable in some register.
2. Add one with that loaded value and now result will be in some temporary register.
3. Store this result in the memory location / register that is pointed by that variable on which increment is performed.
Now As you can see this operation is done in three step. So if there is any switching to another process before completion of these three steps then it leads to undesired results. For more you can read about race condition from this link http://tutorials.jenkov.com/java-concurrency/race-conditions-and-critical-sections.html. As from above you can see that add, store, load instructions are atomic because it will be performed entirely or not at all considering there is no power failure any system failure. So to perform increment operation atomic we need to do some synchronization either using semaphores or monitors. These all are software synchronization technique. I think now you will be clear on this topic..
I am wondering, how can anyone avoid branching in kernels when the threads have to compare and store values either from local, shared or global variables. For example the following code checks a shared variable and sets a bool flag to true accordingly
if ( shared_variable < local_value ){
shared_bool_var = true;
}
__syncthreads();
The problem here is that all threads access the same variable and all will overwrite to true.
So i would use a threadId.x check to only let one thread access that variable but this would cause branch divergence.
if ( threadIdx.x == 0 && shared_variable < local_value ){
shared_bool_var = true;
}
__syncthreads();
The question here is what should I prefer to do? In both cases it seems safe since the syncthread will protect from Hazards (read before write etc). My preference is the second solution but usually the code is not that simple.
In the aforementioned case, is it safe to allow all threads to access 1 shared memory location or this would cause a bank conflict or serialization of memory access?
Thanks
One important thing to note: semantically and functionally speaking, both code stanzas are not equivalent:
// set var to true if ANY thread in the block verifies the predicate
if (shared_variable < local_value) {
shared_bool_var = true;
}
// set var to true if THE FIRST thread in the block verifies the predicate
if (threadIdx.x == 0 && shared_variable < local_value) {
shared_bool_var = true;
}
But back to your question:
In the aforementioned case, is it safe to allow all threads to access 1 shared memory location or this would cause a bank conflict or serialization of memory access?
After verification in the CUDA programming guide, it seems there is some kind of write-collapsing mechanism that prevents serialization of write-accesses to the same address: instead, only one thread writes its value (but which thread is undefined).
CC 1.x:
If a non-atomic instruction executed by a warp writes to the same location in shared memory for more than one of the threads of the warp, only one thread per half-warp performs a write and which thread performs the final write is undefined.
CC 2.x and above:
A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, [...] for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).
Additionally:
So i would use a threadId.x check to only let one thread access that variable but this would cause branch divergence.
This isn't "more divergent" than the first code. The first stanza exhibits divergence whenever a whole warp doesn't evaluate the predicate identically. The second stanza exhibits divergence only in the first warp of every block. In both cases, none of these branches have an impact on performance: there is no else body and the if body is a single instruction.
I have 2 questions regarding to threads, one is about race condition and the other is about mutex.
So the first question :
I've read about race condition in wikipedia page :
http://en.wikipedia.org/wiki/Race_condition
And in the example of race condition between 2 threads this is shown :
http://i60.tinypic.com/2vrtuz4.png[
Now so far I believed that threads works parallel to each other, but judging from this picture it's seems that I interpreted on how actions done by the computer wrong.
From this picture only 1 action is done at a time, and although the threads gets switched from time to time and the other thread gets to do some actions this is still 1 action at a time done by the computer. Is it really like this ? There's no "real" parallel computing, just 1 action done at a time in a very fast rate which gives the illusion of parallel computing ?
This leads me to my second question about mutex.
I've read that if threads read/write to the same memory we need some sort of synchronization mechanism. I've read the normal data types won't do and we need a mutex.
Let's take for example the following code :
#include <stdio.h>
#include <stdbool.h>
#include <windows.h>
#include <process.h>
bool lock = false;
void increment(void*);
void decrement(void*);
int main()
{
int n = 5;
HANDLE hIncrement = (HANDLE)_beginthread(increment, 0, (void*)&n);
HANDLE hDecrement = (HANDLE)_beginthread(decrement, 0, (void*)&n);
WaitForSingleObject(hIncrement, 1000 * 500);
WaitForSingleObject(hDecrement, 1000 * 500);
return 0;
}
void increment(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)++;
lock = false;
}
}
void decrement(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)--;
lock = false;
}
}
Now in my example here, I use bool lock as my synchronization mechanism to avoid a race condition between the 2 threads over the memory space pointed by pointer n.
Now what I did here won't obviously work because although I avoided a race condition over the memory space pointed by pointer n between the 2 threads a new race condition over bool lock variable may occur.
Let's consider the following sequence of events (A = increment thread, B = decrement thread) :
A gets out of the while loop since lock is false
A gets to set lock to true
B waits in the while loop because lock is set to true
A increment the value pointed by n
A sets lock to false
A gets to the while loop
A gets out of the while loop since lock is false
B gets out of the while loop since lock is false
A sets lock to true
B sets lock to true
and from here we get an unexpected behavior of 2 un-synchronized threads because the bool lock is not race condition proof.
Ok, so far this is my understanding and the solution to our problem above we need a mutex.
I'm fine with that, a data type that will magically be condition race proof.
I just don't understand how with mutex type it won't happen where as with every other type it will and here lies my problem, I want to understand why mutex and how this is happening.
About your first question: Whether or not there are actually several different threads running at once, or whether it is just implemented as as fast switching, is a matter of your hardware. Typical PCs these days have several cores (often with more than one thread each), so you have to assume that things actually DO happen at the same time.
But even if you have only a single-core system, things are not quite so easy. This is because the compiler is usually allowed to re-order instructions in order to optimize code. It can also e.g. choose to cache a variable in a CPU register instead of loading it from memory every time you access it, and it also doesn't have to write it back to memory every time you write to that variable. The compiler is allowed to do that as long as the result is the same AS IF it had run your original code in its original order - as long as nobody else is looking closely at what's actually going on, such as a different thread.
And once you actually do have different cores, consider that they all have their own CPU registers and even their own cache. Even if a thread on one core wrote to a certain variable, as long as that core doesn't write its cache back to the shared memory a different core won't see that change.
In short, you have to be very careful in making any assumptions about what happens when two threads access variables at the same time, especially in C/C++. The interactions can be so surprising that I'd say, to stay on the safe side, you should make sure that there are no race conditions in your code, e.g. by always using mutexes for accessing memory that is shared between threads.
Which is where we can neatly segway into the second question: What's so special about mutexes, and how can they work if all basic data types are not threadsafe?
The thing about mutexes is that they are implemented with a lot of knowledge about the system for which they are being used (hardware and operating system), and with either the direct help or a deep knowledge of the compiler itself.
The C language does not give you direct access to all the capabilities of your hardware and operating system, because platforms can be very different from each other. Instead, C focuses on providing a level of abstraction that allows you to compile the same code for many different platforms. The different "basic" data types are just something that the C standard came up with as a set of data types which can in some way be supported on almost any platform - but the actual hardware that your program will be compiled for is usually not limited to those types and operations.
In other word, not everything that you can do with your PC can be expressed in terms of C's ints, bytes, assignments, arithmetic operators and so on. For example, PCs often calculate with 80-bit floating point types which are usually not mapped directly to a C floating point type at all. More to the point of our topic, there are also CPU instructions that influence how multiple CPU cores will work together. Additionally, if you know the CPU, you often know a few things about the behaviour of the basic types that the C standard doesn't guarantee (for example, whether loads and stores to 32-bit integers are atomic). With that extra knowledge, it can become possible to implement mutexes for that particular platform, and it will often require code that is e.g. written directly in assembly language, because the necessary features are not available in plain C.
I have read quite some posts that say compare and swap guarantees atomicity, However I am still not able to get how does it. Here is general pseudo code for compare and swap:
int CAS(int *ptr,int oldvalue,int newvalue)
{
int temp = *ptr;
if(*ptr == oldvalue)
*ptr = newvalue
return temp;
}
How does this guarantee atomicity? For example, if I am using this to implement a mutex,
void lock(int *mutex)
{
while(!CAS(mutex, 0 , 1));
}
how does this prevent 2 threads from acquiring the mutex at the same time? Any pointers would be really appreciated.
"general pseudo code" is not an actual code of CAS (compare and swap) implementation. Special hardware instructions are used to activate special atomic hardware in the CPU. For example, in x86 the LOCK CMPXCHG can be used (http://en.wikipedia.org/wiki/Compare-and-swap).
In gcc, for example, there is __sync_val_compare_and_swap() builtin - which implements hardware-specific atomic CAS. There is description of this operation from fresh wonderful book from Paul E. McKenney (Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2014), section 4.3 "Atomic operations", pages 31-32.
If you want to know more about building higher level synchronization on top of atomic operations and save your system from spinlocks and burning cpu cycles on active spinning, you can read something about futex mechanism in Linux. First paper on futexes is Futexes are tricky by Ulrich Drepper 2011; the other is LWN article http://lwn.net/Articles/360699/ (and the historic one is Fuss, Futexes and Furwocks: Fast Userland Locking in Linux, 2002)
Mutex locks described by Ulrich use only atomic operations for "fast path" (when the mutex is not locked and our thread is the only who wants to lock it), but if the mutex was locked, the thread will go to sleeping using futex(FUTEX_WAIT...) (and it will mark the mutex variable using atomic operation, to inform the unlocking thread about "there are somebody sleeping waiting on this mutex", so unlocker will know that he must wake them using futex(FUTEX_WAKE, ...)
How does it prevent two threads from acquiring the lock? Well, once any one thread succeeds, *mutex will be 1, so any other thread's CAS will fail (because it's called with expected value 0). The lock is released by storing 0 in *mutex.
Note that this is an odd use of CAS, since it's essentially requiring an ABA-violation. Typically you'd just use a plain atomic exchange:
while (exchange(mutex, 1) == 1) { /* spin */ }
// critical section
*mutex = 0; // atomically
Or if you want to be slightly more sophisticated and store information about which thread has the lock, you can do tricks with atomic-fetch-and-add (see for example the Linux kernel spinlock code).
You cannot implement CAS in C. It's done on a hardware level in assembly.