Implement semaphore in User Level C - c

An effective and necessary implementation of semaphore requires it to be atomic instruction.
I see several User level C implementations on the internet implementing semaphores using variables like count or a data structure like queue. But, the instructions involving variable donot run as atomic instructions. So how can anyone implement a sempahore in User Level C.
How does a c library semaphore.h implement semaphore?

The answer is almost certainly "it doesn't" - instead it will call into kernel services which provide the necessary atomic operations.

It's not possible in standard C until c11. What you need is, as you said, atomic operations. c11 finally specifies them, see for example stdatomic.h.
If you're on an older version of the standard, you have to either use embedded assembler directly or rely on vendor-specific extensions of your compiler, see for example the GCC atomic builtins. Of course, processors support instructions for memory barriers, check and swap operations etc. They're just not accessible from pure c99 and earlier because parallel execution wasn't in the scope of the standard.
After reading MartinJames' comment, I should add clarification here: This only applies if you implement all your threading in user space because a semaphore must block threads waiting on it, so if the threads are managed by the kernel's scheduler (as is the case with pthreads on Linux for example), it's necessary to do a syscall. Not in the scope of your question, but atomic operations might still be interesting for implementing e.g. lock-free datastructures.

You could implement semaphore operations as simple as:
void sema_post(atomic_uint *value) {
unsigned old = 0;
while (!atomic_compare_exchange_weak(value, &old, old + 1));
void sema_wait(atomic_uint *value) {
unsigned old = 1;
while (old == 0 || !atomic_compare_exchange_weak(value, &old, old - 1));
It's OK semantically, but it does busy waiting (spinning) in sema_wait. (Note that sema_post is lock-free, although it also may spin.) Instead it should sleep until value becomes positive. This problem cannot be solved with atomics because all atomic operations are non-blocking. Here you need help from OS kernel. So an efficient semaphore could use similar algorithm based on atomics but go into kernel in two cases (see Linux futex for more details on this approach):
sema_wait: when it finds value == 0, ask to sleep
sema_post: when it has incremented value from 0 to 1, ask to wake another sleeping thread if any
In general, to implement a lock-free (using atomics) operations on a data structure it's required that every operation is applicable to any state. For semaphores, wait isn't applicable to value 0.


Is volatile necessary for the resource used in a critical section?

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource. Below is the pseudo code that will be executed on those two threads.
// Read shared resource.
// Write something to shared resource.
I am wondering if I need to make that shared resource volatile to make sure that when one thread is reading shared resource, a thread won't just get the value from registers, it will actually read from that shared resource. Or maybe I should use a accessor functions to make the access to that shared resource volatile with some memory barrier operations instead of make that shared resource volatile?
I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource.
Making sure that only one thread accesses a shared resource at a time is only part of what a locking mechanism adequate for the purpose will do. Among other things, such a mechanism will also ensure that all writes to shared objects performed by thread Ti before it releases lock L are visible to all other threads Tj after they subsequently acquire lock L. And that in terms of the C semantics of the program, notwithstanding any questions of compiler optimization, register usage, CPU instruction reordering, or similar.
When such a locking mechanism is used, volatile does not provide any additional benefit for making threads' writes to shared objects be visible to each other. When such a locking mechanism is not used, volatile does not provide a complete substitute.
C's built-in (since C11) mutexes provide a suitable locking mechanism, at least when using C's built-in threads. So do pthreads mutexes, Sys V and POSIX semaphores, and various other, similar synchronization objects available in various environments, each with respect to corresponding multithreading systems. These semantics are pretty consistent across C-like multithreading implementations, extending at least as far as Java. The semantic requirements for C's built-in multithreading are described in section of the current (C17) language spec.
volatile is for indicating that an object might be accessed outside the scope of the C semantics of the program. That may happen to produce properties that interact with multithreaded execution in a way that is taken to be desirable, but that is not the purpose or intended use of volatile. If it were, or if volatile were sufficient for such purposes, then we would not also need _Atomic objects and operations.
The previous remarks focus on language-level semantics, and that is sufficient to answer the question. However, inasmuch as the question asks specifically about accessing variables' values from registers, I observe that compilers don't actually have to do anything much multithreading-specific in that area as long as acquiring and releasing locks requires calling functions.
In particular, if an execution E of function f writes to an object o that is visible to other functions or other executions of f, then the C implementation must ensure that that write is actually performed on memory before E evaluates any subsequent function call (such as is needed to release a lock). This is necessary because because the value written must be visible to the execution of the called function, regardless of any other threads.
Similarly, if E uses the value of o after return from a function call (such as is needed to acquire a lock) then it must load that value from memory to ensure that it sees the effect of any write that the function may have performed.
The only thing special to multithreading in this regard is that the implementation must ensure that interprocedural analysis optimizations or similar do not subvert the needed memory reads and writes around the lock and unlock functions. In practice, this rarely requires special attention.
The answer is no; volatile is not necessary (assuming the critical-section functions you are using were implemented correctly, and you are using them correctly, of course). Any proper critical-section API's implementation will include the memory-barriers necessary to handle flushing registers, etc, and therefore avoid the need for the volatile keyword.
volatile is normally used inform compiler that this data might be change by others (interrupt, DMA, other CPU,...) to prevent un-expected optimization in compiler.
So in your case you may need or don't need:
If you don't have some while loop with some info from share resource in the thread for value change, you don't really need for volatile.
If you have some wait like while (shareVal == 0) in the source code, you need to tell compiler explicit by attribute volatile.
For case 2 CPUs, there is also possibility issue with cache that a CPU is only reading value from cache memory. Please consider to configure memory attribute properly for shared resource.

Is it true that "volatile" in a userspace program tends to indicate a bug?

When I googling about "volatile" and its user space usage, I found mails between Theodore Tso and Linus Torvalds. According to these great masters, use of "volatile" in userspace probably be a bug??Check discussion here
Although they have some explanations, but I really couldn't understand. Could anyone use some simple language explain why they said so? We are not suppose to use volatile in user space??
volatile tells the compiler that every read and write has an observable side effect; thus, the compiler can't make any assumptions about two reads or two writes in a row having the same effect.
For instance, normally, the following code:
int a = *x;
int b = *x;
if (a == b)
Could be optimized into:
What volatile does is tell the compiler that those values might be coming from somewhere outside of the program's control, so it has to actually read those values and perform the comparison.
A lot of people have made the mistake of thinking that they could use volatile to build lock-free data structures, which would allow multiple threads to share values, and they could observe the effects of those values in other threads.
However, volatile says nothing about how different threads interact, and could be applied to values that could be cached with different values on different cores, or could be applied to values that can't be atomically written in a single operation, and so if you try to write multi-threaded or multi-core code using volatile, you can run into a lot of problems.
Instead, you need to either use locks or some other standard concurrency mechanism to communicate between threads, or use memory barriers, or use C11/C++11 atomic types and atomic operations. Locks ensure that an entire region of code has exclusive access to a variable, which can work if you have a value that is too large, too small, or not aligned to be atomically written in a single operation, while memory barriers and the atomic types and operations provide guarantees about how they work with the CPU to ensure that caches are synchronized or reads and writes happen in particular orders.
Basically, volatile winds up mostly being useful when you're interfacing with a single hardware register, which can vary outside the programs control but may not require any special atomic operations to access. Or it can be used in signal handlers, where because a thread could be interrupted, and the handler run, and then control returned within the same thread, you need to use a volatile value if you want to communicate a flag to the interrupted code.
But if you're doing any kind of sychronization between threads, you should be using locks or some other concurrency primitives provided by a standard library, or really know what you're doing with regards to memory ordering and use memory barriers or atomic operations.

How Compare and Swap works

I have read quite some posts that say compare and swap guarantees atomicity, However I am still not able to get how does it. Here is general pseudo code for compare and swap:
int CAS(int *ptr,int oldvalue,int newvalue)
int temp = *ptr;
if(*ptr == oldvalue)
*ptr = newvalue
return temp;
How does this guarantee atomicity? For example, if I am using this to implement a mutex,
void lock(int *mutex)
while(!CAS(mutex, 0 , 1));
how does this prevent 2 threads from acquiring the mutex at the same time? Any pointers would be really appreciated.
"general pseudo code" is not an actual code of CAS (compare and swap) implementation. Special hardware instructions are used to activate special atomic hardware in the CPU. For example, in x86 the LOCK CMPXCHG can be used (
In gcc, for example, there is __sync_val_compare_and_swap() builtin - which implements hardware-specific atomic CAS. There is description of this operation from fresh wonderful book from Paul E. McKenney (Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2014), section 4.3 "Atomic operations", pages 31-32.
If you want to know more about building higher level synchronization on top of atomic operations and save your system from spinlocks and burning cpu cycles on active spinning, you can read something about futex mechanism in Linux. First paper on futexes is Futexes are tricky by Ulrich Drepper 2011; the other is LWN article (and the historic one is Fuss, Futexes and Furwocks: Fast Userland Locking in Linux, 2002)
Mutex locks described by Ulrich use only atomic operations for "fast path" (when the mutex is not locked and our thread is the only who wants to lock it), but if the mutex was locked, the thread will go to sleeping using futex(FUTEX_WAIT...) (and it will mark the mutex variable using atomic operation, to inform the unlocking thread about "there are somebody sleeping waiting on this mutex", so unlocker will know that he must wake them using futex(FUTEX_WAKE, ...)
How does it prevent two threads from acquiring the lock? Well, once any one thread succeeds, *mutex will be 1, so any other thread's CAS will fail (because it's called with expected value 0). The lock is released by storing 0 in *mutex.
Note that this is an odd use of CAS, since it's essentially requiring an ABA-violation. Typically you'd just use a plain atomic exchange:
while (exchange(mutex, 1) == 1) { /* spin */ }
// critical section
*mutex = 0; // atomically
Or if you want to be slightly more sophisticated and store information about which thread has the lock, you can do tricks with atomic-fetch-and-add (see for example the Linux kernel spinlock code).
You cannot implement CAS in C. It's done on a hardware level in assembly.

Linux Shared Memory Synchronization

I have implemented two applications that share data using the POSIX shared memory API (i.e. shm_open). One process updates data stored in the shared memory segment and another process reads it. I want to synchronize the access to the shared memory region using some sort of mutex or semaphore. What is the most efficient way of do this? Some mechanisms I am considering are
A POSIX mutex stored in the shared memory segment (Setting the PTHREAD_PROCESS_SHARED attribute would be required)
Creating a System V semaphore using semget
Rather than a System V semaphore, I would go with a POSIX named semaphore using sem_open(), etc.
Might as well make this an answer.
You can use sem_init with pshared true to create a POSIX semaphore in your shared memory space. I have used this successfully in the past.
As for whether this is faster or slower than a shared mutex and condition variable, only profiling can tell you. On Linux I suspect they are all pretty similar since they rely on the "futex" machinery.
If efficiency is important, I would go with process-shared mutexes and condition variables.
AFAIR, each operation with a semaphore requires a syscall, so uncontended mutex should be faster than the semaphore [ab]used in mutex-like manner.
First, really benchmark to know if performance is important. The cost of these things is often overestimated. So if you don't find that the access to the control structure is of same order of magnitude than the writes, just take whatever construct is semantically the best for your use case. This would be the case usually if you'd have some 100 bytes written per access to the control structure.
Otherwise, if the control structure is the bottleneck, you should perhaps avoid to use them. C11 has the new concept of _Atomic types and operations that can be used in cases where there are races in access to data. C11 is not yet widely implemented but probably all modern compilers have extensions that implement these features already.

Threading Implementation

I wanted to know how to implement my own threading library.
What I have is a CPU (PowerPC architecture) and the C Standard Library.
Is there an open source light-weight implementation I can look at?
At its very simplest a thread will need:
Some memory for stack space
Somewhere to store its context (ie. register contents, program counter, stack pointer, etc.)
On top of that you will need to implement a simple "kernel" that will be responsible for the thread switching. And if you're trying to implement pre-emptive threading then you'll also need a periodic source of interrupts. eg. a timer. In this case you can execute your thread switching code in the timer interrupt.
Take a look at the setjmp()/longjmp() routines, and the corresponding jmp_buf structure. This will give you easy access to the stack pointer so that you can assign your own stack space, and will give you a simple way of capturing all of the register contents to provide your thread's context.
Typically the longjmp() function is a wrapper for a return from interrupt instruction, which fits very nicely with having thread scheduling functionality in the timer interrupt. You will need to check the implementation of longjmp() and jmp_buf for your platform though.
Try looking for thread implementations on smaller microprocessors, which typically don't have OS's. eg. Atmel AVR, or Microchip PIC.
For example : discussion on AVRFreaks
For a decent thread library you need:
atomic operations to avoid races (to implement e.g a mutex)
some OS support to do the scheduling and to avoid busy waiting
some OS support to implement context switching
All three leave the scope of what C99 offers you. Atomic operations are introduced in C11, up to now C11 implementations don't seem to be ready, so these are usually implemented in assembler. For the later two, you'd have to rely on your OS.
Maybe you could look at C++ which has threading support. I'd start by picking some of their most useful primitives (for example futures), see how they work, and do a simple implementation.
