What is PTHREAD_MUTEX_ADAPTIVE_NP

What is PTHREAD_MUTEX_ADAPTIVE_NP - c

Where can I find documentation for "adaptive" pthread mutexes? The symbol PTHREAD_MUTEX_ADAPTIVE_NP is defined on my system, but the only documentation I can find online says nothing about what an adaptive mutex is, or when it's appropriate to use.
So... what is it, and when should I use it?
For reference, my version of libc is:
GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.5) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.50 system on 2013-09-30.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
and "uname -a" gives
Linux desktop 3.2.0-55-generic #85-Ubuntu SMP Wed Oct 2 12:29:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

PTHREAD_MUTEX_ADAPTIVE_NP is something that I invented while working in the role of a glibc contributor on making LinuxThreads more reliable and perform better. LinuxThreads was the predecessor to glibc's NPTL library, originally developed as a stand-alone library by Xavier Leroy, who is also well-known as one of the creators of OCaml.
The adaptive mutex survived into NTPL in essentially unmodified form: the code is nearly identical, including the magic constants for the estimator smoothing and the maximum spin relative to the estimator.
Under SMP, when you go to acquire a mutex and see that it is locked, it can be sub-optimal to simply give up and call into the kernel to block. If the owner of the lock only holds the lock for a few instructions, it is cheaper to just wait for the execution of those instructions, and then acquire the lock with an atomic operation, instead of spending hundreds of extra cycles by making a system call.
The kernel developers know this very well, which is one reason why we have spinlocks in the Linux kernel for fast critical sections. (Among the other reasons is, of course, that code which cannot sleep, because it is in an interrupt context, can acquire spinlocks.)
The question is, how long should you wait? If you spin forever until the lock is acquired, that can be sub-optimal. User space programs are not well-written like kernel code (cough). They could have long critical sections. They also cannot disable pre-emption; sometimes critical sections blow up due to a context switch. (POSIX threads now provide real time tools to deal with this: you can put threads into a real-time priority and FIFO scheduling and such, plus configure processor affinity.)
I think we experimented with fixed iteration counts, but then I had this idea: why should we guess, when we can measure. Why don't we implement a smoothed estimator of the lock duration, similarly to what we do for the TCP retransmission time-out (RTO) estimator. Each time we spin on a lock, we should measure how many spins it actually took to acquire it. Moreover, we should not spin forever: we should perhaps spin only at most twice the current estimator value. When we take a measurement, we can smooth it exponentially, in just a few instructions: take a fraction of the previous value, and of the new value, and add them together, which is the same as adding a fraction of their difference to back to the estimator: say, estimator += (new_val - estimator)/8 for a 1/8 to 7/8 blend between the old and new value.
You can think of this as a watchdog. Suppose that the estimator tells you that the lock, on average, takes 80 spins to acquire. You can be quite confident, then, that if you have executed 160 spins, then something is wrong: the owner of the lock is executing some exceptionally long case, or maybe has hit a page fault or was otherwise preempted. At this point the waiting thread cuts its losses and calls into the kernel to block.
Without measurement, you cannot do this accurately: there is no "one size fits all" value. Say, a fixed limit of 200 spins would be sub-optimal in a program whose critical sections are so short that a lock can almost always be fetched after waiting only 10 spins. The mutex locking function would burn through 200 iterations every time there is an anomalous wait time, instead of nicely giving up at, say, 20 and saving cycles.
This adaptive approach is specialized, in the sense that it will not work for all locks in all programs, so it is packaged as a special mutex type. For instance, it will not work very well for programs that lock mutexes for long periods: periods so long that more CPU time is wasted spinning on the large estimator values than would have been by going into the kernel. The approach is also not suitable for uniprocessors: all threads besides the one which is trying to get the lock are suspended in the kernel. The approach is also not suitable in situations in which fairness is important: it is an opportunistic lock. No matter how many other threads have been waiting, for no matter how long, or what their priority is, a new thread can come along and snatch the lock.
If you have very well-behaved code with short critical sections that are highly contended, and you're looking for better performance on SMP, the adaptive mutex may be worth a try.

The symbol is mentionned there:
http://elias.rhi.hi.is/libc/Mutexes.html
"LinuxThreads supports only one mutex attribute: the mutex type, which is either PTHREAD_MUTEX_ADAPTIVE_NP for "fast" mutexes, PTHREAD_MUTEX_RECURSIVE_NP for "recursive" mutexes, PTHREAD_MUTEX_TIMED_NP for "timed" mutexes, or PTHREAD_MUTEX_ERRORCHECK_NP for "error checking" mutexes. As the NP suffix indicates, this is a non-portable extension to the POSIX standard and should not be employed in portable programs.
The mutex type determines what happens if a thread attempts to lock a mutex it already owns with pthread_mutex_lock. If the mutex is of the "fast" type, pthread_mutex_lock simply suspends the calling thread forever. If the mutex is of the "error checking" type, pthread_mutex_lock returns immediately with the error code EDEADLK. If the mutex is of the "recursive" type, the call to pthread_mutex_lock returns immediately with a success return code. The number of times the thread owning the mutex has locked it is recorded in the mutex. The owning thread must call pthread_mutex_unlock the same number of times before the mutex returns to the unlocked state.
The default mutex type is "timed", that is, PTHREAD_MUTEX_TIMED_NP."
EDIT: updated with info found by jthill (thanks!)
A little more info on the mutex flags and the PTHREAD_MUTEX_ADAPTIVE_NP can be found here:
"The PTHRED_MUTEX_ADAPTIVE_NP is a new mutex that is intended for high
throughput at the sacrifice of fairness and even CPU cycles. This
mutex does not transfer ownership to a waiting thread, but rather
allows for competition. Also, over an SMP kernel, the lock operation
uses spinning to retry the lock to avoid the cost of immediate
descheduling."
Which basically suggest the following: in case where high thoughput is desirable, such mutex can be implemented requiring extra considerations from the thread logic due to it's very nature. You will have to design an algorithm that can use these properties resulting in high throughput. Something that load balances itself from within (as opposed to "from the kernel") where order of execution is unimportant.
There was a very good book for linux/unix multithreading programming which name escapes me. If I find it I'll update.

Here you go. As I read it, it's a brutally simple mutex that doesn't care about anything except making the no-contention case run fast.

Related

Best way to synchronise threads and measure performance at sub-microsecond frequency

I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)

Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.

Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.

Modern System Architecture?

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?

Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?

It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

Calling convention which only allows one instance of a function at a time

Say I have multiple threads and all threads call the same function at approximately the same time.
Is there a calling convention which would only allow one instance of the function at any time? What I mean is that the function called by the second thread would only start after the function called by the first thread had returned.
Or are these calling conventions compiler specific? I don't have a whole lot of experience using them.

(Skip to the bottom if you don't care about the threading mumbo-jumbo)
As mentioned before, this is not a "calling convention" but a general problem of computing: concurrency. And the particular case where two or more threads can enter a shared zone at a time, and have a different outcome, is called a race condition (and also extends to/from electronics, and other areas).
The hard thing about threading is that computing is such a deterministic affair, but when threading gets involved, it adds a degree of uncertainty, which vary per platform/OS.
A one-thread affair would guarantee that it can do all tasks in the same order, always, but when you got multiple threads, and the order depends on how fast they can complete a task, shared other applications wanting to use the CPU, then the underlying hardware affects the results.
There's not much of a "sure fire way to do threading", as there's techniques, tools and libraries to deal with individual cases.
Locking in
The most well known technique is using semaphores (or locks), and the most well known semaphore is the mutex one, which only allows one thread at a time to access a shared space, by having a sort of "flag" that is raised once a thread has entered.
if (locked == NO)
{
locked = YES;
// Do ya' thing
locked = NO;
}
The code above, although it looks like it could work, it would not guarantee against cases where both threads pass the if () and then set the variable (which threads can easily do). So there's hardware support for this kind of operation, that guarantees that only one thread can execute it: The testAndSet operation, that checks and then, if available, sets the variable. (Here's the x86 instruction from the instruction set)
On the same vein of locks and semaphores, there's also the read-write lock, that allows multiple readers and one writer, specially useful for things with low volatility. And there's many other variations, some that limit an X amount of threads and whatnot.
But overall, locks are lame, since they are basically forcing serialisation of multi-threading, where threads actually need to get stuck trying to get a lock (or just testing it and leaving). Kinda defeats the purpose of having multiple threads, doesn't it?
The best solution in terms of threading, is to minimise the amount of shared space that threads need to use, possibly, elmininating it completely. Maybe use rwlocks when volatility is low, try to have "try and leave" kind of threads, that check if the lock is up, and then go away if it isn't, etc.
As my OS teacher once said (in Zen-like fashion): "The best kind of locking is the one you can avoid".
Thread Pools
Now, threading is hard, no way around it, that's why there are patterns to deal with such kind of problems, and the Thread Pool Pattern is a popular one, at least in iOS since the introduction of Grand Central Dispatch (GCD).
Instead of having a bunch of threads running amok and getting enqueued all over the place, let's have a set of threads, waiting for tasks in a "pool", and having queues of things to do, ideally, tasks that shouldn't overlap each other.
Now, the thread pattern doesn't solve the problems discussed before, but it changes the paradigm to make it easier to deal with, mentally. Instead of having to think about "threads that need to execute such and such", you just switch the focus to "tasks that need to be executed" and the matter of which thread is doing it, becomes irrelevant.
Again, pools won't solve all your problems, but it will make them easier to understand. And easier to understand may lead to better solutions.
All the theoretical things above mentioned are implemented already, at POSIX level (semaphore.h, pthreads.h, etc. pthreads has a very nice of r/w locking functions), try reading about them.
(Edit: I thought this thread was about Obj-C, not plain C, edited out all the Foundation and GCD stuff)

Calling convention defines how stack & registers are used to implement function calls. Because each thread has its own stack & registers, synchronising threads and calling convention are separate things.
To prevent multiple threads from executing the same code at the same time, you need a mutex. In your example of a function, you'd typically put the mutex lock and unlock inside the function's code, around the statements you don't want your threads to be executing at the same time.
In general terms: Plain code, including function calls, does not know about threads, the operating system does. By using a mutex you tap into the system that manages the running of threads. More details are just a Google search away.
Note that C11, the new C standard revision, does include multi-threading support. But this does not change the general concept; it simply means that you can use C library functions instead of operating system specific ones.

Shared semaphore between user and kernel spaces

Short version
Is it possible to share a semaphore (or any other synchronization lock) between user space and kernel space? Named POSIX semaphores have kernel persistence, that's why I was wondering if it is possible to also create, and/or access them from kernel context.
Searching the internet didn't help much due to the sea of information on normal usage of POSIX semaphores.
Long version
I am developing a unified interface to real-time systems in which I have some added book keeping to take care of, protected by a semaphore. These book keepings are done on resource allocation and deallocation, which is done in non-real-time context.
With RTAI, the thread waiting and posting a semaphore however needs to be in real-time context. This means that using RTAI's named semaphore means switching between real-time and non-real-time context on every wait/post in user space, and worse, creating a short real-time thread for every sem/wait in kernel space.
What I am looking for is a way to share a normal Linux or POSIX semaphore between kernel and user spaces so that I can safely wait/post it in non-real-time context.
Any information on this subject would be greatly appreciated. If this is not possible, do you have any other ideas how this task could be accomplished?1
1 One way would be to add a system call, have the semaphore in kernel space, and have user space processes invoke that system call and the semaphore would be all managed in kernel space. I would be happier if I didn't have to patch the kernel just because of this though.

Well, you were in the right direction, but not quite -
Linux named POSIX semaphore are based on FUTex, which stands for Fast User-space Mutex. As the name implies, while their implementation is assisted by the kernel, a big chunk of it is done by user code. Sharing such a semaphore between kernel and user space would require re-implementing this infrastructure in the kernel. Possible, but certainly not easy.
SysV Semaphores on the other hand are implemented completely in kernel and are only accessible to user space via standard system calls (e.g. sem_timedwait() and friends).
This means that every SysV related operations (semaphore creation, taking or release) is actually implemented in the kernel and you can simply call the underlying kernel function from your code to take the same semaphore from the kernel is needed.
Thus, your user code will simply call sem_timedwait(). That's the easy part.
The kernel part is just a little bit more tricky: you have to find the code that implement sem_timedwait() and related calls in the kernel (they are are all in the file ipc/sem.c) and create a replica of each of the functions that does what the original function does without the calls to copy_from_user(...) and copy_to_user(..) and friends.
The reason for this is that those kernel function expect to be called from a system call with a pointer to a user buffer, while you want to call them with parameters in kernel buffers.
Take for example sem_timedwait() - the relevant kernel function is sys_timedwait() in ipc/sem.c (see here: http://lxr.free-electrons.com/source/ipc/sem.c#L1537). If you copy this function in your kernel code and just remove the parts that do copy_from_user() and copy_to_user() and simply use the passed pointers (since you'll call them from kernel space), you'll get kernel equivalent functions that can take SysV semaphore from kernel space, along side user space - so long as you call them from process context in the kernel (if you don't know what this last sentence mean, I highly recommend reading up on Linux Device Drivers, 3rd edition).
Best of luck.

One solution I can think of is to have a /proc (or /sys or whatever) file on a main kernel module where writing 0/1 to it (or read from/write to it) would cause it to issue an up/down on a semaphore. Exporting that semaphore allows other kernel modules to directly access it while user applications would go through the /proc file system.
I'd still wait to see if the original question has an answer.

I'm not really experienced on this by any means, but here's my take. If you look at glibc's implementation of sem_open, and sem_wait, it's really just creating a file in /dev/shm, mmap'ing a struct from it, and using atomic operations on it. If you want to access the named semaphore from user space, you will probably have to patch the tmpfs subsystem. However, I think this would be difficult, as it wouldn't be straightforward to determine if a file is meant to be a named semaphore.
An easier way would probably be to just reuse the kernel's semaphore implementation and have the kernel manage the semaphore for userspace processes. To do this, you would write a kernel module which you associate with a device file. Then define two ioctl's for the device file, one for wait, and one for post. Here is a good tutorial on writing kernel modules, including setting up a device file and adding I/O operations for it. http://www.freesoftwaremagazine.com/articles/drivers_linux. I don't know exactly how to implement an ioctl operation, but I think you can just assign a function to the ioctl member of the file_operations struct. Not sure what the function signature should be, but you could probably figure it out by digging around in the kernel source.

As I'm sure you know, even the best working solution to this would likely be very ugly. If I were in your place, I would simply concede the battle and use rendezvous points to sync the processes

I have read your project's README and I have the following observations. Apologies in advance:
Firstly there already is a universal interface to real time systems. It is called POSIX; certainly VxWorks, Integrity and QNX are POSIX compliant and in my experience there are very few problems with portability if you develop within the POSIX API. Whether POSIX is sane or not is another matter, but it's the one we all use.
[The reason most RTOSes are POSIX compliant is because one of the big markets for them is defence equipment. And the US DoD won't let you use an OS for their non-IT equipment (eg Radars) unless it is POSIX compliant... This has pretty much made it commercially impossible to do an RTOS without giving it POSIX]
Secondly Linux itself can be made into a pretty good real time OS by applying the PREMPT_RT patch set. Of all the RTOSes out there this is probably the best one at the moment from the point of view of making efficient use of all these multi core CPUs. However it's not quite such a hard-realtime OS as the others, so its quid pro quo.
RTAI takes a different approach of in effect placing their own RTOS underneath Linux and making Linux nothing more than one task running in their OS. This approach is ok up to a point, but the big penalty of RTAI is that the real time bit is now (as far as I can tell) not POSIX compliant (though the API looks like they've just stuck rt_ on the front of some POSIX function names) and interaction with other things is now, as you're discovering, quite complicated.
PREEMPT_RT is a much more intrusive patch set than RTAI, but the payback is that everything else (like POSIX and valgrind) stays completely normal. Plus nice things like FTrace are available. Book keeping is then a case of merely using existing tools, not having to write new ones. Also it looks like PREEMPT_RT is gradually worming its way into the mainstream Linux kernel anyway. That would render other patch sets like RTAI pretty much pointless.
So Linux + PREEMPT_RT gives us realtime POSIX plus a bunch of tools, just like all the other RTOSes out there; commonality across the board. Which kinda sounds like the goal of your project.
I apologise for not helping with the with the "how" of your project, and it is highly ungentlemanly of me to query the "why?" of it too. But I feel it is important to know that there are established things out there that seem to heavily overlap with what you're trying to do. Unseating King POSIX is going to be difficult.

I would like to answer this differently: you don't want to do this. There are good reasons why there is no interface to do this kind of thing and there are good reasons why all other kernel subsystems are designed and implemented to never need a lock shared between user and kernel space. The complexity of lock ordering and implicit locking in unexpected places will quickly get out of hand if you start playing around with userland that can prevent the kernel from doing certain things.
Let me recall a very long debugging session I did around 15 years ago to at least shed some light what complex problems you can run into. I was involved in developing a file system where the large portion of the code was in userland. Something like FUSE.
The kernel would do a filesystem operation, package it into a message and send it to the userland daemon and wait for a reply. The userland daemon reads the message, does stuff and writes a reply to the kernel which wakes up and continues with the operation. Simple concept.
One thing you need to understand about filesystems is locking. When you're looking up a name of a file, for example "foo/bar", the kernel somehow gets the node for the directory "foo" then locks it and asks it if it has the file "bar". The filesystem code somehow finds "bar", locks it and then unlocks "foo". The locking protocol is quite straight forward (unless you're doing a rename), parent always gets locked before the child and the child is locked before the parent lock is released. The lookup message for the file is what would get sent to our userland daemon while the directory was still locked, when the daemon replied the kernel would proceed to first lock "bar" and then unlock "foo".
I don't even remember the symptoms we were debugging, but I remember the issue was not trivially reproducible, it required hours and hours of filesystem torture programs until it manifested itself. But after a few weeks we figured out what was going on. Let's say that the full path to our file was "/a/b/c/foo/bar". We're in the process of doing a lookup on "bar", which means that we're holding the lock on "foo". The daemon is a normal userland process so some operations it does can block and can be preempted too. It's actually talking over the network so it can block for a long time. While we're waiting for the userland daemon some other process want to look up "foo" for some reason. To do this, it has the node for "c", locked of course, and asks it to look up "foo". It manages to find it and attempts to lock it (it has to be locked before we can release the lock on "c") and waits for the lock on "foo" to be released. Another process comes in an wants to look up "c", it of course ends up waiting for that lock while holding the lock on "b". Another process waits for "b" and holds "a". Yet another process wants "a" and holds the lock on "/".
This is not a problem, not yet. This sometimes happens in normal filesystems too, locks can cascade all the way up to the root, you wait for a while for a slow disk, the disk responds, the congestions eases up and everyone gets their locks and everything keeps running fine. In our case though, the reason for holding the lock a long time was because the remote server for our distributed filesystem didn't respond. X seconds later the userland daemon times out and just before responding to the kernel that the lookup operation on "bar" has failed it logs a message to syslog with a timestamp. One of the things that the timestamp needs is the timezone information, so it needs to open "/etc/localtime", of course to do that, it needs to start looking up "/etc" and for that it needs to lock "/". "/" is already locked by someone else, so the userland daemon waits for that someone else to unlock "/" while that someone else waits through a chain of 5 processes and locks for the daemon to respond. The system ends up in a total deadlock.
Now, maybe your code will not have problems like this. You're talking about a real-time system so there might be a level of control you have that normal kernels don't. But I'm not sure if adding an unexpected layer of locking complexity would even let you keep real time properties of the system, or really make sure that nothing you do in userland will ever create a deadlock cascade. If you don't page, if you never touch any file descriptor, if you never do memory operations and a bunch of other things I can't really think of right now you could get away with a lock shared between userland and kernel, but it will be hard and you'll probably find unexpected problems.

Multiple solutions exist in Linux/GLIBC but none permit to share explicitly a semaphore between user and kernel spaces.
The kernel provides solutions to suspend threads/processes and the most efficient is the futex. Here are some details about the state of the art of the current implementations to synchronize user space applications.
SYSV services
The Linux System V (SysV) semaphores are a legacy of the eponymous Unix OS. They are based on system calls to lock/unlock semaphores. The corresponding services are:
semget() to get an identifier
semop() to make operations on the semaphores (e.g. incrementation/decrementation)
semctl() to make some control operations on the semaphores (e.g. destruction)
The GLIBC (e.g. 2.31 version) does not provide any added value on top of those services. The library service directly calls the eponymous system call. For example, semop() (in sysdeps/unix/sysv/linux/semtimedop.c) directly invokes the corresponding system call:
int
__semtimedop (int semid, struct sembuf *sops, size_t nsops,
const struct timespec *timeout)
{
/* semtimedop wire-up syscall is not exported for 32-bit ABIs (they have
semtimedop_time64 instead with uses a 64-bit time_t). */
#if defined __ASSUME_DIRECT_SYSVIPC_SYSCALLS && defined __NR_semtimedop
return INLINE_SYSCALL_CALL (semtimedop, semid, sops, nsops, timeout);
#else
return INLINE_SYSCALL_CALL (ipc, IPCOP_semtimedop, semid,
SEMTIMEDOP_IPC_ARGS (nsops, sops, timeout));
#endif
}
weak_alias (__semtimedop, semtimedop)
Nowadays, SysV semaphores (as well as other SysV IPC like shared memory and message queues) are considered deprecated because as they need a system call for each operation, they slow down the calling processes with systematic context switches. New applications should use POSIX compliant services available through the GLIBC.
POSIX services
POSIX semaphores are based on Fast User Mutexes (FUTEX). The principle consists to increment/decrement the semaphore counter in user space with atomic operations as long as there is no contention. But when there is contention (multiple threads/processes want to "lock" the semaphore at the same time), a futex() system call is done to either wake up waiting threads/processes when the semaphore is "unlocked" or suspend threads/processes waiting for the semaphore to be released. From performance point of view, this makes a big difference compared to the above SysV services which systematically required a system call for any operation. The POSIX services are implemented in GLIBC for the user space part of the operations (atomic operations) with a switch into kernel space only when there is contention.
For example, in GLIBC 2.31, the service to lock a semaphore is located in nptl/sem_waitcommon.c. It checks the value of the semaphore to decrement it with an atomic operation (in __new_sem_wait_fast()) and invokes the futex() system call (in __new_sem_wait_slow()) to suspend the calling thread only if the semaphore was equal to 0 before the attempt to decrement it.
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
[...]
uint64_t d = atomic_load_relaxed (&sem->data);
do
{
if ((d & SEM_VALUE_MASK) == 0)
break;
if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1))
return 0;
}
while (definitive_result);
return -1;
[...]
}
[...]
static int
__attribute__ ((noinline))
__new_sem_wait_slow (struct new_sem *sem, clockid_t clockid,
const struct timespec *abstime)
{
int err = 0;
[...]
uint64_t d = atomic_fetch_add_relaxed (&sem->data,
(uint64_t) 1 << SEM_NWAITERS_SHIFT);
pthread_cleanup_push (__sem_wait_cleanup, sem);
/* Wait for a token to be available. Retry until we can grab one. */
for (;;)
{
/* If there is no token available, sleep until there is. */
if ((d & SEM_VALUE_MASK) == 0)
{
err = do_futex_wait (sem, clockid, abstime);
[...]
The POSIX services based on the futex are for examples:
sem_init() to create a semaphore
sem_wait() to lock a semaphore
sem_post() to unlock a semaphore
sem_destroy() to destroy a semaphore
To manage mutex (i.e. binary semaphores), it is possible to use the pthread services. They are also based on the futex. For examples:
pthread_mutex_init() to create/initialize a mutex
pthread_mutex_lock/unlock() to lock/unlock a mutex
pthread_mutex_destroy() to destroy a mutex

I was thinking about ways that kernel and user land share things directly i.e. without syscall/copyin-out cost. One thing I remembered was the RDMA model where the kernel writes/reads directly from user space, with synchronization of course. You may want to explore that model and see if it works for your purpose.

Implementing critical section

What way is better and faster to create a critical section?
With a binary semaphore, between sem_wait and sem_post.
Or with atomic operations:
#include <sched.h>
void critical_code(){
static volatile bool lock = false;
//Enter critical section
while ( !__sync_bool_compare_and_swap (&lock, false, true ) ){
sched_yield();
}
//...
//Leave critical section
lock = false;
}

Regardless of what method you use, the worst performance problem with your code has nothing to do with what type of lock you use, but the fact that you're locking code rather than data.
With that said, there is no reason to roll your own spinlocks like that. Either use pthread_spin_lock if you want a spinlock, or else pthread_mutex_lock or sem_wait (with a binary semaphore) if you want a lock that can yield to other processes when contended. The code you have written is the worst of both worlds in how it uses sched_yield. The call to sched_yield will ensure that the lock waits at least a few milliseconds (and probably a whole scheduling timeslice) in the case where there's both lock contention and cpu load, and it will burn 100% cpu when there's contention but no cpu load (due to the lock-holder being blocked in IO, for instance). If you want to get any of the benefits of a spin lock, you need to be spinning without making any syscalls. If you want any of the benefits of yielding the cpu, you should be using a proper synchronization primitive which will use (on Linux) futex (or equivalent) operations to yield exactly until the lock is available - no shorter and no longer.
And if by chance all that went over your head, don't even think about writing your own locks..

Spin-locks perform better if there is little contention for the lock and/or it is never held for a long period of time. Otherwise you are better off with a lock that blocks rather than spins. There are of course hybrid locks which will spin a few times, and if the lock cannot be acquired, then they will block.
Which is better for you depends on your application. Only you can answer that question.

You didn't look deep enough in the gcc documentation. The correct builtins for such type of lock are __sync_lock_test_and_set and __sync_lock_release. These have exactly the guarantees that you need for such a thing. In terms of the new C11 standard this would be the type atomic_flag with operations atomic_flag_test_and_set and atomic_flag_clear.
As R. already indicates, putting sched_yield into the loop, is really a bad idea.
If the code inside the critical section is only some cycles, the probability that the execution of it falls across the boundary of a scheduling slice is small. The number of threads that will be blocked spinning actively will be at most the number of processors minus one. All this doesn't hold if you yield execution as soon as you don't obtain the lock immediately. If you have real contention on your lock and yield, you will have a multitude of context switches, which will bring your system almost to a hold.

As others have pointed out its not really about how fast the locking code is. This is because once a lock sequence is initiated using "xchg reg,mem" a lock signal is sent down through the caches and out to the devices on all buses. When the last device has acknowledged that it will hold and acknowledged this - which may take hundreds of if not a thousand clocks cycles the actual exchange is performed. If your slowest device is a classic PCI card it will have a bus speed of 33 MHz which is about one hundredth of the CPU's internal clock. And the PCI device (if active) will need several clock cycles (#33 MHz) to respond. During that time the CPU will be waiting for the acknowledge to come back.
Most spinlocks are probably used in device drivers where the routine won't be pre-empted by the OS but might be interrupted by a higher-level driver.
A critical section is really just a spin-lock but with interfacing to the OS because it may be pre-empted.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight