Why is a pthread mutex considered "slower" than a futex? - c

Why are POSIX mutexes considered heavier or slower than futexes? Where is the overhead coming from in the pthread mutex type? I've heard that pthread mutexes are based on futexes, and when uncontested, do not make any calls into the kernel. It seems then that a pthread mutex is merely a "wrapper" around a futex.
Is the overhead simply in the function-wrapper call and the need for the mutex function to "setup" the futex (i.e., basically the setup of the stack for the pthread mutex function call)? Or are there some extra memory barrier steps taking place with the pthread mutex?

Futexes were created to improve the performance of pthread mutexes. NPTL uses futexes, LinuxThreads predated futexes, which I think is where the "slower" consideration comes. NPTL mutexes may have some additional overhead, but it shouldn't be much.
Edit:
The actual overhead basically consists on:
selecting the correct algorithm for the mutex type (normal, recursive, adaptive, error-checking; normal, robust, priority-inheritance, priority-protected), where the code heavily hints to the compiler that we are likely using a normal mutex (so it should convey that to the CPU's branch prediction logic),
and a write of the current owner of the mutex if we manage to take it which should normally be fast, since it resides in the same cache-line as the actual lock which we have just taken, unless the lock is heavily contended and some other CPU accessed the lock between the time we took it and when we attempted to write the owner (this write is unneeded for normal mutexes, but needed for error-checking and recursive mutexes).
So, a few cycles (typical case) to a few cycles + a branch misprediction + an additional cache miss (very worst case).

The short answer to your question is that futexes are known to be implemented about as efficiently as possible, while a pthread mutex may or may not be. At minimum, a pthread mutex has overhead associated with determining the type of mutex and futexes do not. So a futex will almost always be at least as efficient as a pthread mutex, until and unless someone thinks up some structure lighter than a futex and then releases a pthreads implementation that uses that for its default mutex.

Technically speaking pthread mutexes are not slower or faster than futexes. pthread is just a standard API, so whether they are slow or fast depends on the implementation of that API.
Specifically in Linux pthread mutexes are implemented as futexes and are therefore fast. Actually, you don't want to use the futex API itself as it is very hard to use, does not have the appropriate wrapper functions in glibc and requires coding in assembly which would be non portable. Fortunately for us the glibc maintainers already coded all of this for us under the hood of the pthread mutex API.
Now, because most operating systems did not implement futexes then programmers usually mean by pthread mutex is the performance you get from usual implementation of pthread mutexes, which is, slower.
So it's a statistical fact that in most operating systems that are POSIX compliant the pthread mutex is implemented in kernel space and is slower than a futex. In Linux they have the same performance. It could be that there are other operating systems where pthread mutexes are implemented in user space (in the uncontended case) and therefore have better performance but I am only aware of Linux at this point.

Because they stay in userspace as much as possible, which means they require fewer system calls, which is inherently faster because the context switch between user and kernel mode is expensive.
I assume you're talking about kernel threads when you talk about POSIX threads. It's entirely possible to have an entirely userspace implementation of POSIX threads which require no system calls but have other issues of their own.
My understanding is that a futex is halfway between a kernel POSIX thread and a userspace POSIX thread.

On AMD64 a futex is 4 bytes, while a NPTL pthread_mutex_t is 56 bytes! Yes, there is a significant overhead.

Related

Does POSIX specify a memory consistency model (Addressing multithreading)? [duplicate]

Are there any guarantees on when a memory write in one thread becomes visible in other threads using pthreads?
Comparing to Java, the Java language spec has a section that specifies the interaction of locks and memory that makes it possible to write portable multi-threaded Java code.
Is there a corresponding pthreads spec?
Sure, you can always go and make shared data volatile, but that is not what I'm after.
If this is platform dependent, is there a de facto standard? Or should another threading library be used?
POSIX specifies the memory model in 4.11 Memory Synchronization:
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:
fork()
pthread_barrier_wait()
pthread_cond_broadcast()
pthread_cond_signal()
pthread_cond_timedwait()
pthread_cond_wait()
pthread_create()
pthread_join()
pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock()
pthread_mutex_unlock()
pthread_spin_lock()
pthread_spin_trylock()
pthread_spin_unlock()
pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock()
pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock()
pthread_rwlock_trywrlock()
pthread_rwlock_unlock()
pthread_rwlock_wrlock()
sem_post()
sem_timedwait()
sem_trywait()
sem_wait()
semctl()
semop()
wait()
waitpid()
The pthread_once() function shall synchronize memory for the first call in each thread for a given pthread_once_t object.
The pthread_mutex_lock() function need not synchronize memory if the mutex type if PTHREAD_MUTEX_RECURSIVE and the calling thread already owns the mutex. The pthread_mutex_unlock() function need not synchronize memory if the mutex type is PTHREAD_MUTEX_RECURSIVE and the mutex has a lock count greater than one.
Unless explicitly stated otherwise, if one of the above functions returns an error, it is unspecified whether the invocation causes memory to be synchronized.
Applications may allow more than one thread of control to read a memory location simultaneously.
I am not aware that POSIX threads give such guarantees. They don't have a model for atomic access to thread-shared objects. If it'd be for POSIX threads, the only guarantees that you can have for visibility of modifications is using some kind of lock.
Modern C, C11, (and probably also C++11) has a model for this kind of questions. It has threads and atomics (fences and all that stuff) that give you exact rules when you may assume that a modification done by one thread is visible by another.
The thread interface of C11 is a cooked-down version of POSIX threads, with less functionality. Unfortunately, the specification for the semantics of that thread interface is yet much to loose, basically the semantics are missing in many places. But a combination of C11 interfaces and POSIX thread semantics can give you a good view of how things work in modern systems.
Edit: So if you want to have guarantees for memory synchronization use either the lock interfaces that POSIX provides or go for atomic operations. All modern compilers have extensions that provide these, gcc and family (icc, opencc, clang) have e.g the series of __sync... builtins. Clang it its newest version also already has support of the new C11 _Atomic feature. There are also wrappers available that give you interfaces for the other compilers that come close to _Atomic.

Understanding discrepancy between POSIX and Linux/glibc sched_* functions

POSIX XSH 2.8.4 Process Scheduling defines the behavior of scheduling attributes for threads and processes. The sched_* interfaces are specified to affect the scheduling properties of the process, not the thread. This is clarified in the following passages:
The POSIX model treats a "process" as an aggregation of system resources, including one or more threads that may be scheduled by the operating system on the processor(s) it controls. Although a process has its own set of scheduling attributes, these have an indirect effect (if any) on the scheduling behavior of individual threads as described below.
and
For threads with system scheduling contention scope, the process scheduling attributes shall have no effect on the scheduling attributes or behavior either of the thread or an underlying kernel scheduling entity dedicated to that thread.
My reading of this is that, on a system where only the "system scheduling contention scope" is supported (Linux/glibc is such a system), the sched_* functions should have absolutely no observable effect.
This is contrary to the reality of the current behavior on Linux/glibc where sched_* set the scheduling attributes of a particular thread.
Aside from wanting to better understand this situation in general, I guess I have these key questions:
Is there any documentation of the rationale for this discrepancy?
Is my reading of the standard correct? In particular, it seems really surprising to me that sched_setparam and sched_setscheduler would be specified to have no effect in a single-threaded application (where the main thread is using the default scheduling policy, which cannot be changed, and system contention scope).
What is the usefulness of standard's sched_* functions? It seems to me they have no effect on most implementations, and minimal effect even on implementations that support the process contention scope. Could somebody describe the intended usage of them?
I believe the reason is that it has been this way since before NPTL was implemented, and nobody has contributed thread-group-wide scheduling attribute support to the kernel, so these functions just still do the what they have always done.
And possibly because, as you point out, the way POSIX specifies them would not really be at all useful without process contention scope …
The rationale for the behaviour of sched_setparam etc. in Linux is that threads are in fact processes created by the clone(2) system call, cf. glibc/nptl/sysdeps/pthread/createthread.c.

How costly is mutex locks when thread synchronization is not critical?

I have a an object (kind of a queue) which is accessed across the threads. The queue object can be mutex locked before used by either thread.
A simpler way to manage this is by bringing the lock inside the queue object itself - hence every API will lock the queue and release when the work is done. This way, threads don't have to manage additional mutex variables along with each queue.
Now my question is, sometimes there is only one thread which is accessing queue (say it is a local variable). But since now inherently the queue would first lock its internal data structure and unlock before leaving, will this be a costly affair?
How costly is the redundant mutex_lock and mutex_unlock operation - when there is no specific need of thread synchronization?
PS:
My question is slightly related to this one: How efficient is locking an unlocked mutex? What is the cost of a mutex?
But i am looking for a specific answer in my design and understanding of why.
I AM USING C, and pthread libraries.
One way to handle this is to have your queue initialization take a parameter that indicates whether a lock should be acquired or not during queue operations. If a queue is being used by a single thread, it gets initialized such that it won't acquire/release locks (or uses a lock object where the acquire/release operations are nops).
See this answer for an example of how boost::pool does something along these lines (although in C++ and as a compile time configuration): https://stackoverflow.com/a/10188784/12711
A similar concept can be applied to C code at runtime, too.
First of all: Neither the C library nor pthreads implements mutex locking - they call into the kernel to use OS primitives for that. This implies, that the performance of muteces will vary wildly with the base OS.
If you can reduce your portability spectrum to hardware supporting atomic compare-exchange or atomic increase-and-read (such as any x86 from this millennium) you can use atomic increase-and-read to create a threadsafe queue that does not need locking.
For the .Net platform I have such a beast at http://sourceforge.net/projects/dotnetlockless - it should be quite easy to port it to C.

Mutexes and thread priorities with regards to scheduling on POSIX systems

In POSIX systems (linux etc), when multiple threads lock a common mutex - is it the locking order that is always observed, or does thread priority bias threads of higher priority when it comes scheduling the next thread in the critical section?
Does the standard mention anything about the behavior? because as far as I can see it only seems to mention the required interface.
Please note, I'm looking of guidance on any POSIX conformant system (not just linux), so feel free to suggest the behavior of other OSes (QNX, Minix etc..)
When multiple threads are waiting to lock the same mutex, when the mutex becomes available, the highest priority thread will be unblocked first. If multiple threads have the same priority, which thread is unblocked will depend on the scheduling algorithm used, e.g. using a FIFO policy, the thread that has been waiting that the longest will be awakened first.
Thread priorities and synchronisation is quite a thorny area and you need to be very careful you don't end up with priority inversion and cause deadlock.
Chapter 5.5 of Butenhof's Programming with POSIX Threads deals with realtime scheduling.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?
This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).
Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.
In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.
You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.

Resources