Understanding discrepancy between POSIX and Linux/glibc sched_* functions - c

POSIX XSH 2.8.4 Process Scheduling defines the behavior of scheduling attributes for threads and processes. The sched_* interfaces are specified to affect the scheduling properties of the process, not the thread. This is clarified in the following passages:
The POSIX model treats a "process" as an aggregation of system resources, including one or more threads that may be scheduled by the operating system on the processor(s) it controls. Although a process has its own set of scheduling attributes, these have an indirect effect (if any) on the scheduling behavior of individual threads as described below.
and
For threads with system scheduling contention scope, the process scheduling attributes shall have no effect on the scheduling attributes or behavior either of the thread or an underlying kernel scheduling entity dedicated to that thread.
My reading of this is that, on a system where only the "system scheduling contention scope" is supported (Linux/glibc is such a system), the sched_* functions should have absolutely no observable effect.
This is contrary to the reality of the current behavior on Linux/glibc where sched_* set the scheduling attributes of a particular thread.
Aside from wanting to better understand this situation in general, I guess I have these key questions:
Is there any documentation of the rationale for this discrepancy?
Is my reading of the standard correct? In particular, it seems really surprising to me that sched_setparam and sched_setscheduler would be specified to have no effect in a single-threaded application (where the main thread is using the default scheduling policy, which cannot be changed, and system contention scope).
What is the usefulness of standard's sched_* functions? It seems to me they have no effect on most implementations, and minimal effect even on implementations that support the process contention scope. Could somebody describe the intended usage of them?

I believe the reason is that it has been this way since before NPTL was implemented, and nobody has contributed thread-group-wide scheduling attribute support to the kernel, so these functions just still do the what they have always done.
And possibly because, as you point out, the way POSIX specifies them would not really be at all useful without process contention scope …

The rationale for the behaviour of sched_setparam etc. in Linux is that threads are in fact processes created by the clone(2) system call, cf. glibc/nptl/sysdeps/pthread/createthread.c.

Related

Ways to detect if any other threads exist (e.g. prior to fork)

Background, from POSIX:
A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
The difficulty is that we generally don't know if we're a multi-threaded process, since threads may have been created by library code. And "async-signal-safe" is a quite-severe restriction.
It is nonsensical to ask "how many threads are there", since if other threads are still running, they may be exiting or creating new threads while asking. We can, however, get answers (or partial answers) to simpler questions:
Is it even possible for other threads to exist?
Am I the only thread that ever existed?
Am I the only thread that exists right now?
...
For simplicity's sake let's assume:
we're not in a signal handler
nobody is mad enough to invoke UB by calling pthread_create or C11's thrd_create from a signal handler
nobody is doing threads outside of pthreads, C11, and C++11
C++11 threads appear to always be implemented in terms of pthreads (on platforms that support fork, at least)
C11 threads are very similar to pthreads, although we sometimes have to handle the functions separately.
Answers that involve arcane implementation details are encouraged, as long as they are (fairly) stable.
Some partial answers (more still needed):
Question 1 is addressed by libstdc++'s __gthread_active_p() for several libc implementations. The header is compatible with C, but it a static function in a C++-only part of the include path, and also relies on the existence of macro __GXX_WEAK__ which is only predefined for C++. (libc++ unconditionally pulls in pthreads)
Unfortunately, this is dangerously unreliable for the dlopen case (race conditions in correct user code), see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78017
Question 2 can be addressed by installing interceptors for pthread_create and thrd_creat. But this can potentially be finicky (see comments in gthr.h about interceptors).
If calling clock_gettime with CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID differs, this may be proof that another thread has existed, but beware of races, resolution, and clock settability (setting these clocks is not possible on Linux, but POSIX potentially allows it)
Question 3 is the interesting one, anyway:
GDB is likely to know the answer, but spawning a whole other process seems unnecessary (e.g. answers involving ps should be rewritten to use /proc/ directly)), and it may not have permission anyway
libthread_db.so exists but appears undocumented except in the original Solaris version. It looks like it might be possible to implement the proc_service.h callbacks for the current process however, if we ignore the "stop" part ...
On Linux, if gettid() != getpid(), you're not the main thread, thus there probably are at least two threads. (it's possible for the main thread to call pthread_exit, but this is weird)
A (somewhat) more portable version of the preceding: use __attribute__((constructor)) (or politely ask your caller) to stash the value of pthread_self() for the main thread. Unfortunately, there is a disturbing comment in libstdc++'s <thread> header (grep for __GLIBC__) about returning 0 (but I cannot reproduce this).
On Linux if /proc is mounted and accessible, you can enumerate /proc/self/task/. The code to do this is portable, it will just fail on OSes that don't provide this. (are there others that do provide this much?). Is /proc/self/status or /proc/self/stat any more portable? They have less information (and stat is hard to parse securely), but we probably don't need any more. (need to test these for the "main thread exited" case)
On GLIBC, we could possibly read the debug symbols to find the multiple_threads flag (sometimes global, sometimes part of struct pthread - ugh). But this is probably similar to libthread_db.so
Similarly MUSL has a count (minus one) and a linked list ... though it prefers to take an internal lock first. If we're only reading, is it safe to skip that?
If we block a signal and then kill the current process (not thread) with it, and our thread isn't the one that receives it, we know that other threads must exist to handle it. But there's no way to know how long to wait, and signals are dangerous global state anyway.
On Linux, unshare(2) ignores CLONE_THREAD for single-threaded processes and errors for multithreaded processes! (There's also some harder cases with user namespaces but I don't think they're needed)
On Linux, SELinux's setcon(3) is guaranteed to fail for multithreaded processes under certain conditions. This requires investigation; it takes some steps to correlate the kernel implementation to userland headers (there is a userland library involved).
From grepping kernel sources those are the only 2 that use specific functions, but there's nothing stopping other functions from being implemented on the same data structures.

What does it mean to POSIX that a thread is "suspended"?

In the course of commentary on a recent question, a subsidiary question arose about at what point a cancellation request for a pthreads thread with cancelability PTHREAD_CANCEL_DEFERRED can be expected to be acted upon. References to the standard and a bit of lawyering ensued. I'm not much concerned specifically about whether I was mistaken in my comments on that question, but I would like to be sure I understand POSIX's provisions correctly.
The most pertinent section of the standard says
Whenever a thread has cancelability enabled and a cancellation request has been made with that thread as the target, and the thread then calls any function that is a cancellation point [...], the cancellation request shall be acted upon before the function returns. If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon.
What, though, does it mean for a thread to be "suspended"? POSIX explicitly defines the term for processes, but not, as far as I can determine, for threads. On the other hand, POSIX documents thread suspension to be among the behaviors of a handful of functions, including, but not limited to, some of those related to synchronization objects. Should one then conclude that those serve collectively as the relevant definition of the term?
And as this all pertains to the question that spawned this line of inquiry, given that POSIX does not specify thread suspension as part of the behavior of read(), fread(), or any of the general file or stream I/O functions, if a thread is not making progress on account of being blocked on I/O, does that necessarily mean it is "suspended" for the purposes of cancellation?
A suspended thread is one that, as you say, is blocked on a socket read, waiting for a semaphore to become available, etc.
Given that POSIX implementations vary at the tricky edges, and that there is the potential for a thread to be blocked in a function that is not a cancellation point, it might be that relying on cancellation in code that is to be ported might be more trouble than it's worth.
I've never used it, I've always chosen to have code to explicitly instruct a thread to terminate (normally a message down a pipe or queue). This is very easy with a Communicating Sequential Processes or Actor Model system.
That way clean up can be done under one's own control, freeing memory, etc. as necessary. I've no idea whether a cancelled thread will clean up its memory (I suspect not), or whether there is the option for an at_exit() type thing (there may be). On the whole I think that application behaviour is more thoroughly controlled if there is only one single way a thread can exit.
==EDIT==
#JohnBollinger,
The language used If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point could be interpretted as IF a thread has cancelability enabled AND IF cancelled and IF implementation suspends blocked threads AND IF the thread is blocked THEN the thread shall be awakened.... In other words, they're leaving it up to the implementer of the POSIX subsystem.
Cygwin's implementation of select() does not (or at least did not) result in the thread being suspended. Instead it spawns a polling thread per file descriptor to test for signalable activity, due to the fundamental lack of anything quite like select() in Windows (it gets close, but no cigar. Win32 select() works on only sockets). Implementations of select() back in the 1980s often worked this way too.
It might be for reasons like this that POSIX is reluctant to clearly define when a thread is suspended. Historically many implementations of select() were like this, making it a minefield for a standards committee to say when a thread might or might not be suspended. Of course the complexities caused by select() would also apply to a process but as POSIX does define a suspended process it does seem odd that they couldn't / didn't extend the definition to threads.
It might be down to how threads are implemented; you can conceivably have a POSIX implementation that doesn't use OS threads (a bit like the early implementations of ADA back in the days when OSes didn't do threads at all), and in such an implementation a blocked thread might not be suspended (in the sense of taking no CPU cycles) at all.
Definition of suspend in the context of threads:
3.107 Condition Variable
A synchronization object which allows a thread to suspend execution, repeatedly, until some associated predicate becomes true. A thread whose execution is suspended on a condition variable is said to be blocked on the condition variable.
From: http://pubs.opengroup.org/onlinepubs/9699919799/
This is not a direct answer, just a definition – too large for a comment. Blocked == suspended.
read, fread, and friends are system calls and as such they will execute a context switch and execute from the kernel context until those functions complete. Interrupting a kernel context is outside the scope of pthreads thus they will not cause a cancellation.
I don't have a reference for it, but as far as I know, thread suspension in the context of Posix threads has to do with it's synchronization object's ( like futex's ).

Does POSIX specify a memory consistency model (Addressing multithreading)? [duplicate]

Are there any guarantees on when a memory write in one thread becomes visible in other threads using pthreads?
Comparing to Java, the Java language spec has a section that specifies the interaction of locks and memory that makes it possible to write portable multi-threaded Java code.
Is there a corresponding pthreads spec?
Sure, you can always go and make shared data volatile, but that is not what I'm after.
If this is platform dependent, is there a de facto standard? Or should another threading library be used?
POSIX specifies the memory model in 4.11 Memory Synchronization:
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:
fork()
pthread_barrier_wait()
pthread_cond_broadcast()
pthread_cond_signal()
pthread_cond_timedwait()
pthread_cond_wait()
pthread_create()
pthread_join()
pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock()
pthread_mutex_unlock()
pthread_spin_lock()
pthread_spin_trylock()
pthread_spin_unlock()
pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock()
pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock()
pthread_rwlock_trywrlock()
pthread_rwlock_unlock()
pthread_rwlock_wrlock()
sem_post()
sem_timedwait()
sem_trywait()
sem_wait()
semctl()
semop()
wait()
waitpid()
The pthread_once() function shall synchronize memory for the first call in each thread for a given pthread_once_t object.
The pthread_mutex_lock() function need not synchronize memory if the mutex type if PTHREAD_MUTEX_RECURSIVE and the calling thread already owns the mutex. The pthread_mutex_unlock() function need not synchronize memory if the mutex type is PTHREAD_MUTEX_RECURSIVE and the mutex has a lock count greater than one.
Unless explicitly stated otherwise, if one of the above functions returns an error, it is unspecified whether the invocation causes memory to be synchronized.
Applications may allow more than one thread of control to read a memory location simultaneously.
I am not aware that POSIX threads give such guarantees. They don't have a model for atomic access to thread-shared objects. If it'd be for POSIX threads, the only guarantees that you can have for visibility of modifications is using some kind of lock.
Modern C, C11, (and probably also C++11) has a model for this kind of questions. It has threads and atomics (fences and all that stuff) that give you exact rules when you may assume that a modification done by one thread is visible by another.
The thread interface of C11 is a cooked-down version of POSIX threads, with less functionality. Unfortunately, the specification for the semantics of that thread interface is yet much to loose, basically the semantics are missing in many places. But a combination of C11 interfaces and POSIX thread semantics can give you a good view of how things work in modern systems.
Edit: So if you want to have guarantees for memory synchronization use either the lock interfaces that POSIX provides or go for atomic operations. All modern compilers have extensions that provide these, gcc and family (icc, opencc, clang) have e.g the series of __sync... builtins. Clang it its newest version also already has support of the new C11 _Atomic feature. There are also wrappers available that give you interfaces for the other compilers that come close to _Atomic.

Are initialized pthread_mutex_t objects kernel persistent?

Question: are initialized pthread_mutex_t objects kernel persistent?
-- concern is for Linux V 2.6 onward.
Motivation:
If persistent: the objects resources will not be released with specific cleanup, pthread_mutex_destroy
In practical coding terms this means the mutex object will persist after the
creating program exits or aborts without cleanup, unless pthread_mutex_destroy
is called. I have code which is routinely removed by a nasty control program,
that employs kill -9, SIGKILL, after trying kill -15 (SIGTERM). The design
of the program is not going to change, it is vendor code. There is no way to
alter its base behavior. Correctly cleaning up the code often takes longer than the
control daemon likes, so 'zap' goes the process. This occurs frequently.
https://www.kernel.org/doc/Documentation/mutex-design.txt
From Ingo Molnar
[ this is older material which says 'yes', spinlocks are a kernel mode object ]
'struct mutex' is the new mutex type, defined in include/linux/mutex.h and
implemented in kernel/locking/mutex.c. It is a counter-based mutex with a
spinlock and a wait-list. The counter has 3 states: 1 for "unlocked", 0 for
"locked" and negative numbers (usually -1) for "locked, potential waiters
queued".
http://man7.org/linux/man-pages/man2/execve.2.html has:
All threads other than the calling thread are destroyed during an
execve(). Mutexes, condition variables, and other pthreads
objects are not preserved.
So calling one of the exec(), family is not a way to determine persistence.
http://man7.org/linux/man-pages/man3/exit.3.html has nothing about mutexes one
way or the other.
Can someone point me to definitive code or documentation one way or the other?
I need to confront our vendor with something solid.
Pthreads mutexes in Linux are not kernel objects. pthread_mutex_destroy does not make any system calls because there's no kernel resource to free. strace it and see for yourself.
The linked document by Ingo Molnar talks about mutexes that are internal to the Linux kernel, not about pthreads. They are totally different beasts.

Why is a pthread mutex considered "slower" than a futex?

Why are POSIX mutexes considered heavier or slower than futexes? Where is the overhead coming from in the pthread mutex type? I've heard that pthread mutexes are based on futexes, and when uncontested, do not make any calls into the kernel. It seems then that a pthread mutex is merely a "wrapper" around a futex.
Is the overhead simply in the function-wrapper call and the need for the mutex function to "setup" the futex (i.e., basically the setup of the stack for the pthread mutex function call)? Or are there some extra memory barrier steps taking place with the pthread mutex?
Futexes were created to improve the performance of pthread mutexes. NPTL uses futexes, LinuxThreads predated futexes, which I think is where the "slower" consideration comes. NPTL mutexes may have some additional overhead, but it shouldn't be much.
Edit:
The actual overhead basically consists on:
selecting the correct algorithm for the mutex type (normal, recursive, adaptive, error-checking; normal, robust, priority-inheritance, priority-protected), where the code heavily hints to the compiler that we are likely using a normal mutex (so it should convey that to the CPU's branch prediction logic),
and a write of the current owner of the mutex if we manage to take it which should normally be fast, since it resides in the same cache-line as the actual lock which we have just taken, unless the lock is heavily contended and some other CPU accessed the lock between the time we took it and when we attempted to write the owner (this write is unneeded for normal mutexes, but needed for error-checking and recursive mutexes).
So, a few cycles (typical case) to a few cycles + a branch misprediction + an additional cache miss (very worst case).
The short answer to your question is that futexes are known to be implemented about as efficiently as possible, while a pthread mutex may or may not be. At minimum, a pthread mutex has overhead associated with determining the type of mutex and futexes do not. So a futex will almost always be at least as efficient as a pthread mutex, until and unless someone thinks up some structure lighter than a futex and then releases a pthreads implementation that uses that for its default mutex.
Technically speaking pthread mutexes are not slower or faster than futexes. pthread is just a standard API, so whether they are slow or fast depends on the implementation of that API.
Specifically in Linux pthread mutexes are implemented as futexes and are therefore fast. Actually, you don't want to use the futex API itself as it is very hard to use, does not have the appropriate wrapper functions in glibc and requires coding in assembly which would be non portable. Fortunately for us the glibc maintainers already coded all of this for us under the hood of the pthread mutex API.
Now, because most operating systems did not implement futexes then programmers usually mean by pthread mutex is the performance you get from usual implementation of pthread mutexes, which is, slower.
So it's a statistical fact that in most operating systems that are POSIX compliant the pthread mutex is implemented in kernel space and is slower than a futex. In Linux they have the same performance. It could be that there are other operating systems where pthread mutexes are implemented in user space (in the uncontended case) and therefore have better performance but I am only aware of Linux at this point.
Because they stay in userspace as much as possible, which means they require fewer system calls, which is inherently faster because the context switch between user and kernel mode is expensive.
I assume you're talking about kernel threads when you talk about POSIX threads. It's entirely possible to have an entirely userspace implementation of POSIX threads which require no system calls but have other issues of their own.
My understanding is that a futex is halfway between a kernel POSIX thread and a userspace POSIX thread.
On AMD64 a futex is 4 bytes, while a NPTL pthread_mutex_t is 56 bytes! Yes, there is a significant overhead.

Resources