Race condition in glibc/NPTL/Linux robust mutexes?

Race condition in glibc/NPTL/Linux robust mutexes? - c

In a comment on the question Automatically release mutex on crashes in Unix back in 2010, jilles claimed:
glibc's robust mutexes are so fast because glibc takes dangerous shortcuts. There is no guarantee that the mutex still exists when the kernel marks it as "will cause EOWNERDEAD". If the mutex was destroyed and the memory replaced by a memory mapped file that happens to contain the last owning thread's ID at the right place and the last owning thread terminates just after writing the lock word (but before fully removing the mutex from its list of owned mutexes), the file is corrupted. Solaris and will-be-FreeBSD9 robust mutexes are slower because they do not want to take this risk.
I can't make any sense of the claim, since destroying a mutex is not legal unless it's unlocked (and thus not in any thread's robust list). I also can't find any references searching for such a bug/issue. Was the claim simply erroneous?
The reason I ask and that I'm interested is that this is relevant to the correctness of my own implementation built upon the same Linux robust-mutex primitive.

I think I found the race, and it is indeed very ugly. It goes like this:
Thread A has held the robust mutex and unlocks it. The basic procedure is:
Put it in the "pending" slot of the thread's robust list header.
Remove it from the linked list of robust mutexes held by the current thread.
Unlock the mutex.
Clear the "pending" slot of the thread's robust list header.
The problem is that between steps 3 and 4, another thread in the same process could obtain the mutex, then unlock it, and (rightly) believing itself to be the final user of the mutex, destroy and free/munmap it. After that, if any thread in the process creates a shared mapping of a file, device, or shared memory and it happens to get assigned the same address, and the value at that location happens to match the pid of the thread that's still between steps 3 and 4 of unlocking, you have a situation whereby, if the process is killed, the kernel will corrupt the mapped file by setting the high bit of a 32-bit integer it thinks is the mutex owner id.
The solution is to hold a global lock on mmap/munmap between steps 2 and 4 above, exactly the same as in my solution to the barrier issue described in my answer to this question:
Can a correct fail-safe process-shared barrier be implemented on Linux?

The description of the race by FreeBSD pthread developer David Xu: http://lists.freebsd.org/pipermail/svn-src-user/2010-November/003668.html
I don't think the munmap/mmap cycle is strictly required for the race. The piece of shared memory might be put to a different use as well. This is uncommon but valid.
As also mentioned in that message, more "fun" occurs if threads with different privilege access a common robust mutex. Because the node for the list of owned robust mutexes is in the mutex itself, a thread with low privilege may corrupt a high privilege thread's list. This could be exploited easily to make the high privilege thread crash and in rare cases this might allow the high privilege thread's memory to be corrupted. Apparently Linux's robust mutexes are only designed for use by threads with the same privileges. This could have been avoided easily by making the robust list an array fully in the thread's memory instead of a linked list.

Related

pthread_create(3) and memory synchronization guarantee in SMP architectures

I am looking at the section 4.11 of The Open Group Base Specifications Issue 7 (IEEE Std 1003.1, 2013 Edition), section 4.11 document, which spells out the memory synchronization rules. This is the most specific by the POSIX standard I have managed to come by for detailing the POSIX/C memory model.
Here's a quote
4.11 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads:
fork() pthread_barrier_wait() pthread_cond_broadcast()
pthread_cond_signal() pthread_cond_timedwait() pthread_cond_wait()
pthread_create() pthread_join() pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock() pthread_mutex_unlock() pthread_spin_lock()
pthread_spin_trylock() pthread_spin_unlock() pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock() pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock() pthread_rwlock_trywrlock()
pthread_rwlock_unlock() pthread_rwlock_wrlock() sem_post()
sem_timedwait() sem_trywait() sem_wait() semctl() semop() wait()
waitpid()
(exceptions to the requirement omitted).
Basically, paraphrasing the above document, the rule is that when applications read or modify a memory location while another thread or process may modify it, they should make sure to synchronize the thread execution and memory with respect to other threads by calling one of the listed functions. Among them, pthread_create(3) is mentioned to provide that memory synchronization.
I understand that this basically means there needs to be some sort of memory barrier implied by each of the functions (although the standard seems not to use that concept). So for example returning from pthread_create(), we are guaranteed that the memory modifications made by that thread before the call appear to other threads (running possibly different CPU/core) after they also synchronize memory. But what about the newly created thread - is there implied memory barrier before the thread starts running the thread function so that it unfailingly sees the memory modifications synchronized by pthread_create()? Is this specified by the standard? Or should we provide memory synchronization explicitly to be able to trust correctness of any data we read according to POSIX standard?
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
Example:
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
More concretely...
Previously in the program (this is the value in CPU #1 cache memory)
int i = 0;
Thread T0 running in CPU #0:
pthread_mutex_lock(...);
int tmp = i;
pthread_mutex_unlock(...);
Thread T1 running in CPU #1:
i = 42;
pthread_create(...);
Newly created thread T2 running in CPU #0:
printf("i=%d\n", i); /* First step in the thread function */
Without memory barrier, without synchronizing thread T2 memory it could happen that the output would be
i=0
(previously cached, unsynchronized value).
Update:
Lot of applications using POSIX thread library would not be thread safe if this implementation craziness was allowed.

is there implied memory barrier before the thread starts running the thread function so that it
unfailingly sees the memory modifications synchronized by pthread_create()?
Yes. Otherwise there would be no point to pthread_create acting as memory synchronization (barrier).
(This is afaik. not explicitly stated by posix, (nor does posix define a standard memory model),
so you'll have to decide whether you trust your implementation to do the only sane thing it possibly could - ensure synchronization before the new thread is run- I would not worry particularly about it).
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
No, a context switch does not act as a barrier.
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
Since pthread_create must perform memory synchronization, this cannot happen. Any old memory that reside in a cpu cache on another core must be invalidated. (Luckily, the commonly used platforms are cache coherent, so the hardware takes care of that).
Now, if you change your object after you've created your 2. thread, you need memory synchronization again so all parties can see the changes, and otherwise avoid race conditions. pthread mutexes are commonly used to achieve that.

cache coherent architectures guarantee from the architectural design point of view that even separated CPUs (ccNUMA - cache coherent Not Uniform Memory Architecture), with independent memory channels when accessing a memory location will not incur in the incoherency you are describing in the example.
This happens with an important penalty, but the application will function correctly.
Thread #1 runs on CPU0, and hold the object memory in cache L1. When thread #2 on CPU1 read the same memory address (or more exactly: the same cache line - look for false sharing for more info), it forces a cache miss on CPU0 before loading that cache line.

You've turned the guarantee pthread_create provides into an incoherent one. The only thing the pthread_create function could possibly do is establish a "happens before" relationship between the thread that calls it and the newly-created thread.
There is no way it could establish such a relationship with existing threads. Consider two threads, one calls pthread_create, the other accesses a shared variable. What guarantee could you possibly have? "If the thread called pthread_create first, then the other thread is guaranteed to see the latest value of the variable". But that "If" renders the guarantee meaningless and useless.
Creating thread:
i = 1;
pthread_create (...)
Created thread:
if (i == 1)
...
Now, this is a coherent guarantee -- the created thread must see i as 1 since that "happened before" the thread was created. Our code made it possible for the standard to enforce a logical "happens before" relationship, and the standard did so to assure us that our code works as we expect.
Now, let's try to do that with an unrelated thread:
Creating thread:
i = 1;
pthread_create (...)
Unrelated thread:
if ( i == 1)
...
What guarantee could we possible have, even if the standard wanted to provide one? With no synchronization between the threads, we haven't tried to make a logical happens before relationship. So the standard can't honor it -- there's nothing to honor. There no particular behavior that is "right", so no way the standard can promise us the right behavior.
The same applies to the other functions. For example, the guarantee for pthread_mutex_lock means that a thread that acquires a mutex sees all changes made by, or seen by, any threads that have unlocked the mutex. We logically expect our thread to get the mutex "after" any threads that got the mutex "before", and the standard promises to honor that expectation so our code works.

What is the `pthread_mutex_lock()` wake order with multiple threads waiting?

Suppose I have multiple threads blocking on a call to pthread_mutex_lock(). When the mutex becomes available, does the first thread that called pthread_mutex_lock() get the lock? That is, are calls to pthread_mutex_lock() in FIFO order? If not, what, if any, order are they in? Thanks!

When the mutex becomes available, does the first thread that called pthread_mutex_lock() get the lock?
No. One of the waiting threads gets a lock, but which one gets it is not determined.
FIFO order?
FIFO mutex is rather a pattern already. See Implementing a FIFO mutex in pthreads

"If there are threads blocked on the mutex object referenced by mutex when pthread_mutex_unlock() is called, resulting in the mutex becoming available, the scheduling policy shall determine which thread shall acquire the mutex."
Aside from that, the answer to your question isn't specified by the POSIX standard. It may be random, or it may be in FIFO or LIFO or any other order, according to the choices made by the implementation.

FIFO ordering is about the least efficient mutex wake order possible. Only a truly awful implementation would use it. The thread that ran the most recently may be able to run again without a context switch and the more recently a thread ran, more of its data and code will be hot in the cache. Reasonable implementations try to give the mutex to the thread that held it the most recently most of the time.
Consider two threads that do this:
Acquire a mutex.
Adjust some data.
Release the mutex.
Go to step 1.
Now imagine two threads running this code on a single core CPU. It should be clear that FIFO mutex behavior would result in one "adjust some data" per context switch -- the worst possible outcome.
Of course, reasonable implementations generally do give some nod to fairness. We don't want one thread to make no forward progress. But that hardly justifies a FIFO implementation!

POSIX threads and global variables in C on Linux

If I have two threads and one global variable (one thread constantly loops to read the variable; the other constantly loops to write to it) would anything happen that shouldn't? (ex: exceptions, errors). If it, does what is a way to prevent this. I was reading about mutex locks and that they allow exclusive access to a variable to one thread. Does this mean that only that thread can read and write to it and no other?

Would anything happen that shouldn't?
It depends in part on the type of the variables. If the variable is, say, a string (long array of characters), then if the writer and the reader access it at the same time, it is completely undefined what the reader will see.
This is why mutexes and other coordinating mechanisms are provided by pthreads.
Does this mean that only that thread can read and write to it and no other?
Mutexes ensure that at most one thread that is using the mutex can have permission to proceed. All other threads using the same mutex will be held up until the first thread releases the mutex. Therefore, if the code is written properly, at any time, only one thread will be able to access the variable. If the code is not written properly, then:
one thread might access the variable without checking that it has permission to do so
one thread might acquire the mutex and never release it
one thread might destroy the mutex without notifying the other
None of these is desirable behaviour, but the mere existence of a mutex does not prevent any of these happening.
Nevertheless, your code could reasonably use a mutex carefully and then the access to the global variable would be properly controlled. While it has permission via the mutex, either thread could modify the variable, or just read the variable. Either will be safe from interference by the other thread.

Does this mean that only that thread can read and write to it and no other?
It means that only one thread can read or write to the global variable at a time.
The two threads will not race amongst themselves to access the global variable neither will they access it at the same time at any given point of time.
In short the access to the global variable is Synchronized.

First; In C/C++ unsynchronized read/write of variable does not generate any exceptions or system error, BUT it can generate application level errors -- mostly because you are unlikely to fully understand how the memory is accessed, and whether it is atomic unless you look at the generated assembler. A multi core CPU may likely create hard-to-debug race conditions when you access shared memory without synchronization.
Hence
Second; You should always use synchronization -- such as mutex locks -- when dealing with shared memory. A mutex lock is cheap; so it will not really impact performance if done right. Rule of thumb; keep the lcok for as short as possible, such as just for the duration of reading/incrementing/writing the shared memory.
However, from your description, it sounds like that one of your threads is doing nothing BUT waiting for the shared meory to change state before doing something -- that is a bad multi-threaded design which cost unnecessary CPU burn, so
Third; Look at using semaphores (sem_create/wait/post) for synchronization between your threads if you are trying to send a "message" from one thread to the other

As others already said, when communicating between threads through "normal" objects you have to take care of race conditions. Besides mutexes and other lock structures that are relatively heavy weight, the new C standard (C11) provides atomic types and operations that are guaranteed to be race-free. Most modern processors provide instructions for such types and many modern compilers (in particular gcc on linux) already provide their proper interfaces for such operations.

If the threads truly are only one producer and only one consumer, then (barring compiler bugs) then
1) marking the variable as volatile, and
2) making sure that it is correctly aligned, so as to avoid interleaved fetches and stores
will allow you to do this without locking.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?

This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).

Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.

In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.

You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.

Can a correct fail-safe process-shared barrier be implemented on Linux?

In a past question, I asked about implementing pthread barriers without destruction races:
How can barriers be destroyable as soon as pthread_barrier_wait returns?
and received from Michael Burr with a perfect solution for process-local barriers, but which fails for process-shared barriers. We later worked through some ideas, but never reached a satisfactory conclusion, and didn't even begin to get into resource failure cases.
Is it possible on Linux to make a barrier that meets these conditions:
Process-shared (can be created in any shared memory).
Safe to unmap or destroy the barrier from any thread immediately after the barrier wait function returns.
Cannot fail due to resource allocation failure.
Michael's attempt at solving the process-shared case (see the linked question) has the unfortunate property that some kind of system resource must be allocated at wait time, meaning the wait can fail. And it's unclear what a caller could reasonably do when a barrier wait fails, since the whole point of the barrier is that it's unsafe to proceed until the remaining N-1 threads have reached it...
A kernel-space solution might be the only way, but even that's difficult due to the possibility of a signal interrupting the wait with no reliable way to resume it...

This is not possible with the Linux futex API, and I think this can be proven as well.
We have here essentially a scenario in which N processes must be reliably awoken by one final process, and further no process may touch any shared memory after the final awakening (as it may be destroyed or reused asynchronously). While we can awaken all processes easily enough, the fundamental race condition is between the wakeup and the wait; if we issue the wakeup before the wait, the straggler never wakes up.
The usual solution to something like this is to have the straggler check a status variable atomically with the wait; this allows it to avoid sleeping at all if the wakeup has already occurred. However, we cannot do this here - as soon as the wakeup becomes possible, it is unsafe to touch shared memory!
One other approach is to actually check if all processes have gone to sleep yet. However, this is not possible with the Linux futex API; the only indication of number of waiters is the return value from FUTEX_WAKE; if it returns less than the number of waiters you expected, you know some weren't asleep yet. However, even if we find out we haven't woken enough waiters, it's too late to do anything - one of the processes that did wake up may have destroyed the barrier already!
So, unfortunately, this kind of immediately-destroyable primitive cannot be constructed with the Linux futex API.
Note that in the specific case of one waiter, one waker, it may be possible to work around the problem; if FUTEX_WAKE returns zero, we know nobody has actually been awoken yet, so you have a chance to recover. Making this into an efficient algorithm, however, is quite tricky.
It's tricky to add a robust extension to the futex model that would fix this. The basic problem is, we need to know when N threads have successfully entered their wait, and atomically awaken them all. However, any of those threads may leave the wait to run a signal handler at any time - indeed, the waker thread may also leave the wait for signal handlers as well.
One possible way that may work, however, is an extension to the keyed event model in the NT API. With keyed events, threads are released from the lock in pairs; if you have a 'release' without a 'wait', the 'release' call blocks for the 'wait'.
This in itself isn't enough due to the issues with signal handlers; however, if we allow for the 'release' call to specify a number of threads to be awoken atomically, this works. You simply have each thread in the barrier decrement a count, then 'wait' on a keyed event on that address. The last thread 'releases' N - 1 threads. The kernel doesn't allow any wake event to be processed until all N-1 threads have entered this keyed event state; if any thread leaves the futex call due to signals (including the releasing thread), this prevents any wakeups at all until all threads are back.

After a long discussion with bdonlan on SO chat, I think I have a solution. Basically, we break the problem down into the two self-synchronized deallocation issues: the destroy operation and unmapping.
Handling destruction is easy: Simply make the pthread_barrier_destroy function wait for all waiters to stop inspecting the barrier. This can be done by having a usage count in the barrier, atomically incremented/decremented on entry/exit to the wait function, and having the destroy function spin waiting for the count to reach zero. (It's also possible to use a futex here, rather than just spinning, if you stick a waiter flag in the high bit of the usage count or similar.)
Handling unmapping is also easy, but non-local: ensure that munmap or mmap with the MAP_FIXED flag cannot occur while barrier waiters are in the process of exiting, by adding locking to the syscall wrappers. This requires a specialized sort of reader-writer lock. The last waiter to reach the barrier should grab a read lock on the munmap rw-lock, which will be released when the final waiter exits (when decrementing the user count results in a count of 0). munmap and mmap can be made reentrant (as some programs might expect, even though POSIX doesn't require it) by making the writer lock recursive. Actually, a sort of lock where readers and writers are entirely symmetric, and each type of lock excludes the opposite type of lock but not the same type, should work best.

Well, I think I can do it with a clumsy approach...
Have the "barrier" be its own process listening on a socket. Implement barrier_wait as:
open connection to barrier process
send message telling barrier process I am waiting
block in read() waiting for reply
Once N threads are waiting, the barrier process tells all of them to proceed. Each waiter then closes its connection to the barrier process and continues.
Implement barrier_destroy as:
open connection to barrier process
send message telling barrier process to go away
close connection
Once all connections are closed and the barrier process has been told to go away, it exits.
[Edit: Granted, this allocates and destroys a socket as part of the wait and release operations. But I think you can implement the same protocol without doing so; see below.]
First question: Does this protocol actually work? I think it does, but maybe I do not understand the requirements.
Second question: If it does work, can it be simulated without the overhead of an extra process?
I believe the answer is "yes". You can have each thread "take the role of" the barrier process at the appropriate time. You just need a master mutex, held by whichever thread is currently "taking the role" of the barrier process. Details, details... OK, so the barrier_wait might look like:
lock(master_mutex);
++waiter_count;
if (waiter_count < N)
cond_wait(master_condition_variable, master_mutex);
else
cond_broadcast(master_condition_variable);
--waiter_count;
bool do_release = time_to_die && waiter_count == 0;
unlock(master_mutex);
if (do_release)
release_resources();
Here master_mutex (a mutex), master_condition_variable (a condition variable), waiter_count (an unsigned integer), N (another unsigned integer), and time_to_die (a Boolean) are all shared state allocated and initialized by barrier_init. waiter_count is initialiazed to zero, time_to_die to false, and N to the number of threads the barrier is waiting for.
Then barrier_destroy would be:
lock(master_mutex);
time_to_die = true;
bool do_release = waiter_count == 0;
unlock(master_mutex);
if (do_release)
release_resources();
Not sure about all the details concerning signal handling etc... But the basic idea of "last one out turns off the lights" is workable, I think.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight