Does the pthread API provide synchronization in a multiprocessor environment?

Does the pthread API provide synchronization in a multiprocessor environment? - c

I've just started to study the pthread API. I've been using different books and websites, and judging from what they all report, pthread synchronization functions (e.g. those involving mutexes) all work both for a uniprocessor and multiprocessor environments. But none of these sources explicitly stated it, so I wanted to know if that's actually the case (of course I believe so, I just wanted to be 100% sure).
So, if two threads running on different CPUs called a lock (e.g. pthread_mutex_lock()) on the same mutex at the same time, would the execution of this routine be executed sequentially rather than in parallel? And after the first lock is over and the thread invoking it has private access to the critical section, does the lock executed by the other thread on another CPU cause the latter thread to suspend?

Yes, it does. The POSIX API is described in terms of requirements on implementations - for example, a pthread_mutex_lock() that returns zero or EOWNERDEAD must return with the mutex locked and owned by the calling thread. There's no exception for multiprocessor environments, so conforming implementations in multiprocessor environments must continue to make it work.
So, if two threads running on different CPUs called a lock (e.g.
pthread_mutex_lock()) on the same mutex at the same time, would the
execution of this routine be executed sequentially rather than in
parallel?
It's not specified how pthread_mutex_lock() works underneath, but from an application point of view you know that if it doesn't return an error, your thread has acquired the lock.
And after the first lock is over and the thread invoking it has
private access to the critical section, does the lock executed by the
other thread on another CPU cause the latter thread to suspend?
Yes - the specification for pthread_mutex_lock() says:
If the mutex is already locked by another thread, the calling thread
shall block until the mutex becomes available.

Related

What does it mean to POSIX that a thread is "suspended"?

In the course of commentary on a recent question, a subsidiary question arose about at what point a cancellation request for a pthreads thread with cancelability PTHREAD_CANCEL_DEFERRED can be expected to be acted upon. References to the standard and a bit of lawyering ensued. I'm not much concerned specifically about whether I was mistaken in my comments on that question, but I would like to be sure I understand POSIX's provisions correctly.
The most pertinent section of the standard says
Whenever a thread has cancelability enabled and a cancellation request has been made with that thread as the target, and the thread then calls any function that is a cancellation point [...], the cancellation request shall be acted upon before the function returns. If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon.
What, though, does it mean for a thread to be "suspended"? POSIX explicitly defines the term for processes, but not, as far as I can determine, for threads. On the other hand, POSIX documents thread suspension to be among the behaviors of a handful of functions, including, but not limited to, some of those related to synchronization objects. Should one then conclude that those serve collectively as the relevant definition of the term?
And as this all pertains to the question that spawned this line of inquiry, given that POSIX does not specify thread suspension as part of the behavior of read(), fread(), or any of the general file or stream I/O functions, if a thread is not making progress on account of being blocked on I/O, does that necessarily mean it is "suspended" for the purposes of cancellation?

A suspended thread is one that, as you say, is blocked on a socket read, waiting for a semaphore to become available, etc.
Given that POSIX implementations vary at the tricky edges, and that there is the potential for a thread to be blocked in a function that is not a cancellation point, it might be that relying on cancellation in code that is to be ported might be more trouble than it's worth.
I've never used it, I've always chosen to have code to explicitly instruct a thread to terminate (normally a message down a pipe or queue). This is very easy with a Communicating Sequential Processes or Actor Model system.
That way clean up can be done under one's own control, freeing memory, etc. as necessary. I've no idea whether a cancelled thread will clean up its memory (I suspect not), or whether there is the option for an at_exit() type thing (there may be). On the whole I think that application behaviour is more thoroughly controlled if there is only one single way a thread can exit.
==EDIT==
#JohnBollinger,
The language used If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point could be interpretted as IF a thread has cancelability enabled AND IF cancelled and IF implementation suspends blocked threads AND IF the thread is blocked THEN the thread shall be awakened.... In other words, they're leaving it up to the implementer of the POSIX subsystem.
Cygwin's implementation of select() does not (or at least did not) result in the thread being suspended. Instead it spawns a polling thread per file descriptor to test for signalable activity, due to the fundamental lack of anything quite like select() in Windows (it gets close, but no cigar. Win32 select() works on only sockets). Implementations of select() back in the 1980s often worked this way too.
It might be for reasons like this that POSIX is reluctant to clearly define when a thread is suspended. Historically many implementations of select() were like this, making it a minefield for a standards committee to say when a thread might or might not be suspended. Of course the complexities caused by select() would also apply to a process but as POSIX does define a suspended process it does seem odd that they couldn't / didn't extend the definition to threads.
It might be down to how threads are implemented; you can conceivably have a POSIX implementation that doesn't use OS threads (a bit like the early implementations of ADA back in the days when OSes didn't do threads at all), and in such an implementation a blocked thread might not be suspended (in the sense of taking no CPU cycles) at all.

Definition of suspend in the context of threads:
3.107 Condition Variable
A synchronization object which allows a thread to suspend execution, repeatedly, until some associated predicate becomes true. A thread whose execution is suspended on a condition variable is said to be blocked on the condition variable.
From: http://pubs.opengroup.org/onlinepubs/9699919799/
This is not a direct answer, just a definition – too large for a comment. Blocked == suspended.

read, fread, and friends are system calls and as such they will execute a context switch and execute from the kernel context until those functions complete. Interrupting a kernel context is outside the scope of pthreads thus they will not cause a cancellation.
I don't have a reference for it, but as far as I know, thread suspension in the context of Posix threads has to do with it's synchronization object's ( like futex's ).

Does pthread_mutex_lock contains memory fence instruction? [duplicate]

This question already has answers here:
Does guarding a variable with a pthread mutex guarantee it's also not cached?
(3 answers)
Closed 3 years ago.
Do pthread_mutex_lock and pthread_mutex_unlock functions call memory fence/barrier instructions? Or do the the lower level instructions like compare_and_swap implicity have memory barriers?

Do pthread_mutex_lock and pthread_mutex_unlock functions call memory fence/barrier instructions?
They do, as well as thread creation.
Note, however, there are two types of memory barriers: compiler and hardware.
Compiler barriers only prevent the compiler from reordering reads and writes and speculating variable values, but don't prevent the CPU from reordering.
The hardware barriers prevent the CPU from reordering reads and writes. Full memory fence is usually the slowest instruction, most of the time you only need operations with acquire and release semantics (to implement spinlocks and mutexes).
With multi-threading you need both barriers most of the time.
Any function whose definition is not available in this translation unit (and is not intrinsic) is a compiler memory barrier. pthread_mutex_lock, pthread_mutex_unlock, pthread_create also issue a hardware memory barrier to prevent the CPU from reordering reads and writes.
From Programming with POSIX Threads by David R. Butenhof:
Pthreads provides a few basic rules about memory visibility. You can count on all implementations of the standard to follow these rules:
Whatever memory values a thread can see when it calls pthread_create can also be seen by the new thread when it starts. Any data written to memory after the call to pthread_create may not necessarily be seen by the new thread, even if the write occurs before the thread starts.
Whatever memory values a thread can see when it unlocks a mutex, either directly or by waiting on a condition variable, can also be seen by any thread that later locks the same mutex. Again, data written after the mutex is unlocked may not necessarily be seen by the thread that locks the mutex, even if the write occurs before the lock.
Whatever memory values a thread can see when it terminates, either by cancellation, returning from its start function, or by calling pthread_exit, can also be seen by the thread that joins with the terminated thread bycalling pthread_join. And, of course, data written after the thread terminates may not necessarily be seen by the thread that joins, even if the write occurs before the join.
Whatever memory values a thread can see when it signals or broadcasts a condition variable can also be seen by any thread that is awakened by that signal or broadcast. And, one more time, data written after the signal or broadcast may not necessarily be seen by the thread that wakes up, even if the write occurs before it awakens.
Also see C++ and Beyond 2012: Herb Sutter - atomic<> Weapons for more details.

Please take a look at section 4.12 of the POSIX specification.
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. [emphasis mine]
Then a list of functions is given which synchronize memory, plus a few additional notes.
If that requires memory barrier instructions on some architecture, then those must be used.
About compare_and_swap: that isn't in POSIX; check the documentation for whatever you are using. For instance, IBM defines a compare_and_swap function for AIX 5.3. which doesn't have full memory barrier semantics The documentation note says:
If compare_and_swap is used as a locking primitive, insert an isync at the start of any critical sections.
From this documentation we can guess that IBM's compare_and_swap has release semantics: since the documentation does not require a barrier for the end of the critical section. The acquiring processor needs to issue an isync to make sure it is not reading stale data, but the publishing processor doesn't have to do anything.
At the instruction level, some processors have compare and swap with certain synchronizing guarantees, and some don't.

pthread_create(3) and memory synchronization guarantee in SMP architectures

I am looking at the section 4.11 of The Open Group Base Specifications Issue 7 (IEEE Std 1003.1, 2013 Edition), section 4.11 document, which spells out the memory synchronization rules. This is the most specific by the POSIX standard I have managed to come by for detailing the POSIX/C memory model.
Here's a quote
4.11 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads:
fork() pthread_barrier_wait() pthread_cond_broadcast()
pthread_cond_signal() pthread_cond_timedwait() pthread_cond_wait()
pthread_create() pthread_join() pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock() pthread_mutex_unlock() pthread_spin_lock()
pthread_spin_trylock() pthread_spin_unlock() pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock() pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock() pthread_rwlock_trywrlock()
pthread_rwlock_unlock() pthread_rwlock_wrlock() sem_post()
sem_timedwait() sem_trywait() sem_wait() semctl() semop() wait()
waitpid()
(exceptions to the requirement omitted).
Basically, paraphrasing the above document, the rule is that when applications read or modify a memory location while another thread or process may modify it, they should make sure to synchronize the thread execution and memory with respect to other threads by calling one of the listed functions. Among them, pthread_create(3) is mentioned to provide that memory synchronization.
I understand that this basically means there needs to be some sort of memory barrier implied by each of the functions (although the standard seems not to use that concept). So for example returning from pthread_create(), we are guaranteed that the memory modifications made by that thread before the call appear to other threads (running possibly different CPU/core) after they also synchronize memory. But what about the newly created thread - is there implied memory barrier before the thread starts running the thread function so that it unfailingly sees the memory modifications synchronized by pthread_create()? Is this specified by the standard? Or should we provide memory synchronization explicitly to be able to trust correctness of any data we read according to POSIX standard?
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
Example:
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
More concretely...
Previously in the program (this is the value in CPU #1 cache memory)
int i = 0;
Thread T0 running in CPU #0:
pthread_mutex_lock(...);
int tmp = i;
pthread_mutex_unlock(...);
Thread T1 running in CPU #1:
i = 42;
pthread_create(...);
Newly created thread T2 running in CPU #0:
printf("i=%d\n", i); /* First step in the thread function */
Without memory barrier, without synchronizing thread T2 memory it could happen that the output would be
i=0
(previously cached, unsynchronized value).
Update:
Lot of applications using POSIX thread library would not be thread safe if this implementation craziness was allowed.

is there implied memory barrier before the thread starts running the thread function so that it
unfailingly sees the memory modifications synchronized by pthread_create()?
Yes. Otherwise there would be no point to pthread_create acting as memory synchronization (barrier).
(This is afaik. not explicitly stated by posix, (nor does posix define a standard memory model),
so you'll have to decide whether you trust your implementation to do the only sane thing it possibly could - ensure synchronization before the new thread is run- I would not worry particularly about it).
Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?
No, a context switch does not act as a barrier.
Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.
Since pthread_create must perform memory synchronization, this cannot happen. Any old memory that reside in a cpu cache on another core must be invalidated. (Luckily, the commonly used platforms are cache coherent, so the hardware takes care of that).
Now, if you change your object after you've created your 2. thread, you need memory synchronization again so all parties can see the changes, and otherwise avoid race conditions. pthread mutexes are commonly used to achieve that.

cache coherent architectures guarantee from the architectural design point of view that even separated CPUs (ccNUMA - cache coherent Not Uniform Memory Architecture), with independent memory channels when accessing a memory location will not incur in the incoherency you are describing in the example.
This happens with an important penalty, but the application will function correctly.
Thread #1 runs on CPU0, and hold the object memory in cache L1. When thread #2 on CPU1 read the same memory address (or more exactly: the same cache line - look for false sharing for more info), it forces a cache miss on CPU0 before loading that cache line.

You've turned the guarantee pthread_create provides into an incoherent one. The only thing the pthread_create function could possibly do is establish a "happens before" relationship between the thread that calls it and the newly-created thread.
There is no way it could establish such a relationship with existing threads. Consider two threads, one calls pthread_create, the other accesses a shared variable. What guarantee could you possibly have? "If the thread called pthread_create first, then the other thread is guaranteed to see the latest value of the variable". But that "If" renders the guarantee meaningless and useless.
Creating thread:
i = 1;
pthread_create (...)
Created thread:
if (i == 1)
...
Now, this is a coherent guarantee -- the created thread must see i as 1 since that "happened before" the thread was created. Our code made it possible for the standard to enforce a logical "happens before" relationship, and the standard did so to assure us that our code works as we expect.
Now, let's try to do that with an unrelated thread:
Creating thread:
i = 1;
pthread_create (...)
Unrelated thread:
if ( i == 1)
...
What guarantee could we possible have, even if the standard wanted to provide one? With no synchronization between the threads, we haven't tried to make a logical happens before relationship. So the standard can't honor it -- there's nothing to honor. There no particular behavior that is "right", so no way the standard can promise us the right behavior.
The same applies to the other functions. For example, the guarantee for pthread_mutex_lock means that a thread that acquires a mutex sees all changes made by, or seen by, any threads that have unlocked the mutex. We logically expect our thread to get the mutex "after" any threads that got the mutex "before", and the standard promises to honor that expectation so our code works.

What is the `pthread_mutex_lock()` wake order with multiple threads waiting?

Suppose I have multiple threads blocking on a call to pthread_mutex_lock(). When the mutex becomes available, does the first thread that called pthread_mutex_lock() get the lock? That is, are calls to pthread_mutex_lock() in FIFO order? If not, what, if any, order are they in? Thanks!

When the mutex becomes available, does the first thread that called pthread_mutex_lock() get the lock?
No. One of the waiting threads gets a lock, but which one gets it is not determined.
FIFO order?
FIFO mutex is rather a pattern already. See Implementing a FIFO mutex in pthreads

"If there are threads blocked on the mutex object referenced by mutex when pthread_mutex_unlock() is called, resulting in the mutex becoming available, the scheduling policy shall determine which thread shall acquire the mutex."
Aside from that, the answer to your question isn't specified by the POSIX standard. It may be random, or it may be in FIFO or LIFO or any other order, according to the choices made by the implementation.

FIFO ordering is about the least efficient mutex wake order possible. Only a truly awful implementation would use it. The thread that ran the most recently may be able to run again without a context switch and the more recently a thread ran, more of its data and code will be hot in the cache. Reasonable implementations try to give the mutex to the thread that held it the most recently most of the time.
Consider two threads that do this:
Acquire a mutex.
Adjust some data.
Release the mutex.
Go to step 1.
Now imagine two threads running this code on a single core CPU. It should be clear that FIFO mutex behavior would result in one "adjust some data" per context switch -- the worst possible outcome.
Of course, reasonable implementations generally do give some nod to fairness. We don't want one thread to make no forward progress. But that hardly justifies a FIFO implementation!

How costly is mutex locks when thread synchronization is not critical?

I have a an object (kind of a queue) which is accessed across the threads. The queue object can be mutex locked before used by either thread.
A simpler way to manage this is by bringing the lock inside the queue object itself - hence every API will lock the queue and release when the work is done. This way, threads don't have to manage additional mutex variables along with each queue.
Now my question is, sometimes there is only one thread which is accessing queue (say it is a local variable). But since now inherently the queue would first lock its internal data structure and unlock before leaving, will this be a costly affair?
How costly is the redundant mutex_lock and mutex_unlock operation - when there is no specific need of thread synchronization?
PS:
My question is slightly related to this one: How efficient is locking an unlocked mutex? What is the cost of a mutex?
But i am looking for a specific answer in my design and understanding of why.
I AM USING C, and pthread libraries.

One way to handle this is to have your queue initialization take a parameter that indicates whether a lock should be acquired or not during queue operations. If a queue is being used by a single thread, it gets initialized such that it won't acquire/release locks (or uses a lock object where the acquire/release operations are nops).
See this answer for an example of how boost::pool does something along these lines (although in C++ and as a compile time configuration): https://stackoverflow.com/a/10188784/12711
A similar concept can be applied to C code at runtime, too.

First of all: Neither the C library nor pthreads implements mutex locking - they call into the kernel to use OS primitives for that. This implies, that the performance of muteces will vary wildly with the base OS.
If you can reduce your portability spectrum to hardware supporting atomic compare-exchange or atomic increase-and-read (such as any x86 from this millennium) you can use atomic increase-and-read to create a threadsafe queue that does not need locking.
For the .Net platform I have such a beast at http://sourceforge.net/projects/dotnetlockless - it should be quite easy to port it to C.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight