I have a data structure which I personally implemented that now needs to be used across multiple threads.
typedef struct
{
void** array_of_elements;
size_t size;
} myStruct;
For simplicity, let's say my data structure has these functions:
// Gets a data element from the structure.
void* get(myStruct *x);
// Prints out all the data elements.
void print(myStruct *x);
// Adds an element into the structure.
void add(myStruct *x, void *to_be_added);
It's not a problem whatsoever to call get while another thread is calling print since they are both accessors. However, get and print cannot work while add is currently being called. Vice versa, add cannot work if get and print are currently in-progress.
So I changed myStruct to look like the following:
typedef struct
{
void** array_of_elements;
size_t size;
// True when a mutator is editing this struct.
bool mutating;
// The number of threads currently accessing this struct.
int accessors;
} myStruct;
Now my functions look like the following:
void* get(myStruct *x)
{
// Wait for mutating to end.
while (x->mutating);
// Indicate that another accessor is now using this struct.
x->accessors++;
// get algorithm goes here
// Declare we are finished reading.
x->accessors--;
return ...
}
// Same as above...
void print(myStruct *x)
...
void add(myStruct *x)
{
// Wait for any accessors or mutators to finish.
while (x->mutating || x->accessors > 0);
x->mutating = true;
// add algorithm here
x->mutating = false;
}
BUT, I think there are a lot of problems with this approach and I can't find a way to solve them:
One of my classmates told me using while loops like this slows the thread down immensely.
It has no sense of a queue. The first method that begins waiting for the myStruct to finish being used isn't necessarily the one that goes next.
Even IF I had a queue data structure for which thread goes next, that data structure would also need to be synchronized, which in itself is an infinite loop of needing a synchronized data structure to synchronize itself.
I think it's possible that in the same nano second one thread changes the accessors counter from 0 to 1 (meaning they want to start reading), it's possible for a mutator thread to see it's value is 0 and start mutating. Then, both a mutator thread and an accessor thread would be going at the same time.
I'm pretty sure this logic can cause grid-lock (threads waiting on each other infinitely).
I don't know how to make certain threads sleep and wake up right when they need to for this task, besides having it stuck in a while loop.
You have the right idea, just the wrong approach. I'm not sure what OS you're programming on, but you want to look at the concepts of mutex or semaphore to do what you want to do.
On Linux/Unix that is POSIX compliant, you can look at pthreads:
http://www.cs.wm.edu/wmpthreads.html
On Windows, you can look at Critical Sections for something close to a mutex concept:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682530(v=vs.85).aspx
Or WaitForMultipleObjects for something close to a semaphore:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx
And, yes, using while loops are a bad idea. In this case, you are using what is known as a busy loop. More reading on it here:
What is a busy loop?
Using mutex or semaphore, no while loop is required. Good luck!
Related
Each thread of my program has its own log file. In my SIGHUP handler I want to notify those threads that when a new log message arrives, they need to reopen their log files.
I want a lock-free solution based purely on flags and counters. (I do have a thread-local context structure for another purpose, so I can add new fields there).
If there was just one logging thread, I would do:
static int need_reopen = 0;
void sighancont(int signo)
...
case SIGHUP:
need_reopen = 1;
break;
...
}
void log(char *msg) {
if (need_reopen) {
need_reopen = 0;
reopen_log();
}
...
}
Of course, if there are multiple logging threads, a simple flag won't do. I'm thinking of something like this:
static volatile int reopen_counter = 0;
void sighancont(int signo)
...
case SIGHUP:
__sync_fetch_and_add(&reopen_counter, 1);
break;
...
}
void log(struct ctx_st *ctx, char *msg) {
int c = reopen_counter;
if (ctx->reopen_counter != c) {
ctx->reopen_counter = c;
reopen_log();
}
...
}
This way the logging threads are supposed to catch-up with the global counter. If the program receives SIGHUP multiple times, log files will be reopened only once.
I see the only way to break this - to send SIGHUP ~4 billion times.
Is there a better (but still simple) algorithm, e.g. with reference counting?
Your solution is simple and efficient. This is kind of a seqlock.
A few notes, to clear possible confusion from comments:
There's no "atomic variable" but atomic instruction. std::atomic, and friends, are just syntactic sugar around atomic ops - you're perfectly ok there.
The counter doesn't have to be volatile, but the accesses have to be. When you write atomic_read(x) you actually say *(volatile int*)&x.
The volatile qualifier causes all accesses to the variable to be done from memory, while you don't necessarily need that.
But, here as well, you're perfectly ok, since you read the variable into a local.
You can update the counter non-atomically if this is the one and only writer (don't forget to make it atomic_write if you remove the volatile). This would be a very small performance improvement.
The only cost here is in the log threads that have to pay for main memory access after the counter is updated. You should expect 200 cycles or so (x2 on other NUMA node)
I have a function that looks like this that I need to implement.
Threads call this function with these parameters. It's supposed to return with the correct time it accessed the CPU and if it can't access the CPU, it will wait until it can access it.
With regards to the correct time, I keep a global variable that gets updated each time, it calls it.
How do I implement the waits and synchronize it correctly.
int MLFQ(float currentTime, int tid, int remainingTime, int tprio)
My code looks something like this so far and it doesn't quite work.
update globalTime (globalTime = currentTime)
Mutex_lock
Add to MLFQ if needed (either to 5, 10, 15, 20, or 25 MLFQ)
if (canAccessCPU)
getCPU
unlock mutex
return globalTime
else
mutex_unlock
return MLFQ(globalTime, tid, remainingTime, tprio);
Your post uses pseudo code, and there are some ambiguities, so comment if I make the wrong assumptions here:
How do I implement the waits and synchronize it correctly[?]
Waits,
Waits in threads are often implemented in such a way as to not block other threads. Sleep() at the bottom of a thread worker function allows some time for the called thread to sleep, i.e. share time with other processes. In Windows, it is prototyped as:
VOID WINAPI Sleep(
_In_ DWORD dwMilliseconds
);
Linux sleep() here
Synchronizing:
Can be done in many ways. Assuming you are referring to keeping the order in which calls come in from several threads, you can create a simple struct that can be passed back as an argument that could contain a TRUE/FALSE indication of whether the uP was accessed, and the time the attempt was made:
In someheader.h file:
typedef struct {
int uPAccess;
time_t time;
}UP_CALL;
extern UP_CALL uPCall, *pUPCall;
In all of the the .c file(s) you will use:
#include "someheader.h"
In one of the .c files you must initialize the struct: perhaps in the
main fucntion:
int main(void)
{
pUPCall = &uPCall;
//other code
return 0;
}
You can now include a pointer to struct in the thread worker function, (normally globals are at risk of access contention between threads, but you are protecting using mutex), to get time of access attempt, and success of attempt
I have a math function in C which computes lots of complicated math. The function has a header:
double doTheMath(struct example *e, const unsigned int a,
const unsigned int b, const unsigned int c)
{
/* ... lots of math */
}
I would like to call this function 100 times at the same time. I read about pthreads and think it could be a good solution for what I want to achieve. My idea is as follows:
pthread_t tid[100];
pthread_mutex_t lock;
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for(i=0; i<100; i++) {
pthread_create(&(tid[i]), NULL, &doTheMath, NULL);
pthread_mutex_lock(&lock);
d += doTheMath(args);
pthread_mutex_unlock(&lock);
}
pthread_mutex_destroy(&lock);
My questions:
How to pass to the doTheMath all the arguments it needs?
Is there here really a point in using threads, will this even work as I want it to? I cant really understand it, when I lock my function call with the mutex, it won't let to call my function 100 times at the same time, right? So how can I do this?
EDIT:
So, summing up:
My function encrypts/decrypts some data using math - so I guess the order matters
When I do it like this:
pthread_t tid[100];
for(i=0; i<100; i++)
pthread_create(&(tid[i]), NULL, &doTheMath, NULL);
it will create my 100 threads (my machine is capable of running, for sure) and I dont need a mutex, so it will call my function 100 times at the same time?
How about cPU cores? When I do it like this, will all of my CPU cores be fully loaded?
Having just one, single function call will load only one core of my CPU, having 100 (or more) threads created and running and calling my function will load all my CPU cores - am I right here?
Yes, you can do this with threads. Whether calling it 100 times actually speeds things up is a separate question, as once all your CPU cores are fully loaded, trying to run more things at once is likely to decrease speed rather than increase it as processor cache efficiency is lost. For CPU intensive tasks the optimum number of threads is likely to be the number of CPU cores (or a few more).
As to your specific questions:
Q1: When you use phtread_create the last parameter is void *arg, which is an opaque pointer passed to your thread. You would normally use that as a pointer to a struct containing the parameters you want to pass. This might also be used to return the calculated sum. Note that you will have to wrap do_the_maths in a suitable second function so its signature (return value and parameters) look like that expected by pthread_create.
Q2. See above for general warning re too many threads, but uses this a useful technique.
Specifically:
You do not want to use a mutex in the way you are doing. A mutex is only needed to protect parts of your code which are accessing common data (critical sections).
You both create a thread to call doTheMath, then also call doTheMath directly from the main thread. This is incorrect. You should instead merely create all the threads (in one loop), then run another loop to wait for each of the threads to complete, and sum the returned answers.
How to pass to the doTheMath all the arguments it needs?
You'll have to create a proxy function that has a signature acceptable by pthread_create, which invokes the actual functions in its body:
struct doTheMathArgs {
struct example *e;
const unsigned int a, b, c;
};
callDoTheMath(void *data) {
struct doTheMathArgs *args = data;
doTheMath(args->e, args->a, args->b, args->c);
}
...
struct doTheMathArgs dtma;
dtma.e = ...;
...
dtma.c = ...;
pthread_create(&(tid[i]), NULL, &callDoTheMath, (void *) &dtma);
Is there here really a point in using threads, will this even work as I want it to? I cant really understand it, when I lock my function call with the mutex, it wont let to call my function 100 times at the same time, right? So how can I do this?
Unless you're working on a computer that is really capable of running 100 threads at a time your code is going to slow down the entire process. You should rather stay with a number of threads that's your maschine is capable of running.
To answer the futher part of your question you'll have to tell us how your doTheMath function works. Is each invocation of this function completley independant from others? Does the order of invocations matter?
Let's imagine that I have a few worker threads such as follows:
while (1) {
do_something();
if (flag_isset())
do_something_else();
}
We have a couple of helper functions for checking and setting a flag:
void flag_set() { global_flag = 1; }
void flag_clear() { global_flag = 0; }
int flag_isset() { return global_flag; }
Thus the threads keep calling do_something() in a busy-loop and in case some other thread sets global_flag the thread also calls do_something_else() (which could for example output progress or debugging information when requested by setting the flag from another thread).
My question is: Do I need to do something special to synchronize access to the global_flag? If yes, what exactly is the minimum work to do the synchronization in a portable way?
I have tried to figure this out by reading many articles but I am still not quite sure of the correct answer... I think it is one of the following:
A: No need to synchronize because setting or clearing the flag does not create race conditions:
We just need to define the flag as volatile to make sure that it is really read from the shared memory every time it is being checked:
volatile int global_flag;
It might not propagate to other CPU cores immediately but will sooner or later, guaranteed.
B: Full synchronization is needed to make sure that changes to the flag are propagated between threads:
Setting the shared flag in one CPU core does not necessarily make it seen by another core. We need to use a mutex to make sure that flag changes are always propagated by invalidating the corresponding cache lines on other CPUs. The code becomes as follows:
volatile int global_flag;
pthread_mutex_t flag_mutex;
void flag_set() { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }
int flag_isset()
{
int rc;
pthread_mutex_lock(flag_mutex);
rc = global_flag;
pthread_mutex_unlock(flag_mutex);
return rc;
}
C: Synchronization is needed to make sure that changes to the flag are propagated between threads:
This is the same as B but instead of using a mutex on both sides (reader & writer) we set it in only in the writing side. Because the logic does not require synchronization. we just need to synchronize (invalidate other caches) when the flag is changed:
volatile int global_flag;
pthread_mutex_t flag_mutex;
void flag_set() { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }
int flag_isset() { return global_flag; }
This would avoid continuously locking and unlocking the mutex when we know that the flag is rarely changed. We are just using a side-effect of Pthreads mutexes to make sure that the change is propagated.
So, which one?
I think A and B are the obvious choices, B being safer. But how about C?
If C is ok, is there some other way of forcing the flag change to be visible on all CPUs?
There is one somewhat related question: Does guarding a variable with a pthread mutex guarantee it's also not cached? ...but it does not really answer this.
The 'minimum amount of work' is an explicit memory barrier. The syntax depends on your compiler; on GCC you could do:
void flag_set() {
global_flag = 1;
__sync_synchronize(global_flag);
}
void flag_clear() {
global_flag = 0;
__sync_synchronize(global_flag);
}
int flag_isset() {
int val;
// Prevent the read from migrating backwards
__sync_synchronize(global_flag);
val = global_flag;
// and prevent it from being propagated forwards as well
__sync_synchronize(global_flag);
return val;
}
These memory barriers accomplish two important goals:
They force a compiler flush. Consider a loop like the following:
for (int i = 0; i < 1000000000; i++) {
flag_set(); // assume this is inlined
local_counter += i;
}
Without a barrier, a compiler might choose to optimize this to:
for (int i = 0; i < 1000000000; i++) {
local_counter += i;
}
flag_set();
Inserting a barrier forces the compiler to write the variable back immediately.
They force the CPU to order its writes and reads. This is not so much an issue with a single flag - most CPU architectures will eventually see a flag that's set without CPU-level barriers. However the order might change. If we have two flags, and on thread A:
// start with only flag A set
flag_set_B();
flag_clear_A();
And on thread B:
a = flag_isset_A();
b = flag_isset_B();
assert(a || b); // can be false!
Some CPU architectures allow these writes to be reordered; you may see both flags being false (ie, the flag A write got moved first). This can be a problem if a flag protects, say, a pointer being valid. Memory barriers force an ordering on writes to protect against these problems.
Note also that on some CPUs, it's possible to use 'acquire-release' barrier semantics to further reduce overhead. Such a distinction does not exist on x86, however, and would require inline assembly on GCC.
A good overview of what memory barriers are and why they are needed can be found in the Linux kernel documentation directory. Finally, note that this code is enough for a single flag, but if you want to synchronize against any other values as well, you must tread very carefully. A lock is usually the simplest way to do things.
You must not cause data race cases. It is undefined behavior and the compiler is allowed to do anything and everything it pleases.
A humorous blog on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
Case 1: There is no synchronization on the flag, so anything is allowed to happen. For example, the compiler is allowed to turn
flag_set();
while(weArentBoredLoopingYet())
doSomethingVeryExpensive();
flag_clear()
into
while(weArentBoredLoopingYet())
doSomethingVeryExpensive();
flag_set();
flag_clear()
Note: this kind of race is actually very popular. Your millage may vary. One one hand, the de-facto implementation of pthread_call_once involves a data race like this. On the other hand, it is undefined behavior. On most versions of gcc, you can get away with it because gcc chooses not to exercise its right to optimize this way in many cases, but it is not "spec" code.
B: full synchronization is the right call. This is simply what you have to do.
C: Only synchronization on the writer could work, if you can prove that no one wants to read it while it is writing. The official definition of a data race (from the C++11 specification) is one thread writing to a variable while another thread can concurrently read or write the same variable. If your readers and writers all run at once, you still have a race case. However, if you can prove that the writer writes once, there is some synchronization, and then the readers all read, then the readers do not need synchronization.
As for caching, the rule is that a mutex lock/unlock synchronizes with all threads that lock/unlock the same mutex. This means you will not see any unusual caching effects (although under the hood, your processor can do spectacular things to make this run faster... it's just obliged to make it look like it wasn't doing anything special). If you don't synchronize, however, you get no guarantees that the other thread doesn't have changes to push that you need!
All of that being said, the question is really how much are you willing to rely on compiler specific behavior. If you want to write proper code, you need to do proper synchronization. If you are willing to rely on the compiler to be kind to you, you can get away with a lot less.
If you have C++11, the easy answer is to use atomic_flag, which is designed to do exactly what you want AND is designed to synchronize correctly for you in most cases.
For the example you have posted, case A is sufficient provided that ...
Getting and setting the flag takes only one CPU instruction.
do_something_else() is not dependent upon the flag being set during the execution of that routine.
If getting and/or setting the flag takes more than one CPU instruction, then you must some form of locking.
If do_something_else() is dependent upon the flag being set during the execution of that routine, then you must lock as in case C but the mutex must be locked before calling flag_isset().
Hope this helps.
Assigning incoming job to worker threads requires no locking. Typical example is webserver, where the request is catched by a main thread, and this main thread selects a worker. I'm trying explain it with some pesudo code.
main task {
// do forever
while (true)
// wait for job
while (x != null) {
sleep(some);
x = grabTheJob();
}
// select worker
bool found = false;
for (n = 0; n < NUM_OF_WORKERS; n++)
if (workerList[n].getFlag() != AVAILABLE) continue;
workerList[n].setJob(x);
workerList[n].setFlag(DO_IT_PLS);
found = true;
}
if (!found) panic("no free worker task! ouch!");
} // while forever
} // main task
worker task {
while (true) {
while (getFlag() != DO_IT_PLS) sleep(some);
setFlag(BUSY_DOING_THE_TASK);
/// do it really
setFlag(AVAILABLE);
} // while forever
} // worker task
So, if there are one flag, which one party sets is to A and another to B and C (the main task sets it to DO_IT_PLS, and the worker sets it to BUSY and AVAILABLE), there is no confilct. Play it with "real-life" example, say, when the teacher is giving different tasks to students. The teacher selects a student, gives him/her a task. Then, the teacher looks for next available student. When a student is ready, he/she gets back to the pool of available students.
UPDATE: just clarify, there are only one main() thread and several - configurable number of - worker threads. As main() runs only one instance, there is no need to sync the selection and launc of the workers.
This question is based on:
When is it safe to destroy a pthread barrier?
and the recent glibc bug report:
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
I'm not sure about the semaphores issue reported in glibc, but presumably it's supposed to be valid to destroy a barrier as soon as pthread_barrier_wait returns, as per the above linked question. (Normally, the thread that got PTHREAD_BARRIER_SERIAL_THREAD, or a "special" thread that already considered itself "responsible" for the barrier object, would be the one to destroy it.) The main use case I can think of is when a barrier is used to synchronize a new thread's use of data on the creating thread's stack, preventing the creating thread from returning until the new thread gets to use the data; other barriers probably have a lifetime equal to that of the whole program, or controlled by some other synchronization object.
In any case, how can an implementation ensure that destruction of the barrier (and possibly even unmapping of the memory it resides in) is safe as soon as pthread_barrier_wait returns in any thread? It seems the other threads that have not yet returned would need to examine at least some part of the barrier object to finish their work and return, much like how, in the glibc bug report cited above, sem_post has to examine the waiters count after having adjusted the semaphore value.
I'm going to take another crack at this with an example implementation of pthread_barrier_wait() that uses mutex and condition variable functionality as might be provided by a pthreads implementation. Note that this example doesn't try to deal with performance considerations (specifically, when the waiting threads are unblocked, they are all re-serialized when exiting the wait). I think that using something like Linux Futex objects could help with the performance issues, but Futexes are still pretty much out of my experience.
Also, I doubt that this example handles signals or errors correctly (if at all in the case of signals). But I think proper support for those things can be added as an exercise for the reader.
My main fear is that the example may have a race condition or deadlock (the mutex handling is more complex than I like). Also note that it is an example that hasn't even been compiled. Treat it as pseudo-code. Also keep in mind that my experience is mainly in Windows - I'm tackling this more as an educational opportunity than anything else. So the quality of the pseudo-code may well be pretty low.
However, disclaimers aside, I think it may give an idea of how the problem asked in the question could be handled (ie., how can the pthread_barrier_wait() function allow the pthread_barrier_t object it uses to be destroyed by any of the released threads without danger of using the barrier object by one or more threads on their way out).
Here goes:
/*
* Since this is a part of the implementation of the pthread API, it uses
* reserved names that start with "__" for internal structures and functions
*
* Functions such as __mutex_lock() and __cond_wait() perform the same function
* as the corresponding pthread API.
*/
// struct __barrier_wait data is intended to hold all the data
// that `pthread_barrier_wait()` will need after releasing
// waiting threads. This will allow the function to avoid
// touching the passed in pthread_barrier_t object after
// the wait is satisfied (since any of the released threads
// can destroy it)
struct __barrier_waitdata {
struct __mutex cond_mutex;
struct __cond cond;
unsigned waiter_count;
int wait_complete;
};
struct __barrier {
unsigned count;
struct __mutex waitdata_mutex;
struct __barrier_waitdata* pwaitdata;
};
typedef struct __barrier pthread_barrier_t;
int __barrier_waitdata_init( struct __barrier_waitdata* pwaitdata)
{
waitdata.waiter_count = 0;
waitdata.wait_complete = 0;
rc = __mutex_init( &waitdata.cond_mutex, NULL);
if (!rc) {
return rc;
}
rc = __cond_init( &waitdata.cond, NULL);
if (!rc) {
__mutex_destroy( &pwaitdata->waitdata_mutex);
return rc;
}
return 0;
}
int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned int count)
{
int rc;
rc = __mutex_init( &barrier->waitdata_mutex, NULL);
if (!rc) return rc;
barrier->pwaitdata = NULL;
barrier->count = count;
//TODO: deal with attr
}
int pthread_barrier_wait(pthread_barrier_t *barrier)
{
int rc;
struct __barrier_waitdata* pwaitdata;
unsigned target_count;
// potential waitdata block (only one thread's will actually be used)
struct __barrier_waitdata waitdata;
// nothing to do if we only need to wait for one thread...
if (barrier->count == 1) return PTHREAD_BARRIER_SERIAL_THREAD;
rc = __mutex_lock( &barrier->waitdata_mutex);
if (!rc) return rc;
if (!barrier->pwaitdata) {
// no other thread has claimed the waitdata block yet -
// we'll use this thread's
rc = __barrier_waitdata_init( &waitdata);
if (!rc) {
__mutex_unlock( &barrier->waitdata_mutex);
return rc;
}
barrier->pwaitdata = &waitdata;
}
pwaitdata = barrier->pwaitdata;
target_count = barrier->count;
// all data necessary for handling the return from a wait is pointed to
// by `pwaitdata`, and `pwaitdata` points to a block of data on the stack of
// one of the waiting threads. We have to make sure that the thread that owns
// that block waits until all others have finished with the information
// pointed to by `pwaitdata` before it returns. However, after the 'big' wait
// is completed, the `pthread_barrier_t` object that's passed into this
// function isn't used. The last operation done to `*barrier` is to set
// `barrier->pwaitdata = NULL` to satisfy the requirement that this function
// leaves `*barrier` in a state as if `pthread_barrier_init()` had been called - and
// that operation is done by the thread that signals the wait condition
// completion before the completion is signaled.
// note: we're still holding `barrier->waitdata_mutex`;
rc = __mutex_lock( &pwaitdata->cond_mutex);
pwaitdata->waiter_count += 1;
if (pwaitdata->waiter_count < target_count) {
// need to wait for other threads
__mutex_unlock( &barrier->waitdata_mutex);
do {
// TODO: handle the return code from `__cond_wait()` to break out of this
// if a signal makes that necessary
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
} while (!pwaitdata->wait_complete);
}
else {
// this thread satisfies the wait - unblock all the other waiters
pwaitdata->wait_complete = 1;
// 'release' our use of the passed in pthread_barrier_t object
barrier->pwaitdata = NULL;
// unlock the barrier's waitdata_mutex - the barrier is
// ready for use by another set of threads
__mutex_unlock( barrier->waitdata_mutex);
// finally, unblock the waiting threads
__cond_broadcast( &pwaitdata->cond);
}
// at this point, barrier->waitdata_mutex is unlocked, the
// barrier->pwaitdata pointer has been cleared, and no further
// use of `*barrier` is permitted...
// however, each thread still has a valid `pwaitdata` pointer - the
// thread that owns that block needs to wait until all others have
// dropped the pwaitdata->waiter_count
// also, at this point the `pwaitdata->cond_mutex` is locked, so
// we're in a critical section
rc = 0;
pwaitdata->waiter_count--;
if (pwaitdata == &waitdata) {
// this thread owns the waitdata block - it needs to hang around until
// all other threads are done
// as a convenience, this thread will be the one that returns
// PTHREAD_BARRIER_SERIAL_THREAD
rc = PTHREAD_BARRIER_SERIAL_THREAD;
while (pwaitdata->waiter_count!= 0) {
__cond_wait( &pwaitdata->cond, &pwaitdata->cond_mutex);
};
__mutex_unlock( &pwaitdata->cond_mutex);
__cond_destroy( &pwaitdata->cond);
__mutex_destroy( &pwaitdata_cond_mutex);
}
else if (pwaitdata->waiter_count == 0) {
__cond_signal( &pwaitdata->cond);
__mutex_unlock( &pwaitdata->cond_mutex);
}
return rc;
}
17 July 20111: Update in response to a comment/question about process-shared barriers
I forgot completely about the situation with barriers that are shared between processes. And as you mention, the idea I outlined will fail horribly in that case. I don't really have experience with POSIX shared memory use, so any suggestions I make should be tempered with scepticism.
To summarize (for my benefit, if no one else's):
When any of the threads gets control after pthread_barrier_wait() returns, the barrier object needs to be in the 'init' state (however, the most recent pthread_barrier_init() on that object set it). Also implied by the API is that once any of the threads return, one or more of the the following things could occur:
another call to pthread_barrier_wait() to start a new round of synchronization of threads
pthread_barrier_destroy() on the barrier object
the memory allocated for the barrier object could be freed or unshared if it's in a shared memory region.
These things mean that before the pthread_barrier_wait() call allows any thread to return, it pretty much needs to ensure that all waiting threads are no longer using the barrier object in the context of that call. My first answer addressed this by creating a 'local' set of synchronization objects (a mutex and an associated condition variable) outside of the barrier object that would block all the threads. These local synchronization objects were allocated on the stack of the thread that happened to call pthread_barrier_wait() first.
I think that something similar would need to be done for barriers that are process-shared. However, in that case simply allocating those sync objects on a thread's stack isn't adequate (since the other processes would have no access). For a process-shared barrier, those objects would have to be allocated in process-shared memory. I think the technique I listed above could be applied similarly:
the waitdata_mutex that controls the 'allocation' of the local sync variables (the waitdata block) would be in process-shared memory already by virtue of it being in the barrier struct. Of course, when the barrier is set to THEAD_PROCESS_SHARED, that attribute would also need to be applied to the waitdata_mutex
when __barrier_waitdata_init() is called to initialize the local mutex & condition variable, it would have to allocate those objects in shared memory instead of simply using the stack-based waitdata variable.
when the 'cleanup' thread destroys the mutex and the condition variable in the waitdata block, it would also need to clean up the process-shared memory allocation for the block.
in the case where shared memory is used, there needs to be some mechanism to ensured that the shared memory object is opened at least once in each process, and closed the correct number of times in each process (but not closed entirely before every thread in the process is finished using it). I haven't thought through exactly how that would be done...
I think these changes would allow the scheme to operate with process-shared barriers. the last bullet point above is a key item to figure out. Another is how to construct a name for the shared memory object that will hold the 'local' process-shared waitdata. There are certain attributes you'd want for that name:
you'd want the storage for the name to reside in the struct pthread_barrier_t structure so all process have access to it; that means a known limit to the length of the name
you'd want the name to be unique to each 'instance' of a set of calls to pthread_barrier_wait() because it might be possible for a second round of waiting to start before all threads have gotten all the way out of the first round waiting (so the process-shared memory block set up for the waitdata might not have been freed yet). So the name probably has to be based on things like process id, thread id, address of the barrier object, and an atomic counter.
I don't know whether or not there are security implications to having the name be 'guessable'. if so, some randomization needs to be added - no idea how much. Maybe you'd also need to hash the data mentioned above along with the random bits. Like I said, I really have no idea if this is important or not.
As far as I can see there is no need for pthread_barrier_destroy to be an immediate operation. You could have it wait until all threads that are still in their wakeup phase are woken up.
E.g you could have an atomic counter awakening that initially set to the number of threads that are woken up. Then it would be decremented as last action before pthread_barrier_wait returns. pthread_barrier_destroy then just could be spinning until that counter falls to 0.