concurrent variable access in c

concurrent variable access in c - c

I have a fairly specific question about concurrent programming in C. I have done a fair bit of research on this but have seen several conflicting answers, so I'm hoping for some clarification. I have a program that's something like the following (sorry for the longish code block):
typedef struct {
pthread_mutex_t mutex;
/* some shared data */
int eventCounter;
} SharedData;
SharedData globalSharedData;
typedef struct {
/* details unimportant */
} NewData;
void newData(NewData data) {
int localCopyOfCounter;
if (/* information contained in new data triggers an
event */) {
pthread_mutex_lock(&globalSharedData.mutex);
localCopyOfCounter = ++globalSharedData.eventCounter;
pthread_mutex_unlock(&globalSharedData.mutex);
}
else {
return;
}
/* Perform long running computation. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* A new event has happened, old information is stale and
the current computation can be aborted. */
return;
}
/* Perform another long running computation whose results
depend on the previous one. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* Another check for new event that causes information
to be stale. */
return;
}
/* Final stage of computation whose results depend on two
previous stages. */
}
There is a pool of threads servicing the connection for incoming data, so multiple instances of newData can be running at the same time. In a multi-processor environment there are two problems I'm aware of in getting the counter handling part of this code correct: preventing the compiler from caching the shared counter copy in a register so other threads can't see it, and forcing the CPU to write the store of the counter value to memory in a timely fashion so other threads can see it. I would prefer not to use a synchronization call around the counter checks because a partial read of the counter value is acceptable (it will produce a value different than the local copy, which should be adequate to conclude that an event has occurred). Would it be sufficient to declare the eventCounter field in SharedData to be volatile, or do I need to do something else here? Also is there a better way to handle this?

Unfortunately, the C standard says very little about concurrency. However, most compilers (gcc and msvc, anyway) will regard a volatile read as if having acquire semantics -- the volatile variable will be reloaded from memory on every access. That is desirable, your code as it is now may end up comparing values cached in registers. I wouldn't even be surprised if the both comparisons were optimized out.
So the answer is yes, make the eventCounter volatile. Alternatively, if you don't want to restrict your compiler too much, you can use the following function to perform reads of eventCounter.
int load_acquire(volatile int * counter) { return *counter; }
if (localCopy != load_acquire(&sharedCopy))
// ...

preventing the compiler from caching
the local counter copy in a register
so other threads can't see it
Your local counter copy is "local", created on the execution stack and visible only to the running thread. Every other thread runs in a different stack and has the own local counter variable (no concurrency).
Your global counter should be declared volatile to avoid register optimization.

You can also use hand coded assembly or compiler intrinsics which will garuntee atomic checks against your mutex, they can also atomically ++ and -- your counter.
volatile is useless these days, for the most part, you should look at memory barrier's which are other low level CPU facility to help with multi-core contention.
However the best advice I can give, would be for you to bone up on the various managed and native multi-core support libraries. I guess some of the older one's like OpenMP or MPI (message based), are still kicking and people will go on about how cool they are... however for most developers, something like intel's TBB or Microsoft's new API's, I also just dug up this code project article, he's apparently using cmpxchg8b which is the lowlevel hardware route which I mentioned initially...
Good luck.

Related

Why is there barrier() in KCOV code in Linux kernel?

In Linux KCOV code, why is this barrier() placed?
void notrace __sanitizer_cov_trace_pc(void)
{
struct task_struct *t;
enum kcov_mode mode;
t = current;
/*
* We are interested in code coverage as a function of a syscall inputs,
* so we ignore code executed in interrupts.
*/
if (!t || in_interrupt())
return;
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
unsigned long *area;
unsigned long pos;
/*
* There is some code that runs in interrupts but for which
* in_interrupt() returns false (e.g. preempt_schedule_irq()).
* READ_ONCE()/barrier() effectively provides load-acquire wrt
* interrupts, there are paired barrier()/WRITE_ONCE() in
* kcov_ioctl_locked().
*/
barrier();
area = t->kcov_area;
/* The first word is number of subsequent PCs. */
pos = READ_ONCE(area[0]) + 1;
if (likely(pos < t->kcov_size)) {
area[pos] = _RET_IP_;
WRITE_ONCE(area[0], pos);
}
}
}
A barrier() call prevents the compiler from re-ordering instructions. However, how is that related to interrupts here? Why is it needed for semantic correctness?

Without barrier(), the compiler would be free to access t->kcov_area before t->kcov_mode. It's unlikely to want to do that in practice, but that's not the point. Without some kind of barrier, C rules allow the compiler to create asm that doesn't do what we want. (The C11 memory model has no ordering guarantees beyond what you impose explicitly; in C11 via stdatomic or in Linux / GNU C via barriers like barrier() or smp_rb().)
As described in the comment, barrier() is creating an acquire-load wrt. code running on the same core, which is all you need for interrupts.
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
...
barrier();
area = t->kcov_area;
...
I'm not familiar with kcov in general, but it looks like seeing a certain value in t->kcov_mode with an acquire load makes it safe to read t->kcov_area. (Because whatever code writes that object writes kcov_area first, then does a release-store to kcov_mode.)
https://preshing.com/20120913/acquire-and-release-semantics/ explains acq / rel synchronization in general.
Why isn't smp_rb() required? (Even on weakly-ordered ISAs where acquire ordering would need a fence instruction to guarantee seeing other stores done by another core.)
An interrupt handler runs on the same core that was doing the other operations, just like a signal handler interrupts a thread and runs in its context. struct task_struct *t = current means that the data we're looking at is local to a single task. This is equivalent to something within a single thread in user-space. (Kernel pre-emption leading to re-scheduling on a different core will use whatever memory barriers are necessary to preserve correct execution of a single thread when that other core accesses the memory this task had been using).
The user-space C11 stdatomic equivalent of this barrier is atomic_signal_fence(memory_order_acquire). Signal fences only have to block compile-time reordering (like Linux barrier()), unlike atomic_thread_fence that has to emit a memory barrier asm instruction.
Out-of-order CPUs do reorder things internally, but the cardinal rule of OoO exec is to preserve the illusion of instructions running one at a time, in order for the core running the instructions. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a; to correctly load the 1 you just stored; hardware preserves the illusion of serial execution1 in program order. (Typically via having loads snoop the store buffer and store-forward from stores to loads for stores that haven't committed to L1d cache yet.)
Instructions in an interrupt handler logically run after the point where the interrupt happened (as per the interrupt-return address). Therefore we just need the asm instructions in the right order (barrier()), and hardware will make everything work.
Footnote 1: There are some explicitly-parallel ISAs like IA-64 and the Mill, but they provide rules that asm can follow to be sure that one instruction sees the effect of another earlier one. Same for classic MIPS I load delay slots and stuff like that. Compilers take care of this for compiled C.

Safely Exiting to a Particular State in Case of Error

When writing code I often have checks to see if errors occurred. An example would be:
char *x = malloc( some_bytes );
if( x == NULL ){
fprintf( stderr, "Malloc failed.\n" );
exit(EXIT_FAILURE);
}
I've also used strerror( errno ) in the past.
I've only ever written small desktop appications where it doesn't matter if the program exit()ed in case of an error.
Now, however, I'm writing C code for an embedded system (Arduino) and I don't want the system to just exit in case of an error. I want it to go to a particular state/function where it can power down systems, send error reports and idle safely.
I could simply call an error_handler() function, but I could be deep in the stack and very low on memory, leaving error_handler() inoperable.
Instead, I'd like execution to effectively collapse the stack, free up a bunch of memory and start sorting out powering down and error reporting. There is a serious fire risk if the system doesn't power down safely.
Is there a standard way that safe error handling is implemented in low memory embedded systems?
EDIT 1:
I'll limit my use of malloc() in embedded systems. In this particular case, the errors would occur when reading a file, if the file was not of the correct format.

Maybe you're waiting for the Holy and Sacred setjmp/longjmp, the one who came to save all the memory-hungry stacks of their sins?
#include <setjmp.h>
jmp_buf jumpToMeOnAnError;
void someUpperFunctionOnTheStack() {
if(setjmp(jumpToMeOnAnError) != 0) {
// Error handling code goes here
// Return, abort(), while(1) {}, or whatever here...
}
// Do routinary stuff
}
void someLowerFunctionOnTheStack() {
if(theWorldIsOver)
longjmp(jumpToMeOnAnError, -1);
}
Edit: Prefer not to do malloc()/free()s on embedded systems, for the same reasons you said. It's simply unhandable. Unless you use a lot of return codes/setjmp()s to free the memory all the way up the stack...

If your system has a watchdog, you could use:
char *x = malloc( some_bytes );
assert(x != NULL);
The implementation of assert() could be something like:
#define assert (condition) \
if (!(condition)) while(true)
In case of a failure the watchdog would trigger, the system would make a reset. At restart the system would check the reset reason, if the reset reason was "watchdog reset", the system would goto a safe state.
update
Before entering the while loop, assert cold also output a error message, print the stack trace or save some data in non volatile memory.

Is there a standard way that safe error handling is implemented in low memory embedded systems?
Yes, there is an industry de facto way of handling it. It is all rather simple:
For every module in your program you need to have a result type, such as a custom enum, which describes every possible thing that could go wrong with the functions inside that module.
You document every function properly, stating what codes it will return upon error and what code it will return upon success.
You leave all error handling to the caller.
If the caller is another module, it too passes on the error to its own caller. Possibly renames the error into something more suitable, where applicable.
The error handling mechanism is located in main(), at the bottom of the call stack.
This works well together with classic state machines. A typical main would be:
void main (void)
{
for(;;)
{
serve_watchdog();
result = state_machine();
if(result != good)
{
error_handler(result);
}
}
}
You should not use malloc in bare bone or RTOS microcontroller applications, not so much because of safety reasons, but simple because it doesn't make any sense whatsoever to use it. Apply common sense when programming.

Use setjmp(3) to set a recovery point, and longjmp(3) to jump to it, restoring the stack to what it was at the setjmp point. It wont free malloced memory.
Generally, it is not a good idea to use malloc/free in an embedded program if it can be avoided. For example, a static array may be adequate, or even using alloca() is marginally better.

to minimize stack usage:
write the program so the calls are in parallel rather than function calls sub function that calls sub function that calls sub function.... I.E. top level function calls sub function where sub function promptly returns, with status info. top level function then calls next sub function... etc
The (bad for stack limited) nested method of program architecture:
top level function
second level function
third level function
forth level function
should be avoided in embedded systems
the preferred method of program architecture for embedded systems is:
top level function (the reset event handler)
(variations in the following depending on if 'warm' or 'cold' start)
initialize hardware
initialize peripherals
initialize communication I/O
initialize interrupts
initialize status info
enable interrupts
enter background processing
interrupt handler
re-enable the interrupt
using 'scheduler'
select a foreground function
trigger dispatch for selected foreground function
return from interrupt
background processing
(this can be, and often is implemented as a 'state' machine rather than a loop)
loop:
if status info indicates need to call second level function 1
second level function 1, which updates status info
if status info indicates need to call second level function 2
second level function 2, which updates status info
etc
end loop:
Note that, as much as possible, there is no 'third level function x'
Note that, the foreground functions must complete before they are again scheduled.
Note: there are lots of other details that I have omitted in the above, like
kicking the watchdog,
the other interrupt events,
'critical' code sections and use of mutex(),
considerations between 'soft real-time' and 'hard real-time',
context switching
continuous BIT, commanded BIT, and error handling
etc

How come a compiler cannot detect if a global variable is changed by another thread?

I was just going through concepts of the volatile keyword. I just gone through this link, this link is telling about why to use the volatile keyword in case of program using interrupt handler. They have mentioned in one example:
int etx_rcvd = FALSE;
void main()
{
...
while (!ext_rcvd)
{
// Wait
}
...
}
interrupt void rx_isr(void)
{
...
if (ETX == rx_char)
{
etx_rcvd = TRUE;
}
...
}
They are saying since compiler is not able to know ext_rcvd is getting updated in an interrupt handler. So compiler uses optimization intelligence and assumes that this variable value is always FALSE and it never enters into the while{} condition. So to prevent these situation we use volatile keyword, which stops compiler to use its own intelligence.
My question is, While compiling, how compiler is not able to know that ext_rcvd is getting updated in interrupt handler? PLease help me to find its answer, I am not getting correct answer for this.

The compiler cannot analyze all the codes or running processes that can modify the memory location of ext_rcvd.
In your example, you mentioned that ext_rcvd is being updated in an interrupt handler. That is correct. The interrupt handler is a piece of code launched by the Operating System, when the CPU receives an interrupt. That piece of code is actually the driver code. In the driver code, ext_rcvd may have another name but point to the same memory location.
So in order to know if ext_rcvd is updated somewhere else, the compiler needs to analyze the libraries and drivers' code and to figure out that they are updating the exact same memory location that you name ext_rcvd in your code. This cannot be done before execution time.
The same goes for multi-threading. The compiler cannot know a-priori if a certain thread is updating the exact memory location used by another thread. For example if another thread makes a syscall() then the compiler needs to look in the code handling the syscall().

When the CPU receives an interrupt, it stops whatever it's doing (unless it's processing a more important interrupt, in which case it will deal with this one only when the more important one is done), saves certain parameters on the stack and calls the interrupt handler. This means that certain things are not allowed in the interrupt handler itself, because the system is in an unknown state. The solution to this problem is for the interrupt handler to do what needs to be done immediately, usually read something from the hardware or send something to the hardware, and then schedule the handling of the new information at a later time (this is called the "bottom half") and return. The kernel is then guaranteed to call the bottom half as soon as possible -- and when it does, everything allowed in kernel modules will be allowed.
I think when the interrupt is called whatever is working is stopped and the variable is set to TRUE, not satisfying the while condition.
But when you use volatile keyword it makes C checks the variable value again.
Of course, I'm not 100% sure about this, I'm open for answers for change mine.

interrupt is not a C specified keyword, so whatever is discussed is not C specified behavior.
Yes the compiler could see that etx_rcvd is modified inside an interrupt routine and therefore assume etx_rcvd could change at any time outside the interrupt function and make int etx_rcvd --> volatile int etx_rcvd.
Now the question is should it do that?
IMO: No, it is not needed.
An interrupt function could modify global variables and the code flow is such that non-interrupt functions access only happens in a interrupt protected block. An optimizing compiler would be hindered by having implied volatile with int etx_rcvd. So now code needs a way to say non_volatile int etx_rcvd to prevent the volatile assumption OP seeks.
C all ready provides a method to declare variables volatile (add volatile) and non-volatile (do not add volatile). If an interrupt routine could make variables volatile without them being declared so, the code would need a new keyword to insure non-volatility.

Atomic disable and restore interrupts from ISR and non-ISR context: may it be different on some platform?

I work with embedded stuff, namely PIC32 Microchip CPUs these days.
I'm familiar with several real-time kernels: AVIX, FreeRTOS, TNKernel, and in all of them we have 2 versions of nearly all functions: one for calling from task, and second one for calling from ISR.
Of course it makes sense for functions that could switch context and/or sleep: obviously, ISR can't sleep, and context switch should be done in different manner. But there are several functions that do not switch context nor sleep: say, it may return system tick count, or set up software timer, etc.
Now, I'm implementing my own kernel: TNeoKernel, which has well-formed code and is carefully tested, and I'm considering to invent "universal" functions sometimes: the ones that can be called from either task or ISR context. But since all three aforementioned kernels use separate functions, I'm afraid I'm going to do something wrong.
Say, in task and ISR context, TNKernel uses different routines for disabling/restoring interrupts, but as far as I see, the only possible difference is that ISR functions may be "compiled out" as an optimization if the target platform doesn't support nested interrupts. But if target platform supports nested interrupts, then disabling/restoring interrupts looks absolutely the same for task and ISR context.
So, my question is: are there platforms on which disabling/restoring interrupts from ISR should be done differently than from non-ISR context?
If there are no such platforms, I'd prefer to go with "universal" functions. If you have any comments on this approach, they are highly appreciated.
UPD: I don't like to have two set of functions because they lead to notable code duplication and complication. Say, I need to provide a function that should start software timer. Here is what it looks like:
enum TN_RCode _tn_timer_start(struct TN_Timer *timer, TN_Timeout timeout)
{
/* ... real job is done here ... */
}
/*
* Function to be called from task
*/
enum TN_RCode tn_timer_start(struct TN_Timer *timer, TN_Timeout timeout)
{
TN_INTSAVE_DATA; //-- define the variable to store interrupt status,
// it is used by TN_INT_DIS_SAVE()
// and TN_INT_RESTORE()
enum TN_RCode rc = TN_RC_OK;
//-- check that function is called from right context
if (!tn_is_task_context()){
rc = TN_RC_WCONTEXT;
goto out;
}
//-- disable interrupts
TN_INT_DIS_SAVE();
//-- perform real job, after all
rc = _tn_timer_start(timer, timeout);
//-- restore interrupts state
TN_INT_RESTORE();
out:
return rc;
}
/*
* Function to be called from ISR
*/
enum TN_RCode tn_timer_istart(struct TN_Timer *timer, TN_Timeout timeout)
{
TN_INTSAVE_DATA_INT; //-- define the variable to store interrupt status,
// it is used by TN_INT_DIS_SAVE()
// and TN_INT_RESTORE()
enum TN_RCode rc = TN_RC_OK;
//-- check that function is called from right context
if (!tn_is_isr_context()){
rc = TN_RC_WCONTEXT;
goto out;
}
//-- disable interrupts
TN_INT_IDIS_SAVE();
//-- perform real job, after all
rc = _tn_timer_start(timer, timeout);
//-- restore interrupts state
TN_INT_IRESTORE();
out:
return rc;
}
So, we need wrappers like the ones above for nearly all system function. This is a kind of inconvenience, for me as a kernel developer as well as for kernel users.
The only difference is that different macros are used: for task, these are TN_INTSAVE_DATA, TN_INT_DIS_SAVE(), TN_INT_RESTORE(); for interrupts these are TN_INTSAVE_DATA_INT, TN_INT_IDIS_SAVE(), TN_INT_IRESTORE().
For the platforms that support nested interrupts (ARM, PIC32), these macros are identical. For other platforms that don't support nested interrupts, TN_INTSAVE_DATA_INT, TN_INT_IDIS_SAVE() and TN_INT_IRESTORE() are expanded to nothing. So it is a bit of performance optimization, but the cost is too high in my opinion: it's harder to maintain, it's not so convenient to use, and the code size increases.

It's all a matter of design and CPU capabilities. I'm not familiar with any of the PICs but, for example, Freescale (Motorola) MCUs (among many others) have the ability to move the Condition Code Register (CCR) into the accumulator and back. This allows one to save the previous state of the Interrupt Enable/Disable Mask, and restore it at the end, without worrying about bluntly enabling interrupts where they should stay disabled (inside ISRs).
To answer, however, which platform(s) must do it differently inside and outside ISRs would require one to be familiar with all of them, or at least one that fails this test. If there is a CPU that does not allow saving and restoring the CCR (as mentioned above), one would have no option but to do it differently for each case.

Kernel functions that normally cause scheduling to occur have simpler ISR versions because the scheduler runs on return from interrupt (there is usually an interrupt epilogue required to do that), not from the scheduling function itself.
It is simple enough to create a function that will work in any context, but it adds a small overhead. However the safety afforded by not calling an inappropriate function is probably worth it.
For example:
OSStatus semGive( OSSem sem )
{
return isInterrupt() ? ISR_SemGive( sem ) : OS_SemGive( sem ) ;
}
The implementation of isInterrupt() is platform dependent, and is discussed at Safely detect, if function is called from an ISR?

Implementing a FIFO mutex in pthreads

I'm trying to implement a binary tree supporting concurrent insertions (which could occur even between nodes), but without having to allocate a global lock or a separate mutex or mutexes for each node. Rather, the quantity of such locks allocated should be on the order of the quantity of threads using the tree.
Consequently, I end up with a type of lock convoy problem. Explained more simply, it's what potentially happens when two or more threads do the following:
1 for(;;) {
2 lock(mutex)
3 do_stuff
4 unlock(mutex)
5 }
That is, if Thread#1 executes instructions 4->5->1->2 all in one "cpu burst" then Thread#2 gets starved from execution.
On the other hand, if there was a FIFO-type locking option for mutexes in pthreads, then such a problem could be avoided. So, is there a way to implement FIFO-type mutex locking in pthreads? Can altering thread priorities accomplish this?

You can implement a fair queuing system where each thread is added to a queue when it blocks, and the first thread on the queue always gets the resource when it becomes available. Such a "fair" ticket lock built on pthreads primitives might look like this:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}

You could do something like this:
define a "queued lock" that consists of a free/busy flag plus a linked-list of pthread condition variables. access to the queued_lock is protected by a mutex
to lock the queued_lock:
seize the mutex
check the 'busy' flag
if not busy; set busy = true; release mutex; done
if busy; create a new condition # end of queue & wait on it (releasing mutex)
to unlock:
seize the mutex
if no other thread is queued, busy = false; release mutex; done
pthread_cond_signal the first waiting condition
do not clear the 'busy' flag - ownership is passing to the other thread
release mutex
when waiting thread unblocked by pthread_cond_signal:
remove our condition var from head of queue
release mutex
Note that the mutex is locked only while the state of the queued_lock is being altered, not for the whole duration that the queued_lock is held.

You can obtain a fair Mutex with the idea sketched by #caf, but using atomic operations to acquire the ticket before doing the actual lock.
#if defined(_MSC_VER)
typedef volatile LONG Sync32_t;
#define SyncFetchAndIncrement32(V) (InterlockedIncrement(V) - 1)
#elif (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) > 40100
typedef volatile uint32_t Sync32_t;
#define SyncFetchAndIncrement32(V) __sync_fetch_and_add(V, 1)
#else
#error No atomic operations
#endif
class FairMutex {
private:
Sync32_t _nextTicket;
Sync32_t _curTicket;
pthread_mutex_t _mutex;
pthread_cond_t _cond;
public:
inline FairMutex() : _nextTicket(0), _curTicket(0), _mutex(PTHREAD_MUTEX_INITIALIZER), _cond(PTHREAD_COND_INITIALIZER)
{
}
inline ~FairMutex()
{
pthread_cond_destroy(&_cond);
pthread_mutex_destroy(&_mutex);
}
inline void lock()
{
unsigned long myTicket = SyncFetchAndIncrement32(&_nextTicket);
pthread_mutex_lock(&_mutex);
while (_curTicket != myTicket) {
pthread_cond_wait(&_cond, &_mutex);
}
}
inline void unlock()
{
_curTicket++;
pthread_cond_broadcast(&_cond);
pthread_mutex_unlock(&_mutex);
}
};
More broadly, i would not call this a FIFO Mutex, because it gives the impression to maintain an order which is not there in the first place. If your threads are calling a lock() in parallel, they can not have an order before calling the lock, so it makes no sense to create a mutex preserving an order relationship which is not there.

The example as you post it has no solution. Basically you only have one critical section and there is no place for parallelism.
That said, you see that it is important to reduce the period that your threads hold the mutex to a minimum, just a handful of instructions. This is difficult for insertion in a dynamic data structure such as a tree. The conceptually simplest solution is to have one read-write lock per tree node.
If you don't want to have individual locks per tree node you could have one lock structure per level of the tree. I'd experiment with read-write locks for that. You may use just read-locking of the level of the node in hand (plus the next level) when you traverse the tree. Then when you have found the right one to insert lock that level for writing.

The solution could be to use atomic operations. No locking, no context switching, no sleeping, and much much faster than mutexes or condition variables. Atomic ops are not the end all solution to everything, but we have created a lot of thread safe versions of common data structures using just atomic ops. They are very fast.
Atomic ops are a series of simple operations like increment, or decrement or assignment that are guaranteed to execute atomically in a multi threaded environment. If two threads hit the op at the same time, the cpu makes sure one thread executes the op at a time. Atomic ops are hardware instructions, so they are fast. "Compare and swap" is very useful for thread safe data structures. in our testing atomic compare and swap is about as fast as 32 bit integer assignment. Maybe 2x as slow. When you consider how much cpu is consumed with mutexes, atomic ops are infinitely faster.
Its not trivial to do rotations to balance your tree with atomic operations, but not impossible. I ran into this requirement in the past and cheated by making a thread safe skiplist since a skiplist can be done real easy with atomic operations. Sorry I can't give you a copy of our code...my company would fire me, but its easy enough to do yourself.
How atomic ops work to make lock free data structures can be visualized by the simple threadsafe linked list example. To add an item to a global linked list (_pHead) without using locks. First save a copy of _pHead, pOld. I think of these copies as "the state of the world" when executing concurrent ops. Next create a new node, pNew, and set its pNext to the COPY. Then use atomic "compare and swap" to change _pHead to pNew ONLY IF pHead IS STILL pOld. The atomic op will succeed only if _pHead hasn't changed. If it fails, loop back to get a copy of the new _pHead and repeat.
If the op succeeds, the rest of the world will now see a new head. If a thread got the old head a nanosecond before, that thread wont see the new item, but the list will still be safe to iterate through. Since we preset the pNext to the old head BEFORE we added our new item to the list, if a thread sees the new head a nanosecond after we added it, the list is safe to traverse.
Global stuff:
typedef struct _TList {
int data;
struct _TList *pNext;
} TList;
TList *_pHead;
Add to list:
TList *pOld, *pNew;
...
// allocate/fill/whatever to make pNew
...
while (1) { // concurrency loop
pOld = _pHead; // copy the state of the world. We operate on the copy
pNew->pNext = pOld; // chain the new node to the current head of recycled items
if (CAS(&_pHead, pOld, pNew)) // switch head of recycled items to new node
break; // success
}
CAS is shorthand for __sync_bool_compare_and_swap or the like. See how easy? No Mutexes...no locks! In the rare event that 2 threads hit that code at the same time, one simply loops a second time. We only see the second loop because the scheduler swaps a thread out while in the concurrency loop. so it is rare and inconsequential.
Things can be pulled off the head of a linked list in a similar way. You can atomically change more than one value if you use unions and you can use uup to 128 bit atomic ops. We have tested 128 bit on 32 bit redhat linux and they are ~same speed as the 32, 64 bit atomic ops.
You will have to figure out how to use this type of technique with your tree. A b tree node will have two ptrs to child nodes. You can CAS them to change them. The balancing problem is tough. I can see how you could analyze a tree branch before you add something and make a copy of the branch from a certain point. when you finish changing the branch, you CAS the new one in. This would be a problem for large branches. Maybe balancing can be done "later" when the threads are not fighting over the tree. Maybe you can make it so the tree is still searchable even though you haven't cascaded the rotation all the way...in other words if thread A added a node and is recursively rotating nodes, thread b can still read or add nodes. Just some ideas. In some cases, we make a structure that has version numbers or lock flags in the 32 bits after the 32 bits of pNext. We use 64 bit CAS then. Maybe you could make the tree safe to read at all times without locks, but you might have to use the versioning technique on a branch that is being modified.
Here are a bunch of posts I have made talking about the advantages of atomic ops:
Pthreads and mutexes; locking part of an array
Efficient and fast way for thread argument
Configuration auto reloading with pthreads
Advantages of using condition variables over mutex
single bit manipulation
Is memory allocation in linux non-blocking?

You might take a look at the pthread_mutexattr_setprioceiling function.
int pthread_mutexattr_setprioceiling
(
pthread_mutexatt_t * attr,
int prioceiling,
int * oldceiling
);
From the documentation:
pthread_mutexattr_setprioceiling(3THR) sets the priority ceiling attribute of a mutex attribute object.
attr points to a mutex attribute object created by an earlier call to pthread_mutexattr_init().
prioceiling specifies the priority ceiling of initialized mutexes. The ceiling defines the minimum priority level at which the critical section guarded by the mutex is executed. prioceiling will be within the maximum range of priorities defined by SCHED_FIFO. To avoid priority inversion, prioceiling will be set to a priority higher than or equal to the highest priority of all the threads that might lock the particular mutex.
oldceiling contains the old priority ceiling value.