Spin Lock Implementations (OSSpinLock) - c

I am just starting to look into multi-threaded programming and thread safety. I am familiar with busy-waiting and after a bit of research I am now familiar with the theory behind spin locks, so I thought I would have a look at OSSpinLock's implementation on the Mac. It boils down to the following function (defined in objc-os.h):
static inline void ARRSpinLockLock(ARRSpinLock *l)
{
again:
/* ... Busy-waiting ... */
thread_switch(THREAD_NULL, SWITCH_OPTION_DEPRESS, 1);
goto again;
}
(Full implementation here)
After doing a bit of digging, I now have an approximate idea of what thread_switch's parameters do (this site is where I found it). My interpretation of what I have read is that this particular call to thread_switch will switch to the next available thread, and decrease the current thread's priority to an absolute minimum for 1 cycle. 'Eventually' (in CPU time) this thread will become active again and immediately execute the goto again; instruction which starts the busy-waiting all over again.
My question though, is why is this call actually necessary? I found another implementation of a spin-lock (for Windows this time) here and it doesn't include a (Windows-equivalent) thread switching call at all.

You can implement a spin lock in many different ways. If you find another SpinLock implementation for Windows you'll see another algorithm for that (it may involves SetThreadPriority, Sleep or SwitchToThread).
Default implementation for ARRSpinLockLock is clever enough and after one first spinning cycle it "depress" thread priority for a while, this has following advantages:
it gives more opportunities to the thread that owns the lock to release it;
it wastes less CPU time (and power!) performing NOP or PAUSE.
Windows implementation doesn't do it because Windows API doesn't offer that opportunity (there is no equivalent thread_switch() function and multiple calls to SetThreadPriority could be less efficient).

I actually don't think they're that different. In the first case:
static inline void ARRSpinLockLock(ARRSpinLock *l)
{
unsigned y;
again:
if (__builtin_expect(__sync_lock_test_and_set(l, 1), 0) == 0) {
return;
}
for (y = 1000; y; y--) {
#if defined(__i386__) || defined(__x86_64__)
asm("pause");
#endif
if (*l == 0) goto again;
}
thread_switch(THREAD_NULL, SWITCH_OPTION_DEPRESS, 1);
goto again;
}
We try to acquire the lock. If that fails, we spin in the for loop and if it's become available in the meantime we immediately try to reacquire it, if not we relinquish the CPU.
In the other case:
inline void Enter(void)
{
int prev_s;
do
{
prev_s = TestAndSet(&m_s, 0);
if (m_s == 0 && prev_s == 1)
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
// HWSleep(0);
}
while (true);
}
Note the comment below the if, which actually says that we could either spin or relinquish the CPU if the OS gives us that option. In fact the second example seems to just leave that part up to the programmer [insert your preferred way of continuing the code here], so in a sense it's not a complete implementation like the first one.
My take on the whole thing, and I'm commenting on the first snippet, is that they're trying to achieve a balance between being able to get the lock fast (within 1000 iterations) and not hogging the CPU too much (hence we eventually switch if the lock does not become available).

Related

What is a busy loop in C?

I am studying how to write a shell in C, and I have come across a method to use a "busy loop around the sleep function when implementing wait command". In the loop, the while(1) loop is used. I suppose to loop unconditionally, and hence take up some processing time and space? What exactly is the purpose of a busy loop? Also, if the only objective in a lazy loop is to have an unconditional loop, then can't we use any other form of loop like for(;;) instead of while(1) too?
A busy loop is a loop which purposely wastes time waiting for something to happen. Normally, you would want to avoid busy loops at all costs, as they consume CPU time doing nothing and therefore are a waste of resources, but there are rare cases in which they might be needed.
One of those cases is indeed when you need to sleep for a long amount of time and you have things like signal handlers installed that could interrupt sleeping. However, a "sleep busy loop" is hardly a busy loop at all, since almost all the time is spent sleeping.
You can build a busy loop with any loop construct you prefer, after all for, while, do ... while and goto are all interchangeable constructs in C given the appropriate control code.
Here's an example using clock_nanosleep:
// I want to sleep for 10 seconds, but I cannot do it just with a single
// syscall as it might get interrupted, I need to continue requesting to
// sleep untill the entire 10 seconds have elapsed.
struct timespec requested = { .tv_sec = 10, .tv_nsec = 0 };
struct timespec remaining;
int err;
for (;;) {
err = clock_nanosleep(CLOCK_MONOTONIC, 0, &requested, &remaining);
if (err == 0) {
// We're done sleeping
break;
}
if (err != EINTR) {
// Some error occurred, check the value of err
// Handle err somehow
break;
}
// err == EINTR, we did not finish sleeping all the requested time
// Just keep going...
requested = remaining;
}
An actual busy loop would look something like the following, where var is supposedly some sort of atomic variable set by somebody else (e.g. another thread):
while (var != 1);
// or equivalent
while (1) {
if (var == 1)
break;
}
Needless to say, this is the kind of loop that you want to avoid as it is continuously checking for a condition wasting CPU. A better implementation would be to use signals, pthread condition variables, semaphores, etc. There are usually plenty of different ways to avoid busy looping.
Finally, note that in the above case, as #einpoklum says in the comments, the compiler may "optimize" the entire loop body away by dropping the check for var, unless it has some idea that it might change. A volatile qualifier can help, but it really depends on the scenario, don't take the above code as anything other than a silly example.

Signalling to threads waiting at a lock that the lock has become irrelevant

I have a hash table implementation in C where each location in the table is a linked list (to handle collisions). These linked lists are inherently thread safe and so no additional thread-safe code needs to be written at the hash table level if the table is a constant size - the hash table is thread-safe.
However, I would like the hash table to dynamically expand as values were added so as to maintain a reasonable access time. For the table to expand though, it needs additional thread-safety.
For the purposes of this question, procedures which can safely occur concurrently are 'benign' and the table resizing procedure (which cannot occur concurrently) is 'critical'. Threads currently using the list are known as 'users'.
My first solution to this was to put 'preamble' and 'postamble' code for all the critical function which locks a mutex and then waits until there are no current users proceeding. Then I added preamble and postamble code to the benign functions to check if a critical function was waiting, and if so to wait at the same mutex until the critical section is done.
In pseudocode the pre/post-amble functions SHOULD look like:
benignPreamble(table) {
if (table->criticalIsRunning) {
waitUntilSignal;
}
incrementUserCount(table);
}
benignPostamble(table) {
decrementUserCount(table);
}
criticalPreamble(table) {
table->criticalIsRunning = YES;
waitUntilZero(table->users);
}
criticalPostamble(table) {
table->criticalIsRunning = NO;
signalCriticalDone();
}
My actual code is shown at the bottom of this question and uses (perhaps unnecessarily) caf's PriorityLock from this SO question. My implementation, quite frankly, smells awful. What is a better way to handle this situation? At the moment I'm looking for a way to signal to a mutex that it is irrelevant and 'unlock all waiting threads' simultaneously, but I keep thinking there must be a simpler way. I am trying to code it in such a way that any thread-safety mechanisms are 'ignored' if the critical process is not running.
Current Code
void startBenign(HashTable *table) {
// Ignores if critical process can't be running (users >= 1)
if (table->users == 0) {
// Blocks if critical process is running
PriorityLockLockLow(&(table->lock));
PriorityLockUnlockLow(&(table->lock));
}
__sync_add_and_fetch(&(table->users), 1);
}
void endBenign(HashTable *table) {
// Decrement user count (baseline is 1)
__sync_sub_and_fetch(&(table->users), 1);
}
int startCritical(HashTable *table) {
// Get the lock
PriorityLockLockHigh(&(table->lock));
// Decrement user count BELOW baseline (1) to hit zero eventually
__sync_sub_and_fetch(&(table->users), 1);
// Wait for all concurrent threads to finish
while (table->users != 0) {
usleep(1);
}
// Once we have zero users (any new ones will be
// held at the lock) we can proceed.
return 0;
}
void endCritical(HashTable *table) {
// Increment back to baseline of 1
__sync_add_and_fetch(&(table->users), 1);
// Unlock
PriorityLockUnlockHigh(&(table->lock));
}
It looks like you're trying to reinvent the reader-writer lock, which I believe pthreads provides as a primitive. Have you tried using that?
More specifically, your benign functions should be taking a "reader" lock, while your critical functions need a "writer" lock. The end result will be that as many benign functions can execute as desired, but when a critical function starts executing it will wait until no benign functions are in process, and will block additional benign functions until it has finished. I think this is what you want.

Any problems in this rwlock implementation?

I just implemented a reader-writer lock in C. I want to limit the number of readers, so I use 'num' to count it. I'm not sure whether this implementation has some potential data race or deadlock conditions. So could you help me figuring them out please?
Another question is can I remove the 'spin_lock' in struct _rwlock in someway? Thanks!
#define MAX_READER 16;
typedef _rwlock *rwlock;
struct _rwlock{
spin_lock lk;
unint32_t num;
};
void wr_lock(rwlock lock){
while (1){
if (lock->num > 0) continue;
lock(lock->lk);
lock->num += MAX_READER;
return;
}
}
void wr_unlock(rwlock lock){
lock->num -= MAX_READER;
unlock(lock->lk);
}
void rd_lock(rwlock lock){
while (1){
if (lock->num >= MAX_READER) continue;
atom_inc(num);
return;
}
}
void rd_unlock(rwlock lock){
atom_dec(num);
}
Short answer: Yes, there are severe issues here. I don't know what synchronization library you are using, but you are not protecting access to shared data and you will waste tons of CPU cycles on your loops in rd_lock() and wr_lock(). Spin locks should be avoided in virtually all cases (there are exceptions though).
In wr_lock (and similar in rd_lock):
while (1){
if (lock->num > 0) continue;
This is wrong. If you don't somehow synchronize, you aren't guaranteed to see changes from other threads. If this were the only problem you could perhaps acquire the lock and then check the count.
In rd_lock:
atom_inc(num);
This doesn't play well with the non-atomic += and -= in the writer functions, because it can interrupt them. Same for the decrement in rd_unlock.
rd_lock can return while a thread holds the lock as writer -- this isn't the usual semantics of a reader-writer lock, and it means that whatever your rw-lock is supposed to protect, it will not protect it.
If you are using pthreads, then it already has a rwlock. On Windows consider SRWlocks (never used 'em myself). For portable code, build your rwlock using a condition variable (or maybe two -- one for readers and one for writers). That is, insofar as multi-threaded code in C can be portable. C11 has a condition variable, and if there's a pre-C11 threads implementation out there that doesn't, I don't want to have to use it ;-)

Is mutex needed to synchronize a simple flag between pthreads?

Let's imagine that I have a few worker threads such as follows:
while (1) {
do_something();
if (flag_isset())
do_something_else();
}
We have a couple of helper functions for checking and setting a flag:
void flag_set() { global_flag = 1; }
void flag_clear() { global_flag = 0; }
int flag_isset() { return global_flag; }
Thus the threads keep calling do_something() in a busy-loop and in case some other thread sets global_flag the thread also calls do_something_else() (which could for example output progress or debugging information when requested by setting the flag from another thread).
My question is: Do I need to do something special to synchronize access to the global_flag? If yes, what exactly is the minimum work to do the synchronization in a portable way?
I have tried to figure this out by reading many articles but I am still not quite sure of the correct answer... I think it is one of the following:
A: No need to synchronize because setting or clearing the flag does not create race conditions:
We just need to define the flag as volatile to make sure that it is really read from the shared memory every time it is being checked:
volatile int global_flag;
It might not propagate to other CPU cores immediately but will sooner or later, guaranteed.
B: Full synchronization is needed to make sure that changes to the flag are propagated between threads:
Setting the shared flag in one CPU core does not necessarily make it seen by another core. We need to use a mutex to make sure that flag changes are always propagated by invalidating the corresponding cache lines on other CPUs. The code becomes as follows:
volatile int global_flag;
pthread_mutex_t flag_mutex;
void flag_set() { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }
int flag_isset()
{
int rc;
pthread_mutex_lock(flag_mutex);
rc = global_flag;
pthread_mutex_unlock(flag_mutex);
return rc;
}
C: Synchronization is needed to make sure that changes to the flag are propagated between threads:
This is the same as B but instead of using a mutex on both sides (reader & writer) we set it in only in the writing side. Because the logic does not require synchronization. we just need to synchronize (invalidate other caches) when the flag is changed:
volatile int global_flag;
pthread_mutex_t flag_mutex;
void flag_set() { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }
int flag_isset() { return global_flag; }
This would avoid continuously locking and unlocking the mutex when we know that the flag is rarely changed. We are just using a side-effect of Pthreads mutexes to make sure that the change is propagated.
So, which one?
I think A and B are the obvious choices, B being safer. But how about C?
If C is ok, is there some other way of forcing the flag change to be visible on all CPUs?
There is one somewhat related question: Does guarding a variable with a pthread mutex guarantee it's also not cached? ...but it does not really answer this.
The 'minimum amount of work' is an explicit memory barrier. The syntax depends on your compiler; on GCC you could do:
void flag_set() {
global_flag = 1;
__sync_synchronize(global_flag);
}
void flag_clear() {
global_flag = 0;
__sync_synchronize(global_flag);
}
int flag_isset() {
int val;
// Prevent the read from migrating backwards
__sync_synchronize(global_flag);
val = global_flag;
// and prevent it from being propagated forwards as well
__sync_synchronize(global_flag);
return val;
}
These memory barriers accomplish two important goals:
They force a compiler flush. Consider a loop like the following:
for (int i = 0; i < 1000000000; i++) {
flag_set(); // assume this is inlined
local_counter += i;
}
Without a barrier, a compiler might choose to optimize this to:
for (int i = 0; i < 1000000000; i++) {
local_counter += i;
}
flag_set();
Inserting a barrier forces the compiler to write the variable back immediately.
They force the CPU to order its writes and reads. This is not so much an issue with a single flag - most CPU architectures will eventually see a flag that's set without CPU-level barriers. However the order might change. If we have two flags, and on thread A:
// start with only flag A set
flag_set_B();
flag_clear_A();
And on thread B:
a = flag_isset_A();
b = flag_isset_B();
assert(a || b); // can be false!
Some CPU architectures allow these writes to be reordered; you may see both flags being false (ie, the flag A write got moved first). This can be a problem if a flag protects, say, a pointer being valid. Memory barriers force an ordering on writes to protect against these problems.
Note also that on some CPUs, it's possible to use 'acquire-release' barrier semantics to further reduce overhead. Such a distinction does not exist on x86, however, and would require inline assembly on GCC.
A good overview of what memory barriers are and why they are needed can be found in the Linux kernel documentation directory. Finally, note that this code is enough for a single flag, but if you want to synchronize against any other values as well, you must tread very carefully. A lock is usually the simplest way to do things.
You must not cause data race cases. It is undefined behavior and the compiler is allowed to do anything and everything it pleases.
A humorous blog on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
Case 1: There is no synchronization on the flag, so anything is allowed to happen. For example, the compiler is allowed to turn
flag_set();
while(weArentBoredLoopingYet())
doSomethingVeryExpensive();
flag_clear()
into
while(weArentBoredLoopingYet())
doSomethingVeryExpensive();
flag_set();
flag_clear()
Note: this kind of race is actually very popular. Your millage may vary. One one hand, the de-facto implementation of pthread_call_once involves a data race like this. On the other hand, it is undefined behavior. On most versions of gcc, you can get away with it because gcc chooses not to exercise its right to optimize this way in many cases, but it is not "spec" code.
B: full synchronization is the right call. This is simply what you have to do.
C: Only synchronization on the writer could work, if you can prove that no one wants to read it while it is writing. The official definition of a data race (from the C++11 specification) is one thread writing to a variable while another thread can concurrently read or write the same variable. If your readers and writers all run at once, you still have a race case. However, if you can prove that the writer writes once, there is some synchronization, and then the readers all read, then the readers do not need synchronization.
As for caching, the rule is that a mutex lock/unlock synchronizes with all threads that lock/unlock the same mutex. This means you will not see any unusual caching effects (although under the hood, your processor can do spectacular things to make this run faster... it's just obliged to make it look like it wasn't doing anything special). If you don't synchronize, however, you get no guarantees that the other thread doesn't have changes to push that you need!
All of that being said, the question is really how much are you willing to rely on compiler specific behavior. If you want to write proper code, you need to do proper synchronization. If you are willing to rely on the compiler to be kind to you, you can get away with a lot less.
If you have C++11, the easy answer is to use atomic_flag, which is designed to do exactly what you want AND is designed to synchronize correctly for you in most cases.
For the example you have posted, case A is sufficient provided that ...
Getting and setting the flag takes only one CPU instruction.
do_something_else() is not dependent upon the flag being set during the execution of that routine.
If getting and/or setting the flag takes more than one CPU instruction, then you must some form of locking.
If do_something_else() is dependent upon the flag being set during the execution of that routine, then you must lock as in case C but the mutex must be locked before calling flag_isset().
Hope this helps.
Assigning incoming job to worker threads requires no locking. Typical example is webserver, where the request is catched by a main thread, and this main thread selects a worker. I'm trying explain it with some pesudo code.
main task {
// do forever
while (true)
// wait for job
while (x != null) {
sleep(some);
x = grabTheJob();
}
// select worker
bool found = false;
for (n = 0; n < NUM_OF_WORKERS; n++)
if (workerList[n].getFlag() != AVAILABLE) continue;
workerList[n].setJob(x);
workerList[n].setFlag(DO_IT_PLS);
found = true;
}
if (!found) panic("no free worker task! ouch!");
} // while forever
} // main task
worker task {
while (true) {
while (getFlag() != DO_IT_PLS) sleep(some);
setFlag(BUSY_DOING_THE_TASK);
/// do it really
setFlag(AVAILABLE);
} // while forever
} // worker task
So, if there are one flag, which one party sets is to A and another to B and C (the main task sets it to DO_IT_PLS, and the worker sets it to BUSY and AVAILABLE), there is no confilct. Play it with "real-life" example, say, when the teacher is giving different tasks to students. The teacher selects a student, gives him/her a task. Then, the teacher looks for next available student. When a student is ready, he/she gets back to the pool of available students.
UPDATE: just clarify, there are only one main() thread and several - configurable number of - worker threads. As main() runs only one instance, there is no need to sync the selection and launc of the workers.

ways of implementing timer in worker thread in C

I have a worker thread that gets work from pipe. Something like this
void *worker(void *param) {
while (!work_done) {
read(g_workfds[0], work, sizeof(work));
do_work(work);
}
}
I need to implement a 1 second timer in the same thread do to some book-keeping about the work. Following is what I've in mind:
void *worker(void *param) {
prev_uptime = get_uptime();
while (!work_done) {
// set g_workfds[0] as non-block
now_uptime = get_uptime();
if (now_uptime - prev_uptime > 1) {
do_book_keeping();
prev_uptime = now_uptime;
}
n = poll(g_workfds[0], 1000); // Wait for 1 second else timeout
if (n == 0) // timed out
continue;
read(g_workfds[0], work, sizeof(work));
do_work(work); // This can take more than 1 second also
}
}
I am using system uptime instead of system time because system time can get changed while this thread is running. I was wondering if there is any other better way to do this. I don't want to consider using another thread. Using alarm() is not an option as it already used by another thread in same process. This is getting implemented in Linux environment.
I agree with most of what webbi wrote in his answer. But there is one issue with his suggestion of using time instead of uptime. If the system time is updated "forward" it will work as intended. But if the system time is set back by say 30 seconds, then there will be no book keeping done for 30 seconds as (now_time - prev_time) will be negative (unless an unsigned type is used, in which case it will work anyway).
An alternative would be to use clock_gettime() with CLOCK_MONOTONIC as clockid ( http://linux.die.net/man/2/clock_gettime ). A bit messy if you don't need smaller time units than seconds.
Also, adding code to detect a backwards clock jump isn't hard either.
I have found a better way but it is Linux specific using timerfd_create() system call. It takes care of system time change. Following is possible psuedo code:
void *worker(void *param) {
int timerfd = timerfd_create(CLOCK_MONOTONIC, 0); // Monotonic doesn't get affected by system time change
// set timerfd to non-block
timerfd_settime(timerfd, 1 second timer); // timer starts
while (!work_done) {
// set g_workfds[0] as non-block
n = poll(g_workfds[0] and timerfd, 0); // poll on both pipe and timerfd and Wait indefinetly
if (timerfd is readable)
do_book_keeping();
if (g_workfds[0] is readable) {
read(g_workfds[0], work, sizeof(work));
do_work(work); // This can take more than 1 second also
}
}
}
It seems cleaner and read() on timerfd returns extra time elapsed in case do_work() takes long time which is quite useful as do_book_keeping() expects to get called every second.
I found some things weird in your code...
poll() has 3 args, you are passing 2, the second arg is the number of structs that you are passing in the struct array of first param, the third param is the timeout.
Reference: http://linux.die.net/man/2/poll
Besides that, it's fine for me that workaround, it's not the best of course, but it's fine without involving another thread or alarm(), etc.
You use time and not uptime, it could cause you one error if the system date gets changed, but then it will continue working as it will be updated and continuing waiting for 1 sec, no matter what time is.

Resources