Am I using the mutex_trylock correctly? - c

The racers should have an equal chance of winning. When I run the program the results seem to be correct, both racers win about half the time, but I dont think I am using the mutex_trylock correctly. Is it actually doing anything the way with how I implemented it? I am new to C so I dont know a lot about this.
Program Description:
We assume two racers, at two diagonally opposite corner of a rectangular region. They have to traverse along the roads along the peripheri of the region. There are two bridges on two opposite sides of the rectangle. In order to complete one round of traversal around this, the racers have to get pass for both the bridge at a time. The conditions of the race are
1) Only one racer can get a pass at a time.
2) Before one starts one round, he has to request and get both the passes and then after finishing that round has to release the passes, and make new try to get those passes for the next round.
3) Racer1 (R1) will acquire bridge-pass B1 first, then B0. R0 will acquire B0 and then B1.
4) There is a maximum number of rounds prefixed. Whoever reaches that number first will be the winner and the race will stop.
This is how the situation looks before starting.
B0
R0-------- ~ -------------
| |
| |
| |
| |
--------- ~ ------------- R1
B1
#include<stdio.h>
#include<pthread.h>
#include<stdlib.h>
#define THREAD_NUM 2
#define MAX_ROUNDS 200000
#define TRUE 1
#define FALSE 0
/* mutex locks for each bridge */
pthread_mutex_t B0, B1;
/* racer ID */
int r[THREAD_NUM]={0,1};
/* number of rounds completed by each racer */
int numRounds[THREAD_NUM]={0,0};
void *racer(void *); /* prototype of racer routine */
int main()
{
pthread_t tid[THREAD_NUM];
void *status;
int i,j;
/* create 2 threads representing 2 racers */
for (i = 0; i < THREAD_NUM; i++)
{
/*Your code here */
pthread_create(&tid[i], NULL, racer, &r[i]);
}
/* wait for the join of 2 threads */
for (i = 0; i < THREAD_NUM; i++)
{
/*Your code here */
pthread_join(tid[i], &status);
}
printf("\n");
for(i=0; i<THREAD_NUM; i++)
printf("Racer %d finished %d rounds!!\n", i, numRounds[i]);
if(numRounds[0]>=numRounds[1]) printf("\n RACER-0 WINS.\n\n");
else printf("\n RACER-1 WINS..\n\n");
return (0);
}
void *racer(void *arg)
{
int index = *(int*)arg, NotYet;
while( (numRounds[0] < MAX_ROUNDS) && (numRounds[1] < MAX_ROUNDS) )
{
NotYet = TRUE;
/* RACER 0 tries to get both locks before she makes a round */
if(index==0){
/*Your code here */
pthread_mutex_trylock(&B0);
pthread_mutex_trylock(&B1);
}
/* RACER 1 tries to get both locks before she makes a round */
if(index==1){
/*Your code here */
pthread_mutex_trylock(&B1);
pthread_mutex_trylock(&B0);
}
numRounds[index]++; /* Make one more round */
/* unlock both locks */
pthread_mutex_unlock(&B0);
pthread_mutex_unlock(&B1);
/* random yield to another thread */
}
printf("racer %d made %d rounds !\n", index, numRounds[index]);
pthread_exit(0);
}

when first thread locks B0 and if second get scheduled to lock B1, it will cause deadlock. If first mutex is locked and second is not locked, then release first mutex and loop again. This loop can be smaller if tried with mutex_lock and not trylock.

Related

why my program throws 14000000 instead of 10000000 using threads?

i wrote a simple c program to make every thread multiplate its index by 1000000 and add it to sum , i created 5 threads so the logic answer would be (0+1+2+3+4)*1000000 which is 10000000 but it throws 14000000 instead .could anyone helps me understanding this?
#include<pthread.h>
#include<stdio.h>
typedef struct argument {
int index;
int sum;
} arg;
void *fonction(void *arg0) {
((arg *) arg0) -> sum += ((arg *) arg0) -> index * 1000000;
}
int main() {
pthread_t thread[5];
int order[5];
arg a;
for (int i = 0; i < 5; i++)
order[i] = i;
a.sum = 0;
for (int i = 0; i < 5; i++) {
a.index = order[i];
pthread_create(&thread[i], NULL, fonction, &a);
}
for (int i = 0; i < 5; i++)
pthread_join(thread[i], NULL);
printf("%d\n", a.sum);
return 0;
}
It is 140.. because the behavior is undefined. The results will differ on different machines and other environmental factors. The undefined behavior is caused as a result of all threads accessing the same object (see &a given to each thread) that is modified after the first thread is created.
When each thread runs it accesses the same index (as part of accessing a member of the same object (&a)). Thus the assumption that the threads will see [0,1,2,3,4] is incorrect: multiple threads likely see the same value of index (eg. [0,2,4,4,4]1) when they run. This depends on the scheduling with the loop creating threads as it also modifies the shared object.
When each thread updates sum it has to read and write to the same shared memory. This is inherently prone to race conditions and unreliable results. For example, it could be lack of memory visibility (thread X doesn’t see value updated from thread Y) or it could be a conflicting thread schedule between the read and write (thread X read, thread Y read, thread X write, thread Y write) etc..
If creating a new arg object for each thread, then both of these problems are avoided. While the sum issue can be fixed with the appropriate locking, the index issue can only be fixed by not sharing the object given as the thread input.
// create 5 arg objects, one for each thread
arg a[5];
for (..) {
a[i].index = i;
// give DIFFERENT object to each thread
pthread_create(.., &a[i]);
}
// after all threads complete
int sum = 0;
for (..) {
sum += a[i].result;
}
1 Even assuming that there is no race condition in the current execution wrt. the usage of sum, the sequence for the different threads seeing index values as [0,2,4,4,4], the sum of which is 14, might look as follows:
a.index <- 0 ; create thread A
thread A reads a.index (0)
a.index <- 1 ; create thread B
a.index <- 2 ; create thread C
thread B reads a.index (2)
a.index <- 3 ; create thread D
a.index <- 4 ; create thread E
thread D reads a.index (4)
thread C reads a.index (4)
thread E reads a.index (4)

multithreading with mutexes in c and running one thread at a time

I have an array of 100 requests(integers). I want to create 4 threads to which i call a function(thread_function) and with this function i want every thread to take one by one the requests:
(thread0->request0,
thread1->request1,
thread2->request2,
thread3->request3
and then thread0->request4 etc up to 100) all these by using mutexes.
Here is the code i have writen so far:
threadRes = pthread_create(&(threadID[i]), NULL,thread_function, (void *)id_size);
This is inside my main and it is in a loop for 4 times.Now outside my main:
void *thread_function(void *arg){
int *val_p=(int *) arg;
for(i=0; i<200; i=i+2)
{
f=false;
for (j= 0; j<100; j++)
{
if (val_p[i]==cache[j].id)
f=true;
}
if(f==true)
{
printf("The request %d has been served.\n",val_p[i]);
}
else
{
cache[k].id=val_p[i];
printf("\nCurrent request to be served:%d \n",cache[k].id);
k++;
}
}
Where: val_p is the array with the requests and cache is an array of structs to store the id(requests).
-So now i want mutexes to synchronize my threads. I considered using inside my main:
pthread_join(threadID[0], NULL);
pthread_join(threadID[1], NULL);
pthread_join(threadID[2], NULL);
pthread_join(threadID[3], NULL);
pthread_mutex_destroy(&mutex);
and inside the function to use:
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
Before i finish i would like to say that so far my programm result is that 4threads run 100 requests each(400) and what i want to achieve is that 4threads run 100 threads total.
Thanks for your time.
You need to use a loop that looks like this:
Acquire lock.
See if there's any work to be done. If not, release the lock and terminate.
Mark the work that we're going to do as not needing to be done anymore.
Release the lock.
Do the work.
(If necessary) Acquire the lock. Mark the work done and/or report results. Release the lock.
Go to step 1.
Notice how while holding the lock, the thread discovers what work it should do and then prevents any other thread from taking the same assignment before it releases the lock. Note also that the lock is not held while doing the work so that multiple threads can work concurrently.
You may want to post more of your code. How the arrays are set up, how the segment is passed to the individual threads, etc.
Note that using printf will perturb the timing of the threads. It does its own mutex for access to stdout, so it's probably better to no-op this. Or, have a set of per-thread logfiles so the printf calls don't block against one another.
Also, in your thread loop, once you set f to true, you can issue a break as there's no need to scan further.
val_p[i] is loop invariant, so we can fetch that just once at the start of the i loop.
We don't see k and cache, but you'd need to mutex wrap the code that sets these values.
But, that does not protect against races in the for loop. You'd have to wrap the fetch of cache[j].id in a mutex pair inside the loop. You might be okay without the mutex inside the loop on some arches that have good cache snooping (e.g. x86).
You might be better off using stdatomic.h primitives. Here's a version that illustrates that. It compiles but I've not tested it:
#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>
int k;
#define true 1
#define false 0
struct cache {
int id;
};
struct cache cache[100];
#ifdef DEBUG
#define dbgprt(_fmt...) \
printf(_fmt)
#else
#define dbgprt(_fmt...) \
do { } while (0)
#endif
void *
thread_function(void *arg)
{
int *val_p = arg;
int i;
int j;
int cval;
int *cptr;
for (i = 0; i < 200; i += 2) {
int pval = val_p[i];
int f = false;
// decide if request has already been served
for (j = 0; j < 100; j++) {
cptr = &cache[j].id;
cval = atomic_load(cptr);
if (cval == pval) {
f = true;
break;
}
}
if (f == true) {
dbgprt("The request %d has been served.\n",pval);
continue;
}
// increment the global k value [atomically]
int kold = atomic_load(&k);
int knew;
while (1) {
knew = kold + 1;
if (atomic_compare_exchange_strong(&k,&kold,knew))
break;
}
// get current cache value
cptr = &cache[kold].id;
int oldval = atomic_load(cptr);
// mark the cache
// this should never loop because we atomically got our slot with
// the k value
while (1) {
if (atomic_compare_exchange_strong(cptr,&oldval,pval))
break;
}
dbgprt("\nCurrent request to be served:%d\n",pval);
}
return (void *) 0;
}

Allocate the resource of some threads to a global resource such that there is no concurrency(using only mutexes)

The summary of the problem is the following: Given a global resource of size N and M threads with their resource size of Xi (i=1,M) , syncronize the threads such that a thread is allocated,it does its stuff and then it is deallocated.
The main problem is that there are no resources available and the thread has to wait until there is enough memory. I have tried to "block" it with a while statement, but I realized that two threads can pass the while loop and the first to be allocated can change the global resource such that the second thread does not have enough space, but it has already passed the conditional section.
//piece of pseudocode
...
int MAXRES = 100;
// in thread function
{
while (thread_res > MAXRES);
lock();
allocate_res(); // MAXRES-=thread_res;
unlock();
// do own stuff
lock()
deallocate(); // MAXRES +=thread_res;
unlock();
}
To make a robust solution you need something more. As you noted, you need a way to wait until the condition ' there are enough resources available for me ', and a proper mutex has no such mechanism. At any rate, you likely want to bury that in the allocator, so your mainline thread code looks like:
....
n = Allocate()
do stuff
Release(n)
....
and deal with the contention:
int navail = N;
int Acquire(int n) {
lock();
while (n < navail) {
unlock();
lock();
}
navail -= n;
unlock();
return n;
}
void Release(int n) {
lock();
navail += n;
unlock();
}
But such a spin waiting system may fail because of priorities -- if the highest priority thread is spinning, a thread trying to Release may not be able to run. Spinning isn't very elegant, and wastes energy. You could put a short nap in the spin, but if you make the nap too short it wastes energy, too long and it increases latency.
You really want a signalling primitive like semaphores or condvars rather than a locking primitive. With a semaphore, it could look impressively like:
Semaphore *sem;
int Acquire(int n) {
senter(sem, n);
return n;
}
int Release(int n) {
sleave(sem, n);
return n;
}
void init(int N) {
sem = screate(N);
}
Update: Revised to use System V semaphores, which provides the ability to specify arbitrary 'checkout' and 'checkin' to the semaphore value. Overall logic the same.
Disclaimer: I did not use System V for few years, test before using, in case I missed some details.
int semid ;
// Call from main
do_init() {
shmkey = ftok(...) ;
semid = shmget(shmkey, 1, ...) ;
// Setup 100 as max resources.
struct sembuf so ;
so.sem_num = 0 ;
so.sem_op = 100 ;
so.sem_flg = 0 ;
semop(semid, &so, 1) ;
}
// Thread work
do_thread_work() {
int N = <number of resources>
struct sembuf so ;
so.sem_num = 0;
so.sem_op = -N ;
so.sem_flg = 0 ;
semop(semid, &so, 1) ;
... Do thread work
// Return resources
so.sem_op = +N ;
semop(semid, &so, 1) ;
}
As per: https://pubs.opengroup.org/onlinepubs/009695399/functions/semop.html, this will result in atomic checkout of multiple items.
Note: the ... as sections related to IPC_NOWAIT and SEM_UNDO, not relevant to this case.
If sem_op is a negative integer and the calling process has alter permission, one of the following shall occur:
If semval(see ) is greater than or equal to the absolute value of sem_op, the absolute value of sem_op is subtracted from semval. ...
...
If semval is less than the absolute value of sem_op ..., semop() shall increment the semncnt associated with the specified semaphore and suspend execution of the calling thread until one of the following conditions occurs:
The value of semval becomes greater than or equal to the absolute value of sem_op. When this occurs, the value of semncnt associated with the specified semaphore shall be decremented, the absolute value of sem_op shall be subtracted from semval and, ... .

Sieve of Eratosthenes Pthread implementation: thread number doesn't affect computation time

I'm trying to implement the parallel Sieve of Eratosthenes program with Pthread. I have finished my coding and the programs works correctly and as expected, which means that if I use more than 1 threads, the computation time would be less than the sequential program (only 1 thread is used). However, no matter how many extra threads I used, the computation time would be basically the same. For example, if I do the calculation from 1 to 1 billion, the sequential program used about 21 secs, and the parallel program with 2 threads used about 14 secs. But it would always takes about 14 secs when I used 3,4,5,10,20,50 threads as I tried. I want to know what leads to this problem and how to solve it. My code is listed below:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
//The group of arguments passed to thread
struct thrd_data{
long id;
long start;
long end; /* the sub-range is from start to end */
};
typedef struct {
pthread_mutex_t count_lock; /* mutex semaphore for the barrier */
pthread_cond_t ok_to_proceed; /* condition variable for leaving */
long count; /* count of the number of threads who have arrived */
} mylib_barrier_t;
//global variable
bool *GlobalList;//The list of nature number
long Num_Threads;
mylib_barrier_t barrier;/* barrier */
void mylib_barrier_init(mylib_barrier_t *b)
{
b -> count = 0;
pthread_mutex_init(&(b -> count_lock), NULL);
pthread_cond_init(&(b -> ok_to_proceed), NULL);
}
void mylib_barrier(mylib_barrier_t *b, long id)
{
pthread_mutex_lock(&(b -> count_lock));
b -> count ++;
if (b -> count == Num_Threads)
{
b -> count = 0; /* must be reset for future re-use */
pthread_cond_broadcast(&(b -> ok_to_proceed));
}
else
{
while (pthread_cond_wait(&(b -> ok_to_proceed), &(b -> count_lock)) != 0);
}
pthread_mutex_unlock(&(b -> count_lock));
}
void mylib_barrier_destroy(mylib_barrier_t *b)
{
pthread_mutex_destroy(&(b -> count_lock));
pthread_cond_destroy(&(b -> ok_to_proceed));
}
void *DoSieve(void *thrd_arg)
{
struct thrd_data *t_data;
long i,start, end;
long k=2;//The current prime number in first loop
long myid;
/* Initialize my part of the global array */
t_data = (struct thrd_data *) thrd_arg;
myid = t_data->id;
start = t_data->start;
end = t_data->end;
printf ("Thread %ld doing look-up from %ld to %ld\n", myid,start,end);
//First loop: find all prime numbers that's less than sqrt(n)
while (k*k<=end)
{
int flag;
if(k*k>=start)
flag=0;
else
flag=1;
//Second loop: mark all multiples of current prime number
for (i = !flag? k*k-1:start+k-start%k-1; i <= end; i += k)
GlobalList[i] = 1;
i=k;
//wait for other threads to finish the second loop for current prime number
mylib_barrier(&barrier,myid);
//find next prime number that's greater than current one
while (GlobalList[i] == 1)
i++;
k = i+1;
}
//decrement the counter of threads before exit
pthread_mutex_lock (&barrier.count_lock);
Num_Threads--;
if (barrier.count == Num_Threads)
{
barrier.count = 0; /* must be reset for future re-use */
pthread_cond_broadcast(&(barrier.ok_to_proceed));
}
pthread_mutex_unlock (&barrier.count_lock);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
long i, n,n_threads;
long k, nq, nr;
FILE *results;
struct thrd_data *t_arg;
pthread_t *thread_id;
pthread_attr_t attr;
/* Pthreads setup: initialize barrier and explicitly create
threads in a joinable state (for portability) */
mylib_barrier_init(&barrier);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
/* ask to enter n and n_threads from the user */
printf ("enter the range n = ");
scanf ("%ld", &n);
printf ("enter the number of threads n_threads = ");
scanf ("%ld", &n_threads);
time_t start = time(0);//set initial time
//Initialize global list
GlobalList=(bool *)malloc(sizeof(bool)*n);
for(i=0;i<n;i++)
GlobalList[i]=0;
/* create arrays of thread ids and thread args */
thread_id = (pthread_t *)malloc(sizeof(pthread_t)*n_threads);
t_arg = (struct thrd_data *)malloc(sizeof(struct thrd_data)*n_threads);
/* distribute load and create threads for computation */
nq = n / n_threads;
nr = n % n_threads;
k = 1;
Num_Threads=n_threads;
for (i=0; i<n_threads; i++){
t_arg[i].id = i;
t_arg[i].start = k;
if (i < nr)
k = k + nq + 1;
else
k = k + nq;
t_arg[i].end = k-1;
pthread_create(&thread_id[i], &attr, DoSieve, (void *) &t_arg[i]);
}
/* Wait for all threads to complete then print all prime numbers */
for (i=0; i<n_threads; i++) {
pthread_join(thread_id[i], NULL);
}
int j=1;
//Get the spent time for the computation works by all participanting threads
time_t stop = time(0);
printf("Time to do everything except print = %lu seconds\n", (unsigned long) (stop-start));
//print the result of prime numbers
printf("The prime numbers are listed below:\n");
for (i = 1; i < n; i++)
{
if (GlobalList[i] == 0)
{
printf("%ld ", i + 1);
j++;
}
if (j% 15 == 0)
printf("\n");
}
printf("\n");
// Clean up and exit
free(GlobalList);
pthread_attr_destroy(&attr);
mylib_barrier_destroy(&barrier); // destroy barrier object
pthread_exit (NULL);
}
You make a valid observation. More threads doesn't mean more work gets done.
You are running you program on a dual-core CPU. You already saturate the system with 2 threads.
With 1 thread only 1 core will get used. With 2 threads 2 cores will get used. With let say 4 threads you will see about the same performance as with 2 threads. Hyper-threading doesn't help because a logical core (HT core) shares the memory system with it's physical core.
Here is the output of running
perf stat -d sieve
23879.553188 task-clock (msec) # 1.191 CPUs utilized
3,666 context-switches # 0.154 K/sec
1,470 cpu-migrations # 0.062 K/sec
219,177 page-faults # 0.009 M/sec
76,070,790,848 cycles # 3.186 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
34,500,622,236 instructions # 0.45 insns per cycle
4,172,395,541 branches # 174.727 M/sec
1,020,010 branch-misses # 0.02% of all branches
21,065,385,093 L1-dcache-loads # 882.152 M/sec
1,223,920,596 L1-dcache-load-misses # 5.81% of all L1-dcache hits
69,357,488 LLC-loads # 2.904 M/sec
<not supported> LLC-load-misses:HG
This is the output of i5-4460 CPU's hardware performance monitor. It tracking some interesting statistics.
Notice the low instructions per cycle count. The cpu is doing 0.45 instructions per cycle. Normally you want to see this value > 1.
Update: The key here to notice is that increasing the number of threads doesn't help. The CPU can only do a finite amount of branching and memory access.
Two observations.
First, if you fix your sieve code then it should run about 25 times as fast as it does now, corresponding roughly to the expected gain from distributing your current code successfully over 32 cores.
Have a look at prime number summing still slow after using sieve where I showed how to sieve the numbers up to 2,000,000,000 in 1.25 seconds in C# of all languages. The article discusses (and benchmarks) each step/technique separately so that you pick what you like and roll a solution that strikes the perfect bang/buck ratio for your needs. Things will be even faster in C/C++ because there you can count on the compiler sweating the small stuff for you (at least with excellent compilers like gcc or VC++).
Second: when sieving large ranges the most important resource is the level 1 cache of the CPU. Everything else plays second fiddle. You can see this also from the benchmarks in my article. To distribute a sieving task across several CPUs, count the L1 caches in your system and assign a sieving job to each cache (1/kth of the range where k is the number of L1 caches). This is a bit of a simplification since you'd normally choose a finer granularity for the size of the work items, but it gives the general idea.
I said 'caches', not 'cores', 'virtual cores' or 'threads' because that is precisely what you need to do: assign the jobs such that each job has its own L1 cache. How that works depends not only on the operating system but also on the specific CPU(s) in your system. If two 'whatevers' share an L1 cache, give a job to only one of the two and ignore the other (or rather, set the affinity for the job such that it can run on any one of the two but nowhere else).
This is easy enough to do with operating system APIs (e.g. Win32) but I don't know enough about pthreads to tell whether it offers the required precision. As a first approximation you could match the number of threads to the suspected number of L1 caches.

Quickly Reacquirable Locks

This is an algorithm that does not use OS synchronization primitives until two or more threads really access the critical section. Even in recursive "locks" of same thread, there is no real lock until a second thread is involved.
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
There are two functions:
int qrlgeneric_acquire(qrlgeneric_lock *L, int id);
void qrlgeneric_release(qrlgeneric_lock *L, int acquiredquickly);
qrlgeneric_acquire: called when the thread wants to lock. id is thread id
qrlgeneric_release: called when the thread wants to unlock
Example:
Thread_1 which already locked calls qrlgeneric_acquire again, so a recursive lock will be performed. At the same time, Thread_2 calls qrlgeneric_acquire, so there will be contention (two threads wants to lock, real os sync primitive will be used).
Thread_1 will reach this condition on line 4.
04 if (BIASED(id) == status) // SO: this means this thread already has this lock
05 {
06 L->lockword.h.quicklock = 1;
07 if (BIASED(id) == HIGHWORD(L->lockword.data))
08 return 1;
09 L->lockword.h.quicklock = 0; /* I didn’t get the lock, so be sure */
10 /* not to block the process that did */
11 }
Thread_2 will reach this condition on line 35. CAS is compare_and_swap atomic operation.
34 unsigned short biaslock = L->lockword.h.quicklock;
35 if (CAS(&L->lockword,
36 MAKEDWORD(biaslock, status),
37 MAKEDWORD(biaslock, REVOKED)))
38 {
39 /* I’m the revoker. Set up the default lock. */
40 /* *** INITIALIZE AND ACQUIRE THE DEFAULT LOCK HERE *** */
41 /* Note: this is an uncontended acquire, so it */
42 /* can be done without use of atomics if this is */
43 /* desirable. */
44 L->lockword.h.status = DEFAULT;
45
46 /* Wait until quicklock is free */
47 while (LOWWORD(L->lockword.data))
48 ;
49 return 0; /* And then it’s mine */
50 }
From the comments on line 9 and 47, you can see that the statement at line 9 is there for support the statement on line 47 so the Thread_2 doesn't spin lock there forever.
QUESTION: It seems from those comments on line 9 and 47 that those two conditions above should never both succeed, otherwise the Thread_2 will spin lock on the line 47 because the statement on line 9 will not be executed. THE PROBLEM is I need to help to understand how it is possible that it will never happen that both them succeed, because I still think it can happen:
1. Thread_1: 06 L->lockword.h.quicklock = 1;
2. Thread_2: 34 unsigned short biaslock = L->lockword.h.quicklock;
3. Thread_1: if (BIASED(id) == HIGHWORD(L->lockword.data))
4. Thread_2: 35 if (CAS(&L->lockword,MAKEDWORD(biaslock, status),MAKEDWORD(biaslock, REVOKED)))
3. This condition will succeed because Thread_2 didn't change anything yet.
4. This condition will succeed, because the points 1 and 3 didn't affect it.
The result is I think they can both succeed, but this means that Thread_2 will spin lock on the line 47 until the Thread_1 releases the lock. I think this is definitely wrong and shouldn't happen, so I probably don't understand it. Can anyone help?
Whole algorithm:
/* statuses for qrl locks */
#define BIASED(id) ((int)(id) << 2)
#define NEUTRAL 1
#define DEFAULT 2
#define REVOKED 3
#define ISBIASED(status) (0 == ((status) & 3))
/* word manipulation (big-endian versions shown here) */
#define MAKEDWORD(low, high) (((unsigned int)(low) << 16) | (high))
#define HIGHWORD(dword) ((unsigned short)dword)
#define LOWWORD(dword) ((unsigned short)(((unsigned int)(dword)) >> 16))
typedef volatile struct tag_qrlgeneric_lock
{
volatile union
{
volatile struct
{
volatile short quicklock;
volatile short status;
}
h;
volatile int data;
}
lockword;
/* *** PLUS WHATEVER FIELDS ARE NEEDED FOR THE DEFAULT LOCK *** */
}
qrlgeneric_lock;
int qrlgeneric_acquire(qrlgeneric_lock *L, int id)
{
int status = L->lockword.h.status;
/* If the lock’s mine, I can reenter by just setting a flag */
if (BIASED(id) == status)
{
L->lockword.h.quicklock = 1;
if (BIASED(id) == HIGHWORD(L->lockword.data))
return 1;
L->lockword.h.quicklock = 0; /* I didn’t get the lock, so be sure */
/* not to block the process that did */
}
if (DEFAULT != status)
{
/* If the lock is unowned, try to claim it */
if (NEUTRAL == status)
{
if (CAS(&L->lockword, /* By definition, if we saw */
MAKEDWORD(0, NEUTRAL), /* neutral, the lock is unheld */
MAKEDWORD(1, BIASED(id))))
{
return 1;
}
/* If I didn’t bias the lock to me, someone else just grabbed
it. Fall through to the revocation code */
status = L->lockword.h.status; /* resample */
}
/* If someone else owns the lock, revoke them */
if (ISBIASED(status))
{
do
{
unsigned short biaslock = L->lockword.h.quicklock;
if (CAS(&L->lockword,
MAKEDWORD(biaslock, status),
MAKEDWORD(biaslock, REVOKED)))
{
/* I’m the revoker. Set up the default lock. */
/* *** INITIALIZE AND ACQUIRE THE DEFAULT LOCK HERE *** */
/* Note: this is an uncontended acquire, so it */
/* can be done without use of atomics if this is */
/* desirable. */
L->lockword.h.status = DEFAULT;
/* Wait until quicklock is free */
while (LOWWORD(L->lockword.data))
;
return 0; /* And then it’s mine */
}
/* The CAS could have failed and we got here for either of
two reasons. First, another process could have done the
revoking; in this case we need to fall through to the
default path once the other process is finished revoking.
Secondly, the bias process could have acquired or released
the biaslock field; in this case we need merely retry. */
status = L->lockword.h.status;
}
while (ISBIASED(L->lockword.h.status));
}
/* If I get here, the lock has been revoked by someone other
than me. Wait until they’re done revoking, then fall through
to the default code. */
while (DEFAULT != L->lockword.h.status)
;
}
/* Regular default lock from here on */
assert(DEFAULT == L->lockword.h.status);
/* *** DO NORMAL (CONTENDED) DEFAULT LOCK ACQUIRE FUNCTION HERE *** */
return 0;
}
void qrlgeneric_release(qrlgeneric_lock *L, int acquiredquickly)
{
if (acquiredquickly)
L->lockword.h.quicklock = 0;
else
{
/* *** DO NORMAL DEFAULT LOCK RELEASE FUNCTION HERE *** */
}
}

Resources