Self-written Mutex for 2+ Threads - c

I have written the following code, and so far in all my tests it seems as if I have written a working Mutex for my 4 Threads, but I would like to get someone else's opinion on the validity of my solution.
typedef struct Mutex{
int turn;
int * waiting;
int num_processes;
} Mutex;
void enterLock(Mutex * lock, int id){
int i;
for(i = 0; i < lock->num_processes; i++){
lock->waiting[id] = 1;
if (i != id && lock->waiting[i])
i = -1;
lock->waiting[id] = 0;
}
printf("ID %d Entered\n",id);
}
void leaveLock(Mutex * lock, int id){
printf("ID %d Left\n",id);
lock->waiting[id] = 0;
}
void foo(Muted * lock, int id){
enterLock(lock,id);
// do stuff now that i have access
leaveLock(lock,id);
}

I feel compelled writing an answer here because the question is a good one, taking into concern it could help others to understand the general problem with mutual exclusion. In your case, you came a long way to hide this problem, but you can't avoid it. It boils down to this:
01 /* pseudo-code */
02 if (! mutex.isLocked())
03 mutex.lock();
You always have to expect a thread switch between lines 02 and 03. So there is a possible situation where two threads find mutex unlocked and be interrupted after that ... only to resume later and lock this mutex individually. You will have two threads entering the critical section at the same time.
What you definitely need to implement reliable mutual exclusion is therefore an atomic operation that tests a condition and at the same time sets a value without any chance to be interrupted meanwhile.
01 /* pseudo-code */
02 while (! test_and_lock(mutex));
As soon as this test_and_lock function cannot be interrupted, your implementation is safe. Until c11, C didn't provide anything like this, so implementations of pthreads needed to use e.g. assembly or special compiler intrinsics. With c11, there is finally a "standard" way to write atomic operations like this, but I can't give an example here, because I don't have experience doing that. For general use, the pthreads library will give you what you need.
edit: of course, this is still simplified -- in a multi-processor scenario, you need to ensure that even memory accesses are mutually exclusive.

The Problem I see in you code:
The idea behind a mutex is to provide mutual exclusion, means that when thread_a is in the critical section, thread_b must wait(in case he wants also to enter) for thread_a.
This waiting part should be implemented in enterLock function. But what you have is a for loop which might end way before thread_a is done from the critical section and thus thread_b could also enter, hence you can't have mutual exclusion.
Way to fix it:
Take a look for example at Peterson's algorithm or Dekker's(more complicated), what they did there is what's called busy waiting which is basically a while loop which says:
while(i can't enter) { do nothing and wait...}

You are totally ignoring the topic of memory models. Unless you are on a machine with a sequential consistent memory model (which none of today's PC CPUs are), your code is incorrect, as any store executed by one thread is not necessarily immediately visible to other CPUs. However, exactly this seems to be an assumption in your code.
Bottom line: Use the existing synchronization primitives provided by the OS or a runtime library such a POSIX or Win32 API and don't try to be smart and implement this yourself. Unless you have years of experince in parallel programming as well as in-depth knowledge of CPU architecture, chances are quite good that you end up with an incorrect implementation. And debugging parallel programms can be hell...

After enterLock() returns, the state of the Mutex object is the same as before the function was called. Hence it will not prevent a second thread to enter the same Mutex object even before the first one released it calling leaveLock(). There is no mutual exclusiveness.

Related

How Do I Enforce Write-only Behavior to a Page in Windows?

I'm reading the documentation for the Win32 VirtualAlloc function, and in the protections listed, there is no PAGE_WRITEONLY option that would pair with the PAGE_READONLY option, naturally. Is there any way to obtain such support by the operating system, or will I have to implement this in software somehow, or can I use processor features that may be available for implementing such things in hardware from user code? A software implementation is undesirable for obvious performance reasons.
Now this also introduces an obvious problem: the memory cannot be read, effectively making the writes an expensive NOP sequence, so the question is whether or not I can make a page have different protections from different contexts so that from one context, the page is write-only, but from another context, the page is read-only.
Security is only one small consideration, but in principle, it is for the sake of ensuring consistency of implementation with design which has security as a benefit (no unwanted reading of what should only be written from one context and vice versa). If you only need to write to something (which is obvious in the case of e.g. the output of a procedure, a hardware send buffer [or software model thereof in RAM], etc.), then it is worthwhile to ensure it is only written, and if you only need to read something, then it is worthwhile to ensure it is only read.
Reading you comments I think you are looking for a lock system where only one thread can write or read to memory at the same time. Is that correct?
You may be looking for the cmpxchg instruction which is implemented in Windows by function InterlockedCompareExchange, InterlockedCompareExchange64 and InterlockedCompareExchange128. This will help you compare two 32/64/128 bit values and copy a new value to the location if they are equal. You can compare it to the following C code
if(a==b)
a = c;
The difference between this C example and the cmpxchg instruction is that cmpxchg is one single instruction and the C example consist out of multiple instructions. This means the cmpxchg cannot be interrupted, where the C example can be interrupted. If the C example is interrupted after the 'if' statement and before the 'set' instruction, another thread will get CPU time and can change variable 'a'. This cannot happen with cmpxchg.
This still causes problems if the system has multiple cores. To fix this, the lock prefix is used. This causes synchronization through all the CPU's. This is also used in the windows API I mentioned above, so don't worry about this.
For every piece of memory you want to lock, you create an integer. You use the InterlockedCompareExchange to set this variable to '1', but only if it equals '0'. If the function returns that it didn't equal '0', you wait by calling sleep, and retry until it does. Every thread needs to set this variable to '0' when it's done using it.
Example:
LONG volatile lock;
int main()
{
//init the lock
lock = (LONG)0;
for (int i = 0; i < 100; i++)
CreateThread(0, 0, (LPTHREAD_START_ROUTINE) &newThread, (LPVOID) i, 0, 0);
ExitThread(0);
}
int newThread(int var) {
//Request lock
while (InterlockedCompareExchange((long *)&lock, 1, 0) != 0)
Sleep(1);
printf("Thread %x (%d) got the lock, waiting %dms seconds before releasing the lock.\n", GetCurrentThreadId(), var, var*100);
//Do whatever you want to do
Sleep(var * 100);
printf("Lock released.\n");
//unlock
lock = (LONG)0;
return 0;
}

Implement semaphore in User Level C

An effective and necessary implementation of semaphore requires it to be atomic instruction.
I see several User level C implementations on the internet implementing semaphores using variables like count or a data structure like queue. But, the instructions involving variable donot run as atomic instructions. So how can anyone implement a sempahore in User Level C.
How does a c library semaphore.h implement semaphore?
The answer is almost certainly "it doesn't" - instead it will call into kernel services which provide the necessary atomic operations.
It's not possible in standard C until c11. What you need is, as you said, atomic operations. c11 finally specifies them, see for example stdatomic.h.
If you're on an older version of the standard, you have to either use embedded assembler directly or rely on vendor-specific extensions of your compiler, see for example the GCC atomic builtins. Of course, processors support instructions for memory barriers, check and swap operations etc. They're just not accessible from pure c99 and earlier because parallel execution wasn't in the scope of the standard.
After reading MartinJames' comment, I should add clarification here: This only applies if you implement all your threading in user space because a semaphore must block threads waiting on it, so if the threads are managed by the kernel's scheduler (as is the case with pthreads on Linux for example), it's necessary to do a syscall. Not in the scope of your question, but atomic operations might still be interesting for implementing e.g. lock-free datastructures.
You could implement semaphore operations as simple as:
void sema_post(atomic_uint *value) {
unsigned old = 0;
while (!atomic_compare_exchange_weak(value, &old, old + 1));
}
void sema_wait(atomic_uint *value) {
unsigned old = 1;
while (old == 0 || !atomic_compare_exchange_weak(value, &old, old - 1));
}
It's OK semantically, but it does busy waiting (spinning) in sema_wait. (Note that sema_post is lock-free, although it also may spin.) Instead it should sleep until value becomes positive. This problem cannot be solved with atomics because all atomic operations are non-blocking. Here you need help from OS kernel. So an efficient semaphore could use similar algorithm based on atomics but go into kernel in two cases (see Linux futex for more details on this approach):
sema_wait: when it finds value == 0, ask to sleep
sema_post: when it has incremented value from 0 to 1, ask to wake another sleeping thread if any
In general, to implement a lock-free (using atomics) operations on a data structure it's required that every operation is applicable to any state. For semaphores, wait isn't applicable to value 0.

Locking Strategy for Millions of Linked Lists + Multithreading (C) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a C program with 16-million-odd linked lists, and 4 worker threads.
No two threads should work on the same linked list at the same time, otherwise they might be modifying it simultaneously, which would be bad.
My initial naive solution was something like this:
int linked_lists_locks[NUM_LINKED_LISTS];
for (i=0; i< NUM_LINKED_LISTs; i++)
linked_lists_locks[i] = 0;
then later, in a section executed by each thread as it works:
while ( linked_lists_locks[some_list] == 1 ) {
/* busy wait */
}
linked_lists_locks[some_list] = 1; // mark it locked lock it
/* work with the list */
linked_lists_locks[some_list] = 0;
However, with 4 threads and ~250,000,000 operations I quickly got into cases where both threads did the same "is it locked" simultaneously and problems ensued. Smart people here would have seen that coming :-)
I've looked at some locking algorithms like Dekker's and Peterson's, but they seem to be more "lock this section of code" whereas what I'm looking for is "lock this variable". I suspect that if I lock the "work with the list" section of code, everything slows to a crawl because then only one thread can work (though I haven't tried it). Essentially, each worker's job is limited to doing some math and populating these lists. Cases where each thread wants to work on the same list simultaneously are rare, btw - only a few thousand times out of 250M operations, but they do happen.
Is there an algorithm or approach for implementing locks on many variables as opposed to sections of code? This is C (on Linux if that matters) so synchronized array lists, etc. from Java/C#/et al are not available.
It would be useful to know more about how your application is organized, but here are a few ideas about how to approach the problem.
A common solution for "synchronized" objects is to assign a mutex to each object. Before working on an object, the thread needs to acquire the object's mutex; when it is done, it releases the mutex. That's simple and effective, but if you really have 16 million lockable objects, it's a lot of overhead. More seriously, if two tasks really try to work on the same object at the same time, one of them will end up sleeping until the other one releases the lock. If there was something else the tasks might have been doing, the opportunity has been lost.
A simple solution to the first problem -- the overhead of 16 million mutexes -- is to use a small vector of mutexes and a hash function which maps each object to one mutex. If you only have four tasks, and you used a vector of, say, 1024 mutexes, you will occasionally end up with a thread needlessly waiting for another thread, but it won't be very common.
If lock contention really turns out to be a problem and it is possible to vary the order in which work is one, a reasonable model is a workqueue. Here, when a thread wants to do something, it takes a task off the workqueue, attempts to lock the task's object (using trylock instead of lock), and if that works, does the task. If the lock fails, it just puts the task back on the workqueue and grabs another one. To avoid workqueue lock contention, it's common for threads to grab a handful of tasks instead of one; each thread then manages its own subqueue. Tuning the various parameters in this solution requires knowing at least a bit about the characteristics of the tasks. (There is a kind of race condition in this solution, but it doesn't matter; it just means that occasionally tasks will be deferred unnecessarily. But they should always get executed eventually.)
You should use an atomic test and set operation. Unfortunately, you may need to use an assembly routine if your compiler doesn't have a built-in for that. See this article:
http://en.wikipedia.org/wiki/Test-and-set
If you are absolutely forced to use this many lists, and you have very few threads, you might not want to lock the lists, but allow the worker-threads to claim a single list at a time. In this case you need a structure to store the number of the list currently held and the list must not be aliased with another number.
Since you didn't seem to use any library I'll add some pseudo-code to clarify my idea:
/*
* list_number, the number of the list you want to lock
* my_id, the id of the thread trying to lock this list
* mutex, the mutex used to control locking the lists
* active_lists, array containing the lists currently held by the threads
* num_threads, size of the array and also number of threads
*/
void lock_list(int list_number, int my_id, some_mutex *mutex,
atomic_int *active_lists, size_t num_threads) {
int ok = 0;
int i;
while (true){ //busy wait to claim the lock
//first check if anyone seems to hold the list we want.
//Do this in a non-locking way to avoid lock contention
while (!ok){
ok = 1;
for (i = 0; i < num_threads; ++i){
if (active_lists[i].load() == list_number && i != my_id){
ok = 0;
/*
* we have to restart - potential to optimize
* at this point, you could delay the work on this list
* to do some other work
*/
break;
}
}
}
while(try_to_lock(mutex));
//rerun the check to see if anyone has taken the list in the meantime
// ok == 1 at this point
for (i = 0; i < num_threads; ++i){
if (active_lists[i].load() == list_number && i != my_id){
ok = 0;
break;
}
}
//this must not be set from anywhere else!
if (ok) active_lists[my_id].store(list_number);
unlock(mutex);
//if we noticed someone claimed the list, go back to the beginning.
if (ok) break;
}
}
There are a few constraints to the pseudo-types. some_mutex obviously has to be a mutex. What I call atomic_int here must somehow support fetching its latest value form main memory to prevent you from seeing old values, which are cached. Same goes for the store: it must not be cached core-locally before being written. Using a regular int and using lfence, sfence and/or mfence may work as well.
There are obviously some trade-offs here, where the main one is probably memory vs speed. This example will create contention at the single mutex used to store which list you have locked, so it will scale poorly with a large number of threads, but well with a large number of lists. If lists are claimed infrequently this would work well even at a larger number of threads. The advantage is that the storage requirement depends mainly on the number of threads. You have to pick a storage type which can hold a number equivalent to the maximum number of lists though.
I am not sure what exactly your scenario is, but recently lock-free lists have also gained some momentum. With the introduction of advanced support for lock-free code in C11 and C++11, there have been a few working (as in not shown to be broken) examples around. Herb Sutter gave a talk on how to do this in C++11. It is C++, but he discusses the relevant points of writing a lock free singly linked list, which are also true for plain old C. You can also try to find an existing implementation, but you should inspect it carefully because this is kind of bleeding edge stuff. However using lock-free lists would erase the need to lock at all.

Race condition and mutex

I have 2 questions regarding to threads, one is about race condition and the other is about mutex.
So the first question :
I've read about race condition in wikipedia page :
http://en.wikipedia.org/wiki/Race_condition
And in the example of race condition between 2 threads this is shown :
http://i60.tinypic.com/2vrtuz4.png[
Now so far I believed that threads works parallel to each other, but judging from this picture it's seems that I interpreted on how actions done by the computer wrong.
From this picture only 1 action is done at a time, and although the threads gets switched from time to time and the other thread gets to do some actions this is still 1 action at a time done by the computer. Is it really like this ? There's no "real" parallel computing, just 1 action done at a time in a very fast rate which gives the illusion of parallel computing ?
This leads me to my second question about mutex.
I've read that if threads read/write to the same memory we need some sort of synchronization mechanism. I've read the normal data types won't do and we need a mutex.
Let's take for example the following code :
#include <stdio.h>
#include <stdbool.h>
#include <windows.h>
#include <process.h>
bool lock = false;
void increment(void*);
void decrement(void*);
int main()
{
int n = 5;
HANDLE hIncrement = (HANDLE)_beginthread(increment, 0, (void*)&n);
HANDLE hDecrement = (HANDLE)_beginthread(decrement, 0, (void*)&n);
WaitForSingleObject(hIncrement, 1000 * 500);
WaitForSingleObject(hDecrement, 1000 * 500);
return 0;
}
void increment(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)++;
lock = false;
}
}
void decrement(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)--;
lock = false;
}
}
Now in my example here, I use bool lock as my synchronization mechanism to avoid a race condition between the 2 threads over the memory space pointed by pointer n.
Now what I did here won't obviously work because although I avoided a race condition over the memory space pointed by pointer n between the 2 threads a new race condition over bool lock variable may occur.
Let's consider the following sequence of events (A = increment thread, B = decrement thread) :
A gets out of the while loop since lock is false
A gets to set lock to true
B waits in the while loop because lock is set to true
A increment the value pointed by n
A sets lock to false
A gets to the while loop
A gets out of the while loop since lock is false
B gets out of the while loop since lock is false
A sets lock to true
B sets lock to true
and from here we get an unexpected behavior of 2 un-synchronized threads because the bool lock is not race condition proof.
Ok, so far this is my understanding and the solution to our problem above we need a mutex.
I'm fine with that, a data type that will magically be condition race proof.
I just don't understand how with mutex type it won't happen where as with every other type it will and here lies my problem, I want to understand why mutex and how this is happening.
About your first question: Whether or not there are actually several different threads running at once, or whether it is just implemented as as fast switching, is a matter of your hardware. Typical PCs these days have several cores (often with more than one thread each), so you have to assume that things actually DO happen at the same time.
But even if you have only a single-core system, things are not quite so easy. This is because the compiler is usually allowed to re-order instructions in order to optimize code. It can also e.g. choose to cache a variable in a CPU register instead of loading it from memory every time you access it, and it also doesn't have to write it back to memory every time you write to that variable. The compiler is allowed to do that as long as the result is the same AS IF it had run your original code in its original order - as long as nobody else is looking closely at what's actually going on, such as a different thread.
And once you actually do have different cores, consider that they all have their own CPU registers and even their own cache. Even if a thread on one core wrote to a certain variable, as long as that core doesn't write its cache back to the shared memory a different core won't see that change.
In short, you have to be very careful in making any assumptions about what happens when two threads access variables at the same time, especially in C/C++. The interactions can be so surprising that I'd say, to stay on the safe side, you should make sure that there are no race conditions in your code, e.g. by always using mutexes for accessing memory that is shared between threads.
Which is where we can neatly segway into the second question: What's so special about mutexes, and how can they work if all basic data types are not threadsafe?
The thing about mutexes is that they are implemented with a lot of knowledge about the system for which they are being used (hardware and operating system), and with either the direct help or a deep knowledge of the compiler itself.
The C language does not give you direct access to all the capabilities of your hardware and operating system, because platforms can be very different from each other. Instead, C focuses on providing a level of abstraction that allows you to compile the same code for many different platforms. The different "basic" data types are just something that the C standard came up with as a set of data types which can in some way be supported on almost any platform - but the actual hardware that your program will be compiled for is usually not limited to those types and operations.
In other word, not everything that you can do with your PC can be expressed in terms of C's ints, bytes, assignments, arithmetic operators and so on. For example, PCs often calculate with 80-bit floating point types which are usually not mapped directly to a C floating point type at all. More to the point of our topic, there are also CPU instructions that influence how multiple CPU cores will work together. Additionally, if you know the CPU, you often know a few things about the behaviour of the basic types that the C standard doesn't guarantee (for example, whether loads and stores to 32-bit integers are atomic). With that extra knowledge, it can become possible to implement mutexes for that particular platform, and it will often require code that is e.g. written directly in assembly language, because the necessary features are not available in plain C.

How are read/write locks implemented in pthread?

How are they implemented especially in case of pthreads. What pthread synchronization APIs do they use under the hood? A little bit of pseudocode would be appreciated.
I haven't done any pthreads programming for a while, but when I did, I never used POSIX read/write locks. The problem is that most of the time a mutex will suffice: ie. your critical section is small, and the region isn't so performance critical that the double barrier is worth worrying about.
In those cases where performance is an issue, normally using atomic operations (generally available as a compiler extension) are a better option (ie. the extra barrier is the problem, not the size of the critical section).
By the time you eliminate all these cases, you are left with cases where you have specific performance/fairness/rw-bias requirements that require a true rw-lock; and that is when you discover that all the relevant performance/fairness parameters of POSIX rw-lock are undefined and implementation specific. At this point you are generally better off implementing your own so you can ensure the appropriate fairness/rw-bias requirements are met.
The basic algorithm is to keep a count of how many of each are in the critical section, and if a thread isn't allowed access yet, to shunt it off to an appropriate queue to wait. Most of your effort will be in implementing the appropriate fairness/bias between servicing the two queues.
The following C-like pthreads-like pseudo-code illustrates what I'm trying to say.
struct rwlock {
mutex admin; // used to serialize access to other admin fields, NOT the critical section.
int count; // threads in critical section +ve for readers, -ve for writers.
fifoDequeue dequeue; // acts like a cond_var with fifo behaviour and both append and prepend operations.
void *data; // represents the data covered by the critical section.
}
void read(struct rwlock *rw, void (*readAction)(void *)) {
lock(rw->admin);
if (rw->count < 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count < 0) {
prepend(rw->dequeue, rw->admin); // Used to avoid starvation.
}
rw->count++;
// Wake the new head of the dequeue, which may be a reader.
// If it is a writer it will put itself back on the head of the queue and wait for us to exit.
signal(rw->dequeue);
unlock(rw->admin);
readAction(rw->data);
lock(rw->admin);
rw->count--;
signal(rw->dequeue); // Wake the new head of the dequeue, which is probably a writer.
unlock(rw->admin);
}
void write(struct rwlock *rw, void *(*writeAction)(void *)) {
lock(rw->admin);
if (rw->count != 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count != 0) {
prepend(rw->dequeue, rw->admin);
}
rw->count--;
// As we only allow one writer in at a time, we don't bother signaling here.
unlock(rw->admin);
// NOTE: This is the critical section, but it is not covered by the mutex!
// The critical section is rather, covered by the rw-lock itself.
rw->data = writeAction(rw->data);
lock(rw->admin);
rw->count++;
signal(rw->dequeue);
unlock(rw->admin);
}
Something like the above code is a starting point for any rwlock implementation. Give some thought to the nature of your problem and replace the dequeue with the appropriate logic that determines which class of thread should be woken up next. It is common to allow a limited number/period of readers to leapfrog writers or visa versa depending on the application.
Of course my general preference is to avoid rw-locks altogether; generally by using some combination of atomic operations, mutexes, STM, message-passing, and persistent data-structures. However there are times when what you really need is a rw-lock, and when you do it is useful to know how they work, so I hope this helped.
EDIT - In response to the (very reasonable) question, where do I wait in the pseudo-code above:
I have assumed that the dequeue implementation contains the wait, so that somewhere within append(dequeue, mutex) or prepend(dequeue, mutex) there is a block of code along the lines of:
while(!readyToLeaveQueue()) {
wait(dequeue->cond_var, mutex);
}
which was why I passed in the relevant mutex to the queue operations.
Each implementation can be different, but normally they have to favor readers by default due to the requirement by POSIX that a thread be able to obtain the read-lock on an rwlock multiple times. If they favored writers, then whenever a writer was waiting, the reader would deadlock on the second read-lock attempt unless the implementation could determine the reader already has a read lock, but the only way to determine that is storing a list of all threads that hold read locks, which is very inefficient in time and space requirements.

Resources