Say I have a multithreaded application and I run it with the same inputs. Is it enough to instrument every load and stores to detect write-write and write-read data races? I mean from the logged load and store addresses, if we can see which thread did which load and which thread did which store, we can detect write-read and write-write data race by noticing the overlapped addresses. Or am I missing something?
Or am I missing something?
You are missing a lot. As Pubby said, if you see read, then write in T1, and later read, then write in T2, you can't say anything about absence of races. You need to know about locks involved.
You may want to use a tool, such as Google's ThreadSanitizer instead.
Update:
But will my approach cover all races or at least some of the races?
Your comments here and on other answers appear to show that you don't understand what a race is.
Your approach may expose some of the races, yes. It is guaranteed to not cover most of them (which will make the exercise futile).
Here is a simple example from Wikipedia that I have slightly modified:
As a simple example let us assume that two threads T1 and T2 each want
to perform arithmetic on the value of a global integer by one. Ideally, the
following sequence of operations would take place:
Integer i = 0; (memory)
T1 reads the value of i from memory into register1: 0
T1 increments the value of i in register1: (register1 contents) + 1 = 1
T1 stores the value of register1 in memory: 1
T2 reads the value of i from memory into register2: 1
T2 multiplies the value of i in register2: (register2 contents) * 2 = 2
T2 stores the value of register2 in memory: 2
Integer i = 2; (memory)
In the case shown above, the final value of i is 2, as expected.
However, if the two threads run simultaneously without locking or
synchronization, the outcome of the operation could be wrong. The
alternative sequence of operations below demonstrates this scenario:
Integer i = 0; (memory)
T1 reads the value of i from memory into register1: 0
T2 reads the value of i from memory into register2: 0
T1 increments the value of i in register1: (register1 contents) + 1 = 1
T2 multiplies the value of i in register2: (register2 contents) * 2 = 0
T1 stores the value of register1 in memory: 1
T2 stores the value of register2 in memory: 0
Integer i = 0; (memory)
The final value of i is 0 instead of the expected result of 2. This
occurs because the increment operations of the second case are not
mutually exclusive. Mutually exclusive operations are those that
cannot be interrupted while accessing some resource such as a memory
location. In the first case, T1 was not interrupted while accessing
the variable i, so its operation was mutually exclusive.
All of these operations are atomic. The race condition occurs because this certain order does not have the same semantics as the first. How do you prove the semantics are not the same as the first? Well, you know they are different for this case, but you need to prove every possible order to determine you have no race conditions. This is a very hard thing to do and has an immense complexity (probably NP-hard or requiring AI-complete) and thus can't be checked reliably.
What happens if a certain order never halts? How do you even know it will never halt in the first place? You're basically left with solving the halting problem which is an impossible task.
If you're talking about using consecutive reads or writes to determine the race, then observe this:
Integer i = 0; (memory)
T2 reads the value of i from memory into register2: 0
T2 multiplies the value of i in register2: (register2 contents) * 2 = 0
T2 stores the value of register2 in memory: 0
T1 reads the value of i from memory into register1: 0
T1 increments the value of i in register1: (register1 contents) + 1 = 1
T1 stores the value of register1 in memory: 1
Integer i = 1; (memory)
This has the same read/store pattern as the first but gives different results.
The most obvious thing you'll learn is that there are several threads using the same memory. That's not necessarily bad in itself.
Good uses would include protection by semaphores, atomic access and mechanisms like RCU or double buffering.
Bad uses would include races conditions, true and false sharing:
Race conditions mostly stem from ordering issues - if a certain task A writes something at the end of its execution whereas task B needs that value at its start, you better make sure that the read of B only happens after A is completed. Semaphores, signals or similar are a good solution to this. Or run it in the same thread of course.
True sharing means that two or more cores are aggressively reading and writing the same memory address. This slows down the processor as it will constantly have to send any changes to the caches of the other cores (and the memory of course). YOur approach could catch this, but probably not highlight it.
False sharing is even more complex than true sharing: processor caches do not work on single bytes but on "cache lines" - which hold more than one value. If core A keeps hammering byte 0 of a line whereas core B keeps writing to byte 4, the cache updating will still stall the whole processor.
Related
Consider the following test program, compiled and run on an implementation that fully implements C2011 atomics and threads.
#include <stdio.h>
#include <stdatomic.h>
#include <threads.h>
#if !ATOMIC_INT_LOCK_FREE
#error "Program behavior is only interesting if atomic_int is lock-free."
#endif
static atomic_int var;
int inc_print(void *unused)
{
atomic_fetch_add(&var, 1);
printf(" %d", atomic_load(&var));
return 0;
}
int main(void)
{
thrd_t thrd;
if (thrd_create(&thrd, inc_print, 0) == thrd_success) {
inc_print(0);
thrd_join(&thrd, 0);
}
putchar('\n');
return 0;
}
I have managed to convince myself that all of the following statements are true:
Each thread's atomic_load must observe the increment performed by that same thread, so it cannot read a zero.
Each thread's atomic_load may or may not observe the increment performed by the other thread. (The other thread might not get scheduled at all until after the atomic_load.) Therefore, it can read either a 1 or a 2.
The calls to printf are serialized only against each other. Therefore, if one thread's atomic_load reads a 1 and the other thread's atomic_load reads a 2, either 1 2 or 2 1 may be printed.
It is possible for both atomic_loads to observe the increment performed by the other thread, so the output 2 2 is also possible.
What I'm not sure of, though: Is it possible for neither of the atomic_loads to observe the increment performed by the other thread? That is, is the output 1 1 possible?
Also, does relaxing the memory model change anything?
Your conclusions look correct to me.
The default memory_order_seq_cst guarantees this whole program executes in a sequentially consistent manner, since it's data-race free and doesn't use any non-SC atomics. So the possible results are only interleavings of program-order.
This allows both increments then both loads, but one increment must come after the other, seeing its 1 result and writing a 2. And the load in that thread must come after it, so at least one thread sees a 2. The 1 1 result is impossible, the 2 2 result can happen.
Relaxed atomics don't introduce any new possibilities here; we can obtain the same guarantees from the rules for operations on the same atomic variable that apply regardless of the memory_order parameter.
A consistent modification order for var exists, and a total of two atomic increments must do a total of +=2. (Atomic RMWs are guaranteed to read the latest value for this reason, to make sure RMWs on the same object are serialized with each other, not both loading a 0 and writing a 1. That wouldn't be an atomic increment.)
A load after a fetch_add in the same thread must see its result or some later value in the modification order. (In C++ this is the write-read coherence guarantee, and sequenced-before (in the same thread) forcing ordering between two operations on the same atomic object. edits welcome with a link to the equivalent language in the C11 standard.)
And yes, the printfs are ordered independently of the atomic modifications to var. It effectively locks stdout. If that works like the rules for C11 mtx_lock, that's like an acquire operation on the mutex, so it can be starting to take the lock before the increment or load complete.
Not that that's relevant; it's not required for the 2 1 output to be possible. Even if locking was seq_cst, you could get to a state where neither printf had started, but all the atomics had finished. With one thread having a 1 and the other a 2 as their temporaries. Then it's just chance which gets to print first.
I am working on freeRtos, and I have a variable called say x. Now one and only one task is writing to this variable per second and other tasks are reading the variable value. Do I need to guard the variable with mutex?
If the variable is 32 bits or smaller, and its value stands alone and is not to be interpreted with regard to any other variable, then you do not need a mutex.
If you have one data item bigger than 32 bits, or else you have multiple items that have to stay together (eg: a light sensor that records both brightness and colour) then you need a mutex so that the readers can't get part of the old data and part of the new data.
I think it’s good practice using a mutex semaphore while changing and reading variables, used in multiple tasks.
Multibyte variables like strings could be changed while reading in another task. It will not be done every readout, but depending on the frequency the likelihood it will is much higher.
NB: writing to 32bit or smaller variables can be safe, IF accessed by tasks on the same core (if a duo core esp is used)
I think you should use mutex , consider this case:(r for read pro, w for write p ro)
if r is called before w, but due to the deployment, w finishes changing the value before r reads the value. This makes a trouble, r reads the new vlaue but if expects the old value.
I was studying OS and synchronizing and I got an idea about dealing with this shared data without synchronizing but I am not sure if it will work.Here is the code
Now,the race condition is obviously the increment and decrement in a shared data.But what if the integer variable was atomic?I think I read something about this when I just a beginner in CS so question might not be perfect.As far as I remember it was blocking something to prevent the increment and decrement at the same time.Now,I am a bit confused about this because if the atomic variables really worked there would not be any need to find synchronization methods for simple codes like this one.
Note:Code is removed since it just changes the focus of people and answer provides enough info
As it stands, the code is indeed not safe to call concurrently, so there must be some kind of syncronization that prevents this.
Now, concerning the idea to make num_processes atomic, that could work. It wouldn't be a simple substitution though, in particular comparing to the max and incrementing must be done atomically and not in two steps, otherwise you still have a race condition. In particular, the following steps must be prevented:
Thread A checks if the limit is reached, which it isn't.
Thread B checks if the limit is reached, which it isn't.
Thread B increments the PID counter.
Thread A increments the PID counter.
Each step in and of itself is atomic, but obviously that didn't help preventing a PID overflow. Instead, the code must check if the counter is not at the limit and then increment it atomically. This is also a common task (compare and increment), so you should easily find existing code examples.
However, I'm pretty sure this isn't all code that is involved and some other code (e.g. in get_processID() or the code that releases a PID) could still require a lock around the whole.
For your code, synchronization is not necessary at all because here num_processes is incremented and decremented by only one process i.e. Parent process.And also num_processes is not a shared variable here. To create shared variable you have to first learn about shmget() and shmat() function in UNIX.
And race condition arises if two or more processes want to access a shared memory.An operation will be atomic if that operation is going to executed entirely (i.e. no switching) or not at all. For example
Consider increment operator on a shared data. This operator is not atomic. Because if go to the lower level instruction for increment operator then this operation is performed in several steps as:
1. First load the value of variable in some register.
2. Add one with that loaded value and now result will be in some temporary register.
3. Store this result in the memory location / register that is pointed by that variable on which increment is performed.
Now As you can see this operation is done in three step. So if there is any switching to another process before completion of these three steps then it leads to undesired results. For more you can read about race condition from this link http://tutorials.jenkov.com/java-concurrency/race-conditions-and-critical-sections.html. As from above you can see that add, store, load instructions are atomic because it will be performed entirely or not at all considering there is no power failure any system failure. So to perform increment operation atomic we need to do some synchronization either using semaphores or monitors. These all are software synchronization technique. I think now you will be clear on this topic..
I have an int array[100] and I want 5 threads to calculate the sum of all array elements.
Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
for(i=offset; i<offset+range; i++){
// not used pthread_mutex_lock(&mutex);
sum += array[i];
// not used pthread_mutex_unlock(&mutex);
}
Can this lead to unpredictable behavior or does the OS actually handle this?
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
Yes, you need synchronization, because all thread are modifying the sum at the same time. Here's example:
You have array of 4 elements [a1, a2, a3, a4] and 2 threads t1 and t2 and sum. To begin let's say t1 get value a1 and adds it to sum. But it's not an atomic operation, so he copy current value of sum (it's 0) to his local space, let's call it t1_s, adds to it a1 and then write sum = t1_s. But at the same time t2 do the same, he get sum value (which is 0, because t1 have not completed it operation) to t2_s, adds a3 and write to sum. So we got in the sum value of a3 insted of a1 + a3. This is called data race.
There are multiple solutions to this is:
You can use mutex as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
Without array calculate local sum_local for each thread and in the end add all these sums to shared variable sum using mutex. I guess it will be faster (however it need to be checked).
However as #gavinb mentioned all of it make sense only for larger amount of data.
I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
Yes - the reading of the array variable is independent, however updating the sum variable is not, so you would need a mutex to serialise access to sum, according to your description above.
However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum. If you calculate intermediate sums for each subset (as #Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.
Can this lead to unpredictable behavior or does the OS actually handle this?
Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum.
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
No, definitely not. It might run faster, but it will almost certainly not give you the correct result!
In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.
Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.
I have 2 questions regarding to threads, one is about race condition and the other is about mutex.
So the first question :
I've read about race condition in wikipedia page :
http://en.wikipedia.org/wiki/Race_condition
And in the example of race condition between 2 threads this is shown :
http://i60.tinypic.com/2vrtuz4.png[
Now so far I believed that threads works parallel to each other, but judging from this picture it's seems that I interpreted on how actions done by the computer wrong.
From this picture only 1 action is done at a time, and although the threads gets switched from time to time and the other thread gets to do some actions this is still 1 action at a time done by the computer. Is it really like this ? There's no "real" parallel computing, just 1 action done at a time in a very fast rate which gives the illusion of parallel computing ?
This leads me to my second question about mutex.
I've read that if threads read/write to the same memory we need some sort of synchronization mechanism. I've read the normal data types won't do and we need a mutex.
Let's take for example the following code :
#include <stdio.h>
#include <stdbool.h>
#include <windows.h>
#include <process.h>
bool lock = false;
void increment(void*);
void decrement(void*);
int main()
{
int n = 5;
HANDLE hIncrement = (HANDLE)_beginthread(increment, 0, (void*)&n);
HANDLE hDecrement = (HANDLE)_beginthread(decrement, 0, (void*)&n);
WaitForSingleObject(hIncrement, 1000 * 500);
WaitForSingleObject(hDecrement, 1000 * 500);
return 0;
}
void increment(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)++;
lock = false;
}
}
void decrement(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)--;
lock = false;
}
}
Now in my example here, I use bool lock as my synchronization mechanism to avoid a race condition between the 2 threads over the memory space pointed by pointer n.
Now what I did here won't obviously work because although I avoided a race condition over the memory space pointed by pointer n between the 2 threads a new race condition over bool lock variable may occur.
Let's consider the following sequence of events (A = increment thread, B = decrement thread) :
A gets out of the while loop since lock is false
A gets to set lock to true
B waits in the while loop because lock is set to true
A increment the value pointed by n
A sets lock to false
A gets to the while loop
A gets out of the while loop since lock is false
B gets out of the while loop since lock is false
A sets lock to true
B sets lock to true
and from here we get an unexpected behavior of 2 un-synchronized threads because the bool lock is not race condition proof.
Ok, so far this is my understanding and the solution to our problem above we need a mutex.
I'm fine with that, a data type that will magically be condition race proof.
I just don't understand how with mutex type it won't happen where as with every other type it will and here lies my problem, I want to understand why mutex and how this is happening.
About your first question: Whether or not there are actually several different threads running at once, or whether it is just implemented as as fast switching, is a matter of your hardware. Typical PCs these days have several cores (often with more than one thread each), so you have to assume that things actually DO happen at the same time.
But even if you have only a single-core system, things are not quite so easy. This is because the compiler is usually allowed to re-order instructions in order to optimize code. It can also e.g. choose to cache a variable in a CPU register instead of loading it from memory every time you access it, and it also doesn't have to write it back to memory every time you write to that variable. The compiler is allowed to do that as long as the result is the same AS IF it had run your original code in its original order - as long as nobody else is looking closely at what's actually going on, such as a different thread.
And once you actually do have different cores, consider that they all have their own CPU registers and even their own cache. Even if a thread on one core wrote to a certain variable, as long as that core doesn't write its cache back to the shared memory a different core won't see that change.
In short, you have to be very careful in making any assumptions about what happens when two threads access variables at the same time, especially in C/C++. The interactions can be so surprising that I'd say, to stay on the safe side, you should make sure that there are no race conditions in your code, e.g. by always using mutexes for accessing memory that is shared between threads.
Which is where we can neatly segway into the second question: What's so special about mutexes, and how can they work if all basic data types are not threadsafe?
The thing about mutexes is that they are implemented with a lot of knowledge about the system for which they are being used (hardware and operating system), and with either the direct help or a deep knowledge of the compiler itself.
The C language does not give you direct access to all the capabilities of your hardware and operating system, because platforms can be very different from each other. Instead, C focuses on providing a level of abstraction that allows you to compile the same code for many different platforms. The different "basic" data types are just something that the C standard came up with as a set of data types which can in some way be supported on almost any platform - but the actual hardware that your program will be compiled for is usually not limited to those types and operations.
In other word, not everything that you can do with your PC can be expressed in terms of C's ints, bytes, assignments, arithmetic operators and so on. For example, PCs often calculate with 80-bit floating point types which are usually not mapped directly to a C floating point type at all. More to the point of our topic, there are also CPU instructions that influence how multiple CPU cores will work together. Additionally, if you know the CPU, you often know a few things about the behaviour of the basic types that the C standard doesn't guarantee (for example, whether loads and stores to 32-bit integers are atomic). With that extra knowledge, it can become possible to implement mutexes for that particular platform, and it will often require code that is e.g. written directly in assembly language, because the necessary features are not available in plain C.