I have a general question that occured to me while trying to implement a thread sychronization problem with sempaphores. I do not want to get into much (unrelated) detail so I am going to give the code that I think is important to clarify my question.
sem_t *mysema;
violatile int counter;
struct my_info{
pthread_t t;
int id;
};
void *barrier (void *arg){
struct my_info *a = arg;
arg->id = thrid;
while(counter >0){
do_work(&mysem[thrid])
sem_wait(&mysema[third])
display_my_work(arg);
counter--;
sem_post(&mysema[thrid+1])
}
return NULL;
}
int main(int argc, char *argv[]){
int N = atoi(argv[1]);
mysema = mallon(N*(*mysema));
counter = 50;
/*semaphore intialisations */
for(i=0; i<M; i++){
sem_init(&mysema[i],0,0);
}
for(i=0; i<M; i++){
mysema[i].id = i;
}
for(i=0; i<M; i++){
pthread_create(&mysema.t[i], NULL, barrier, &tinfo[i])
}
/*init wake up the first sempahore */
sem_post(&mysema[0]);
.
.
.
We have an array made of M semaphores intialised in 0 , where M is defined in command line by the user.
I know I am done when all M threads in total have done some necessary computations 50 times.
Each thread blocks itself, until the previous thread "sem_post's" it. The very first thread will be waken up by init.
My question is whether the threads will stop when '''counter = 0 '''. Do they all see the same variable - counter? (It is a global one, initialised in the main).
If thread zero , makes the very first time ```counter = 49''' do all the other threads( thread 1, 2, ...M-1) see that ?
These are different questions:
Do [the threads] all see the same variable - counter? (It is a global one, initialised in the main).
If thread zero , makes the very first time ```counter = 49''' do all the other threads( thread 1, 2, ...M-1) see that ?
The first is fairly simple: yes. An object declared at file scope and without storage class specifier _Thread_local is a single object whose storage duration is the entire run of the program. Wherever that object's identifier is in-scope and visible, it identifies the same object regardless of which thread is accessing it.
The answer to the second question is more complicated. In a multi-threaded program there is the potential for data races, and the behavior of a program containing a data race is undefined. The volatile qualifier does not protect against these; instead, you need proper synchronization for all accesses to each shared variable, both reads and writes. This can be provided by a semaphore or more often a mutex, among other possibilities.
Your code's decrement of counter may be adequately protected, but I suspect not, on account of the threads using different semaphores. If this allows for multiple different threads to execute the ...
display_my_work(arg);
counter--;
... lines at the same time then you have a data race. Even if your protection is adequate there, however, the read of counter in the while condition clearly is not properly synchronized, and you definitely have a data race there.
One of the common manifestations of the undefined behavior brought on by data races is that threads do not see each others' updates, so not only does your program's undefined behavior generally mean that threads 1 ... M-1 may not see thread 0's update of counter, it also specifically makes such a failure comparatively probable.
Related
#include <stdio.h>
int main()
{
for(int i=0;i<100;i++)
{
int count=0;
printf("%d ",++count);
}
return 0;
}
output of the above program is: 1 1 1 1 1 1..........1
Please take a look at the code above. I declared variable "int count=0" inside the for loop.
With my knowledge, the scope of the variable is within the block, so count variable will be alive up to for loop execution.
"int count=0" is executing 100 times, then it has to create the variable 100 times else it has to give the error (re-declaration of the count variable), but it's not happening like that — what may be the reason?
According to output the variable is initializing with zero every time.
Please help me to find the reason.
Such simple code can be visualised on http://www.pythontutor.com/c.html for easy understanding.
To answer your question, count gets destroyed when it goes outside its scope, that is the closing } of the loop. On next iteration, a variable of the same name is created and initialised to 0, which is used by the printf.
And if counting is your goal, print i instead of count.
The C standard describes the C language using an abstract model of a computer. In this model, count is created each time the body of the loop is executed, and it is destroyed when execution of the body ends. By “created” and “destroyed,” we mean that memory is reserved for it and is released, and that the initialization is performed with the reservation.
The C standard does not require compilers to implement this model slavishly. Most compilers will allocate a fixed amount of stack space when the routine starts, with space for count included in this fixed amount, and then count will use that same space in each iteration. Then, if we look at the assembly code generated, we will not see any reservation or release of memory; the stack will be grown and shrunk only once for the whole routine, not grown and shrunk in each loop iteration.
Thus, the answer is twofold:
In C’s abstract model of computing, a new lifetime of count begins and ends in each loop iteration.
In most actual implementations, memory is reserved just once for count, although implementations may also allocate and release memory in each iteration.
However, even if you know your C implementation allocates stack space just once per routine when it can, you should generally think about programs in the C model in this regard. Consider this:
for (int i = 0; i < 100; ++i)
{
int count = 0;
// Do some things with count.
float x = 0;
// Do some things with x.
}
In this code, the compiler might allocate four bytes of stack space to use for both count and x, to be used for one of them at a time. The routine would grow the stack once, when it starts, including four bytes to use for count and x. In each iteration of the loop, it would use the memory first for count and then for x. This lets us see that the memory is first reserved for count, then released, then reserved for x, then released, and then that repeats in each iteration. The reservations and releases occur conceptually even though there are no instructions to grow and shrink the stack.
Another illuminating example is:
for (int i = 0; i < 100; ++i)
{
extern int baz(void);
int a[baz()], b[baz()];
extern void bar(void *, void *);
bar(a, b);
}
In this case, the compiler cannot reserve memory for a and b when the routine starts because it does not know how much memory it will need. In each iteration, it must call baz to find how much memory is needed for a and how much for b, and then it must allocate stack space (or other memory) for them. Further, since the sizes may vary from iteration to iteration, it is not possible for both a and b to start in the same place in each iteration—one of them must move to make way for the other. So this code lets us see that a new a and a new b must be created in each iteration.
int count=0 is executing 100 times, then it has to create the variable 100 times
No, it defines the variable count once, then assigns it the value 0 100 times.
Defining a variable in C does not involve any particular step or code to "create" it (unlike for example in C++, where simply defining a variable may default-construct it). Variable definitions just associate the name with an "entity" that represents the variable internally, and definitions are tied to the scope where they appear.
Assigning a variable is a statement which gets executed during the normal program flow. It usually has "observable effects", otherwise the compiler is allowed to optimize it out entirely.
OP's example can be rewritten in a completely equivalent form as follows.
for(int i=0;i<100;i++)
{
int count; // definition of variable count - defined once in this {} scope
count=0; // assignment of value 0 to count - executed once per iteration, 100 times total
printf("%d ",++count);
}
Eric has it correct. In much shorter form:
Typically compilers determine at compile time how much memory is needed by a function and the offsets in the stack to those variables. The actual memory allocations occur on each function call and memory release on the function return.
Further, when you have variables nested within {curly braces} once execution leaves that brace set the compiler is free to reuse that memory for other variables in the function. There are two reasons I intentionally do this:
The variables are large but only needed for a short time so why make stacks larger than needed? Especially if you need several large temporary structures or arrays at different times. The smaller the scope the less chance of bugs.
If a variable only has a sane value for a limited amount of time, and would be dangerous or buggy to use out of that scope, add extra curly braces to limit the scope of access so improper use generates immediate compiler errors. Using unique names for each variable, even if the compiler doesn't insist on it, can help the debugger, and your mind, less confused.
Example:
your_function(int a)
{
{ // limit scope of stack_1
int stack_1 = 0;
for ( int ii = 0; ii < a; ++ii ) { // really limit scope of ii
stack_1 += some_calculation(i, a);
}
printf("ii=%d\n", ii); // scope error
printf("stack_1=%d\n", stack_1); // good
} // done with stack_1
{
int limited_scope_1[10000];
do_something(a,limited_scope_1);
}
{
float limited_scope_2[10000];
do_something_else(a,limited_scope_2);
}
}
A compiler given code like:
void do_something(int, int*);
...
for (int i=0; i<100; i++)
{
int const j=(i & 1);
doSomething(i, &j);
}
could legitimately replace it with:
void do_something(int, int*);
...
int const __compiler_generated_0 = 0;
int const __compiler_generated_1 = 1;
for (int i=0; i<100; i+=2)
{
doSomething(i, &compiler_generated_0);
doSomething(i+1, &compiler_generated_1);
}
Although a compiler would typically allocate space on the stack once for j, when the function was entered, and then not reuse the storage during the loop (or even the function), meaning that j would have the same address on every iteration of the loop, there is no requirement that the address remain constant. While there typically wouldn't be an advantage to having the address vary on different iterations, compilers are be allowed to exploit such situations should they arise.
I'm trying to understand the details in the TCB (thread control block and the differences between per-thread states and shared states. My book has its own implementation of pthread, so it gives an example with this mini C program (I've not typed the whole thing out)
#include "thread.h"
static void go(int n);
static thread_t threads[NTHREADS];
#define NTHREADS 10
int main(int argh, char **argv) {
int i;
long exitValue;
for (i = 0; i < NTHREADS; i++) {
thread_create(&threads[i]), &go, i);
}
for (i = 0; i < NTHREADS; i++) {
exitValue = thread_join(threads[i]);
}
printf("Main thread done".\n);
return 0;
}
void go(int n) {
printf("Hello from thread %d\n", n);
thread_exit(100 + n);
}
What would the variables i and exitValue (in the main() function) be examples of? They're not shared state since they're not global variables, but I'm not sure if they're per-thread state either. The i is used as the parameter for the go function when each thread is being created, so I'm a bit confused about it. The exitValue's scope is limited only to main() so that seems like it would just be stored on the process' stack. The int n as the parameter for the void go() would be a per-thread variable because its value is independent for each thread. I don't think I fully understand these concepts so any help would be appreciated! Thanks!
Short Answer
All of the variables in your example program are automatic variables. Each time one of them comes into scope storage for it is allocated, and when it leaves its scope it is no longer valid. This concept is independent of whether the variables is shared or not.
Longer Answer
The scope of a variable refers to its lifetime in the program (and also the rules for how it can be accessed). In your program the variables i and exitValue are scoped to the main function. Typically a compiler will allocate space on the stack which is used to store the values for these variables.
The variable n in function go is a parameter to the function and so it also acts as a local variable in the function go. So each time go is executed the compiler will allocate space on the stack frame for the variables n (although the compiler may be able to perform optimization to keep the local variables in registers rather than actually allocating stack space). However, as a parameter n will be initialized with whatever value it was called with (its actual parameter).
To make this more concrete, here is what the values of the variales in the program would be after the first loop has completed 2 iterations (assuming that the spawned threads haven't finished executing).
Main thread: i = 2, exitValue = 0
Thread 0: n = 0
Thread 1: n = 1
The thing to note is that there are multiple independent copies of the variable n. And that n gets a copy of the value in i when thread_create is executed, but that the values of i and n are independent after that.
Finally I'm not certain what is supposed to happen with the statement exitValue = thread_join(threads[i]); since this is a variation of pthreads. But what probably happens is that it makes the value available when another thread calls thread_join. So in that way you do get some data sharing between threads, but the sharing is synchronized by the thread_join command.
They're objects with automatic storage, casually known as "local variables" although the latter is ambiguous since C and C++ both allow objects with local scope but that only have one global instance via the static keyword.
Can you clarify, why the following code is a safe way to pass parameters into the new thread:
//Listing 5.3 Passing a Value into a Created Thread
for ( int i=0; i<10; i++ )
pthread_create( &thread, 0, &thread_code, (void *)i );
And the following code isn't:
//Listing 5.4 Erroneous Way of Passing Data to a New Thread
for ( int i=0; i<10; i++ )
pthread_create( &thread, 0, &thread_code, (void *)&i );
Quote from the book,regarding the code:
It is critical to realize that the child thread can start executing at any point after the call, so the pointer must point to something that still exists and still retains the same value. This rules out passing in pointers to changing variables as well as pointers to information held on the stack (unless the stack is certain to exist until after the child thread has read the value).
A third method is good as given below:
static int args[10];
for ( int i=0; i<10; i++ ) {
args[i] = i;
pthread_create( &thread, 0, &thread_code, (void *)&args[i] );
}
If you want same variable shared across all the threads, make a local variable in main or preferably and static or global variable.
Issues with method 1 and method 2:
Method 1 You are casting an int to void * and then back to int which is bad as the size of int and void * may be different. If you plan to cast void * to int *, it is even worse and an UB. Also read this post.
Method 2 You are passing same address to all threads. When i is changed from main thread of any of the 10 worker threads same value would be reflected everywhere which may not be your intention. Moreover scope of i ends after the for loop, and you may end up accessing dangling pointers in threads. and would cause UB. (undefined behaviour)
Why is the second example wrong?
As your citation says, you must not pass a pointer to the interation variable because it gets changed quickly. You never know when exactly the concurrent thread will use the pointer and dereference it.
// Listing 5.4 Erroneous Way of Passing Data to a New Thread
for ( int i=0; i<10; i++ )
pthread_create( &thread, 0, &thread_code, (void *)&i );
Imagine the very first call to pthread_create(). It receives a pointer to i and will probably dereference the pointer and read the value. Your value is supposed to be 0 at the time. But your main thread (the one with the for loop) may have already changed i from 0 to 1. That is called a race condition because your program depends on whether one thread is faster to change the value or the other is faster to get it.
There's a second race condition as well, as your i variable will get out of scope at the end of the loop. If the threads were slow to start or to read the pointer target, the address on the stack can already be allocated to something else. You must not dereference pointers to variables that no longer exist.
Why the first doesn't have the same problem?
The first example uses the value of i, not it's address. That is good, as pthread_create() will just hold the value and pass it to the thread.
// Listing 5.3 Passing a Value into a Created Thread
for ( int i=0; i<10; i++ )
pthread_create( &thread, 0, &thread_code, (void *)i );
But pthread_create() only accepts void * (a generic pointer). The example uses a special trick where you cast the integer value to a pointer value. It is expected that the thread function will do the reverse (will cast the pointer back to integer).
This trick is often used to store an integer value where an object is expected, as it avoids having to allocate and deallocate the object. Whether such a technique is good or bad practice is out of scope of a factual answer. It's being used in frameworks like GLib but I guess many programmers will scorn it.
Final notes
The examples in the book are clearly not solutions for real problems but just motivation examples. In actual code, you would rarely pass just an integer value and you might want to join the thread at some point of time. Therefore in a simple scenario you would have to allocate the thread arguments, fill them in, start the workers, join the workers, retrieve the results and free the allocations.
In a more complicated scenario you would communicate with the threads and therefore you wouldn't be limited to feeding them at their creation and retreiving the results after joining them. You could even just let the workers run and reuse them whenever you need them.
I have to use two threads; one to do various operations on matrices, and the other to monitor virtual memory at various points in the matrix operation process. This method is required to use a global state variable 'flag'.
So far I have the following (leaving some out for brevity):
int flag = 0;
int allocate_matrices(int dimension)
{
while (flag == 0) {} //busy wait while main prints memory state
int *matrix = (int *) malloc(sizeof(int)*dimension*dimension);
int *matrix2 = (int *) malloc(sizeof(int)*dimension*dimension);
flag = 0;
while (flag == 0) {} //busy wait while main prints memory state
// more similar actions...
}
int memory_stats()
{
while (flag == 0)
{ system("top"); flag = 1; }
}
int main()
{ //threads are created and joined for these two functions }
As you might expect, the system("top") call happens once, the the matrices are allocated, then the program falls into an infinite loop. It seems apparent to me that this is because the thread assigned to the memory_stats function has already completed its duty, so flag will never be updated again.
Is there an elegant way around this? I know I have to print memory stats four times, so it occurs to me that I could write four while loops in the memory_stats function with busy waiting contingent on the global flag in between each of them, but that seems clunky to me. Any help or pointers would be appreciated.
One of the possible reasons for the hang is that flag is a regular variable and the compiler sees that it's never set to a non-zero value between flag = 0; and while (flag == 0) {} or in this while inside allocate_matrices(). And so it "thinks" the variable stays 0 and the loop becomes infinite. The compiler is entirely oblivious to your threads.
You could define flag as volatile to prevent the above from happening, but you'll likely run into other issues after adding volatile. For one thing, volatile does not guarantee atomicity of variable modifications.
Another issue is that if the compiler sees an infinite loop that has no side effects, it may be considered undefined behavior and anything could happen, or, at least, not what you're thinking should, also this.
You need to use proper synchronization primitives like mutexes.
You can lock it with mutex. I assume you use pthread.
pthread_mutex_t mutex;
pthread_mutex_lock(&mutex);
flag=1;
pthread_mutex_unlock (&mutex);
Here is a very good tutorial about pthreads, mutexes and other stuff: https://computing.llnl.gov/tutorials/pthreads/
Your problem could be solved with a C compiler that follows the latest C standard, C11. C11 has threads and a data type called atomic_flag, that can basically used for a spin lock as you have it in your question.
First of all, the variable flag needs to be declared volatile or else the compiler has license to omit reads to it after the first one.
With that out of the way, a sequencer/event_counter can be used: one thread may increment the variable when it's odd, the other when it's even. Since one thread always "owns" the variable, and transfers the ownership with the increment, there is no race condition.
Coming from CUDA I'm interested in how shared memory is read from a thread and compares to the reading alignment requirements of CUDA. I'll used the following code as an example:
#include <sys/unistd.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#define THREADS 2
void * threadFun(void * args);
typedef struct {
float * dataPtr;
int tIdx,
dSize;
} t_data;
int main(int argc, char * argv[])
{
int i,
sizeData=5;
void * status;
float *data;
t_data * d;
pthread_t * threads;
pthread_attr_t attr;
data=(float *) malloc(sizeof(float) * sizeData );
threads=(pthread_t *)malloc(sizeof(pthread_t)*THREADS);
d = (t_data *) malloc (sizeof(t_data)*THREADS);
data[0]=0.0;
data[1]=0.1;
data[2]=0.2;
data[3]=0.3;
data[4]=0.4;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<THREADS;i++)
{
d[i].tIdx=i;
d[i].dataPtr=data;
d[i].dSize=sizeData;
pthread_create(&threads[i],NULL,threadFun,(void *)(d+i));
}
for (i=0; i<THREADS; i++)
{
pthread_join(threads[i],&status);
if(status);
//Error;
}
return 0;
}
void * threadFun(void * args)
{
int i;
t_data * d= (t_data *) args;
float sumVal=0.0;
for (i=0; i<d->dSize; i++)
sumVal+=d->dataPtr[i]*(d->tIdx+1);
printf("Thread %d calculated the value as %-11.11f\n",d->tIdx,sumVal);
return(NULL);
}
In the threadFun, the entire pointer d is pointing to shared memory space (I believe). From what I've encountered in documentation reading from multiple threads is ok. In CUDA reads need to be coalesced - is there similar alignment restrictions in pthreads? I.e. if I have two threads reading from the same shared address I'm assuming somewhere along the line a scheduler has to put one thread ahead of the other. In CUDA this could be a costly operation and should be avoided. Is there a penalty for 'simultaneous' reads from shared memory - and if so is it so small that it is negligible? i.e. both threads may need to read d->datPtr[0] simultaneously - I'm assuming that memory read cannot occur simultaneously - is this assumption wrong?
Also I read an article from intel that said to use a structure of arrays when multithreading - this is consistent with cuda. If I do this though, it is almost inevitable I will need the thread ID - which I believe will require me to use a mutex lock the thread ID until it is read into the thread's scope, is this true or would there be some other way to identify threads?
An article on memory management for mulithreaded programs would be appreciated as well.
While your thread data pointer d is pointing into a shared memory space, unless you increment that pointer to try and read from or write to an adjoining thread data element in the shared memory space array, you're basically dealing with localized thread data. Also the value of args is local to each thread, so in both cases if you are not incrementing the data pointer itself (i.e., you're never calling something like d++, etc. so that you're pointing to another thread's memory), no mutex is needed to guard the memory "belonging" to your thread.
Also again for your thread ID, since you're only writing that value from the spawning thread, and then reading that value in the actual spawned thread, there is no need for a mutex or synchronization mechanism ... you only have a single producer/consumer for the data. Mutexes and other synchronization mechanisms are only needed if there are multiple threads that will read and write the same data location.
CPUs have caches. Reads come from caches, so each CPU/core can read from its own cache, as long as the corresponding cacheline is SHARED. Writes force cachelines into EXCLUSIVE state, invalidating the corresponding cachelines on other CPUs.
If you have an array with a member per thread, and there are both reads and writes to that array, you may want to align every member to a cacheline, to avoid false sharing.
memory read to the same area in different thread to the same memory isn't a problem in shared memory systems (write is another matter, the pertinent area is the cache line: 64-256 bytes depending on the system)
I don't see any reason for which getting the thread_id should be a synchronized operation. (And you can feed your thread with any id meaningful for you, it can be simpler than getting a meaningful value from an abstract id)
Coming from CUDA probably let's you think to complicated. POSIX threads are much simpler. Basically what you are doing should work, as long as you are only reading in the shared array.
Also, don't forget that CUDA is a dismemberment of C++ and not on C, so some things might look different from that aspect, too. E.g in your code the habit of casting the return from malloc is generally frowned upon by real C programmers since it can be the source of subtle errors, there.