GCC optimization effect on thread concurrency

GCC optimization effect on thread concurrency - c

I'm just fiddling around with threads and observing how race condition occurs while modifying an unprotected global variable. Simple program with 3 threads incrementing a global variable in a tight loop with 100000 iterations -
#include <stdio.h>
#include <pthread.h>
static int global;
pthread_mutex_t lock;
#define LOCK() pthread_mutex_lock(&lock)
#define UNLOCK() pthread_mutex_unlock(&lock)
void *func(void *arg)
{
int i;
//LOCK();
for(i = 0; i < 100000; i++)
{
global++;
}
//UNLOCK();
}
int main()
{
pthread_t tid[3];
int i;
pthread_mutex_init(&lock, NULL);
for(i = 0; i < 3; i++)
{
pthread_create(&tid[i], NULL, func, NULL);
}
for(i = 0; i < 3; i++)
{
pthread_join(tid[i], NULL);
}
pthread_mutex_destroy(&lock);
printf("Global value: %d\n", global);
return 0;
}
When I compile this with -g flag and run the object 5 times, I get this output:
Global value: 300000
Global value: 201567
Global value: 179584
Global value: 105194
Global value: 205161
Which is expected. Classic synchronization issue here. Nothing to see.
But when I compile with optimization flag -O. I get this output:
Global value: 300000
Global value: 100000
Global value: 100000
Global value: 100000
Global value: 200000
This is part which doesn't makes sense to me. What did the GCC optimize so the threads got race conditioned for an entire 1/3 or 2/3 of total iterations?

Likely the loop got optimized to read the global variable once, do all the increments, then write it back. The difference in output depends on whether the loops overlap or don't overlap.

Related

Confused with the OpenMP threadprivate variable's initialization

I'm learning OpenMP these days and I just met the "threadprivate" directive. The code snippet below written by myself didn't output the expected result:
// **** File: fun.h **** //
void seed(int x);
int drand();
// ********************* //
// **** File: fun.c **** //
extern int num;
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
// **** File: main.c **** //
#include "fun.h"
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int num = 0;
#pragma omp threadprivate(num)
int main()
{
int num_inital = 4;
seed(num_inital);
printf("At the beginning, num = %d\n", num); // should num be 4?
#pragma omp parallel for num_threads(2) schedule(static,1) copyin(num)
for (int ii = 0; ii < 4; ii++) {
int my_rank = omp_get_thread_num();
//printf("Before processing, in thread %d num = %d\n", my_rank,num);
int num_in_loop = drand();
printf("Thread %d is processing loop %d: num = %d\n", my_rank,ii, num_in_loop);
}
system("pause");
return 0;
}
// ********************* //
Here list my questions:
Why the result of printf("At the beginning, num = %d\n", num); is num = 0 instead of num = 4?
As for the parallel for loop, multiple executions produce different results one of which is:
Thread 1 is processing loop 1: num = 5
Thread 0 is processing loop 0: num = 6
Thread 1 is processing loop 3: num = 7
Thread 0 is processing loop 2: num = 8
It seems that num is initialized to 4 in the for loop which denotes that the num in copyin clause is equal to 4. Why num in printf("At the beginning, num = %d\n", num) is different from that in copyin?
In OpenMP website, it said
In parallel regions, references by the master thread will be to the copy of the variable in the thread that encountered the parallel region.
According to this explanation, Thread 0 (the master thread) should firstly contains num = 4. Therefore, loop 0's output should always be: Thread 0 is processing loop 0: num = 5. Why the result above is different?
My working environment is win10 operating system with VS2015.

I think the problem is within the fun.c compilation unit. The compiler cannot determine the extern int num; variable is also a TLS one.
I will include directive #pragma omp threadprivate(num) in this file:
// **** File: fun.c **** //
extern int num;
#pragma omp threadprivate(num)
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
In any case, the compiler should warn about it at the linking phase.

The copyin clause is meant to be used in OpenMP teams (eg. computation on computing accelerators).
Indeed, the OpenMP documentation says:
These clauses support the copying of data values from private or threadprivate variables on one implicit task or thread to the corresponding variables on other implicit tasks or threads in the team.
Thus, in you case, you should rather use the clause firstprivate.
Please note that the version (5.0) of the OpenMP documentation your are reading is probably not supported by VS2015. I advise you to read an older version compatible with VS2015. The results of the compiled program are likely to be undefined.

Array iteration in multiple threads

I've got the following example, let's say I want for each thread to count from 0 to 9.
void* iterate(void* arg) {
int i = 0;
while(i<10) {
i++;
}
pthread_exit(0);
}
int main() {
int j = 0;
pthread_t tid[100];
while(j<100) {
pthread_create(&tid[j],NULL,iterate,NULL);
pthread_join(tid[j],NULL);
}
}
variable i - is in a critical section, it will be overwritten multiple times and therefore threads will fail to count.
int* i=(int*)calloc(1,sizeof(int));
doesn't solve the problem either. I don't want to use mutex. What is the most common solution for this problem?

As other users are commenting, there are severals problems in your example:
Variable i is not shared (it should be a global variable, for instance), nor in a critical section (it is a local variable to each thread). To have a critical section you should use locks or transactional memory.
You don't need to create and destroy threads every iteration. Just create a number of threads at the beggining and wait for them to finish (join).
pthread_exit() is not necessary, just return from the thread function (with a value).
A counter is a bad example for threads. It requires atomic operations to avoid overwriting the value of other threads. Actually, a multithreaded counter is a typical example of why atomic accesses are necessary (see this tutorial, for example).
I recommend you to start with some tutorials, like this or this.
I also recommend frameworks like OpenMP, they simplify the semantics of multithreaded programs.
EDIT: example of a shared counter and 4 threads.
#include <stdio.h>
#include <pthread.h>
#define NUM_THREADS 4
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int counter = 0;
void* iterate(void* arg) {
int i = 0;
while(i++ < 10) {
// enter critical section
pthread_mutex_lock(&mutex);
++counter;
pthread_mutex_unlock(&mutex);
}
return NULL;
}
int main() {
int j;
pthread_t tid[NUM_THREADS];
for(j = 0; j < NUM_THREADS; ++j)
pthread_create(&tid[j],NULL,iterate,NULL);
// let the threads do their magic
for(j = 0; j < NUM_THREADS; ++j)
pthread_join(tid[j],NULL);
printf("%d", counter);
return 0;
}

Confusion with thread synchronization [pthreads]

I have had a problem threads for a long time. This code is supposed to have a worker thread increment the value of a shared integer while the main thread prints it out. However, I am not getting my expected output.
#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
pthread_mutex_t lock;
int shared_data = 0; //shared data
// Often shared data is more complex than just an int.
void* thread_function(void* arg)
{
int i;
for (i = 0; i < 10; ++i)
{
// Access the shared data here.
pthread_mutex_lock(&lock);
shared_data++;
pthread_mutex_unlock(&lock);
}
return NULL;
}
int main(void)
{
pthread_t thread;
int i;
void* exit_status;
// Initialize the mutex before trying to use it.
pthread_mutex_init(&lock, NULL);
pthread_create(&thread, NULL, thread_function, NULL);
// Try to use the shared data.
for (i = 0; i < 10; ++i)
{
sleep(1);
pthread_mutex_lock(&lock);
printf ("\r for i= %d Shared integer 's value = %d\n", i, shared_data);
pthread_mutex_unlock(&lock);
}
printf("\n");
pthread_join(thread, &exit_status);
// Clean up the mutex when we are finished with it.
pthread_mutex_destroy(&lock);
return 0;
}
Here is what I expect:
for i=0 Shared Integer 's value = 0
for i=1 Shared Integer 's value = 1
for i=3 Shared Integer 's value = 2
...
for i=10 Shared Integer 's value =10
but the result is:
for i=0 Shared Integer 's value = 0
for i=1 Shared Integer 's value = 10
for i=3 Shared Integer 's value = 10
...
for i=10 Shared Integer 's value =10
so how can i resoleve this?

The main thread and your worker thread are running concurrently. That is, getting those for loops to coincide with each other perfectly is nearly impossible without extra synchronization.
Your output is exactly what you should expect. The time taken to spawn the thread allows the main thread to print before the other thread changes the shared data. Then, the print takes so long that the other thread completely finishes with its loop and increments the shared data to 10 before the main thread can get to its second iteration.
In a perfect world, this little hack using condition variables will get you what you want:
EDIT: condition variables were a bad idea for this. Here is working version that uses pseudo atomic variables and doesn't contain UB :) :
#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
pthread_mutex_t want_incr_mut;
pthread_mutex_t done_incr_mut;
int want_incr = 0;
int done_incr = 0;
int shared_data = 0; //shared data
// Not using atomics, so...
void wait_for_want_increment()
{
while (1)
{
pthread_mutex_lock(&want_incr_mut);
if (want_incr)
{
pthread_mutex_unlock(&want_incr_mut);
return;
}
pthread_mutex_unlock(&want_incr_mut);
}
}
void wait_for_done_incrementing()
{
while (1)
{
pthread_mutex_lock(&done_incr_mut);
if (done_incr)
{
pthread_mutex_unlock(&done_incr_mut);
return;
}
pthread_mutex_unlock(&done_incr_mut);
}
}
void done_incrementing()
{
pthread_mutex_lock(&done_incr_mut);
done_incr = 1;
pthread_mutex_lock(&want_incr_mut);
want_incr = 0;
pthread_mutex_unlock(&want_incr_mut);
pthread_mutex_unlock(&done_incr_mut);
}
void want_increment()
{
pthread_mutex_lock(&want_incr_mut);
want_incr = 1;
pthread_mutex_lock(&done_incr_mut);
done_incr = 0;
pthread_mutex_unlock(&done_incr_mut);
pthread_mutex_unlock(&want_incr_mut);
}
// Often shared data is more complex than just an int.
void* thread_function(void* arg)
{
int i;
for (i = 0; i < 10; ++i)
{
wait_for_want_increment();
// Access the shared data here.
shared_data++;
done_incrementing();
}
return NULL;
}
int main(void)
{
pthread_t thread;
int i;
void* exit_status;
// Initialize the mutex before trying to use it.
pthread_mutex_init(&want_incr_mut, NULL);
pthread_mutex_init(&done_incr_mut, NULL);
pthread_create(&thread, NULL, thread_function, NULL);
// Try to use the shared data.
for (i = 0; i <= 10; ++i)
{
printf("\r for i= %d Shared integer 's value = %d\n", i, shared_data);
if (i == 10) break;
want_increment();
wait_for_done_incrementing();
}
printf("\n");
pthread_join(thread, &exit_status);
// Clean up the mutexes when we are finished with them.
pthread_mutex_destroy(&want_incr_mut);
pthread_mutex_destroy(&done_incr_mut);
return 0;
}
Here, we just tell the worker that we want an increment and wait for him to say he is done before we continue. Meanwhile, the worker waits for us to want an increment and tells us when he is done.
I also changed the main loop to go to ten because that is what I think you want.
Here is my output:
for i= 0 Shared integer 's value = 0
for i= 1 Shared integer 's value = 1
for i= 2 Shared integer 's value = 2
for i= 3 Shared integer 's value = 3
for i= 4 Shared integer 's value = 4
for i= 5 Shared integer 's value = 5
for i= 6 Shared integer 's value = 6
for i= 7 Shared integer 's value = 7
for i= 8 Shared integer 's value = 8
for i= 9 Shared integer 's value = 9
for i= 10 Shared integer 's value = 10

global variable not updated pthread

I wrote this c-program :
int counter = 0;
void* increment()
{
int maxI = 10000;
int i;
for (i = 0; i < maxI; ++i) { counter++; }
}
int main()
{
pthread_t thread1_id;
pthread_t thread2_id;
pthread_create (&thread1_id,NULL,&increment,NULL);
pthread_create (&thread2_id,NULL,&increment,NULL);
pthread_join (thread1_id,NULL);
pthread_join (thread2_id,NULL);
printf("counter = %d\n",counter);
return 0;
}
As a result I get : counter = 10000
why is that ? I would have expected something much bigger instead as I am using two threads, how can I correct it
PS : I am aware that there will be a race condition!
edit : volatile int counter seems to solve the problem :)

Predicting what code with bugs will do is extremely difficult. Most likely, the compiler is optimizing your increment function to keep counter in a register. But you'd have to look at the generated assembly code to be sure.

Why does this code work without a mutex?

I am trying to learn how locks work in multi-threading. When I execute the following code without lock, it worked fine even though the variable sum is declared as a global variable and multiple threads are updating it. Could anyone please explain why here threads are working perfectly fine on a shared variable without locks?
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 100
#define ARRAYSIZE 1000000
#define ITERATIONS ARRAYSIZE / NTHREADS
double sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;
void *do_work(void *tid)
{
int i, start, *mytid, end;
double mysum=0.0;
/* Initialize my part of the global array and keep local sum */
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1);
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
mysum = mysum + a[i];
}
/* Lock the mutex and update the global sum, then exit */
//pthread_mutex_lock (&sum_mutex); //here I tried not to use locks
sum = sum + mysum;
//pthread_mutex_unlock (&sum_mutex);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int i, start, tids[NTHREADS];
pthread_t threads[NTHREADS];
pthread_attr_t attr;
/* Pthreads setup: initialize mutex and explicitly create threads in a
joinable state (for portability). Pass each thread its loop offset */
pthread_mutex_init(&sum_mutex, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
pthread_create(&threads[i], &attr, do_work, (void *) &tids[i]);
}
/* Wait for all threads to complete then print global sum */
for (i=0; i<NTHREADS; i++) {
pthread_join(threads[i], NULL);
}
printf ("Done. Sum= %e \n", sum);
sum=0.0;
for (i=0;i<ARRAYSIZE;i++){
a[i] = i*1.0;
sum = sum + a[i]; }
printf("Check Sum= %e\n",sum);
/* Clean up and exit */
pthread_attr_destroy(&attr);
pthread_mutex_destroy(&sum_mutex);
pthread_exit (NULL);
}
With and without lock I got the same answer!
Done. Sum= 4.999995e+11
Check Sum= 4.999995e+11
UPDATE: Change suggested by user3386109
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
//pthread_mutex_lock (&sum_mutex);
sum = sum + a[i];
//pthread_mutex_lock (&sum_mutex);
}
EFFECT :
Done. Sum= 3.878172e+11
Check Sum= 4.999995e+11

Mutexes are used to prevent race conditions which are undesirable situations when you have two or more threads accessing a shared resource. Race conditions such as the one in your code happen when the shared variable sum is being accessed by multiple threads. Sometimes the access to the shared variable will be interleaved in such a way that the result is incorrect and sometimes the result will be correct.
For example lets say you have two threads, thread A and thread B both adding 1 to a shared value, sum, which starts at 5. If thread A reads sum and then thread B reads sum and then thread A writes a new value followed by thread B writing a new value you will can an incorrect result, 6 as opposed to 7. However it is also possible than thread A reads and then writes a value (specifically 6) followed by thread B reading and writing a value (specifically 7) and then you get the correct result. The point being that some interleavings of operations result in the correct value and some interleavings result in an incorrect value. The job of the mutex is to force the interleaving to always be correct.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

GCC optimization effect on thread concurrency - c

Likely the loop got optimized to read the global variable once, do all the increments, then write it back. The difference in output depends on whether the loops overlap or don't overlap.

Related

Confused with the OpenMP threadprivate variable's initialization

Array iteration in multiple threads

Confusion with thread synchronization [pthreads]

global variable not updated pthread

Why does this code work without a mutex?

Categories

Resources