I wrote this c-program :
int counter = 0;
void* increment()
{
int maxI = 10000;
int i;
for (i = 0; i < maxI; ++i) { counter++; }
}
int main()
{
pthread_t thread1_id;
pthread_t thread2_id;
pthread_create (&thread1_id,NULL,&increment,NULL);
pthread_create (&thread2_id,NULL,&increment,NULL);
pthread_join (thread1_id,NULL);
pthread_join (thread2_id,NULL);
printf("counter = %d\n",counter);
return 0;
}
As a result I get : counter = 10000
why is that ? I would have expected something much bigger instead as I am using two threads, how can I correct it
PS : I am aware that there will be a race condition!
edit : volatile int counter seems to solve the problem :)
Predicting what code with bugs will do is extremely difficult. Most likely, the compiler is optimizing your increment function to keep counter in a register. But you'd have to look at the generated assembly code to be sure.
Related
By using OpenMP I'm trying to parallelize the creation of a kind of dictionary so defined.
typedef struct Symbol {
int usage;
char character;
} Symbol;
typedef struct SymbolDictionary {
int charsNr;
Symbol *symbols;
} SymbolDictionary;
I did the following code.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <omp.h>
static const int n = 10;
int main(int argc, char* argv[]) {
int thread_count = strtol(argv[1], NULL, 10);
omp_set_dynamic(0);
omp_set_num_threads(thread_count);
SymbolDictionary **symbolsDict = calloc(omp_get_max_threads(), sizeof(SymbolDictionary*));
SymbolDictionary *dict = NULL;
int count = 0;
#pragma omp parallel for firstprivate(dict, count) shared(symbolsDict)
for (int i = 0; i < n; i++) {
if (count == 0) {
dict = calloc(1, sizeof(SymbolDictionary));
dict->charsNr = 0;
dict->symbols = calloc(n, sizeof(Symbol));
#pragma omp critical
symbolsDict[omp_get_thread_num()] = dict;
}
dict->symbols[count].usage = i;
dict->symbols[count].character = 'a' + i;
++dict->charsNr;
++count;
}
if (omp_get_max_threads() > 1) {
// merge the dictionaries
}
for (int j = 0; j < symbolsDict[0]->charsNr; j++)
printf("symbolsDict[0][%d].character: %c\nsymbolsDict[0][%d].usage: %d\n",
j,
symbolsDict[0]->symbols[j].character,
j,
symbolsDict[0]->symbols[j].usage);
for (int i = 0; i < omp_get_max_threads(); i++)
free(symbolsDict[i]->symbols);
free(symbolsDict);
return 0;
}
The code compiles and runs, but I'm not sure about how the omp block works and if I implemented it correctly. Especially I have to attach the dict with the symbolsDict at the beginning of the loop, because I don't know when a thread will complete its work. However, by doing that probably different threads will write inside symbolsDict at the same time but in different memory. Although the threads will use different access points, dict should be different for every thread, I'm not sure this is a good way to do that.
I tested the code with different threads and creating dictionaries of different sizes.
I didn't have any kind of problem, but maybe it was just chance.
Basically I looked for the theory part around on the documentation. So I would like to know if I implemented the code correctly? If not, what is incorrect and why?
different threads will write inside symbolsDict at the same time but in different memory. Although the threads will use different access points, dict should be different for every thread, I'm not sure this is a good way to do that.
It isn't a good way but it is safe. A cleaner way would be this:
SymbolDictionary **symbolsDict = calloc(
omp_get_max_threads(), sizeof(SymbolDictionary*));
#pragma omp parallel
{
SymbolDictionary *dict = calloc(1, sizeof(SymbolDictionary));
int count = 0;
dict->charsNr = 0;
dict->symbols = calloc(n, sizeof(Symbol));
symbolsDict[omp_get_thread_num()] = dict;
# pragma omp for nowait
for(int i = 0; i < n; i++) {
dict->symbols[count].usage = i;
dict->symbols[count].character = 'a' + i;
++dict->charsNr;
++count;
}
}
Note that the inner pragma is omp for, not omp parallel for so it is using the outer parallel block to distribute its work. The nowait is a performance improvement that avoids a thread barrier at the end of the loop since it is the last part of the parallel section and threads wait for all other threads at the end of the section anyway.
I'm trying to get pthread to output a variable declared in a for loop:
pthread_t pthread[10];
void * count(void* argv){
int index = *(int*)argv;
printf("%d", index);
pthread_exit(NULL);
}
int main() {
for (int i = 0; i < 10; ++i) {
pthread_create(&pthread[i], NULL, count,(void*)&i);
}
return 0;
}
Output I thought:
0123456789 (or not in order, but all 10 number)
What I got:
123456789
Why 0 is not in here?
One problem is that your main() thread exits without waiting for the child threads to finish, which means the child threads may not have time to finish (or potentially even begin) their execution before the process is terminated.
To avoid that problem, you need to call pthread_join() on all of your threads before main() exits, like this:
int main() {
for (int i = 0; i < 10; ++i) {
pthread_create(&pthread[i], NULL, count,(void*)&i);
}
for (int i = 0; i < 10; ++i) {
pthread_join(pthread[i], NULL); // won't return until thread has exited
}
return 0;
}
The other problem, as mentioned by 500 in the comments, is that you are passing a pointer to i to the child threads, which they then dereference, and since i is being modified in the main thread's loop, it's undefined behavior what value the child threads will read from that pointer. One way to avoid this is to give each thread its own separate (non-changing) integer to read:
int values[10];
for (int i = 0; i < 10; ++i) {
values[i] = i;
pthread_create(&pthread[i], NULL, count,(void*)&values[i]);
}
I've got the following example, let's say I want for each thread to count from 0 to 9.
void* iterate(void* arg) {
int i = 0;
while(i<10) {
i++;
}
pthread_exit(0);
}
int main() {
int j = 0;
pthread_t tid[100];
while(j<100) {
pthread_create(&tid[j],NULL,iterate,NULL);
pthread_join(tid[j],NULL);
}
}
variable i - is in a critical section, it will be overwritten multiple times and therefore threads will fail to count.
int* i=(int*)calloc(1,sizeof(int));
doesn't solve the problem either. I don't want to use mutex. What is the most common solution for this problem?
As other users are commenting, there are severals problems in your example:
Variable i is not shared (it should be a global variable, for instance), nor in a critical section (it is a local variable to each thread). To have a critical section you should use locks or transactional memory.
You don't need to create and destroy threads every iteration. Just create a number of threads at the beggining and wait for them to finish (join).
pthread_exit() is not necessary, just return from the thread function (with a value).
A counter is a bad example for threads. It requires atomic operations to avoid overwriting the value of other threads. Actually, a multithreaded counter is a typical example of why atomic accesses are necessary (see this tutorial, for example).
I recommend you to start with some tutorials, like this or this.
I also recommend frameworks like OpenMP, they simplify the semantics of multithreaded programs.
EDIT: example of a shared counter and 4 threads.
#include <stdio.h>
#include <pthread.h>
#define NUM_THREADS 4
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int counter = 0;
void* iterate(void* arg) {
int i = 0;
while(i++ < 10) {
// enter critical section
pthread_mutex_lock(&mutex);
++counter;
pthread_mutex_unlock(&mutex);
}
return NULL;
}
int main() {
int j;
pthread_t tid[NUM_THREADS];
for(j = 0; j < NUM_THREADS; ++j)
pthread_create(&tid[j],NULL,iterate,NULL);
// let the threads do their magic
for(j = 0; j < NUM_THREADS; ++j)
pthread_join(tid[j],NULL);
printf("%d", counter);
return 0;
}
I'm just fiddling around with threads and observing how race condition occurs while modifying an unprotected global variable. Simple program with 3 threads incrementing a global variable in a tight loop with 100000 iterations -
#include <stdio.h>
#include <pthread.h>
static int global;
pthread_mutex_t lock;
#define LOCK() pthread_mutex_lock(&lock)
#define UNLOCK() pthread_mutex_unlock(&lock)
void *func(void *arg)
{
int i;
//LOCK();
for(i = 0; i < 100000; i++)
{
global++;
}
//UNLOCK();
}
int main()
{
pthread_t tid[3];
int i;
pthread_mutex_init(&lock, NULL);
for(i = 0; i < 3; i++)
{
pthread_create(&tid[i], NULL, func, NULL);
}
for(i = 0; i < 3; i++)
{
pthread_join(tid[i], NULL);
}
pthread_mutex_destroy(&lock);
printf("Global value: %d\n", global);
return 0;
}
When I compile this with -g flag and run the object 5 times, I get this output:
Global value: 300000
Global value: 201567
Global value: 179584
Global value: 105194
Global value: 205161
Which is expected. Classic synchronization issue here. Nothing to see.
But when I compile with optimization flag -O. I get this output:
Global value: 300000
Global value: 100000
Global value: 100000
Global value: 100000
Global value: 200000
This is part which doesn't makes sense to me. What did the GCC optimize so the threads got race conditioned for an entire 1/3 or 2/3 of total iterations?
Likely the loop got optimized to read the global variable once, do all the increments, then write it back. The difference in output depends on whether the loops overlap or don't overlap.
I'm new to multithreading and try to learn it through a simple program, which adds 1 to n and return the sum. In the sequential case, the main call the sumFrom1 function twice for n = 1e5 and 2e5; in the multithreaded cases, two threads are created using pthread_create and two sums are calculated in separate thread. The multithreadting version is much slower than the sequential version (see results below). I run this on a 12-CPU platform and there are no communication between threads.
Multithreaded:
Thread 1 returns: 0
Thread 2 returns: 0
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 156 seconds
Sequential:
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 56 seconds
When I add -O2 in compilation, the time of multithreaded version (9s) is less than that of sequential version (11s), but not much as I expect. I can always have the -O2 flag on but I'm curious about the low speed of multithreading in the unoptimized case. Should it be slower than sequential version? If not, what can I do to make it faster?
The code:
#include <stdio.h>
#include <pthread.h>
#include <time.h>
typedef struct my_struct
{
int n;
int sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = (my_struct_t*) sit;
int i;
int nsim = 500000; // Loops for consuming time
int j;
for(j = 0; j < nsim; j++)
{
local_sit->sum = 0;
for(i = 0; i <= local_sit->n; i++)
local_sit->sum += i;
}
}
int main(int argc, char *argv[])
{
pthread_t thread1;
pthread_t thread2;
my_struct_t si1;
my_struct_t si2;
int iret1;
int iret2;
time_t t1;
time_t t2;
si1.n = 10000;
si2.n = 20000;
if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version
{
t1 = time(0);
iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);
iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
t2 = time(0);
printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
else // Use "./prog" to test the time of sequential version
{
t1 = time(0);
sumFrom1((void*)&si1);
sumFrom1((void*)&si2);
t2 = time(0);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
return 0;
}
UPDATE1:
After a little googling on "false sharing" (Thanks, #Martin James!), I think it is the main cause. There are (at least) two ways to fix it:
The first way is inserting a buffer zone between the two structs (Thanks, #dasblinkenlight):
my_struct_t si1;
char memHolder[4096];
my_struct_t si2;
Without -O2, the time consuming decreases from ~156s to ~38s.
The second way is avoiding frequently updating sit->sum, which can be realized using a temp variable in sumFrom1 (as #Jens Gustedt replied):
for(int sum = 0, j = 0; j < nsim; j++)
{
sum = 0;
for(i = 0; i <= local_sit->n; i++)
sum += i;
}
local_sit->sum = sum;
Without -O2, the time consuming decreases from ~156s to ~35s or ~109s (It has two peaks! I don't know why.). With -O2, the time consuming stays ~8s.
By modifying your code to
typedef struct my_struct
{
size_t n;
size_t sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = sit;
size_t nsim = 500000; // Loops for consuming time
size_t n = local_sit->n;
size_t sum = 0;
for(size_t j = 0; j < nsim; j++)
{
for(size_t i = 0; i <= n; i++)
sum += i;
}
local_sit->sum = sum;
return 0;
}
the phenomenon disappears. The problems you had:
using int as a datatype is completely wrong for such a test. Your
figures where such that the sum overflowed. Overflow of signed types is undefined behavior. You are lucky that it didn't eat your lunch.
having bounds and summation variables with indirection buys you
additional loads and stores, that in case of -O0 are really done as
such, with all the implications of false sharing and stuff like that.
Your code also observed other errors:
a missing include for atoi
superflouous cast to and from void*
printing of time_t as int
Please compile your code with -Wall before posting.