Array iteration in multiple threads

Array iteration in multiple threads - c

I've got the following example, let's say I want for each thread to count from 0 to 9.
void* iterate(void* arg) {
int i = 0;
while(i<10) {
i++;
}
pthread_exit(0);
}
int main() {
int j = 0;
pthread_t tid[100];
while(j<100) {
pthread_create(&tid[j],NULL,iterate,NULL);
pthread_join(tid[j],NULL);
}
}
variable i - is in a critical section, it will be overwritten multiple times and therefore threads will fail to count.
int* i=(int*)calloc(1,sizeof(int));
doesn't solve the problem either. I don't want to use mutex. What is the most common solution for this problem?

As other users are commenting, there are severals problems in your example:
Variable i is not shared (it should be a global variable, for instance), nor in a critical section (it is a local variable to each thread). To have a critical section you should use locks or transactional memory.
You don't need to create and destroy threads every iteration. Just create a number of threads at the beggining and wait for them to finish (join).
pthread_exit() is not necessary, just return from the thread function (with a value).
A counter is a bad example for threads. It requires atomic operations to avoid overwriting the value of other threads. Actually, a multithreaded counter is a typical example of why atomic accesses are necessary (see this tutorial, for example).
I recommend you to start with some tutorials, like this or this.
I also recommend frameworks like OpenMP, they simplify the semantics of multithreaded programs.
EDIT: example of a shared counter and 4 threads.
#include <stdio.h>
#include <pthread.h>
#define NUM_THREADS 4
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int counter = 0;
void* iterate(void* arg) {
int i = 0;
while(i++ < 10) {
// enter critical section
pthread_mutex_lock(&mutex);
++counter;
pthread_mutex_unlock(&mutex);
}
return NULL;
}
int main() {
int j;
pthread_t tid[NUM_THREADS];
for(j = 0; j < NUM_THREADS; ++j)
pthread_create(&tid[j],NULL,iterate,NULL);
// let the threads do their magic
for(j = 0; j < NUM_THREADS; ++j)
pthread_join(tid[j],NULL);
printf("%d", counter);
return 0;
}

Related

Synchronization of p_threads using globals

I'm pretty new to C, so I'm not sure where even to start digging about my problem.
I'm trying to port python number-crunching algos to C, and since there's no GIL in C (woohoo), I can change whatever I want in memory from threads, for as long as I make sure there are no races.
I did my homework on mutexes, however, I cannot wrap my head around use of mutexes in case of continuously running threads accessing the same array over and over.
I'm using p_threads in order to split the workload on a big array a[N].
Number crunching algorithm on array a[N] is additive, so I'm splitting it using a_diff[N_THREADS][N] array, writing changes to be applied to a[N] array from each thread to a_diff[N_THREADS][N] and then merging them all together after each step.
I need to run the crunching on different versions of array a[N], so I pass them via global pointer p (in the MWE, there's only one a[N])
I'm synchronizing threads using another global array SYNC_THREADS[N_THREADS] and make sure threads quit when I need them to by setting END_THREADS global (I know, I'm using too many globals - I don't care, code is ~200 lines). My question is in regard to this synchronization technique - is it safe to do so and what is cleaner/better/faster way to achieve that?
MWEe:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define N_THREADS 3
#define N 10000000
#define STEPS 3
double a[N]; // main array
double a_diff[N_THREADS][N]; // diffs array
double params[N]; // parameter used for number-crunching
double (*p)[N]; // pointer to array[N]
// structure for bounds for crunching the array
struct bounds {
int lo;
int hi;
int thread_num;
};
struct bounds B[N_THREADS];
int SYNC_THREADS[N_THREADS]; // for syncing threads
int END_THREADS = 0; // signal to terminate threads
static void *crunching(void *arg) {
// multiple threads run number-crunching operations according to assigned low/high bounds
struct bounds *data = (struct bounds *)arg;
int lo = (*data).lo;
int hi = (*data).hi;
int thread_num = (*data).thread_num;
printf("worker %d started for bounds [%d %d] \n", thread_num, lo, hi);
int i;
while (END_THREADS != 1) { // END_THREADS tells threads to terminate
if (SYNC_THREADS[thread_num] == 1) { // SYNC_THREADS allows threads to start number-crunching
printf("worker %d working... \n", thread_num );
for (i = lo; i <= hi; ++i) {
a_diff[thread_num][i] += (*p)[i] * params[i]; // pretend this is an expensive operation...
}
SYNC_THREADS[thread_num] = 0; // thread disables itself until SYNC_THREADS is back to 1
printf("worker %d stopped... \n", thread_num );
}
}
return 0;
}
int i, j, th,s;
double joiner;
int main() {
// pre-fill arrays
for (i = 0; i < N; ++i) {
a[i] = i + 0.5;
params[i] = 0.0;
}
// split workload between workers
int worker_length = N / N_THREADS;
for (i = 0; i < N_THREADS; ++i) {
B[i].thread_num = i;
B[i].lo = i * worker_length;
if (i == N_THREADS - 1) {
B[i].hi = N;
} else {
B[i].hi = i * worker_length + worker_length - 1;
}
}
// pointer to parameters to be passed to worker
struct bounds **data = malloc(N_THREADS * sizeof(struct bounds*));
for (i = 0; i < N_THREADS; i++) {
data[i] = malloc(sizeof(struct bounds));
data[i]->lo = B[i].lo;
data[i]->hi = B[i].hi;
data[i]->thread_num = B[i].thread_num;
}
// create thread objects
pthread_t threads[N_THREADS];
// disallow threads to crunch numbers
for (th = 0; th < N_THREADS; ++th) {
SYNC_THREADS[th] = 0;
}
// launch workers
for(th = 0; th < N_THREADS; th++) {
pthread_create(&threads[th], NULL, crunching, data[th]);
}
// big loop of iterations
for (s = 0; s < STEPS; ++s) {
for (i = 0; i < N; ++i) {
params[i] += 1.0; // adjust parameters
// zero diff array
for (i = 0; i < N; ++i) {
for (th = 0; th < N_THREADS; ++th) {
a_diff[th][i] = 0.0;
}
}
p = &a; // pointer to array a
// allow threads to process numbers and wait for threads to complete
for (th = 0; th < N_THREADS; ++th) { SYNC_THREADS[th] = 1; }
// ...here threads started by pthread_create do calculations...
for (th = 0; th < N_THREADS; th++) { while (SYNC_THREADS[th] != 0) {} }
// join results from threads (number-crunching is additive)
for (i = 0; i < N; ++i) {
joiner = 0.0;
for (th = 0; th < N_THREADS; ++th) {
joiner += a_diff[th][i];
}
a[i] += joiner;
}
}
}
// join workers
END_THREADS = 1;
for(th = 0; th < N_THREADS; th++) {
pthread_join(threads[th], NULL);
}
return 0;
}
I see that workers don't overlap time-wise:
worker 0 started for bounds [0 3333332]
worker 1 started for bounds [3333333 6666665]
worker 2 started for bounds [6666666 10000000]
worker 0 working...
worker 1 working...
worker 2 working...
worker 2 stopped...
worker 0 stopped...
worker 1 stopped...
worker 2 working...
worker 0 working...
worker 1 working...
worker 1 stopped...
worker 0 stopped...
worker 2 stopped...
worker 2 working...
worker 0 working...
worker 1 working...
worker 1 stopped...
worker 2 stopped...
worker 0 stopped...
Process returned 0 (0x0) execution time : 1.505 s
and I make sure worker's don't get into each other workspaces by separating them via a_diff[thead_num][N] sub-arrays, however, I'm not sure that's always the case and that I'm not introducing hidden races somewhere...

I didn't realize what was the question :-)
So, the question is if you're thinking well with your SYNC_THREADS and END_THREADS synchronization mechanism.
Yes!... Almost. The problem is that threads are burning CPU while waiting.
Conditional Variables
To make threads wait for an event you have conditional variables (pthread_cond). These offer a few useful functions like wait(), signal() and broadcast():
wait(&cond, &m) blocks a thread in a given condition variable. [note 2]
signal(&cond) unlocks a thread waiting in a given condition variable.
broadcast(&cond) unlocks all threads waiting in a given condition variable.
Initially you'd have all the threads waiting [note 1]:
while(!start_threads)
pthread_cond_wait(&cond_start);
And, when the main thread is ready:
start_threads = 1;
pthread_cond_broadcast(&cond_start);
Barriers
If you have data dependencies between iterations, you'd want to make sure threads are executing the same iteration at any given moment.
To synchronize threads at the end of each iteration, you'll want to take a look at barriers (pthread_barrier):
pthread_barrier_init(count): initializes a barrier to synchronize count threads.
pthread_barrier_wait(): thread waits here until all count threads reach the barrier.
Extending functionality of barriers
Sometimes you'll want the last thread reaching a barrier to compute something (e.g. to increment the counter of number of iterations, or to compute some global value, or to check if execution should stop). You have two alternatives
Using pthread_barriers
You'll need to essentially have two barriers:
int rc = pthread_barrier_wait(&b);
if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD)
if(shouldStop()) stop = 1;
pthread_barrier_wait(&b);
if(stop) return;
Using pthread_conds to implement our own specialized barrier
pthread_mutex_lock(&mutex)
remainingThreads--;
// all threads execute this
executedByAllThreads();
if(remainingThreads == 0) {
// reinitialize barrier
remainingThreads = N;
// only last thread executes this
if(shouldStop()) stop = 1;
pthread_cond_broadcast(&cond);
} else {
while(remainingThreads > 0)
pthread_cond_wait(&cond, &mutex);
}
pthread_mutex_unlock(&mutex);
Note 1: why is pthread_cond_wait() inside a while block? May seem a bit odd. The reason behind it is due to the existence of spurious wakeups. The function may return even if no signal() or broadcast() was issued. So, just to guarantee correctness, it's usual to have an extra variable to guarantee that if a thread suddenly wakes up before it should, it runs back into the pthread_cond_wait().
From the manual:
When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.
(...)
If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup.
Note 2:
An noted by Michael Burr in the comments, you should hold a companion lock whenever you modify the predicate (start_threads) and pthread_cond_wait(). pthread_cond_wait() will release the mutex when called; and re-acquires it when it returns.
PS: It's a bit late here; sorry if my text is confusing :-)

GCC optimization effect on thread concurrency

I'm just fiddling around with threads and observing how race condition occurs while modifying an unprotected global variable. Simple program with 3 threads incrementing a global variable in a tight loop with 100000 iterations -
#include <stdio.h>
#include <pthread.h>
static int global;
pthread_mutex_t lock;
#define LOCK() pthread_mutex_lock(&lock)
#define UNLOCK() pthread_mutex_unlock(&lock)
void *func(void *arg)
{
int i;
//LOCK();
for(i = 0; i < 100000; i++)
{
global++;
}
//UNLOCK();
}
int main()
{
pthread_t tid[3];
int i;
pthread_mutex_init(&lock, NULL);
for(i = 0; i < 3; i++)
{
pthread_create(&tid[i], NULL, func, NULL);
}
for(i = 0; i < 3; i++)
{
pthread_join(tid[i], NULL);
}
pthread_mutex_destroy(&lock);
printf("Global value: %d\n", global);
return 0;
}
When I compile this with -g flag and run the object 5 times, I get this output:
Global value: 300000
Global value: 201567
Global value: 179584
Global value: 105194
Global value: 205161
Which is expected. Classic synchronization issue here. Nothing to see.
But when I compile with optimization flag -O. I get this output:
Global value: 300000
Global value: 100000
Global value: 100000
Global value: 100000
Global value: 200000
This is part which doesn't makes sense to me. What did the GCC optimize so the threads got race conditioned for an entire 1/3 or 2/3 of total iterations?

Likely the loop got optimized to read the global variable once, do all the increments, then write it back. The difference in output depends on whether the loops overlap or don't overlap.

global variable not updated pthread

I wrote this c-program :
int counter = 0;
void* increment()
{
int maxI = 10000;
int i;
for (i = 0; i < maxI; ++i) { counter++; }
}
int main()
{
pthread_t thread1_id;
pthread_t thread2_id;
pthread_create (&thread1_id,NULL,&increment,NULL);
pthread_create (&thread2_id,NULL,&increment,NULL);
pthread_join (thread1_id,NULL);
pthread_join (thread2_id,NULL);
printf("counter = %d\n",counter);
return 0;
}
As a result I get : counter = 10000
why is that ? I would have expected something much bigger instead as I am using two threads, how can I correct it
PS : I am aware that there will be a race condition!
edit : volatile int counter seems to solve the problem :)

Predicting what code with bugs will do is extremely difficult. Most likely, the compiler is optimizing your increment function to keep counter in a register. But you'd have to look at the generated assembly code to be sure.

Why does this code work without a mutex?

I am trying to learn how locks work in multi-threading. When I execute the following code without lock, it worked fine even though the variable sum is declared as a global variable and multiple threads are updating it. Could anyone please explain why here threads are working perfectly fine on a shared variable without locks?
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 100
#define ARRAYSIZE 1000000
#define ITERATIONS ARRAYSIZE / NTHREADS
double sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;
void *do_work(void *tid)
{
int i, start, *mytid, end;
double mysum=0.0;
/* Initialize my part of the global array and keep local sum */
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1);
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
mysum = mysum + a[i];
}
/* Lock the mutex and update the global sum, then exit */
//pthread_mutex_lock (&sum_mutex); //here I tried not to use locks
sum = sum + mysum;
//pthread_mutex_unlock (&sum_mutex);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int i, start, tids[NTHREADS];
pthread_t threads[NTHREADS];
pthread_attr_t attr;
/* Pthreads setup: initialize mutex and explicitly create threads in a
joinable state (for portability). Pass each thread its loop offset */
pthread_mutex_init(&sum_mutex, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
pthread_create(&threads[i], &attr, do_work, (void *) &tids[i]);
}
/* Wait for all threads to complete then print global sum */
for (i=0; i<NTHREADS; i++) {
pthread_join(threads[i], NULL);
}
printf ("Done. Sum= %e \n", sum);
sum=0.0;
for (i=0;i<ARRAYSIZE;i++){
a[i] = i*1.0;
sum = sum + a[i]; }
printf("Check Sum= %e\n",sum);
/* Clean up and exit */
pthread_attr_destroy(&attr);
pthread_mutex_destroy(&sum_mutex);
pthread_exit (NULL);
}
With and without lock I got the same answer!
Done. Sum= 4.999995e+11
Check Sum= 4.999995e+11
UPDATE: Change suggested by user3386109
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
//pthread_mutex_lock (&sum_mutex);
sum = sum + a[i];
//pthread_mutex_lock (&sum_mutex);
}
EFFECT :
Done. Sum= 3.878172e+11
Check Sum= 4.999995e+11

Mutexes are used to prevent race conditions which are undesirable situations when you have two or more threads accessing a shared resource. Race conditions such as the one in your code happen when the shared variable sum is being accessed by multiple threads. Sometimes the access to the shared variable will be interleaved in such a way that the result is incorrect and sometimes the result will be correct.
For example lets say you have two threads, thread A and thread B both adding 1 to a shared value, sum, which starts at 5. If thread A reads sum and then thread B reads sum and then thread A writes a new value followed by thread B writing a new value you will can an incorrect result, 6 as opposed to 7. However it is also possible than thread A reads and then writes a value (specifically 6) followed by thread B reading and writing a value (specifically 7) and then you get the correct result. The point being that some interleavings of operations result in the correct value and some interleavings result in an incorrect value. The job of the mutex is to force the interleaving to always be correct.

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}

"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.

In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);

This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}

The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.