The idea is to have multiple threads launch the same computation on different parts of the array. A computation is composed of several iterations (with a while loop). The threads must be synchronised together by the end of every iteration. The number of iterations that a given thread takes to perform a computation varies.
Here is what I have come up with so far, and I can't see what I am doing wrong:
pthread_cond_t continue_cond;
pthread_mutex_t waiting_threads_mut;
int num_waiting_threads = 0;
void WorkerProcess(void * args) {
printf("Thread %d has started\n", args->thread_id);
int to_process_indices = 1000;
while (to_process_indices > 0) {
to_process_indices -= (args->thread_id + 1);
pthread_mutex_lock(&waiting_threads_mut);
++num_waiting_threads;
pthread_cond_broadcast(&continue_cond);
// num waiting threads % num threads == 0 => all threads are in computation, and have reached barrier, so continue
// num waiting threads % num threads == 1 => 1 thread left in computation, so there is no need to wait, so continue
// num waiting threads % num threads >1 => 2+ threads in computation, so they must continue syncing.
while ((num_waiting_threads % args->num_threads) > 1) {
pthread_cond_wait(&continue_cond, &waiting_threads_mut);
}
pthread_mutex_unlock(&waiting_threads_mut);
}
printf("Thread %d done with chunk..\n", args->thread_id);
}
Related
I need to multiply large numbers using multithreading. The two numbers to be multiplied can have up to 10000 digits. I have written multiplication code using a single thread. But I am not sure how to multiply when I am assigning multiple threads to different digits.
For example, if the two numbers are: 254678 and 378929 and there are 3 threads, I am assigning each of the two digits to one thread(2,5-Thread 1),(4,6->Thread 2),(7,8-> Thread 3) and each of the digits should multiply digits of 2nd number-> 378929.
When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
input: array contains both the numbers
index: i1 contains the last digit of 1st number
index: i2 contains the last digit of 2nd number
enter code here
for (int i=i1-1; i>=1; i--){
int carry = 0;
int n1 = input[i];
t2 = 0;
for(int j=i2-1; j>i1-1; j--){
int n2 = input[j];
int sum = n1*n2 + output[t1+t2] + carry;
carry = sum/10;
output[t1+t2] = sum % 10;
t2++;
}
if(carry > 0)
output[t1 + t2] += carry;
t1++;
}
int main() {
pthread_t threads[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, &multiply, (void*)NULL);
for (int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
}
When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
Don't allow multiple threads to update anything at the same time.
Specifically (assuming each digit is a digit in "base 1<<32"), each thread can be like:
my_accumulator = table_of_per_thread_accumulators[thread_number];
for (int src1_digit_number = startDigit; src1_digit_number >= 0; src1_digit_number -= NUM_THREADS ) {
for (int src2_digit_number = src2_digits-1; src2_digit_number >= 0; src2_digit_number-- ) {
int dest_digit_number = src1_digit_number + src2_digit_number;
uint32_t n1 = input1[src1_digit_number];
uint32_t n2 = input2[src2_digit_number];
uint64_t r = (uint64_t)n1 * n2;
while(r != 0) {
uint64_t temp = my_accumulator[dest_digit_number] + (r & 0xFFFFFFFFUL);
my_accumulator[dest_digit_number++] = temp;
temp >>= 32;
r = (r >> 32) + temp;
}
}
}
Then, when all threads are finished (after some "pthread_join()" calls?) you'd add the numbers from each thread's separate my_accumulator to get the actual result.
Note 1: In theory you can do some "binary merging of accumulators", such that an odd numbered thread waits for the next lower numbered thread to finish and adds that thread's accumulator to its own; then threads that satisfy "my_thread_number % 4 == 3" wait for the the thread numbered "my_thread_number - 2" to finish and adds its accumulator to its own; then... This is likely to be too complicated and messy to bother with.
Note 2: The alternative (if you use a single output[] that's modified by multiple threads) is to have a mutex to ensure only one thread can modify output[] at the same time (or multiple mutexes to ensure only one thread can modify a piece of output[] at a time); and this will destroy performance so badly that it will be faster to use a single thread.
Note 3: The alternative alternative is to use atomics somehow, or use inline assembly (e.g. "lock add [output+rdi],eax;") so that no mutex is needed. This is still very bad because the CPUs will be fighting for exclusive access of the same cache line (and you'll ruin performance by having CPUs spend most of their time trying to get exclusive access to the cache line).
I'm trying to count the number of prime numbers up to 10 million and I have to do it using multiple threads using Posix threads(so, that each thread computes a subset of 10 million). However, my code is not checking for the condition IsPrime. I'm thinking this is due to a race condition. If it is what can I do to ameliorate this issue?
I've tried using a global integer array with k elements but since k is not defined it won't let me declare that at the file scope.
I'm running my code using gcc -pthread:
/*
Program that spawns off "k" threads
k is read in at command line each thread will compute
a subset of the problem domain(check if the number is prime)
to compile: gcc -pthread lab5_part2.c -o lab5_part2
*/
#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
typedef int bool;
#define FALSE 0
#define TRUE 1
#define N 10000000 // 10 Million
int k; // global variable k willl hold the number of threads
int primeCount = 0; //it will hold the number of primes.
//returns whether num is prime
bool isPrime(long num) {
long limit = sqrt(num);
for(long i=2; i<=limit; i++) {
if(num % i == 0) {
return FALSE;
}
}
return TRUE;
}
//function to use with threads
void* getPrime(void* input){
//get the thread id
long id = (long) input;
printf("The thread id is: %ld \n", id);
//how many iterations each thread will have to do
int numOfIterations = N/k;
//check the last thread. to make sure is a whole number.
if(id == k-1){
numOfIterations = N - (numOfIterations * id);
}
long startingPoint = (id * numOfIterations);
long endPoint = (id + 1) * numOfIterations;
for(long i = startingPoint; i < endPoint; i +=2){
if(isPrime(i)){
primeCount ++;
}
}
//terminate calling thread.
pthread_exit(NULL);
}
int main(int argc, char** args) {
//get the num of threads from command line
k = atoi(args[1]);
//make sure is working
printf("Number of threads is: %d\n",k );
struct timespec start,end;
//start clock
clock_gettime(CLOCK_REALTIME,&start);
//create an array of threads to run
pthread_t* threads = malloc(k * sizeof(pthread_t));
for(int i = 0; i < k; i++){
pthread_create(&threads[i],NULL,getPrime,(void*)(long)i);
}
//wait for each thread to finish
int retval;
for(int i=0; i < k; i++){
int * result = NULL;
retval = pthread_join(threads[i],(void**)(&result));
}
//get the time time_spent
clock_gettime(CLOCK_REALTIME,&end);
double time_spent = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec)/1000000000.0f;
printf("Time tasken: %f seconds\n", time_spent);
printf("%d primes found.\n", primeCount);
}
the current output I am getting: (using the 2 threads)
Number of threads is: 2
Time tasken: 0.038641 seconds
2 primes found.
The counter primeCount is modified by multiple threads, and therefore must be atomic. To fix this using the standard library (which is now supported by POSIX as well), you should #include <stdatomic.h>, declare primeCount as an atomic_int, and increment it with an atomic_fetch_add() or atomic_fetch_add_explicit().
Better yet, if you don’t care about the result until the end, each thread can store its own count in a separate variable, and the main thread can add all the counts together once the threads finish. You will need to create, in the main thread, an atomic counter per thread (so that updates don’t clobber other data in the same cache line), pass each thread a pointer to its output parameter, and then return the partial tally to the main thread through that pointer.
This looks like an exercise that you want to solve yourself, so I won’t write the code for you, but the approach to use would be to declare an array of counters like the array of thread IDs, and pass &counters[i] as the arg parameter of pthread_create() similarly to how you pass &threads[i]. Each thread would need its own counter. At the end of the thread procedure, you would write something like, atomic_store_explicit( (atomic_int*)arg, localTally, memory_order_relaxed );. This should be completely wait-free on all modern architectures.
You might also decide that it’s not worth going to that trouble to avoid a single atomic update per thread, declare primeCount as an atomic_int, and then atomic_fetch_add_explicit( &primeCount, localTally, memory_order_relaxed ); once before the thread procedure terminates.
I'm pretty new to C, so I'm not sure where even to start digging about my problem.
I'm trying to port python number-crunching algos to C, and since there's no GIL in C (woohoo), I can change whatever I want in memory from threads, for as long as I make sure there are no races.
I did my homework on mutexes, however, I cannot wrap my head around use of mutexes in case of continuously running threads accessing the same array over and over.
I'm using p_threads in order to split the workload on a big array a[N].
Number crunching algorithm on array a[N] is additive, so I'm splitting it using a_diff[N_THREADS][N] array, writing changes to be applied to a[N] array from each thread to a_diff[N_THREADS][N] and then merging them all together after each step.
I need to run the crunching on different versions of array a[N], so I pass them via global pointer p (in the MWE, there's only one a[N])
I'm synchronizing threads using another global array SYNC_THREADS[N_THREADS] and make sure threads quit when I need them to by setting END_THREADS global (I know, I'm using too many globals - I don't care, code is ~200 lines). My question is in regard to this synchronization technique - is it safe to do so and what is cleaner/better/faster way to achieve that?
MWEe:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define N_THREADS 3
#define N 10000000
#define STEPS 3
double a[N]; // main array
double a_diff[N_THREADS][N]; // diffs array
double params[N]; // parameter used for number-crunching
double (*p)[N]; // pointer to array[N]
// structure for bounds for crunching the array
struct bounds {
int lo;
int hi;
int thread_num;
};
struct bounds B[N_THREADS];
int SYNC_THREADS[N_THREADS]; // for syncing threads
int END_THREADS = 0; // signal to terminate threads
static void *crunching(void *arg) {
// multiple threads run number-crunching operations according to assigned low/high bounds
struct bounds *data = (struct bounds *)arg;
int lo = (*data).lo;
int hi = (*data).hi;
int thread_num = (*data).thread_num;
printf("worker %d started for bounds [%d %d] \n", thread_num, lo, hi);
int i;
while (END_THREADS != 1) { // END_THREADS tells threads to terminate
if (SYNC_THREADS[thread_num] == 1) { // SYNC_THREADS allows threads to start number-crunching
printf("worker %d working... \n", thread_num );
for (i = lo; i <= hi; ++i) {
a_diff[thread_num][i] += (*p)[i] * params[i]; // pretend this is an expensive operation...
}
SYNC_THREADS[thread_num] = 0; // thread disables itself until SYNC_THREADS is back to 1
printf("worker %d stopped... \n", thread_num );
}
}
return 0;
}
int i, j, th,s;
double joiner;
int main() {
// pre-fill arrays
for (i = 0; i < N; ++i) {
a[i] = i + 0.5;
params[i] = 0.0;
}
// split workload between workers
int worker_length = N / N_THREADS;
for (i = 0; i < N_THREADS; ++i) {
B[i].thread_num = i;
B[i].lo = i * worker_length;
if (i == N_THREADS - 1) {
B[i].hi = N;
} else {
B[i].hi = i * worker_length + worker_length - 1;
}
}
// pointer to parameters to be passed to worker
struct bounds **data = malloc(N_THREADS * sizeof(struct bounds*));
for (i = 0; i < N_THREADS; i++) {
data[i] = malloc(sizeof(struct bounds));
data[i]->lo = B[i].lo;
data[i]->hi = B[i].hi;
data[i]->thread_num = B[i].thread_num;
}
// create thread objects
pthread_t threads[N_THREADS];
// disallow threads to crunch numbers
for (th = 0; th < N_THREADS; ++th) {
SYNC_THREADS[th] = 0;
}
// launch workers
for(th = 0; th < N_THREADS; th++) {
pthread_create(&threads[th], NULL, crunching, data[th]);
}
// big loop of iterations
for (s = 0; s < STEPS; ++s) {
for (i = 0; i < N; ++i) {
params[i] += 1.0; // adjust parameters
// zero diff array
for (i = 0; i < N; ++i) {
for (th = 0; th < N_THREADS; ++th) {
a_diff[th][i] = 0.0;
}
}
p = &a; // pointer to array a
// allow threads to process numbers and wait for threads to complete
for (th = 0; th < N_THREADS; ++th) { SYNC_THREADS[th] = 1; }
// ...here threads started by pthread_create do calculations...
for (th = 0; th < N_THREADS; th++) { while (SYNC_THREADS[th] != 0) {} }
// join results from threads (number-crunching is additive)
for (i = 0; i < N; ++i) {
joiner = 0.0;
for (th = 0; th < N_THREADS; ++th) {
joiner += a_diff[th][i];
}
a[i] += joiner;
}
}
}
// join workers
END_THREADS = 1;
for(th = 0; th < N_THREADS; th++) {
pthread_join(threads[th], NULL);
}
return 0;
}
I see that workers don't overlap time-wise:
worker 0 started for bounds [0 3333332]
worker 1 started for bounds [3333333 6666665]
worker 2 started for bounds [6666666 10000000]
worker 0 working...
worker 1 working...
worker 2 working...
worker 2 stopped...
worker 0 stopped...
worker 1 stopped...
worker 2 working...
worker 0 working...
worker 1 working...
worker 1 stopped...
worker 0 stopped...
worker 2 stopped...
worker 2 working...
worker 0 working...
worker 1 working...
worker 1 stopped...
worker 2 stopped...
worker 0 stopped...
Process returned 0 (0x0) execution time : 1.505 s
and I make sure worker's don't get into each other workspaces by separating them via a_diff[thead_num][N] sub-arrays, however, I'm not sure that's always the case and that I'm not introducing hidden races somewhere...
I didn't realize what was the question :-)
So, the question is if you're thinking well with your SYNC_THREADS and END_THREADS synchronization mechanism.
Yes!... Almost. The problem is that threads are burning CPU while waiting.
Conditional Variables
To make threads wait for an event you have conditional variables (pthread_cond). These offer a few useful functions like wait(), signal() and broadcast():
wait(&cond, &m) blocks a thread in a given condition variable. [note 2]
signal(&cond) unlocks a thread waiting in a given condition variable.
broadcast(&cond) unlocks all threads waiting in a given condition variable.
Initially you'd have all the threads waiting [note 1]:
while(!start_threads)
pthread_cond_wait(&cond_start);
And, when the main thread is ready:
start_threads = 1;
pthread_cond_broadcast(&cond_start);
Barriers
If you have data dependencies between iterations, you'd want to make sure threads are executing the same iteration at any given moment.
To synchronize threads at the end of each iteration, you'll want to take a look at barriers (pthread_barrier):
pthread_barrier_init(count): initializes a barrier to synchronize count threads.
pthread_barrier_wait(): thread waits here until all count threads reach the barrier.
Extending functionality of barriers
Sometimes you'll want the last thread reaching a barrier to compute something (e.g. to increment the counter of number of iterations, or to compute some global value, or to check if execution should stop). You have two alternatives
Using pthread_barriers
You'll need to essentially have two barriers:
int rc = pthread_barrier_wait(&b);
if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD)
if(shouldStop()) stop = 1;
pthread_barrier_wait(&b);
if(stop) return;
Using pthread_conds to implement our own specialized barrier
pthread_mutex_lock(&mutex)
remainingThreads--;
// all threads execute this
executedByAllThreads();
if(remainingThreads == 0) {
// reinitialize barrier
remainingThreads = N;
// only last thread executes this
if(shouldStop()) stop = 1;
pthread_cond_broadcast(&cond);
} else {
while(remainingThreads > 0)
pthread_cond_wait(&cond, &mutex);
}
pthread_mutex_unlock(&mutex);
Note 1: why is pthread_cond_wait() inside a while block? May seem a bit odd. The reason behind it is due to the existence of spurious wakeups. The function may return even if no signal() or broadcast() was issued. So, just to guarantee correctness, it's usual to have an extra variable to guarantee that if a thread suddenly wakes up before it should, it runs back into the pthread_cond_wait().
From the manual:
When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.
(...)
If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup.
Note 2:
An noted by Michael Burr in the comments, you should hold a companion lock whenever you modify the predicate (start_threads) and pthread_cond_wait(). pthread_cond_wait() will release the mutex when called; and re-acquires it when it returns.
PS: It's a bit late here; sorry if my text is confusing :-)
This program mimics the web page counter, counting how many visits to a web page. I just wanna ask what is wrong with this code and why its output is different
the counter value is smaller than the number of visits
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
// repeat 100 times to mimic 100 random visits to the page
#define RPT 100
//web page visit counter int cnt=0;
void* counter() {
int cntLocalCopy;
float r;
cntLocalCopy = cnt;
// mimicking the work of the sever in serving the page to the browser
r = rand() % 2000;
usleep(r);
cnt = cntLocalCopy + 1;
}
int main () {
int i;
float r;
pthread_t tid[RPT];
// seed the random number sequence
srand(time(NULL));
for (i=0; i<RPT; i++) {
// mimicking the random access to the web page
r = rand() % 2000; usleep(r);
// a thread to respond to a connect from a browser
pthread_create (&tid[i], NULL, &counter, NULL);
}
// Wait till threads complete.
for (i=0; i<RPT; i++) {
pthread_join(tid[i], NULL);
}
// print out the counter value and the number of mimicked visits
// the 2 values should be the same if the program is written
// properly
printf ("cnt=%d, repeat=%d\n", cnt, RPT);
}
This isn't a good idea at all:
cntLocalCopy = cnt;
... sleep
cnt = cntLocalCopy + 1;
As the old value of cnt is read before the sleep, the likelihood of 2 or more threads concurrently reading the old value of cnt and then sleeping is very high. Because the sleep duration is random, this might even decrement the counter.
Even if you rearranged the code as follows
... sleep
cntLocalCopy = cnt;
cnt = cntLocalCopy + 1;
or even
++cnt;
A memory barrier will still be needed, as 2 threads could simultaneously read the same old value of cnt, they will both increment it to the same new value, instead of both incrementing the value serially. Have a look here for an example.
As StuartLC, you clearly have a concurrency problem with variable cnt.
Perhaps you should use a mutex or semaphore to create a critical region around this variable so that if a thread is editing / reading in, no thread would be able to write / read it.
I'm running some tests on a short program I've written. It runs another program which performs some file operations based on the inputs I give it. The whole purpose of this program is to break a large packet of work into smaller packets to increase performance (sending 10 smaller packets to 10 versions of the program instead of waiting for the one larger one to execute, simple divide and conquer).
The problem lies in the fact that, while I believe I have limited the number of threads that will be created, the testing messages I have set up indicate that there are many more threads running than there should be. I'm really uncertain what I did wrong here.
Code snippet:
if (finish != start){
if (sizeOfBlock != 0){
num_threads = (finish - start)/sizeOfBlock + 1;
}
else{
num_threads = (finish-start) + 1;
}
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads == 10;
}
else if (finish == start){
num_threads = 1;
}
}
threads = (pthread_t *) malloc(num_threads * sizeof(pthread_t));
for (i = 0; i < num_threads; i++){
printf("Creating thread %d\n", i);
s = pthread_create(&threads[i], NULL, thread, &MaxNum);
if (s != 0)
printf("error in pthread_create\n");
if (s==0)
activethreads++;
}
while (activethreads > 0){
//printf("active threads: %d\n", activethreads);
}
pthread_exit(0);
This code is useless:
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads == 10
}
num_threads == 10 compares num_threads to 10 and then throws that away. You want assignment instead:
if (num_threads > 10){ // this should limit threads to 10 at most
num_threads = 10;
}
Also, there are numerous ; missing in your code, in the future, please try to provide a self contained example of code that compiles.