Windows Threading API: Calculate PI value with multiple threads - c

I am currently working on this project where I need to calculate the value of PI...
When specifying only one thread works perfectly and I get 3.1416[...] but when I specify to solve the process in 2 or more threads I stop getting the 3.1416 value, this is my code:
#include <stdio.h>
#include <time.h>
#include <windows.h>
//const int numThreads = 1;
//long long num_steps = 100000000;
const int numThreads = 2;
long long num_steps = 50000000;
double x, step, pi, sum = 0.0;
int i;
DWORD WINAPI ValueFunc(LPVOID arg){
for (i=0; i<=num_steps; i++) {
x = (i + .5)*step;
sum = sum + 4.0 / (1. + x*x);
}
printf("this is %d step\n", i);
return 0;
}
int main(int argc, char* argv[]) {
int count;
clock_t start, stop;
step = 1. / (double)num_steps;
start = clock();
HANDLE hThread[numThreads];
for ( count = 0; count < numThreads; count++) {
printf("This is thread %d\n", count);
hThread[count] = CreateThread(NULL, 0, ValueFunc, NULL, 0, NULL);
}
WaitForMultipleObjects(numThreads, hThread, TRUE, INFINITE);
pi = sum*step;
stop = clock();
printf("The value of PI is %15.12f\n", pi);
printf("The time to calculate PI was %f seconds\n", ((double)(stop - start) / 1000.0));
}
I get this wrong output when specifying 2 threads:

It seems that your program, when using two threads, allows both threads to directly manipulate a global/shared resource 'sum' without any synchronization protection.
In other words, both threads can manipulate 'sum' at the same time. The value of 'sum' at any point will not be what is expected (ie: as it was with only one thread).
Your program needs to implement some sort of access synchronization between the two threads; such as semaphores, spin-locks, mutex, atomic operations, etc. If implemented properly, these features would allow the two (or more) threads to share the single task (of calculating PI).

You need to use mutexes to access data shared by multiple threads or have the data local to the particular thread and then colate the answer when all the threads have completed.

Related

why is this c code causing a race condition?

I'm trying to count the number of prime numbers up to 10 million and I have to do it using multiple threads using Posix threads(so, that each thread computes a subset of 10 million). However, my code is not checking for the condition IsPrime. I'm thinking this is due to a race condition. If it is what can I do to ameliorate this issue?
I've tried using a global integer array with k elements but since k is not defined it won't let me declare that at the file scope.
I'm running my code using gcc -pthread:
/*
Program that spawns off "k" threads
k is read in at command line each thread will compute
a subset of the problem domain(check if the number is prime)
to compile: gcc -pthread lab5_part2.c -o lab5_part2
*/
#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
typedef int bool;
#define FALSE 0
#define TRUE 1
#define N 10000000 // 10 Million
int k; // global variable k willl hold the number of threads
int primeCount = 0; //it will hold the number of primes.
//returns whether num is prime
bool isPrime(long num) {
long limit = sqrt(num);
for(long i=2; i<=limit; i++) {
if(num % i == 0) {
return FALSE;
}
}
return TRUE;
}
//function to use with threads
void* getPrime(void* input){
//get the thread id
long id = (long) input;
printf("The thread id is: %ld \n", id);
//how many iterations each thread will have to do
int numOfIterations = N/k;
//check the last thread. to make sure is a whole number.
if(id == k-1){
numOfIterations = N - (numOfIterations * id);
}
long startingPoint = (id * numOfIterations);
long endPoint = (id + 1) * numOfIterations;
for(long i = startingPoint; i < endPoint; i +=2){
if(isPrime(i)){
primeCount ++;
}
}
//terminate calling thread.
pthread_exit(NULL);
}
int main(int argc, char** args) {
//get the num of threads from command line
k = atoi(args[1]);
//make sure is working
printf("Number of threads is: %d\n",k );
struct timespec start,end;
//start clock
clock_gettime(CLOCK_REALTIME,&start);
//create an array of threads to run
pthread_t* threads = malloc(k * sizeof(pthread_t));
for(int i = 0; i < k; i++){
pthread_create(&threads[i],NULL,getPrime,(void*)(long)i);
}
//wait for each thread to finish
int retval;
for(int i=0; i < k; i++){
int * result = NULL;
retval = pthread_join(threads[i],(void**)(&result));
}
//get the time time_spent
clock_gettime(CLOCK_REALTIME,&end);
double time_spent = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec)/1000000000.0f;
printf("Time tasken: %f seconds\n", time_spent);
printf("%d primes found.\n", primeCount);
}
the current output I am getting: (using the 2 threads)
Number of threads is: 2
Time tasken: 0.038641 seconds
2 primes found.
The counter primeCount is modified by multiple threads, and therefore must be atomic. To fix this using the standard library (which is now supported by POSIX as well), you should #include <stdatomic.h>, declare primeCount as an atomic_int, and increment it with an atomic_fetch_add() or atomic_fetch_add_explicit().
Better yet, if you don’t care about the result until the end, each thread can store its own count in a separate variable, and the main thread can add all the counts together once the threads finish. You will need to create, in the main thread, an atomic counter per thread (so that updates don’t clobber other data in the same cache line), pass each thread a pointer to its output parameter, and then return the partial tally to the main thread through that pointer.
This looks like an exercise that you want to solve yourself, so I won’t write the code for you, but the approach to use would be to declare an array of counters like the array of thread IDs, and pass &counters[i] as the arg parameter of pthread_create() similarly to how you pass &threads[i]. Each thread would need its own counter. At the end of the thread procedure, you would write something like, atomic_store_explicit( (atomic_int*)arg, localTally, memory_order_relaxed );. This should be completely wait-free on all modern architectures.
You might also decide that it’s not worth going to that trouble to avoid a single atomic update per thread, declare primeCount as an atomic_int, and then atomic_fetch_add_explicit( &primeCount, localTally, memory_order_relaxed ); once before the thread procedure terminates.

Serial Execution faster than Parallel Execution with one thread of OpenMP

I am trying to compute value of pi using trapezoidal rule of numerical integration. For that I have written a serial code which does iterations in a given range. For computing the parallel overhead, I have run the same code by setting number of threads to 1. Now, I have obtained the following graph of execution time versus the problem size.
Since, we are only creating one thread, I don't think there is much of communication overhead involved in this. So what might be the reason behind this? And as far as I know, the directive's invocation is done at compile time, i.e., if you define a MACRO then it gets expanded before runtime, so am I missing something there? Or is it something totally different from what I have thought?
Below is the serial code
#include<stdio.h>
#include<omp.h>
int main()
{
FILE *fp = fopen("pi_serial.txt", "a+");
long num_steps = 1e9;
double step_size = 1.0 / num_steps;
long i;
double sum = 0;
double start_time = omp_get_wtime();
for(i = 0; i< num_steps; i++) {
double x = (i + 0.5) * step_size;
sum += (4.0 / (1.0 + (x * x)));
}
sum = sum * step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the multi-threaded code
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
FILE* fp = fopen("pi_parallel.txt", "a+");
omp_set_num_threads(1);
long num_steps = atol(argv[1]);
double step_size = 1.0 / num_steps;
double sum = 0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
int id = omp_get_thread_num();
double private_sum = 0;
int i;
for(i = id; i <= num_steps; i += 1){
double x = (i + 0.5) * step_size;
private_sum += (4.0 / (1.0 + x * x));
}
#pragma omp critical
sum += private_sum;
}
sum *= step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the graph for Execution time
https://www.youtube.com/watch?v=OuzYICZUthM&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=7
The above video will help in understanding why a serial code might be faster than a parallel code with one thread.
According to the presenter, it can be seen that since you are setting up omp environment variables, creating a thread in the middle of the program it is normal for the openmp program to run slower than the serial code.
But the main thing would be look at the scalability of your code- how fast is your code compared to serial when running on more than 1 thread?
When you are running the same code on multiple threads and still do not see an increase in performance it may be due to false sharing. From what I understand, consider two variables that reside in the same cache line. The master thread accesses one of the variables and modifies it which causes the cache line to be invalidated. If thread 1 has to access the modified cache line then the modified cache line is written to memory and the thread then fetches the cache line from memory and modifies it. This process may increase the execution time.
References:
https://docs.oracle.com/cd/E37069_01/html/E37081/aewcy.html
*I dont own the video.

C Threads program

I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?
The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.

Why does this code work without a mutex?

I am trying to learn how locks work in multi-threading. When I execute the following code without lock, it worked fine even though the variable sum is declared as a global variable and multiple threads are updating it. Could anyone please explain why here threads are working perfectly fine on a shared variable without locks?
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 100
#define ARRAYSIZE 1000000
#define ITERATIONS ARRAYSIZE / NTHREADS
double sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;
void *do_work(void *tid)
{
int i, start, *mytid, end;
double mysum=0.0;
/* Initialize my part of the global array and keep local sum */
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1);
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
mysum = mysum + a[i];
}
/* Lock the mutex and update the global sum, then exit */
//pthread_mutex_lock (&sum_mutex); //here I tried not to use locks
sum = sum + mysum;
//pthread_mutex_unlock (&sum_mutex);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int i, start, tids[NTHREADS];
pthread_t threads[NTHREADS];
pthread_attr_t attr;
/* Pthreads setup: initialize mutex and explicitly create threads in a
joinable state (for portability). Pass each thread its loop offset */
pthread_mutex_init(&sum_mutex, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
pthread_create(&threads[i], &attr, do_work, (void *) &tids[i]);
}
/* Wait for all threads to complete then print global sum */
for (i=0; i<NTHREADS; i++) {
pthread_join(threads[i], NULL);
}
printf ("Done. Sum= %e \n", sum);
sum=0.0;
for (i=0;i<ARRAYSIZE;i++){
a[i] = i*1.0;
sum = sum + a[i]; }
printf("Check Sum= %e\n",sum);
/* Clean up and exit */
pthread_attr_destroy(&attr);
pthread_mutex_destroy(&sum_mutex);
pthread_exit (NULL);
}
With and without lock I got the same answer!
Done. Sum= 4.999995e+11
Check Sum= 4.999995e+11
UPDATE: Change suggested by user3386109
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
//pthread_mutex_lock (&sum_mutex);
sum = sum + a[i];
//pthread_mutex_lock (&sum_mutex);
}
EFFECT :
Done. Sum= 3.878172e+11
Check Sum= 4.999995e+11
Mutexes are used to prevent race conditions which are undesirable situations when you have two or more threads accessing a shared resource. Race conditions such as the one in your code happen when the shared variable sum is being accessed by multiple threads. Sometimes the access to the shared variable will be interleaved in such a way that the result is incorrect and sometimes the result will be correct.
For example lets say you have two threads, thread A and thread B both adding 1 to a shared value, sum, which starts at 5. If thread A reads sum and then thread B reads sum and then thread A writes a new value followed by thread B writing a new value you will can an incorrect result, 6 as opposed to 7. However it is also possible than thread A reads and then writes a value (specifically 6) followed by thread B reading and writing a value (specifically 7) and then you get the correct result. The point being that some interleavings of operations result in the correct value and some interleavings result in an incorrect value. The job of the mutex is to force the interleaving to always be correct.

Not sure if I need a mutex or not

I'm new to concurrent programming so be nice. I have a basic sequential program (which is for homework) and I'm attempting to turn it into a multithreaded program. I'm not sure if I need a lock for my second shared variable. The threads should modify my variable but never read them. The only time count should be read is after the loop which spawns all of my threads has finished distributing keys.
#define ARRAYSIZE 50000
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void binary_search(int *array, int key, int min, int max);
int count = 0; // count of intersections
int l_array[ARRAYSIZE * 2]; //array to check for intersection
int main(void)
{
int r_array[ARRAYSIZE]; //array of keys
int ix = 0;
struct timeval start, stop;
double elapsed;
for(ix = 0; ix < ARRAYSIZE; ix++)
{
r_array[ix] = ix;
}
for(ix = 0; ix < ARRAYSIZE * 2; ix++)
{
l_array[ix] = ix + 500;
}
gettimeofday(&start, NULL);
for(ix = 0; ix < ARRAYSIZE; ix++)
{
//this is where I will spawn off separate threads
binary_search(l_array, r_array[ix], 0, ARRAYSIZE * 2);
}
//wait for all threads to finish computation, then proceed.
fprintf(stderr, "%d\n", count);
gettimeofday(&stop, NULL);
elapsed = ((stop.tv_sec - start.tv_sec) * 1000000+(stop.tv_usec-start.tv_usec))/1000000.0;
printf("time taken is %f seconds\n", elapsed);
return 0;
}
void binary_search(int *array, int key, int min, int max)
{
int mid = 0;
if (max < min) return;
else
{
mid = (min + max) / 2;
if (array[mid] > key) return binary_search(array, key, min, mid - 1);
else if (array[mid] < key) return binary_search(array, key, mid + 1, max);
else
{
//this is where I'm not sure if I need a lock or not
count++;
return;
}
}
}
As you suspect, count++; requires synchronization. This is actually not something you should try to "get away with" not doing. Sooner or later a second thread will read count after the first thread reads it but before it increments it. Then you will miss a count. It is impossible to predict how often it will happen. It could happen once in a blue moon or thousands of times a second.
Actually, the code as you've written it does both read and modify the variable. If you were to look at the machine code that gets generated for a line like
count++
you'd see that it consists of something like
fetch count into register
increment register
store count
So yes, you should use a mutex there. (And even if you could get away without doing so, why not take the chance to practice?)
If you simply want accurate increments to count across multiple threads, these types of single-value updates are precisely what the interlocked memory-barrier functions are for.
For this I would use :__sync_add_and_fetch if you're using gcc. There a host of different interlocked operations you can do, most of them platform-specific, so check your documentation. For updating counters like this, however, they can save a heap-ton of hassle. Other samples include InterlockedIncrement under Windows, OSAtomicIncrement32 on OS X, etc.

Resources