I am trying to calculate pi using the bpp method but my result keeps coming up at 0.The whole idea is for each thread to compute a part of it and the sum of each thread gets summed up using the join method
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#define NUM_THREADS 20
void *pi_function(void *p);//returns the value of pi
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER; //creates a mutex variable
double pi=0,p16=1;int k=0;
double sumvalue=0,sum=0;
main()
{
pthread_t threads[NUM_THREADS]; //creates the number of threads NUM_THREADS
int iret1; //used to ensure that threads are created properly
//pthread_create(thread,attr,start_routine,arg)
int i;
pthread_mutex_init(&mutex1, NULL);
for(i=0;i<NUM_THREADS;i++){
iret1= pthread_create(&threads[i],NULL,pie_function,(void *) i);
if(iret1){
printf("ERROR; return code from pthread_create() is %d\n", iret1);
exit(-1);
}
}
for(i=0;i<NUM_THREADS;i++){
iret1=pthread_join(threads[i],&sumvalue);
if(iret1){
printf("ERROR; return code from pthread_create() is %d\n", iret1);
exit(-1);
}
pi=pi+sumvalue; //my result here keeps returning 0
}
pthread_mutex_destroy(&mutex1);
printf("Main: program completed. Exiting.\n");
printf("The value of pi is : %f\n",pi);
exit(0);
}
void *pie_function(void * p){
int rc;
int k=(int)p;
sumvalue += 1.0/p16 * (4.0/(8* k + 1) - 2.0/(8*k + 4)
- 1.0/(8*k + 5) - 1.0/(8*k+6));
pthread_mutex_lock( &mutex1 ); //locks the share variable pi and p16
p16 *=16;
rc=pthread_mutex_unlock( &mutex1 );
if(rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
}
pthread_exit(&sumvalue);
}
For your purpose you don't need to have a mutex or other complicated structure. Just have every thread compute on its own local variables. Provide to each thread the address of a double where he receives his k and may return the result, in the same way as you already separate the ptread_t variables for each thread.
To avoid getting 0 as the output. Put pi=pi+sumvalue inside the join for loop.
since pi=pi+sumvalue is executed only once when sumvalue is very small and pi=0. pi=0 is the output. Just put pi=pi+sumvalue in the join for loop.
To get consistent value of pi follow below mentioned details.
You have to make sure two thinks:-
Just make pi as an global variable and remove all global variables. limit the use of many global variable as updating them will become a critical section. Update pi as
pthread mutex lock(&mutex1);
pi += pi + my sum;
pthread mutex unlock(&mutex1);
you can also calculate sum_values as:
void pie_function(void rank) {
long my_rank = (long) rank;
//printf("%ld \n",my_rank);
double factor, sumvalue = 0.0;
long long i;
long long n=1000000;
long long my_n = n/thread_count;
long long my_first_i = my_n*my_rank;
long long my_last_i = my_first_i + my_n;
if (my_first_i % 2 == 0)
factor = 1.0;
else
factor = -1.0;
for (i = my_first_i; i < my_last_i; i++, factor = -factor)
sumvalue+= 4factor/(2i+1);
pthread_exit(&sumvalue);
}
Related
So i have a global variable called counter and i run 4 threads which increment in million times but the result i am getting at the end does not even reach 2 million.
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
int nthread;
int counter=0;
void *f(void *arg)
{
int i = *(int *)arg;
int *p;
for (int c = 0; c < 1000000; c++)
{
counter++;
}
printf(" I am thread %d (out of %d),tid =% ld\n", i, nthread, pthread_self());
p = malloc(sizeof(int));
*p = i * 2;
pthread_exit(p); // return p
}
int main(int argc, char *argv[])
{
pthread_t *tid;
int e, i, *ti;
nthread = 4;
tid = malloc(nthread * sizeof(pthread_t));
ti = malloc(nthread * sizeof(int));
for (i = 0; i < nthread; i++)
{
ti[i] = i;
if ((e = pthread_create(&tid[i], NULL, f, &ti[i])) != 0)
send_error(e, " pthread_create ");
}
for (i = 0; i < nthread; i++)
{
void *r;
if ((e = pthread_join(tid[i], &r)) != 0)
send_error(e, " pthread_join ");
printf(" Return of thread %d = %d\n", i, *(int *)r);
free(r);
}
printf("counter is %d\n",counter);
free(tid);
free(ti);
}
What causes this and how i can fix this?
PS:if your code not compile replace send_error with printfs
The pthreads standards is very clear that you may not access an object in one thread while another thread is, or might be, modifying it. Your code violates this rule.
There are many reasons for this rule, but the most obvious is this:
for (int c = 0; c < 1000000; c++)
{
counter++;
}
You want your compiler to optimize code like this. You want it to keep counter in a register or even eliminate the loop if it can. But without the requirement that you avoid threads overlapping modifications and accesses to the same object, the compiler would have to somehow prove that no other code in any other thread could touch counter while this code was running.
That would result in a huge amount of valuable optimizations being impossible on the 99% of code that doesn't share objects across threads just because the compiler can't prove that accesses might overlap.
It makes much more sense to require code that does have overlapping object access to clearly indicate that they do. And every threading standard provides good ways to do this, including pthreads.
You can use any method to prevent this problem that you like. Using a mutex is the simplest and definitely the one you should learn first.
My goal is to create a program that evaluates the performance gains from increasing the number of threads the program can use. I evaluate the performance by using the Monte Carlo method to calculate pi. Each thread should create 1 random coordinate (x,y) and check if that coordinate is within the circle or not. If it is, the inCircle counter should increase. Pi is calculated as follows: 4 * inCircle/trys. Using pthread_join, there is no performance gains in a problem that should benefit from multiple threads. Is there some way to allow multiple threads to increase a counter without having to wait for each individual thread?
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <stdbool.h>
#include <pthread.h>
#define nPoints 10000000
#define NUM_THREADS 16
int inCircle = 0;
int count = 0;
double x,y;
pthread_mutex_t mutex;
bool isInCircle(double x, double y){
if(x*x+y*y<=1){
return true;
}
else{
return false;
}
}
void *piSlave(){
int myCount = 0;
time_t now;
time(&now);
srand((unsigned int)now);
for(int i = 1; i <= nPoints/NUM_THREADS; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
if(isInCircle(x,y)){
myCount++;
}
}
pthread_mutex_lock(&mutex);
inCircle += myCount;
pthread_mutex_unlock(&mutex);
pthread_exit(0);
}
double piMaster()
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("Creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, piSlave, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
//pthread_join(threads[t], NULL);
}
//wait(NULL);
return 4.0*inCircle/nPoints;
}
int main()
{
printf("%f\n",piMaster());
return(0);
}
There are few issues with the code.
Wait for Thread Termination
The piMaster() function should wait for the threads it created. We can do this by simply running pthread_join() in a loop:
for (t = 0; t < NUM_THREADS; t++)
pthread_join(threads[t], NULL);
Avoid Locks
We can simply atomically increase the inCircle counter at the end of the loop, so no locks are needed. The variable must be declared with _Atomic keyword as described in the Atomic operations C reference:
_Atomic long inCircle = 0;
void *piSlave(void *arg)
{
[...]
inCircle += myCount;
[...]
}
This will generate correct CPU instructions to atomically increase the variable. For example, for x86 architecture a lock prefix appears as we can confirm in disassembly:
29 inCircle += myCount;
0x0000000100000bdb <+155>: lock add %rbx,0x46d(%rip) # 0x100001050 <inCircle>
Avoid Slow and Thread Unsafe rand()
Instead, we can simply scan the whole circle in a loop as described on Approximations of Pi Wikipedia page:
for (long x = -RADIUS; x <= RADIUS; x++)
for (long y = -RADIUS; y <= RADIUS; y++)
myCount += isInCircle(x, y);
So here is the code after the changes above:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define RADIUS 10000L
#define NUM_THREADS 10
_Atomic long inCircle = 0;
inline long isInCircle(long x, long y)
{
return x * x + y * y <= RADIUS * RADIUS ? 1 : 0;
}
void *piSlave(void *arg)
{
long myCount = 0;
long tid = (long)arg;
for (long x = -RADIUS + tid; x <= RADIUS + tid; x += NUM_THREADS)
for (long y = -RADIUS; y <= RADIUS; y++)
myCount += isInCircle(x, y);
printf("\tthread %ld count: %zd\n", tid, myCount);
inCircle += myCount;
pthread_exit(0);
}
double piMaster()
{
pthread_t threads[NUM_THREADS];
long t;
for (t = 0; t < NUM_THREADS; t++) {
printf("Creating thread %ld...\n", t);
if (pthread_create(&threads[t], NULL, piSlave, (void *)t)) {
perror("Error creating pthread");
exit(-1);
}
}
for (t = 0; t < NUM_THREADS; t++)
pthread_join(threads[t], NULL);
return (double)inCircle / (RADIUS * RADIUS);
}
int main()
{
printf("Result: %f\n", piMaster());
return (0);
}
And here is the output:
Creating thread 0...
Creating thread 1...
Creating thread 2...
Creating thread 3...
Creating thread 4...
Creating thread 5...
Creating thread 6...
Creating thread 7...
Creating thread 8...
Creating thread 9...
thread 7 count: 31415974
thread 5 count: 31416052
thread 1 count: 31415808
thread 3 count: 31415974
thread 0 count: 31415549
thread 4 count: 31416048
thread 2 count: 31415896
thread 9 count: 31415808
thread 8 count: 31415896
thread 6 count: 31416048
Result: 3.141591
Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost.
I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50.
I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel.
And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop.
#define SEED 35791246
__shared__ int niter;
__shared__ double pi;
__global__ void calcularPi(){
double x;
double y;
int count;
double z;
count = 0;
niter = 0;
//keep looping
do{
niter = niter + 1;
//Generate random number
curandState state;
curand_init(SEED,(int)niter, 0, &state);
x = curand(&state);
y = curand(&state);
z = x*x+y*y;
if (z<=1) count++;
pi =(double)count/niter*4;
}while(niter < 50);
}
int main(void){
float tempoTotal;
//Start timer
clock_t t;
t = clock();
//call kernel
calcularPi<<<1,32>>>();
//wait while kernel finish
cudaDeviceSynchronize();
typeof(pi) piFinal;
cudaMemcpyFromSymbol(&piFinal, "pi", sizeof(piFinal),0, cudaMemcpyDeviceToHost);
typeof(niter) niterFinal;
cudaMemcpyFromSymbol(&niterFinal, "niter", sizeof(niterFinal),0, cudaMemcpyDeviceToHost);
//Ends timer
t = clock() - t;
tempoTotal = ((double)t)/CLOCKS_PER_SEC;
printf("Pi: %g \n", piFinal);
printf("Adds: %d \n", niterFinal);
printf("Total time: %f \n", tempoTotal);
}
There are a variety of issues with your code.
I suggest using proper cuda error checking and run your code with cuda-memcheck to spot any runtime errors. I've omitted proper error checking in my code below for brevity of presentation, but I've run it with cuda-memcheck to indicate no runtime errors.
Your usage of curand() is probably not correct (it returns integers over a large range). For this code to work correctly, you want a floating-point quantity between 0 and 1. The correct call for that is curand_uniform().
Since you want all threads to work on the same values, you must prevent those threads from stepping on each other. One way to do that is to use atomic updates of the variables in question.
It should not be necessary to re-run curand_init on each iteration. Once per thread should be sufficient.
We don't use cudaMemcpy..Symbol operations on __shared__ variables. For convenience, and to preserve something that resembles your original code, I've elected to convert those to __device__ variables.
Here's a modified version of your code that has most of the above issues fixed:
$ cat t978.cu
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#define ITER_MAX 5000
#define SEED 35791246
__device__ int niter;
__device__ int count;
__global__ void calcularPi(){
double x;
double y;
double z;
int lcount;
curandState state;
curand_init(SEED,threadIdx.x, 0, &state);
//keep looping
do{
lcount = atomicAdd(&niter, 1);
//Generate random number
x = curand_uniform(&state);
y = curand_uniform(&state);
z = x*x+y*y;
if (z<=1) atomicAdd(&count, 1);
}while(lcount < ITER_MAX);
}
int main(void){
float tempoTotal;
//Start timer
clock_t t;
t = clock();
int count_final = 0;
int niter_final = 0;
cudaMemcpyToSymbol(niter, &niter_final, sizeof(int));
cudaMemcpyToSymbol(count, &count_final, sizeof(int));
//call kernel
calcularPi<<<1,32>>>();
//wait while kernel finish
cudaDeviceSynchronize();
cudaMemcpyFromSymbol(&count_final, count, sizeof(int));
cudaMemcpyFromSymbol(&niter_final, niter, sizeof(int));
//Ends timer
double pi = count_final/(double)niter_final*4;
t = clock() - t;
tempoTotal = ((double)t)/CLOCKS_PER_SEC;
printf("Pi: %g \n", pi);
printf("Adds: %d \n", niter_final);
printf("Total time: %f \n", tempoTotal);
}
$ nvcc -o t978 t978.cu -lcurand
$ cuda-memcheck ./t978
========= CUDA-MEMCHECK
Pi: 3.12083
Adds: 5032
Total time: 0.558463
========= ERROR SUMMARY: 0 errors
$
I've modified the iterations to a larger number, but you can use 50 if you want for ITER_MAX.
Note that there are many criticisms that could be levelled against this code. My aim here, since it's clearly a learning exercise, is to point out what the minimum number of changes could be to get a functional code, using the algorithm you've outlined. As just one example, you might want to change your kernel launch config (<<<1,32>>>) to other, larger numbers, in order to more fully utilize the GPU.
EDIT TO QUESTION: Is it possible to have thread safe access to a bit array? My implementation below seems to require mutex locks which defeats the purpose of parallelizing.
I've been tasked with creating a parallel implementation of a twin prime generator using pthreads. I decided to use the Sieve of Eratosthenes and to divide the work of marking the factors of known primes. I staggering which factors a thread gets.
For example, if there are 4 threads:
thread one marks multiples 3, 11, 19, 27...
thread two marks multiples 5, 13, 21, 29...
thread two marks multiples 7, 15, 23, 31...
thread two marks multiples 9, 17, 25, 33...
I skipped the even multiples as well as the even base numbers. I've used a bitarray, so I run it up to INT_MAX. The problem I have is at max value of 10 million, the result varies by about 5 numbers, which is how much error there is compared to a known file. The results vary all the way down to about max value of 10000, where it changes by 1 number. Anything below that is error-free.
At first I didn't think there was a need for communication between processes. When I saw the results, I added a pthread barrier to let all the threads catch up after each set of multiples. This didn't make any change. Adding a mutex lock around the mark() function did the trick, but that slows everything down.
Here is my code. Hoping someone might see something obvious.
#include <pthread.h>
#include <stdio.h>
#include <sys/times.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <string.h>
#include <limits.h>
#include <getopt.h>
#define WORDSIZE 32
struct t_data{
int *ba;
unsigned int val;
int num_threads;
int thread_id;
};
pthread_mutex_t mutex_mark;
void mark( int *ba, unsigned int k )
{
ba[k/32] |= 1 << (k%32);
}
void mark( int *ba, unsigned int k )
{
pthread_mutex_lock(&mutex_mark);
ba[k/32] |= 1 << (k%32);
pthread_mutex_unlock(&mutex_mark);
}
void initBa(int **ba, unsigned int val)
{
*ba = calloc((val/WORDSIZE)+1, sizeof(int));
}
void getPrimes(int *ba, unsigned int val)
{
int i, p;
p = -1;
for(i = 3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(++p == 8){
printf(" \n");
p = 0;
}
printf("%9d", i);
}
}
printf("\n");
}
void markTwins(int *ba, unsigned int val)
{
int i;
for(i=3; i<=val; i+=2){
if(!isMarked(ba, i)){
if(isMarked(ba, i+2)){
mark(ba, i);
}
}
}
}
void *setPrimes(void *arg)
{
int *ba, thread_id, num_threads, status;
unsigned int val, i, p, start;
struct t_data *data = (struct t_data*)arg;
ba = data->ba;
thread_id = data->thread_id;
num_threads = data->num_threads;
val = data->val;
start = (2*(thread_id+2))-1; // stagger threads
i=3;
for(i=3; i<=sqrt(val); i+=2){
if(!isMarked(ba, i)){
p=start;
while(i*p <= val){
mark(ba, (i*p));
p += (2*num_threads);
}
}
}
return 0;
}
void usage(char *filename)
{
printf("Usage: \t%s [option] [arg]\n", filename);
printf("\t-q generate #'s internally only\n");
printf("\t-m [size] maximum size twin prime to calculate\n");
printf("\t-c [threads] number of threads\n");
printf("Defaults:\n\toutput results\n\tsize = INT_MAX\n\tthreads = 1\n");
}
int main(int argc, char **argv)
{
int *ba, i, num_threads, opt, output;
unsigned int val;
output = 1;
num_threads = 1;
val = INT_MAX;
while ((opt = getopt(argc, argv, "qm:c:")) != -1){
switch (opt){
case 'q': output = 0;
break;
case 'm': val = atoi(optarg);
break;
case 'c': num_threads = atoi(optarg);
break;
default:
usage(argv[0]);
exit(EXIT_FAILURE);
}
}
struct t_data data[num_threads];
pthread_t thread[num_threads];
pthread_attr_t attr;
pthread_mutex_init(&mutex_mark, NULL);
initBa(&ba, val);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(i=0; i < num_threads; i++){
data[i].ba = ba;
data[i].thread_id = i;
data[i].num_threads = num_threads;
data[i].val = val;
if(0 != pthread_create(&thread[i],
&attr,
setPrimes,
(void*)&data[i])){
perror("Cannot create thread");
exit(EXIT_FAILURE);
}
}
for(i = 0; i < num_threads; i++){
pthread_join(thread[i], NULL);
}
markTwins(ba, val);
if(output)
getPrimes(ba, val);
free(ba);
return 0;
}
EDIT: I got rid of the barrier and added a mutex_lock to the mark function. Output is accurate now, but now more than one thread slows it down. Any suggestions on speeding it up?
Your currently implementation of mark is correct, but the locking is extremely coarse-grained - there's only one lock for your entire array. This means that your threads are constantly contending for that lock.
One way of improving performance is to make the lock finer-grained: each 'mark' operation only requires exclusive access to a single integer within the array, so you could have a mutex for each array entry:
struct bitarray
{
int *bits;
pthread_mutex_t *locks;
};
struct t_data
{
struct bitarray ba;
unsigned int val;
int num_threads;
int thread_id;
};
void initBa(struct bitarray *ba, unsigned int val)
{
const size_t array_size = val / WORDSIZE + 1;
size_t i;
ba->bits = calloc(array_size, sizeof ba->bits[0]);
ba->locks = calloc(array_size, sizeof ba->locks[0]);
for (i = 0; i < array_size; i++)
{
pthread_mutex_init(&ba->locks[i], NULL);
}
}
void mark(struct bitarray ba, unsigned int k)
{
const unsigned int entry = k / 32;
pthread_mutex_lock(&ba.locks[entry]);
ba.bits[entry] |= 1 << (k%32);
pthread_mutex_unlock(&ba.locks[entry]);
}
Note that your algorithm has a race-condition: consider the example where num_threads = 4, so Thread 0 starts at 3, Thread 1 starts at 5 and Thread 2 starts at 7. It is possible for Thread 2 to execute fully, marking every multiple of 7 and then start again at 15, before Thread 0 or Thread 1 get a chance to mark 15 as a multiple of 3 or 5. Thread 2 will then do useless work, marking every multiple of 15.
Another alternative, if your compiler supports Intel-style atomic builtins, is to use those instead of a lock:
void mark(int *ba, unsigned int k)
{
__sync_or_and_fetch(&ba[k/32], 1U << k % 32);
}
Your mark() funciton is not threadsafe - if two threads try to set bits within the same int location one might overwrite with 0 a bit that was just set by another thread.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
int main(int argc, char **argv)
{
unsigned long long in = 1;
unsigned long long total = 2;
double tol , change, new, secs , old = 0.0;
struct timeval start , end;
int threads ; /* ignored */
if ( argc < 2) {
exit (-1);
}
threads = atoi ( argv[1]);
tol = atof ( argv[2]);
if (( threads < 1) || ( tol < 0.0)) {
exit (-1);
}
tol = tol *tol;
srand48(clock());
gettimeofday (&start , NULL);
do
{
double x, y;
x = drand48();
y = drand48();
total ++;
if (( x*x + y*y) <= 1.00)
in ++;
new = 4.0 * (double)in/( double)total ;
change = fabs (new - old);
old = new;
}while (change > tol );
gettimeofday (&end, NULL);
secs = (( double)end.tv_sec - (double)start.tv_sec )
+ (( double)end.tv_usec - (double)start.tv_usec )/1000000.0;
printf ( ”Found estimate of pi of %.12f in %llu iterations , %.6f seconds.n n”,
new, total - 2, secs );
}
The code above is a sequential program that takes an argument for the tolerance for how closely to estimate pi. Once the change in these old and new values fall below the tolerance it exits.
I have to parallelize this program in pthreads. I am not trying to have someone do it for me but rather to get some pointers and ideas to think about so that I may be able to this. the pthreads program will take number of threads and tolerance as it's argument and output the estimation. I am very new to parallel programs and don't really know where to start so I will take any advice. Thanks.
Each thread should keep its own in and total counts and use a better exit condition. Then add up the in and total values from each thread.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
void* Function(void* i);
#define MAX_THREADS 200
unsigned long long total[MAX_THREADS] = {0}; //total points for threads
unsigned long long in[MAX_THREADS] = {0}; //points inside for threads
double tolerance, change, new, old = 0.0;
long thread_num;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char **argv)
{
long i;
struct timeval start, end;
double secs;
unsigned long long master_total;
pthread_t* threads;
if (argc != 3){
printf("\nMust pass 2 arguments: (Tolerance) (# of Threads)");
exit(-1);
}
thread_num = atoi ( argv[1]);
tolerance = atof ( argv[2]);
if (( thread_num < 1) || ( tolerance < 0.0) || (thread_num > 200)) {
printf("\nIncorrect tolerance or threads.");
exit (-1);
}
threads = malloc(thread_num*sizeof(pthread_t)); //allocating space for threads
tolerance = tolerance * tolerance;
change = 0.5;
srand48(clock());
gettimeofday (&start, NULL);
for( i = 0; i < thread_num; i++ ){
pthread_create(&threads[i], NULL, Function, (void*)i);
}
for( i = 0; i < thread_num; i++ ){
pthread_join(threads[i], NULL);
}
gettimeofday (&end, NULL);
master_total = 0;
for( i = 0; i < thread_num; i++ ){
master_total = master_total + total[i];
}
secs = (( double)end.tv_sec - (double)start.tv_sec )
+ (( double)end.tv_usec - (double)start.tv_usec )/1000000.0;
printf ( "Estimation of pi is %.12f in %llu iterations , %.6f seconds.\n", new, master_total, secs );
}
//Each thread will calculate it's own points for in and total
//Per 1000 points it will calculate new and old values and compare to tolerance
//If desired accuracy is met the threads will return.
void* Function(void* i){
/*
rc - return code
total[i], in[i] - threads own number of calculated points
my_total, my_in - Each thread calculates global in and total, per 1000 points and calculates tolerance
*/
long my_spot = (long) i;
long rc;
long j;
unsigned long long my_total;
unsigned long long my_in;
do
{
double x, y;
x = drand48();
y = drand48();
total[my_spot]++;
if (( x*x + y*y) <= 1.00){
in[my_spot]++;
}
if(total[my_spot] % 1000 == 0){
while ( j < thread_num){
my_total = my_total + total[j];
my_in = my_in + in[j];
}
my_total = my_total;
//critical section
//old, new, and change are all global
rc = pthread_mutex_lock(&mutex);
new = 4.0 * (double)my_in/( double)my_total;
change = fabs (new - old);
old = new;
rc = pthread_mutex_unlock(&mutex);
}
}while (change > tolerance );
return NULL;
}
This is what I whipped up but I am getting errors. It just stops. I just have the threads break out of a loop and return then the main thread joins them. Any advice on what I am doing here?
I run it and it seems that all the threads get locked out when they reach mutex lock. I have each thread check the change in pi every 1000 points.