test "pthread_create" function behaviour - c

I am new in multi-threaded programming and I have a question about "pthread_create" behaviour
this is the code :
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 10
#define ARRAYSIZE 1000000
#define ITERATIONS ARRAYSIZE / NTHREADS
double sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;
void *do_work(void *tid)
{
int i, k=0,start, *mytid, end;
double mysum=0.0;
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1);
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
mysum = mysum + a[i];
}
sum = sum + mysum;
}
int main(int argc, char *argv[])
{
int i, start, tids[NTHREADS];
pthread_t threads[NTHREADS];
pthread_attr_t attr;
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
pthread_create(&threads[i], NULL/*&attr*/, do_work, (void *) &tids[i]);
}
/* Wait for all threads to complete then print global sum */
/*
for (i=0; i<NTHREADS; i++) {
pthread_join(threads[i], NULL);
}*/
printf ("Done. Sum= %e \n", sum);
sum=0.0;
for (i=0;i<ARRAYSIZE;i++){
a[i] = i*1.0;
sum = sum + a[i]; }
printf("Check Sum= %e\n",sum);
}
the result of execution is :
Thread 1 doing iterations 100000 to 199999
Done. Sum= 0.000000e+00
Thread 0 doing iterations 0 to 99999
Thread 2 doing iterations 200000 to 299999
Thread 3 doing iterations 300000 to 399999
Thread 8 doing iterations 800000 to 899999
Thread 4 doing iterations 400000 to 499999
Thread 5 doing iterations 500000 to 599999
Thread 9 doing iterations 900000 to 999999
Thread 7 doing iterations 700000 to 799999
Thread 6 doing iterations 600000 to 699999
Check Sum= 8.299952e+11
all thread are created and the execution is not sequential (remove pthread_join), but the function do_work is executed in order and depend on thread. it means iterations 0 to 99999 are done by thread 0 and iterations 100000 to 199999 are done by thread 1 etc ...
the question is here why for example iterations 0 to 99999 is not done by thread 2 ?

This is because iteration range is calculated based on thread number from 0 to N in the following line:
start = (*mytid * ITERATIONS);
And you create and pass that number in a loop as such:
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
...
In other words, 2 + N will never be 0 to perform iteration over 0 to 99999 when N is non-negative.

I think you are confused as to what threads are.
Think about each thread as its own program. If you run 10 programs at the same time, they will be running "simultaneously", i.e. instructions of these 10 programs will be interleaved, but within each program all instructions are executed in the deterministic order.
Same thing with threads. You define which numbers each thread will iterate over by passing the thread id argument when creating the thread.

You are sending the address of tids[i]th variable to thread i and printing based on it. In this case Nth thread will always print from N000000 till N999999 without any change in all the trial runs.
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
For thread 2, it would behave as,
mytid = (int *) tid; // *mytid = 2
start = ( 2 * ITERATIONS); // 2000000
end = ( 2 * ITERATIONS) + ITERATIONS; // 2999999
Thus printing 2000000 to 2999999. So you can't expect thread 2 to print 0 to 99999.
It is not wise to use global variables like sum, which are shared between threads, without any kind of locking mechanisms. They would bring unexpected results. Use of pthread_mutex here would solve this problem.

Related

why is this c code causing a race condition?

I'm trying to count the number of prime numbers up to 10 million and I have to do it using multiple threads using Posix threads(so, that each thread computes a subset of 10 million). However, my code is not checking for the condition IsPrime. I'm thinking this is due to a race condition. If it is what can I do to ameliorate this issue?
I've tried using a global integer array with k elements but since k is not defined it won't let me declare that at the file scope.
I'm running my code using gcc -pthread:
/*
Program that spawns off "k" threads
k is read in at command line each thread will compute
a subset of the problem domain(check if the number is prime)
to compile: gcc -pthread lab5_part2.c -o lab5_part2
*/
#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
typedef int bool;
#define FALSE 0
#define TRUE 1
#define N 10000000 // 10 Million
int k; // global variable k willl hold the number of threads
int primeCount = 0; //it will hold the number of primes.
//returns whether num is prime
bool isPrime(long num) {
long limit = sqrt(num);
for(long i=2; i<=limit; i++) {
if(num % i == 0) {
return FALSE;
}
}
return TRUE;
}
//function to use with threads
void* getPrime(void* input){
//get the thread id
long id = (long) input;
printf("The thread id is: %ld \n", id);
//how many iterations each thread will have to do
int numOfIterations = N/k;
//check the last thread. to make sure is a whole number.
if(id == k-1){
numOfIterations = N - (numOfIterations * id);
}
long startingPoint = (id * numOfIterations);
long endPoint = (id + 1) * numOfIterations;
for(long i = startingPoint; i < endPoint; i +=2){
if(isPrime(i)){
primeCount ++;
}
}
//terminate calling thread.
pthread_exit(NULL);
}
int main(int argc, char** args) {
//get the num of threads from command line
k = atoi(args[1]);
//make sure is working
printf("Number of threads is: %d\n",k );
struct timespec start,end;
//start clock
clock_gettime(CLOCK_REALTIME,&start);
//create an array of threads to run
pthread_t* threads = malloc(k * sizeof(pthread_t));
for(int i = 0; i < k; i++){
pthread_create(&threads[i],NULL,getPrime,(void*)(long)i);
}
//wait for each thread to finish
int retval;
for(int i=0; i < k; i++){
int * result = NULL;
retval = pthread_join(threads[i],(void**)(&result));
}
//get the time time_spent
clock_gettime(CLOCK_REALTIME,&end);
double time_spent = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec)/1000000000.0f;
printf("Time tasken: %f seconds\n", time_spent);
printf("%d primes found.\n", primeCount);
}
the current output I am getting: (using the 2 threads)
Number of threads is: 2
Time tasken: 0.038641 seconds
2 primes found.
The counter primeCount is modified by multiple threads, and therefore must be atomic. To fix this using the standard library (which is now supported by POSIX as well), you should #include <stdatomic.h>, declare primeCount as an atomic_int, and increment it with an atomic_fetch_add() or atomic_fetch_add_explicit().
Better yet, if you don’t care about the result until the end, each thread can store its own count in a separate variable, and the main thread can add all the counts together once the threads finish. You will need to create, in the main thread, an atomic counter per thread (so that updates don’t clobber other data in the same cache line), pass each thread a pointer to its output parameter, and then return the partial tally to the main thread through that pointer.
This looks like an exercise that you want to solve yourself, so I won’t write the code for you, but the approach to use would be to declare an array of counters like the array of thread IDs, and pass &counters[i] as the arg parameter of pthread_create() similarly to how you pass &threads[i]. Each thread would need its own counter. At the end of the thread procedure, you would write something like, atomic_store_explicit( (atomic_int*)arg, localTally, memory_order_relaxed );. This should be completely wait-free on all modern architectures.
You might also decide that it’s not worth going to that trouble to avoid a single atomic update per thread, declare primeCount as an atomic_int, and then atomic_fetch_add_explicit( &primeCount, localTally, memory_order_relaxed ); once before the thread procedure terminates.

How to create thread blocks?

I have two questions.
First:
I need to create thread blocks gradually not more then some max value, for example 20.
For example, first 20 thread go, job is finished, only then 20 second thread go, and so on in a loop.
Total number of jobs could be much larger then total number of threads (in our example 20), but total number of threads should not be bigger then our max value (in our example 20).
Second:
Could threads be added continuously? For example, 20 threads go, one thread job is finished, we see that total number of threads is 19 but our max value is 20, so we can create one more thread, and one more thread go :)
So we don't waste a time waiting another threads job to be done and our total threads number is not bigger then our some max value (20 in our example) - sounds cool.
Conclusion:
For total speed I consider the second variant would be much faster and better, and I would be very graceful if you help me with this, but also tell how to do the first variant.
Here is me code (it's not working properly and the result is strange - some_array elements become wrong after eleven step in a loop, something like this: Thread counter = 32748):
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <math.h>
#define num_threads 5 /* total max number of threads */
#define lines 17 /* total jobs to be done */
/* args for thread start function */
typedef struct {
int *words;
} args_struct;
/* thread start function */
void *thread_create(void *args) {
args_struct *actual_args = args;
printf("Thread counter = %d\n", *actual_args->words);
free(actual_args);
}
/* main function */
int main(int argc, char argv[]) {
float block;
int i = 0;
int j = 0;
int g;
int result_code;
int *ptr[num_threads];
int some_array[lines] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17};
pthread_t threads[num_threads];
/* counting how many block we need */
block = ceilf(lines / (double)num_threads);
printf("blocks= %f\n", block);
/* doing ech thread block continuously */
for (g = 1; g <= block; g++) {
//for (i; i < num_threads; ++i) { i < (num_threads * g),
printf("g = %d\n", g);
for (i; i < lines; ++i) {
printf("i= %d\n", i);
/* locate memory to args */
args_struct *args = malloc(sizeof *args);
args->words = &some_array[i];
if(pthread_create(&threads[i], NULL, thread_create, args)) {
free(args);
/* goto error_handler */
}
}
/* wait for each thread to complete */
for (j; j < lines; ++j) {
printf("j= %d\n", j);
result_code = pthread_join(threads[j], (void**)&(ptr[j]));
assert(0 == result_code);
}
}
return 0;
}

OpenMP giving inconsistent order of data

I wrote a code in C using OpenMP, but the first four results when compiling are always in a different order and the run time only shows a value other than 0 about half of the time. Am I missing something? Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void test(int *ptr0, int *ptr1, int n)
{
int sum=0, i;
int id=omp_get_thread_num();
for(i=0;i<16;i++){
sum+=prt0[i]*ptr1[i];
}
printf("Sum = %d from thread %d\n",sum,id);
}
int main()
{
int m=64, n=16, a, b;
int** array0=calloc(m, sizeof(int*));
for(a=0;a<m;a++){
array0[a]=calloc(n, sizeof(int));
}
int** array1=calloc(m, sizeof(int*));
for(b=0;b<m;b++){
array1[b]=calloc(n, sizeof(int));
}
for(i=0;i<16;i++);{
array0[i][n/2]=i;
array1[i][n/2]=i;
}
#pragma omp parallel for schedule(dynamic)
for(i=0;i<n;i++){
test(array0[i],array1[i],n);
}
double start_time, run_time;
start_time=omp_get_wtime();
run_time=omp_get_wtime()-start_time;
printf("Total run time = %.5g seconds\n", run_time);
}
The results are (with different variations of the first 4 lines):
Sum = 0 from thread 3
Sum = 1 from thread 1
Sum = 4 from thread 2
Sum = 9 from thread 3
Sum = 16 from thread 0
Sum = 25 from thread 1
Sum = 36 from thread 2
Sum = 49 from thread 3
Sum = 64 from thread 0
Sum = 81 from thread 1
Sum = 100 from thread 2
Sum = 121 from thread 3
Sum = 144 from thread 0
Sum = 169 from thread 1
Sum = 196 from thread 2
Sum = 225 from thread 3
Total run time = 3.95e-07 seconds
Any suggestions on how I can make the results consistent with sums being 0, 1, 4, 9, etc. and a run time that isn't 0 seconds?
Why do you expect a long duration by requesting current start & end time in row? Didn't you want to measure something? I guess your real target was to get the start time before the loop starts and the end time after all tests are done.
The order of your outputs you can't control because you are out of control how fast your threads work. Simply put the results in an additional array and print contents in your main thread.

Why does this code work without a mutex?

I am trying to learn how locks work in multi-threading. When I execute the following code without lock, it worked fine even though the variable sum is declared as a global variable and multiple threads are updating it. Could anyone please explain why here threads are working perfectly fine on a shared variable without locks?
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 100
#define ARRAYSIZE 1000000
#define ITERATIONS ARRAYSIZE / NTHREADS
double sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;
void *do_work(void *tid)
{
int i, start, *mytid, end;
double mysum=0.0;
/* Initialize my part of the global array and keep local sum */
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1);
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
mysum = mysum + a[i];
}
/* Lock the mutex and update the global sum, then exit */
//pthread_mutex_lock (&sum_mutex); //here I tried not to use locks
sum = sum + mysum;
//pthread_mutex_unlock (&sum_mutex);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int i, start, tids[NTHREADS];
pthread_t threads[NTHREADS];
pthread_attr_t attr;
/* Pthreads setup: initialize mutex and explicitly create threads in a
joinable state (for portability). Pass each thread its loop offset */
pthread_mutex_init(&sum_mutex, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
pthread_create(&threads[i], &attr, do_work, (void *) &tids[i]);
}
/* Wait for all threads to complete then print global sum */
for (i=0; i<NTHREADS; i++) {
pthread_join(threads[i], NULL);
}
printf ("Done. Sum= %e \n", sum);
sum=0.0;
for (i=0;i<ARRAYSIZE;i++){
a[i] = i*1.0;
sum = sum + a[i]; }
printf("Check Sum= %e\n",sum);
/* Clean up and exit */
pthread_attr_destroy(&attr);
pthread_mutex_destroy(&sum_mutex);
pthread_exit (NULL);
}
With and without lock I got the same answer!
Done. Sum= 4.999995e+11
Check Sum= 4.999995e+11
UPDATE: Change suggested by user3386109
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
//pthread_mutex_lock (&sum_mutex);
sum = sum + a[i];
//pthread_mutex_lock (&sum_mutex);
}
EFFECT :
Done. Sum= 3.878172e+11
Check Sum= 4.999995e+11
Mutexes are used to prevent race conditions which are undesirable situations when you have two or more threads accessing a shared resource. Race conditions such as the one in your code happen when the shared variable sum is being accessed by multiple threads. Sometimes the access to the shared variable will be interleaved in such a way that the result is incorrect and sometimes the result will be correct.
For example lets say you have two threads, thread A and thread B both adding 1 to a shared value, sum, which starts at 5. If thread A reads sum and then thread B reads sum and then thread A writes a new value followed by thread B writing a new value you will can an incorrect result, 6 as opposed to 7. However it is also possible than thread A reads and then writes a value (specifically 6) followed by thread B reading and writing a value (specifically 7) and then you get the correct result. The point being that some interleavings of operations result in the correct value and some interleavings result in an incorrect value. The job of the mutex is to force the interleaving to always be correct.

Unix c program to calculate pi using threads

Been working on this assignment for class. Put this code together but its giving me several errors I'm not able to solve.
Code
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
//global variables
int N, T;
double vsum[T];
//pie function
void* pie_runner(void* arg)
{
double *limit_ptr = (double*) arg;
double j = *limit_ptr;
for(int i = (N/T)*j; i<=((N/T)*(j+1)-1); j++)
{
if(i %2 =0)
vsum[j] += 4/((2*j)*(2*j+1)*(2*j+2));
else
vsum[j] -= 4/((2*j)*(2*j+1)*(2*j+2));
}
pthread_exit(0);
}
int main(int argc, char **argv)
{
if(argc != 3) {
printf("Error: Must send it 2 parameters, you sent %s", argc);
exit(1);
}
N = atoi[1];
T = atoi[2];
if(N !> T) {
printf("Error: Number of terms must be greater then number of threads.");
exit(1);
}
for(int p=0; p<T; p++) //initialize array to 0
{
vsum[p] = 0;
}
double pie = 3;
//launch threads
pthread_t tids[T];
for(int i = 0; i<T; i++)
{
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_create(&tids[i], &attr, pie_runner, &i);
}
//wait for threads...
for(int k = 0; k<T; k++)
{
pthread_join(tids[k], NULL);
}
for(int x=0; x<T; x++)
{
pie += vsum[x];
}
printf("pi computed with %d terms in %s threads is %k\n", N, T, pie);
}
One of the problems I'm having is with the array up top. It needs to be a global variable but it keeps telling me it's not a constant, even when I declare it as such.
Any help is appreciated, with the rest of the code also.
**EDIT: After updating the code using the comments below, here is the new code. I have a few errors still there and would appreciate help dealing with them.
1) Warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] int j = (int)arg;
2)Warning: cast to pointer from integer of different size [Wint - to - pointer - cast] pthread_create(.......... , (void*)i);
NEW CODE:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
//global variables
int N, T;
double *vsum;
//pie function
void* pie_runner(void* arg)
{
long j = (long)arg;
//double *limit_ptr = (double*) arg;
//double j = *limit_ptr;
//for(int i = (j-1)*N/T; i < N*(j) /T; i++)
for(int i = (N/T)*(j-1); i < ((N/T)*(j)); i++)
{
if(i % 2 == 0){
vsum[j] += 4.0/((2*j)*(2*j+1)*(2*j+2));
//printf("vsum %lu = %f\n", j, vsum[j]);
}
else{
vsum[j] -= 4.0/((2*j)*(2*j+1)*(2*j+2));
//printf("vsum %lu = %f\n", j, vsum[j]);
}
}
pthread_exit(0);
}
int main(int argc, char **argv)
{
if(argc != 3) {
printf("Error: Must send it 2 parameters, you sent %d\n", argc-1);
exit(1);
}
N = atoi(argv[1]);
T = atoi(argv[2]);
vsum = malloc((T+1) * sizeof(*vsum));
if(vsum == NULL) {
fprintf(stderr, "Memory allocation problem\n");
exit(1);
}
if(N <= T) {
printf("Error: Number of terms must be greater then number of threads.\n");
exit(1);
}
for(int p=1; p<=T; p++) //initialize array to 0
{
vsum[p] = 0;
}
double pie = 3.0;
//launch threads
pthread_t tids[T+1];
for(long i = 1; i<=T; i++)
{
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_create(&tids[i], &attr, pie_runner, (void*)i);
}
//wait for threads...
for(int k = 1; k<=T; k++)
{
pthread_join(tids[k], NULL);
}
for(int x=1; x<=T; x++)
{
pie += vsum[x];
}
printf("pi computed with %d terms in %d threads is %.20f\n", N, T, pie);
//printf("pi computed with %d terms in %d threads is %20f\n", N, T, pie);
free(vsum);
}
Values not working:
./pie1 2 1
pi computed with 2 terms in 1 threads is 3.00000000000000000000
./pie1 3 1
pi computed with 3 terms in 1 threads is 3.16666666666666651864
./pie1 3 2
pi computed with 3 terms in 2 threads is 3.13333333333333330373
./pie1 4 2
pi computed with 4 terms in 2 threads is 3.00000000000000000000
./pie1 4 1
pi computed with 4 terms in 1 threads is 3.00000000000000000000
./pie1 4 3
pi computed with 4 terms in 3 threads is 3.14523809523809516620
./pie1 10 1
pi computed with 10 terms in 1 threads is 3.00000000000000000000
./pie1 10 2
pi computed with 10 terms in 2 threads is 3.13333333333333330373
./pie1 10 3
pi computed with 10 terms in 3 threads is 3.14523809523809516620
./pie1 10 4
pi computed with 10 terms in 4 threads is 3.00000000000000000000
./pie1 10 5
pi computed with 10 terms in 5 threads is 3.00000000000000000000
./pie1 10 6
pi computed with 10 terms in 6 threads is 3.14088134088134074418
./pie1 10 7
pi computed with 10 terms in 7 threads is 3.14207181707181693042
./pie1 10 8
pi computed with 10 terms in 8 threads is 3.14125482360776464574
./pie1 10 9
pi computed with 10 terms in 9 threads is 3.14183961892940200045
./pie1 11 2
pi computed with 11 terms in 2 threads is 3.13333333333333330373
./pie1 11 4
pi computed with 11 terms in 4 threads is 3.00000000000000000000
There are numerous problems with that code. Your specific problem is that, in C, variable length arrays (VLAs) are not permitted at file scope.
So, if you want that array to be dynamic, you will have to declare the pointer to it and allocate it yourself:
int N, T;
double *vsum;
and then, in main() after T has been set:
vsum = malloc (T * sizeof(*vsum));
if (vsum == NULL) {
fprintf (stderr, "Memory allocation problem\n");
exit (1);
}
remembering to free it before exiting (not technically required but good form anyway):
free (vsum);
Among the other problems:
1/ There is no !> operator in C, I suspect the line should be:
if (N > T) {
rather than:
if (N !> T) {
2/ To get the arguments from the command line, change:
N = atoi[1];
T = atoi[2];
into:
N = atoi(argv[1]);
T = atoi(argv[2]);
3/ The comparison operator is ==, not =, so you need to change:
if(i %2 =0)
into:
if (i % 2 == 0)
4/ Your error message about not having enough parameters needs to use %d rather than %s, as argc is an integral type:
printf ("Error: Must send it 2 parameters, you sent %d\n", argc-1);
Ditto for your calculation message at the end (and fixing the %k for the floating point value):
printf ("pi computed with %d terms in %d threads is %.20f\n", N, T, pie);
5/ You pass an integer pointer into your thread function but there are two problems with that.
The first is that you then extract it into a double j, which cannot be used as an array index. If it's an integer being passed in, it should be turned back into an integer.
The second is that there is no guarantee the new thread will extract the value (or even start running its code at all) before the main thread changes that value to start up another thread. You should probably just convert the integer to a void * directly rather than messing about with integer pointers.
To fix both those, use this when creating the thread:
pthread_create (&tids[i], &attr, pie_runner, (void*)i);
and this at the start of the thread function:
int j = (int) arg;
If you get warnings or experience problems with that, it's probably because your integers and pointers are not compatible sizes. In that case, you could try something like:
pthread_create (&tids[i], &attr, pie_runner, (void*)(intptr_t)i);
though I'm not sure that will work any better.
Alternatively (though it's a bit of a kludge), stick with your pointer solution and just make sure there's no possibility of race conditions (by passing a unique pointer per thread).
First, revert the thread function to receiving its value by a pointer:
int j = *((int*) arg);
Then, before you start creating threads, you need to create a thread integer array and, for each thread created, populate and pass the (address of the) correct index of that array:
int tvals[T]; // add this line.
for (int i = 0; i < T; i++) {
tvals[i] = i; // and this one.
pthread_attr_t attr;
pthread_attr_init (&attr);
pthread_create (&tids[i], &attr, pie_runner, &(tvals[i]));
}
That shouldn't be too onerous unless you have so many threads the estra array will be problematic. But, if you have that many threads, you're going to have far greater problems.
6/ Your loop in the thread incorrectly incremented j rather than i. Since this is the same area touched by the following section, I'll correct it there.
7/ The use of integers in what is predominantly a floating point calculation means that you have to arrange your calculations so that they don't truncate divisions, such as 10 / 4 -> 2 where it should be 2.5.
That means the loop in the thread function should be changed as follows (including incrementing i as in previous point):
for (int i = j*N/T; i <= N * (j+1) / T - 1; i++)
if(i % 2 == 0)
vsum[j] += 4.0/((2*j)*(2*j+1)*(2*j+2));
else
vsum[j] -= 4.0/((2*j)*(2*j+1)*(2*j+2));
With all those changes, you get a reasonably sensible result:
$ ./picalc 100 101
pi computed with 100 terms in 101 threads is 3.14159241097198238535
Two problems with that array: The first is that T is not a compile-time constant, which it needs to be if you're programming in C++. The second is that T is initialized to zero, meaning the array will have a size of zero and all indexing of the array will be out of bounds.
You need to allocate the array dynamically once you have read T and know the size. In C you use malloc for that, in C++ you should use std::vector instead.

Resources