OpenMP giving inconsistent order of data - c

I wrote a code in C using OpenMP, but the first four results when compiling are always in a different order and the run time only shows a value other than 0 about half of the time. Am I missing something? Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void test(int *ptr0, int *ptr1, int n)
{
int sum=0, i;
int id=omp_get_thread_num();
for(i=0;i<16;i++){
sum+=prt0[i]*ptr1[i];
}
printf("Sum = %d from thread %d\n",sum,id);
}
int main()
{
int m=64, n=16, a, b;
int** array0=calloc(m, sizeof(int*));
for(a=0;a<m;a++){
array0[a]=calloc(n, sizeof(int));
}
int** array1=calloc(m, sizeof(int*));
for(b=0;b<m;b++){
array1[b]=calloc(n, sizeof(int));
}
for(i=0;i<16;i++);{
array0[i][n/2]=i;
array1[i][n/2]=i;
}
#pragma omp parallel for schedule(dynamic)
for(i=0;i<n;i++){
test(array0[i],array1[i],n);
}
double start_time, run_time;
start_time=omp_get_wtime();
run_time=omp_get_wtime()-start_time;
printf("Total run time = %.5g seconds\n", run_time);
}
The results are (with different variations of the first 4 lines):
Sum = 0 from thread 3
Sum = 1 from thread 1
Sum = 4 from thread 2
Sum = 9 from thread 3
Sum = 16 from thread 0
Sum = 25 from thread 1
Sum = 36 from thread 2
Sum = 49 from thread 3
Sum = 64 from thread 0
Sum = 81 from thread 1
Sum = 100 from thread 2
Sum = 121 from thread 3
Sum = 144 from thread 0
Sum = 169 from thread 1
Sum = 196 from thread 2
Sum = 225 from thread 3
Total run time = 3.95e-07 seconds
Any suggestions on how I can make the results consistent with sums being 0, 1, 4, 9, etc. and a run time that isn't 0 seconds?

Why do you expect a long duration by requesting current start & end time in row? Didn't you want to measure something? I guess your real target was to get the start time before the loop starts and the end time after all tests are done.
The order of your outputs you can't control because you are out of control how fast your threads work. Simply put the results in an additional array and print contents in your main thread.

Related

C, pthreads initialized in loop does not execute assigned function properly despite mutex

I am having trouble debugging my C program where the goal is to create 5 threads and have each of them working on size-2 chunks of an array of length 10. The goal is to get the sum of that array. My actual program is a little less trivial than this as it takes dynamic array sizes and thread counts but I tried simplifying it to this simple problem and it still does not work.
ie.,
array = {1 2 3 4 5 6 7 8 9 10}
then thread1 works on array[0] and array [1]
and thread2 works on array[2] and array[3]
etc...
thread5 works on array[8] and array[9]
However when i run my code, I get weird results, even when using a mutex lock.
For example, this is one of my results when running this program.
Thread #1 adding 3 to 0 New sum: 3
Thread #1 adding 4 to 3 New sum: 7
Thread #2 adding 5 to 7 New sum: 12
Thread #2 adding 6 to 12 New sum: 18
Thread #3 adding 7 to 18 New sum: 25
Thread #3 adding 8 to 25 New sum: 33
Thread #4 adding 9 to 33 New sum: 42
Thread #4 adding 9 to 42 New sum: 51
Thread #4 adding 10 to 51 New sum: 61
Thread #4 adding 10 to 61 New sum: 71
Sum: 71
First of all, why are there no tabs before the "New sum" for the first 3 lines? (see my printf log in calculate_sum function). And more importantly, why is thread0 never executing it's job and why is thread 4 executing twice?
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
typedef struct {
int start, end, thread_number;
int *data;
} Func_args;
static pthread_mutex_t mutex;
static int sum = 0;
void *calculate_sum(void *args) {
int *arr = ((Func_args *)args)->data;
int i = ((Func_args *)args)->start;
int end = ((Func_args *)args)->end;
int t_id = ((Func_args *)args)->thread_number;
while (i < end) {
pthread_mutex_lock(&mutex);
printf("Thread #%d adding %d to %d\t", t_id, arr[i], sum);
sum += arr[i++];
printf("New sum: %d\n", sum);
pthread_mutex_unlock(&mutex);
}
return NULL;
}
#define NUM_THREAD 5
#define ARRAY_LEN 10
int main(int argc, char **argv) {
int array[ARRAY_LEN] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
pthread_t tid[NUM_THREAD];
int i, pos = 0;
pthread_mutex_init(&mutex, NULL);
for (i = 0; i < NUM_THREAD; i++) {
Func_args args;
args.data = array;
args.thread_number = i;
args.start = pos;
pos += 2;
args.end = pos;
pthread_create(&tid[i], NULL, calculate_sum, &args);
}
for (i = 0; i < NUM_THREAD; i++)
pthread_join(tid[i], NULL);
pthread_mutex_destroy(&mutex);
printf("Sum: %d\n", sum);
return 0;
}
You are passing each thread a pointer to an object that might get destroyed before the thread starts.
args is local, so it gets destroyed when the program exits the scope it's declared in - that is, at the end of the for loop body.
The thread might takes a few moments to start up, so if the thread starts after that, it will access a destroyed object - in practice, the memory will have been reused to store the next thread's values.
You can fix it by dynamically allocating the thread data with malloc (and remembering to free it in the thread or if pthread_create fails).

C Threads program

I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?
The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.

Unix c program to calculate pi using threads

Been working on this assignment for class. Put this code together but its giving me several errors I'm not able to solve.
Code
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
//global variables
int N, T;
double vsum[T];
//pie function
void* pie_runner(void* arg)
{
double *limit_ptr = (double*) arg;
double j = *limit_ptr;
for(int i = (N/T)*j; i<=((N/T)*(j+1)-1); j++)
{
if(i %2 =0)
vsum[j] += 4/((2*j)*(2*j+1)*(2*j+2));
else
vsum[j] -= 4/((2*j)*(2*j+1)*(2*j+2));
}
pthread_exit(0);
}
int main(int argc, char **argv)
{
if(argc != 3) {
printf("Error: Must send it 2 parameters, you sent %s", argc);
exit(1);
}
N = atoi[1];
T = atoi[2];
if(N !> T) {
printf("Error: Number of terms must be greater then number of threads.");
exit(1);
}
for(int p=0; p<T; p++) //initialize array to 0
{
vsum[p] = 0;
}
double pie = 3;
//launch threads
pthread_t tids[T];
for(int i = 0; i<T; i++)
{
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_create(&tids[i], &attr, pie_runner, &i);
}
//wait for threads...
for(int k = 0; k<T; k++)
{
pthread_join(tids[k], NULL);
}
for(int x=0; x<T; x++)
{
pie += vsum[x];
}
printf("pi computed with %d terms in %s threads is %k\n", N, T, pie);
}
One of the problems I'm having is with the array up top. It needs to be a global variable but it keeps telling me it's not a constant, even when I declare it as such.
Any help is appreciated, with the rest of the code also.
**EDIT: After updating the code using the comments below, here is the new code. I have a few errors still there and would appreciate help dealing with them.
1) Warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] int j = (int)arg;
2)Warning: cast to pointer from integer of different size [Wint - to - pointer - cast] pthread_create(.......... , (void*)i);
NEW CODE:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
//global variables
int N, T;
double *vsum;
//pie function
void* pie_runner(void* arg)
{
long j = (long)arg;
//double *limit_ptr = (double*) arg;
//double j = *limit_ptr;
//for(int i = (j-1)*N/T; i < N*(j) /T; i++)
for(int i = (N/T)*(j-1); i < ((N/T)*(j)); i++)
{
if(i % 2 == 0){
vsum[j] += 4.0/((2*j)*(2*j+1)*(2*j+2));
//printf("vsum %lu = %f\n", j, vsum[j]);
}
else{
vsum[j] -= 4.0/((2*j)*(2*j+1)*(2*j+2));
//printf("vsum %lu = %f\n", j, vsum[j]);
}
}
pthread_exit(0);
}
int main(int argc, char **argv)
{
if(argc != 3) {
printf("Error: Must send it 2 parameters, you sent %d\n", argc-1);
exit(1);
}
N = atoi(argv[1]);
T = atoi(argv[2]);
vsum = malloc((T+1) * sizeof(*vsum));
if(vsum == NULL) {
fprintf(stderr, "Memory allocation problem\n");
exit(1);
}
if(N <= T) {
printf("Error: Number of terms must be greater then number of threads.\n");
exit(1);
}
for(int p=1; p<=T; p++) //initialize array to 0
{
vsum[p] = 0;
}
double pie = 3.0;
//launch threads
pthread_t tids[T+1];
for(long i = 1; i<=T; i++)
{
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_create(&tids[i], &attr, pie_runner, (void*)i);
}
//wait for threads...
for(int k = 1; k<=T; k++)
{
pthread_join(tids[k], NULL);
}
for(int x=1; x<=T; x++)
{
pie += vsum[x];
}
printf("pi computed with %d terms in %d threads is %.20f\n", N, T, pie);
//printf("pi computed with %d terms in %d threads is %20f\n", N, T, pie);
free(vsum);
}
Values not working:
./pie1 2 1
pi computed with 2 terms in 1 threads is 3.00000000000000000000
./pie1 3 1
pi computed with 3 terms in 1 threads is 3.16666666666666651864
./pie1 3 2
pi computed with 3 terms in 2 threads is 3.13333333333333330373
./pie1 4 2
pi computed with 4 terms in 2 threads is 3.00000000000000000000
./pie1 4 1
pi computed with 4 terms in 1 threads is 3.00000000000000000000
./pie1 4 3
pi computed with 4 terms in 3 threads is 3.14523809523809516620
./pie1 10 1
pi computed with 10 terms in 1 threads is 3.00000000000000000000
./pie1 10 2
pi computed with 10 terms in 2 threads is 3.13333333333333330373
./pie1 10 3
pi computed with 10 terms in 3 threads is 3.14523809523809516620
./pie1 10 4
pi computed with 10 terms in 4 threads is 3.00000000000000000000
./pie1 10 5
pi computed with 10 terms in 5 threads is 3.00000000000000000000
./pie1 10 6
pi computed with 10 terms in 6 threads is 3.14088134088134074418
./pie1 10 7
pi computed with 10 terms in 7 threads is 3.14207181707181693042
./pie1 10 8
pi computed with 10 terms in 8 threads is 3.14125482360776464574
./pie1 10 9
pi computed with 10 terms in 9 threads is 3.14183961892940200045
./pie1 11 2
pi computed with 11 terms in 2 threads is 3.13333333333333330373
./pie1 11 4
pi computed with 11 terms in 4 threads is 3.00000000000000000000
There are numerous problems with that code. Your specific problem is that, in C, variable length arrays (VLAs) are not permitted at file scope.
So, if you want that array to be dynamic, you will have to declare the pointer to it and allocate it yourself:
int N, T;
double *vsum;
and then, in main() after T has been set:
vsum = malloc (T * sizeof(*vsum));
if (vsum == NULL) {
fprintf (stderr, "Memory allocation problem\n");
exit (1);
}
remembering to free it before exiting (not technically required but good form anyway):
free (vsum);
Among the other problems:
1/ There is no !> operator in C, I suspect the line should be:
if (N > T) {
rather than:
if (N !> T) {
2/ To get the arguments from the command line, change:
N = atoi[1];
T = atoi[2];
into:
N = atoi(argv[1]);
T = atoi(argv[2]);
3/ The comparison operator is ==, not =, so you need to change:
if(i %2 =0)
into:
if (i % 2 == 0)
4/ Your error message about not having enough parameters needs to use %d rather than %s, as argc is an integral type:
printf ("Error: Must send it 2 parameters, you sent %d\n", argc-1);
Ditto for your calculation message at the end (and fixing the %k for the floating point value):
printf ("pi computed with %d terms in %d threads is %.20f\n", N, T, pie);
5/ You pass an integer pointer into your thread function but there are two problems with that.
The first is that you then extract it into a double j, which cannot be used as an array index. If it's an integer being passed in, it should be turned back into an integer.
The second is that there is no guarantee the new thread will extract the value (or even start running its code at all) before the main thread changes that value to start up another thread. You should probably just convert the integer to a void * directly rather than messing about with integer pointers.
To fix both those, use this when creating the thread:
pthread_create (&tids[i], &attr, pie_runner, (void*)i);
and this at the start of the thread function:
int j = (int) arg;
If you get warnings or experience problems with that, it's probably because your integers and pointers are not compatible sizes. In that case, you could try something like:
pthread_create (&tids[i], &attr, pie_runner, (void*)(intptr_t)i);
though I'm not sure that will work any better.
Alternatively (though it's a bit of a kludge), stick with your pointer solution and just make sure there's no possibility of race conditions (by passing a unique pointer per thread).
First, revert the thread function to receiving its value by a pointer:
int j = *((int*) arg);
Then, before you start creating threads, you need to create a thread integer array and, for each thread created, populate and pass the (address of the) correct index of that array:
int tvals[T]; // add this line.
for (int i = 0; i < T; i++) {
tvals[i] = i; // and this one.
pthread_attr_t attr;
pthread_attr_init (&attr);
pthread_create (&tids[i], &attr, pie_runner, &(tvals[i]));
}
That shouldn't be too onerous unless you have so many threads the estra array will be problematic. But, if you have that many threads, you're going to have far greater problems.
6/ Your loop in the thread incorrectly incremented j rather than i. Since this is the same area touched by the following section, I'll correct it there.
7/ The use of integers in what is predominantly a floating point calculation means that you have to arrange your calculations so that they don't truncate divisions, such as 10 / 4 -> 2 where it should be 2.5.
That means the loop in the thread function should be changed as follows (including incrementing i as in previous point):
for (int i = j*N/T; i <= N * (j+1) / T - 1; i++)
if(i % 2 == 0)
vsum[j] += 4.0/((2*j)*(2*j+1)*(2*j+2));
else
vsum[j] -= 4.0/((2*j)*(2*j+1)*(2*j+2));
With all those changes, you get a reasonably sensible result:
$ ./picalc 100 101
pi computed with 100 terms in 101 threads is 3.14159241097198238535
Two problems with that array: The first is that T is not a compile-time constant, which it needs to be if you're programming in C++. The second is that T is initialized to zero, meaning the array will have a size of zero and all indexing of the array will be out of bounds.
You need to allocate the array dynamically once you have read T and know the size. In C you use malloc for that, in C++ you should use std::vector instead.

How to assign a specific job to each thread for matrix addition in openmp

I am trying to create a matrix addition program to practice with OpenMP. I have N^2 processors/threads and need to assign each thread such that it computes one entry of the resultant matrix. For example, if I have two matrices A and B of size NxN, then each thread should compute one entry of the resultant matrix C. Upon reading some of the beginner tutorials in OpenMp it seems that the #pragma omp parallel for directive divides the tasks equally among the total number of threads specified. But in the code below only 3 threads are active, and not 9 as I want.
The code I have is as follows:
#include <stdio.h>
#include "omp.h"
void main() {
// omp_set_num_threads(NUM_THREADS);
int i, k;
int N=3;
int A[3][3] = { {1, 2, 3},{ 5, 6, 7}, {8,9,10} };
int B[3][3] = { {1, 2, 3},{ 5, 6, 7}, {8,9,10} };
int C[3][3] ;
omp_set_dynamic(0);
omp_set_num_threads(9);
// printf("Num of threads %i \n", omp_get_max_threads());
#pragma omp parallel for private(i,k) shared(A, B, C, N)
for (i = 0; i< N; i++) {
for (k=0; k< N;k++){
int j = omp_get_thread_num();
C[i][k] = A[i][k] + B[i][k] ;
printf("I m thread %d computing A[%d][%d] and B[%d][%d] = %d \n ", j, i,k, i,k, C[i][k]);
}
}
int n, m;
for (n=0; n<3; n++) {
for ( m=0;m<3;m++){
printf("C[%d][%d] = %d \n",n,m, C[n][m]);
}
}
}
And the output I am getting is:
I m thread 0 computing A[0][0] and B[0][0] = 2
I m thread 1 computing A[1][0] and B[1][0] = 10
I m thread 1 computing A[1][1] and B[1][1] = 12
I m thread 1 computing A[1][2] and B[1][2] = 14
I m thread 0 computing A[0][1] and B[0][1] = 4
I m thread 0 computing A[0][2] and B[0][2] = 6
I m thread 2 computing A[2][0] and B[2][0] = 16
I m thread 2 computing A[2][1] and B[2][1] = 18
I m thread 2 computing A[2][2] and B[2][2] = 20
C[0][0] = 2
C[0][1] = 4
C[0][2] = 6
C[1][0] = 10
C[1][1] = 12
C[1][2] = 14
C[2][0] = 16
C[2][1] = 18
C[2][2] = 20
What I want though is that each of the nine threads compute one entry of the matrix C. Could anyone please help regarding this. I am new to C and OpenMP both. I am also confused regarding the exact function of private variables in the private clause. For example, if I specify 'i' and 'k' as private, then does that mean that each of the thread will have a copy of 'i' and 'k' and may therefore run their own iteration of the loop? but that doesn't make sense since in the above output thread 0 is computing all the row 0 values, and thread 1 all the row 1 values. How is this happening on its own without any specific directive? Thank you for your help!
Using #pragma omp parallel for on outer for loop, it is applied only on the outer loop, which only iterates 3 times (N = 3), so you only need 3 threads.
If you want to use 9 threads, you should collapse the 2d array to 1d, using a single index, let's call it p:
#pragma omp parallel for private(i, k, p) shared(A, B, C, N)
for (p = 0; p < N * N; p++) {
i = p / N;
k = p % N;
C[i][k] = A[i][k] + B[i][k];
}
As stated on George's answer and Timothy's comment, you can also use OpenMP's collapse(2) keyword to achieve the same thing.
Another way ,if you want to retain the 2 loops ,besides 'chrk' answer ,is to use:
#pragma omp parallel for private(i,k) shared(A, B, C, N) collapse(2)
Like this , you will have parallel execution in both loops.
Because , right now , you have only parallel execution in the outer loop.
That's why you see that for example thread 1 calculates all the row 1 values.

test "pthread_create" function behaviour

I am new in multi-threaded programming and I have a question about "pthread_create" behaviour
this is the code :
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 10
#define ARRAYSIZE 1000000
#define ITERATIONS ARRAYSIZE / NTHREADS
double sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;
void *do_work(void *tid)
{
int i, k=0,start, *mytid, end;
double mysum=0.0;
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1);
for (i=start; i < end ; i++) {
a[i] = i * 1.0;
mysum = mysum + a[i];
}
sum = sum + mysum;
}
int main(int argc, char *argv[])
{
int i, start, tids[NTHREADS];
pthread_t threads[NTHREADS];
pthread_attr_t attr;
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
pthread_create(&threads[i], NULL/*&attr*/, do_work, (void *) &tids[i]);
}
/* Wait for all threads to complete then print global sum */
/*
for (i=0; i<NTHREADS; i++) {
pthread_join(threads[i], NULL);
}*/
printf ("Done. Sum= %e \n", sum);
sum=0.0;
for (i=0;i<ARRAYSIZE;i++){
a[i] = i*1.0;
sum = sum + a[i]; }
printf("Check Sum= %e\n",sum);
}
the result of execution is :
Thread 1 doing iterations 100000 to 199999
Done. Sum= 0.000000e+00
Thread 0 doing iterations 0 to 99999
Thread 2 doing iterations 200000 to 299999
Thread 3 doing iterations 300000 to 399999
Thread 8 doing iterations 800000 to 899999
Thread 4 doing iterations 400000 to 499999
Thread 5 doing iterations 500000 to 599999
Thread 9 doing iterations 900000 to 999999
Thread 7 doing iterations 700000 to 799999
Thread 6 doing iterations 600000 to 699999
Check Sum= 8.299952e+11
all thread are created and the execution is not sequential (remove pthread_join), but the function do_work is executed in order and depend on thread. it means iterations 0 to 99999 are done by thread 0 and iterations 100000 to 199999 are done by thread 1 etc ...
the question is here why for example iterations 0 to 99999 is not done by thread 2 ?
This is because iteration range is calculated based on thread number from 0 to N in the following line:
start = (*mytid * ITERATIONS);
And you create and pass that number in a loop as such:
for (i=0; i<NTHREADS; i++) {
tids[i] = i;
...
In other words, 2 + N will never be 0 to perform iteration over 0 to 99999 when N is non-negative.
I think you are confused as to what threads are.
Think about each thread as its own program. If you run 10 programs at the same time, they will be running "simultaneously", i.e. instructions of these 10 programs will be interleaved, but within each program all instructions are executed in the deterministic order.
Same thing with threads. You define which numbers each thread will iterate over by passing the thread id argument when creating the thread.
You are sending the address of tids[i]th variable to thread i and printing based on it. In this case Nth thread will always print from N000000 till N999999 without any change in all the trial runs.
mytid = (int *) tid;
start = (*mytid * ITERATIONS);
end = start + ITERATIONS;
For thread 2, it would behave as,
mytid = (int *) tid; // *mytid = 2
start = ( 2 * ITERATIONS); // 2000000
end = ( 2 * ITERATIONS) + ITERATIONS; // 2999999
Thus printing 2000000 to 2999999. So you can't expect thread 2 to print 0 to 99999.
It is not wise to use global variables like sum, which are shared between threads, without any kind of locking mechanisms. They would bring unexpected results. Use of pthread_mutex here would solve this problem.

Resources