Why does using OpenMP make the code slower to run?

Why does using OpenMP make the code slower to run? - c

I have the following code that uses OMP to parallelize a monte carlo method. My question is why does the serial version of the code (monte_carlo_serial) run a lot faster than the parallel version (monte_carlo_parallel). I am running the code on a machine with 32 cores and get the following result printed to the console:
-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
Pi (Serial): 3.140856
Time taken 0 seconds 50 milliseconds
Pi (Parallel): 3.132103
Time taken 127 seconds 990 milliseconds
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>
int niter = 1000000; //number of iterations per FOR loop
int monte_carlo_parallel() {
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 32;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
{
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi (Parallel): %f\n", pi);
return 0;
}
int monte_carlo_serial(){
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
pi = ((double)count/(double)(niter))*4.0;
printf("Pi (Serial): %f\n", pi);
return 0;
}
void main(){
clock_t start = clock(), diff;
monte_carlo_serial();
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
start = clock(), diff;
monte_carlo_parallel();
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
}

The variable
count
is shared across all of your spawned threads. Each of them has to lock count to increment it. In addition if the threads are running on separate cpu's (and there's no possible win if they're not) you have the cost of sending the value of count from one core to another and back again.
This is a textbook example of false sharing. Accessing count in your serial version it will be in a register and cost 1 cycle to increment. In the parallel version it will usually not be in cache, you have to tell the other cores to invalidate that cache line, fetch it (L3 is going to take 66 cycles at best) increment it, and store it back. Every time count migrates from one cpu core to another you have a minimum ~125 cycle cost which is a lot worse than 1. The threads will never be able to run in parallel because they depend on count.
Try to modify your code so that each thread has its own count, then sum all values of count from all the threads at the end and you /might/ see a speedup.

Related

Why am I not getting the same estimation of PI using a parallelized (OpenMP) algothrim copied from working code?

The code below is a direct translation from a youtube video on Estimating PI using OpenMP and Monte Carlo. Even with the same inputs I'm not getting here their output. In fact, it seems like around half the value is what I get.
int main() {
int num; // number of iterations
printf("Enter number of iterations you want the loop to run for: ");
scanf_s("%d", &num);
double x, y, z, pi;
long long int i;
int count = 0;
int num_thread;
printf("Enter number of threads you want to run to parallelize the process:\t");
scanf_s("%d", &num_thread);
printf("\n");
#pragma omp parallel firstprivate(x,y,z,i) shared(count) num_threads(num_thread)
{
srand((int)time(NULL) ^ omp_get_thread_num());
for (i = 0; i < num; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
z = pow(((x * x) + (y * y)), .5);
if (z <= 1) {
count++;
}
}
} // END PRAGMA
pi = ((double)count / (double)(num * num_thread)) * 4;
printf("The value of pi obtained is %f\n", pi);
return 0;
}
I've also used a similar algorithm straight from the Oak Ridge National Laboratory's website (https://www.olcf.ornl.gov/tutorials/monte-carlo-pi/):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
int main(int argc, char* argv[])
{
int niter = 1000000; //number of iterations per FOR loop
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 16;
#pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads)
{
srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)random()/RAND_MAX; //gets a random x coordinate
y = (double)random()/RAND_MAX; //gets a random y coordinate
z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
//print the value of each thread/rank
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi: %f\n", pi);
return 0;
}
And I am have the exact problem, so I'm think it isn't the code but somehow my machine.
I am running in VS Studio 22, Windows 11 with 16 core i9-12900kf and 32 gb ram.
Edit: I forgot to mention I did alter the second algorithm to use srand() and rand() instead.

There are many errors in the code:
As pointed out by #JeromeRichard and #JohnBollinger rand\srand\random are not threadsafe you should use a threadsafe solution.
There is a race condition at line ++count; (different threads read and write a shared variable). You should use reduction to avoid it.
The code assumes that you use numthreads threads, but OpenMP does not guarantee that you actually got all of the threads you requested. I think if you got PI/2 as a result, the problem should be the difference between the requested and obtained number of threads. If you use #pragma omp parallel for... before the loop, you do not need any assumptions about the number of threads (ie. in this case the equation to calculate PI does not contain the number of threads).
A minor comment is that you do not need to use the time-consuming pow function.
Putting it together your code should be something like this:
#pragma omp parallel for reduction(+:count) num_threads(num_thread)
for (long long int i = 0; i < num; i++) {
const double x = threadsafe_random_number_between_0_1();
const double y = threadsafe_random_number_between_0_1();
const double z = x * x + y * y;
if (z <= 1) {
count++;
}
}
double pi = ((double) count / (double) num ) * 4.0;

One assumption but I may be wrong : you initialise random with time, so it may happen than different thread use the same time , which may result in same random number generated, and so the result will be really bad as you got multiple time the same values. This is a problem with the Monte-Carlo method where 2 identical points will make wrong result.

Serial Execution faster than Parallel Execution with one thread of OpenMP

I am trying to compute value of pi using trapezoidal rule of numerical integration. For that I have written a serial code which does iterations in a given range. For computing the parallel overhead, I have run the same code by setting number of threads to 1. Now, I have obtained the following graph of execution time versus the problem size.
Since, we are only creating one thread, I don't think there is much of communication overhead involved in this. So what might be the reason behind this? And as far as I know, the directive's invocation is done at compile time, i.e., if you define a MACRO then it gets expanded before runtime, so am I missing something there? Or is it something totally different from what I have thought?
Below is the serial code
#include<stdio.h>
#include<omp.h>
int main()
{
FILE *fp = fopen("pi_serial.txt", "a+");
long num_steps = 1e9;
double step_size = 1.0 / num_steps;
long i;
double sum = 0;
double start_time = omp_get_wtime();
for(i = 0; i< num_steps; i++) {
double x = (i + 0.5) * step_size;
sum += (4.0 / (1.0 + (x * x)));
}
sum = sum * step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the multi-threaded code
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
FILE* fp = fopen("pi_parallel.txt", "a+");
omp_set_num_threads(1);
long num_steps = atol(argv[1]);
double step_size = 1.0 / num_steps;
double sum = 0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
int id = omp_get_thread_num();
double private_sum = 0;
int i;
for(i = id; i <= num_steps; i += 1){
double x = (i + 0.5) * step_size;
private_sum += (4.0 / (1.0 + x * x));
}
#pragma omp critical
sum += private_sum;
}
sum *= step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the graph for Execution time

https://www.youtube.com/watch?v=OuzYICZUthM&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=7
The above video will help in understanding why a serial code might be faster than a parallel code with one thread.
According to the presenter, it can be seen that since you are setting up omp environment variables, creating a thread in the middle of the program it is normal for the openmp program to run slower than the serial code.
But the main thing would be look at the scalability of your code- how fast is your code compared to serial when running on more than 1 thread?
When you are running the same code on multiple threads and still do not see an increase in performance it may be due to false sharing. From what I understand, consider two variables that reside in the same cache line. The master thread accesses one of the variables and modifies it which causes the cache line to be invalidated. If thread 1 has to access the modified cache line then the modified cache line is written to memory and the thread then fetches the cache line from memory and modifies it. This process may increase the execution time.
References:
https://docs.oracle.com/cd/E37069_01/html/E37081/aewcy.html
*I dont own the video.

C Threads program

I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?

The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.

Ising 1-Dimensional C - program

i am trying to simulate the Ising Model 1-D. This model consists in a chain of spin (100 spins) and using the Mont Carlo - Metropolis to accept the flip of a spin if the energy of the system (unitary) goes down or if it will be less than a random number.
In the correct program, both the energy the magnetization go to zero, and we have the results as a Gaussian (graphics of Energyor the magnetization by the number of Monte Carlo steps).
I have done some work but i think my random generator isn't correctt for this, and i don't know how/where to implement the boundary conditions: the last spin of the chain is the first one.
I need help to finish it. Any help will be welcome. Thank you.
I am pasting my C program down:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h> //necessary for function time()
#define LENGTH 100 //size of the chain of spins
#define TEMP 2 // Temperature in units of J
#define WARM 200 // Termalização
#define MCS 20000 //Monte Carlo Steps
void start( int spin[])
{
/* starts with all the spins 1 */
int i;
for (i = 0 ; i < 100; i++)
{
spin[i] = 1;
}
}
double energy( int spin[]) //of the change function J=1
{
int i;
double energyX=0;// because the begining Energy = -J*sum (until 100) =-100,
for (i = 0;i<100;i++)
energyX=energyX-spin[i]*spin[i+1];
return(energyX);
}
int randnum(){
int num;
srand(time(NULL));
/* srand(time(NULL)) objectives to initiate the random number generator
with the value of the function time(NULL). This is calculated as being the
total of seconds passed since january first of 1970 until the present date.
So, this way, for each execution the value of the "seed" will be different.
*/
srand(time(NULL));
//picking one spin randomly zero to 100
num=rand() % 100;
printf("num = %d ", num);
return num;
}
void montcarlo( int spin[])
{
int i,j,num;
double prob;
double energyA, energyB; // A -> old energy and B -> the new energy
int rnum1,rnum2;
prob=exp(-(energyB-energyA)/TEMP);
energyA = 0;
energyB = 0;
for (i = 0;i<100;i++)
{
for (j = 0;j<100;j++)
{
energyA=energy(spin);
rnum1=randnum();
rnum2=randnum(); // i think they will give me different numbers
spin[rnum1] = -spin[rnum1]; //flip of the randomly selected spin
energyB = energyB-spin[j]*spin[j+1];
if ((energyB-energyA<0)||((energyB-energyA>0)&&(rnum2>prob))){ // using rnum2 not to be correlated if i used rnum1
spin[rnum1]=spin[rnum1];} // keep the flip
else if((energyB-energyA>0)&&(rnum2<prob))
spin[rnum1]=-spin[rnum1]; // unflip
}
}
}
int Mag_Moment( int spin[] ) // isso é momento magnetico
{
int i;
int mag;
for (i = 0 ; i < 100; i++)
{
mag = mag + spin[i];
}
return(mag);
}
int main()
{
// starting the spin's chain
int spin[100];//the vector goes til LENGHT=100
int i,num,j;
int itime;
double mag_moment;
start(spin);
double energy_chain=0;
energy_chain=energy(spin); // that will give me -100 in the begining
printf("energy_chain starts with %f", energy_chain);// initially it gives -100
/*Warming it makes the spins not so ordered*/
for (i = 1 ; i <= WARM; i++)
{
itime = i;
montcarlo(spin);
}
printf("Configurtion after warming %d \n", itime);
for (j = 0 ; j < LENGTH; j++)
{
printf("%d",spin[j]);
}
printf("\n");
energy_chain=energy(spin); // new energy after the warming
/*openning a file to save the values of energy and magnet moment of the chain*/
FILE *fp; // declaring the file for the energy
FILE *fp2;// declaring the file for the mag moment
fp=fopen("energy_chain.txt","w");
fp2=fopen("mag_moment.txt","w");
int pures;// net value of i
int a;
/* using Monte Carlo metropolis for the whole chain */
for (i = (WARM + 1) ; i <= MCS; i++)
{
itime=i;//saving the i step for the final printf.
pures = i-(WARM+1);
montcarlo(spin);
energy_chain = energy_chain + energy(spin);// the spin chain is moodified by void montcarlo
mag_moment = mag_moment + Mag_Moment(spin);
a=pures%10000;// here i select a value to save in a txt file for 10000 steps to produce graphs
if (a==0){
fprintf(fp,"%.12f\n",energy_chain); // %.12f just to give a great precision
fprintf(fp2,"%.12f\n",mag_moment);
}
}
fclose(fp); // closing the files
fclose(fp2);
/* Finishing -- Printing */
printf("energy_chain = %.12f\n", energy_chain);
printf("mag_moment = %.12f \n", mag_moment);
printf("Temperature = %d,\n Size of the system = 100 \n", TEMP);
printf("Warm steps = %d, Montcarlo steps = %d \n", WARM , MCS);
printf("Configuration in time %d \n", itime);
for (j = 0 ; j < 100; j++)
{
printf("%d",spin[j]);
}
printf("\n");
return 0;
}

you should call srand(time(NULL)); only once in your program. Every time you call this in the same second you will get the same sequence of random numbers. So it is very likely that both calls to randnum will give you the same number.
Just add srand(time(NULL)); at the begin of main and remove it elsewhere.

I see a number of bugs in this code, I think. The first one is the re-seeding of the srand() each loop which has already been addressed. Many of the loops go beyond the array bounds, such as:
for (ii = 0;ii<100;ii++)
{
energyX = energyX - spin[ii]*spin[ii+1];
}
This will give you spin[99]*spin[100] for the last loop, for which is out of bounds. That is kind of peppered throughout the code. Also, I noticed the probability rnum2 is an int but compared as if it's supposed to be a double. I think dividing the rnum2 by 100 will give a reasonable probability.
rnum2 = (randnum()/100.0); // i think they will give me different numbers
The initial probability used to calculate the spin is, prob=exp(-(energyB-energyA)/TEMP); but both energy values are not initialized, maybe this is intentional, but I think it would be better to just use rand(). The Mag_Moment() function never initializes the return value, so you wind up with a return value that is garbage. Can you point me to the algorithm you are trying to reproduce? I'm just curious.

issues timing the thread functions of pthreads in C

So i am having an issues calculating the elapsed time of the thread function of each thread, I need to be able to find the time total elapsed time for all of the threads but it is not performing this properly. (see output below code)
#include <unistd.h>
#include <sys/types.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#include <math.h>
#include <time.h>
int numthread;
double x1;
double x2;
double h;
double totalintegral;
int n; //number of trapezoids
int localn;
double gnolock;
double gmute;
double gbusy;
double gsema;
double doTrapRule(double localx1, double localx2, double h, int localn);
double doFunction(double x);
void *threadCalc(void* threadid);
int main(int argc, char * argv[])
{
int i;
x1 = 0.0;
x2 = 20.0;
n = 200000;
numthread = 10;
pthread_t* threads = malloc(numthread*sizeof(pthread_t));
h = (x2 - x1)/n;
localn = n/numthread;
for(i = 0; i < numthread; i++)
{
pthread_create(&threads[i], NULL, (void *) &threadCalc, (void*) i);
}
for(i = 0; i < numthread; i++)
{
pthread_join(threads[i], NULL);
}
printf("Trap rule result with %d trap(s) is %f\n", n, totalintegral);
fflush(stdout);
printf("no lock completed in %f\n", gnolock);
exit(0);
}
void *threadCalc(void* threadid)
{
clock_t start = clock();
double localx1;
double localx2;
double localintegral;
int cur_thread = (int)threadid;
localx1 = x1 + cur_thread * localn * h;
localx2 = localx1 + localn * h;
localintegral = doTrapRule(localx1, localx2, h, localn);
totalintegral = totalintegral + localintegral;
//printf("Trap rule result with %d trap(s) is %f", n, totalintegral);
clock_t stop = clock();
double time_elapsed = (long double)(stop - start)/CLOCKS_PER_SEC;
printf("time elapsed of each thread %f\n",time_elapsed);
gnolock = gnolock + time_elapsed;
return NULL;
}
double doTrapRule(double localx1, double localx2, double h, int localn)
{
//time start here
double localtrapintegral;
double tempx1;
int i;
localtrapintegral = (doFunction(localx1) + doFunction(localx2)) / 2.0;
for(i = 1; i <= (localn - 1); i++)
{
tempx1 = localx1 + i * h;
localtrapintegral = localtrapintegral + doFunction(tempx1);
}
localtrapintegral = localtrapintegral * h;
//time end here, add elapsed to global
return localtrapintegral;
}
double doFunction(double x)
{
double result;
result = x*x*x;
return result;
}
output:
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.010000
time elapsed of each thread 0.010000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
Trap rule result with 200000 trap(s) is 40000.000001
no lock completed in 0.020000
As you can see for whatever reason only someone of the threads are actually returning a time. I ran this multiple times, and every time only a few threads returned a result. Just as FYI gnolock is my variable that stores the total amount of time elapsed. My guess as to why this isnt working is because the decimal point is out of range, but it shouldnt be?

If you call clock() on your system, it has a resolution of 10 ms. So if a process takes 2 ms, then it will usually report a time of 0.00s or 0.01s, depending on a bunch of things which you have no control over.
Use one of the high resolution clocks instead. You can use clock_gettime with CLOCK_THREAD_CPUTIME_ID or CLOCK_PROCESS_CPUTIME_ID, I believe the resolution of this clock is several orders of magnitude better than clock().
See man 2 clock_gettime for more information.

Most likely thing is that your clock tick is too coarse for the elapsed time you are trying to measure. Mostly start and stop clocks are the same. Occasionally, by chance a clock tick occurs during your thread execution and you see 1 tick. (This is effectively what Dietrich said above).
As an example of what this means, imagine your thread takes an hour to complete and your clock ticks once a day, at midnight. Mostly when you run the thread is starts and ends on the same day. But if you happen to run it within an hour of midnight, you will see the start and stop on different days (1 tick). What you need then is a faster clock, but such a clock might well not be available.

You are using the wrong tool. clock doesn't measure the elapsed time but
The clock() function returns an approximation of processor time used
by the program.
these are two completely different things. Perhaps your threads don't use up much of processor time.