issues timing the thread functions of pthreads in C - c

So i am having an issues calculating the elapsed time of the thread function of each thread, I need to be able to find the time total elapsed time for all of the threads but it is not performing this properly. (see output below code)
#include <unistd.h>
#include <sys/types.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#include <math.h>
#include <time.h>
int numthread;
double x1;
double x2;
double h;
double totalintegral;
int n; //number of trapezoids
int localn;
double gnolock;
double gmute;
double gbusy;
double gsema;
double doTrapRule(double localx1, double localx2, double h, int localn);
double doFunction(double x);
void *threadCalc(void* threadid);
int main(int argc, char * argv[])
{
int i;
x1 = 0.0;
x2 = 20.0;
n = 200000;
numthread = 10;
pthread_t* threads = malloc(numthread*sizeof(pthread_t));
h = (x2 - x1)/n;
localn = n/numthread;
for(i = 0; i < numthread; i++)
{
pthread_create(&threads[i], NULL, (void *) &threadCalc, (void*) i);
}
for(i = 0; i < numthread; i++)
{
pthread_join(threads[i], NULL);
}
printf("Trap rule result with %d trap(s) is %f\n", n, totalintegral);
fflush(stdout);
printf("no lock completed in %f\n", gnolock);
exit(0);
}
void *threadCalc(void* threadid)
{
clock_t start = clock();
double localx1;
double localx2;
double localintegral;
int cur_thread = (int)threadid;
localx1 = x1 + cur_thread * localn * h;
localx2 = localx1 + localn * h;
localintegral = doTrapRule(localx1, localx2, h, localn);
totalintegral = totalintegral + localintegral;
//printf("Trap rule result with %d trap(s) is %f", n, totalintegral);
clock_t stop = clock();
double time_elapsed = (long double)(stop - start)/CLOCKS_PER_SEC;
printf("time elapsed of each thread %f\n",time_elapsed);
gnolock = gnolock + time_elapsed;
return NULL;
}
double doTrapRule(double localx1, double localx2, double h, int localn)
{
//time start here
double localtrapintegral;
double tempx1;
int i;
localtrapintegral = (doFunction(localx1) + doFunction(localx2)) / 2.0;
for(i = 1; i <= (localn - 1); i++)
{
tempx1 = localx1 + i * h;
localtrapintegral = localtrapintegral + doFunction(tempx1);
}
localtrapintegral = localtrapintegral * h;
//time end here, add elapsed to global
return localtrapintegral;
}
double doFunction(double x)
{
double result;
result = x*x*x;
return result;
}
output:
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.010000
time elapsed of each thread 0.010000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
Trap rule result with 200000 trap(s) is 40000.000001
no lock completed in 0.020000
As you can see for whatever reason only someone of the threads are actually returning a time. I ran this multiple times, and every time only a few threads returned a result. Just as FYI gnolock is my variable that stores the total amount of time elapsed. My guess as to why this isnt working is because the decimal point is out of range, but it shouldnt be?

If you call clock() on your system, it has a resolution of 10 ms. So if a process takes 2 ms, then it will usually report a time of 0.00s or 0.01s, depending on a bunch of things which you have no control over.
Use one of the high resolution clocks instead. You can use clock_gettime with CLOCK_THREAD_CPUTIME_ID or CLOCK_PROCESS_CPUTIME_ID, I believe the resolution of this clock is several orders of magnitude better than clock().
See man 2 clock_gettime for more information.

Most likely thing is that your clock tick is too coarse for the elapsed time you are trying to measure. Mostly start and stop clocks are the same. Occasionally, by chance a clock tick occurs during your thread execution and you see 1 tick. (This is effectively what Dietrich said above).
As an example of what this means, imagine your thread takes an hour to complete and your clock ticks once a day, at midnight. Mostly when you run the thread is starts and ends on the same day. But if you happen to run it within an hour of midnight, you will see the start and stop on different days (1 tick). What you need then is a faster clock, but such a clock might well not be available.

You are using the wrong tool. clock doesn't measure the elapsed time but
The clock() function returns an approximation of processor time used
by the program.
these are two completely different things. Perhaps your threads don't use up much of processor time.

Related

Pragma omp parallel overhead

I have a problem with a #pragma omp parallel section in my code.
I hava program which should sort a given array of integers with quicksort using multiple threads. For this in every step every thread gets assigned a portion of the array, partitions it and returns how many elements are smaller than a given global pivot. The code executes without errors, but the more threads I tell omp to use, the slower it executes. I added logging for the execution times and it seems like a huge part of the program is spent on overhead for OpenMP. The overhead seems to be consistent, so the speed difference is proportional to the size of the array to sort.
Here is the code which is executed in parallel:
void create_count_elems_lower(int lower, int upper, int global_pivot_position, int *block_sizes, int *data,
int *count_elems_lower) {
assert(lower >= 0);
times_function_called++;
int pivot = data[global_pivot_position];
double start = omp_get_wtime();
double wait_start = 0;
double wait_time = 0;
#pragma omp parallel for
for (int i = 0; i < omp_get_max_threads(); ++i) {
double start_thread = omp_get_wtime();
int lower_p = lower;
lower_p += i == 0 ? 0 : block_sizes[i - 1] * i;
count_elems_lower[i] = partition_fixed_pivot(lower_p, lower_p + block_sizes[i], pivot, data) -
lower_p; // - lower_p since it needs to be relative
assert(count_elems_lower[i] >= 0);
double end_thread = omp_get_wtime();
double time_spent = end_thread - start_thread;
time_spent_per_thread_sum += time_spent;
if (max_time_spent_per_thread[i] < time_spent) {
max_time_spent_per_thread[i] = time_spent;
}
if (wait_start == 0) {
wait_start = end_thread;
} else {
double time_waiting = end_thread - wait_start;
if (time_waiting > wait_time) {
wait_time = time_waiting;
}
}
}
double end = omp_get_wtime();
time_spent_in_function += end - start;
time_spent_idling += wait_time;
}
And here the testing function:
printf("Num threads: %d\n", num_threads);
double start = omp_get_wtime();
test_sort_big();
double end = omp_get_wtime();
printf("total: %f\n", end - start);
printf("times function called: %f\n", times_function_called);
printf("time spent in create_count_elems_lower: %f\n", time_spent_in_function);
printf("time spent per thread approx: %f\n", time_spent_per_thread_sum / num_threads);
printf("time spent idling: %f\n", time_spent_idling);
for (int i = 0; i < num_threads; ++i) {
printf("max time spent by thread %d: %f \t", i, max_time_spent_per_thread[i]);
}
The program gets compiled and linked with:
gcc -fopenmp -O3 -c -o tests/tests.o tests/tests.c
gcc -fopenmp -o build_test tests/tests.o array_utils.o datagenerator.o quicksort.o
And the results are:
Num threads: 1
Testing sorting of 10000000 Elements
total: 9.204632
times function called: 10000000.000000
time spent in create_count_elems_lower: 5.914602
time spent per thread approx: 1.610363
time spent idling: 0.000000
max time spent by thread 0: 0.041889
Num threads: 4
Testing sorting of 10000000 Elements
total: 16.955334
times function called: 10000000.000000
time spent in create_count_elems_lower: 12.598185
time spent per thread approx: 0.874607
time spent idling: 2.130419
max time spent by thread 0: 0.016055 max time spent by thread 1: 0.013543 max time spent by thread 2: 0.013532 max time spent by thread 3: 0.018599
I run Fedora 27 64 bit with an Intel® Core™ i7-2760QM CPU # 2.40GHz × 8
Edit:
As it turned out the overhead was the problem, since the method gets called a lot of times with only one thread, changing the algorithm to a simple sort when only one thread is available improved the runtime a lot.

Sorting 2 arrays using 2 threads takes more time than sorting the 2 arrays one by one

I have 2 unsorted arrays and 2 copies of these arrays. I am using two different threads to sort two arrays, then I am sorting other two unsorted array one by one. What I thought was that the thread process would be faster but it's not, so how does threads take more time?
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
struct thread_data
{
int count;
unsigned int *arr;
};
struct thread_data thread_data_array[2];
void insertionSort(unsigned int arr[], int n)
{
int i, key, j;
for (i = 1; i < n; i++)
{
key = arr[i];
j = i-1;
while (j >= 0 && arr[j] > key)
{
arr[j+1] = arr[j];
j = j-1;
}
arr[j+1] = key;
}
}
void *sortAndMergeArrays(void *threadarg)
{
int count;
unsigned int *arr;
struct thread_data *my_data;
my_data = (struct thread_data *) threadarg;
count = my_data->count;
arr = my_data->arr;
insertionSort(arr, count);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int count, i, rc;
clock_t start, end, total_t;
pthread_t threads[2];
//get the loop count. If loop count is not provided take 10000 as default loop count.
if(argc == 2){
count = atoi(argv[1]);
}
else{
count = 10000;
}
unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];
srand(time(0));
for(i = 0; i<count; i++){
arr1[i] = rand();
arr2[i] = rand();
copyArr1[i] = arr1[i];
copyArr2[i] = arr2[i];
}
start = clock();
for(int t=0; t<2; t++) {
thread_data_array[t].count = count;
if(t==0)
thread_data_array[t].arr = arr1;
else
thread_data_array[t].arr = arr2;
rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *) &thread_data_array[t]);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
end = clock();
total_t = (double)(end - start);
printf("Total time taken by CPU to sort using threads: %d\n", total_t);
start = clock();
insertionSort(copyArr1, count);
insertionSort(copyArr2, count);
end = clock();
total_t = (double)(end - start);
printf("Total time taken by CPU to sort sequentially: %d\n", total_t);
pthread_exit(NULL);
}
I am using Linux server to execute the code. First I am randomly populating the arrays and copying them to two separate arrays. For the first two arrays I am creating two threads using pthread and passing the two arrays to them, which uses insertion sort to sort them. And for the other two arrays I am just sorting one by one.
I expected that by using threads I would reduce the execution time but actually takes more time.
Diagnosis
The reason you get practically the same time — and slightly more time from the threaded code than from the sequential code — is that clock() measures CPU time, and the two ways of sorting take almost the same amount of CPU time because they're doing the same job (and the threading number is probably slightly bigger because of the time to setup and tear down threads).
The clock() function shall return the implementation's best approximation to the processor time used by the process since the beginning of an implementation-defined era related only to the process invocation.
BSD (macOS) man page:
The clock() function determines the amount of processor time used since the invocation of the calling process, measured in CLOCKS_PER_SECs of a second.
The amount of CPU time it takes to sort the two arrays is basically the same; the difference is the overhead of threading (more or less).
Revised code
I have a set of functions that can use clock_gettime() instead (code in timer.c and timer.h at GitHub). These measure wall clock time — elapsed time, not CPU time.
Here's a mildly tweaked version of your code — the substantive changes were changing the type of key in the sort function from int to unsigned int to match the data in the array, and to fix the conversion specification of %d to %ld to match the type identified by GCC as clock_t. I mildly tweaked the argument handling, and the timing messages so that they're consistent in length, and added the elapsed time measurement code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#include "timer.h"
struct thread_data
{
int count;
unsigned int *arr;
};
struct thread_data thread_data_array[2];
static
void insertionSort(unsigned int arr[], int n)
{
for (int i = 1; i < n; i++)
{
unsigned int key = arr[i];
int j = i - 1;
while (j >= 0 && arr[j] > key)
{
arr[j + 1] = arr[j];
j = j - 1;
}
arr[j + 1] = key;
}
}
static
void *sortAndMergeArrays(void *threadarg)
{
int count;
unsigned int *arr;
struct thread_data *my_data;
my_data = (struct thread_data *)threadarg;
count = my_data->count;
arr = my_data->arr;
insertionSort(arr, count);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int count = 10000;
int i, rc;
clock_t start, end, total_t;
pthread_t threads[2];
// get the loop count. If loop count is not provided take 10000 as default loop count.
if (argc == 2)
count = atoi(argv[1]);
unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];
srand(time(0));
for (i = 0; i < count; i++)
{
arr1[i] = rand();
arr2[i] = rand();
copyArr1[i] = arr1[i];
copyArr2[i] = arr2[i];
}
Clock clk;
clk_init(&clk);
start = clock();
clk_start(&clk);
for (int t = 0; t < 2; t++)
{
thread_data_array[t].count = count;
if (t == 0)
thread_data_array[t].arr = arr1;
else
thread_data_array[t].arr = arr2;
rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *)&thread_data_array[t]);
if (rc)
{
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
clk_stop(&clk);
end = clock();
char buffer[32];
printf("Elapsed using threads: %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
total_t = (double)(end - start);
printf("CPU time using threads: %ld\n", total_t);
start = clock();
clk_start(&clk);
insertionSort(copyArr1, count);
insertionSort(copyArr2, count);
clk_stop(&clk);
end = clock();
printf("Elapsed sequentially: %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
total_t = (double)(end - start);
printf("CPU time sequentially: %ld\n", total_t);
return 0;
}
Results
Example runs (program inssortthread23) — run on a MacBook Pro (15" 2016) with 16 GiB RAM and 2.7 GHz Intel Core i7 CPU, running macOS High Sierra 10.13, using GCC 7.2.0 for compilation.
I had routine background programs running — e.g. browser not being actively used, no music or videos playing, no downloads in progress etc. (These things matter for benchmarking.)
$ inssortthread23 100000
Elapsed using threads: 1.060299 s
CPU time using threads: 2099441
Elapsed sequentially: 2.146059 s
CPU time sequentially: 2138465
$ inssortthread23 200000
Elapsed using threads: 4.332935 s
CPU time using threads: 8616953
Elapsed sequentially: 8.496348 s
CPU time sequentially: 8469327
$ inssortthread23 300000
Elapsed using threads: 9.984021 s
CPU time using threads: 19880539
Elapsed sequentially: 20.000900 s
CPU time sequentially: 19959341
$
Conclusions
Here, you can clearly see that:
The elapsed time is approximately twice as long for the non-threaded code as for the threaded code.
The CPU time for the threaded and non-threaded code is almost the same.
The overall time is quadratic in the number of rows sorted.
All of which are very much in line with expectations — once you realize that clock() is measuring CPU time, not elapsed time.
Minor puzzle
You can also see that I'm getting the threaded CPU time as slightly smaller than the CPU time for sequential sorts, some of the time. I don't have an explanation for that — I deem it 'lost in the noise', though the effect is persistent:
$ inssortthread23 100000
Elapsed using threads: 1.051229 s
CPU time using threads: 2081847
Elapsed sequentially: 2.138538 s
CPU time sequentially: 2132083
$ inssortthread23 100000
Elapsed using threads: 1.053656 s
CPU time using threads: 2089886
Elapsed sequentially: 2.128908 s
CPU time sequentially: 2122983
$ inssortthread23 100000
Elapsed using threads: 1.058283 s
CPU time using threads: 2093644
Elapsed sequentially: 2.126402 s
CPU time sequentially: 2120625
$
$ inssortthread23 200000
Elapsed using threads: 4.259660 s
CPU time using threads: 8479978
Elapsed sequentially: 8.872929 s
CPU time sequentially: 8843207
$ inssortthread23 200000
Elapsed using threads: 4.463954 s
CPU time using threads: 8883267
Elapsed sequentially: 8.603401 s
CPU time sequentially: 8580240
$ inssortthread23 200000
Elapsed using threads: 4.227154 s
CPU time using threads: 8411582
Elapsed sequentially: 8.816412 s
CPU time sequentially: 8797965
$

Why does using OpenMP make the code slower to run?

I have the following code that uses OMP to parallelize a monte carlo method. My question is why does the serial version of the code (monte_carlo_serial) run a lot faster than the parallel version (monte_carlo_parallel). I am running the code on a machine with 32 cores and get the following result printed to the console:
-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
Pi (Serial): 3.140856
Time taken 0 seconds 50 milliseconds
Pi (Parallel): 3.132103
Time taken 127 seconds 990 milliseconds
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>
int niter = 1000000; //number of iterations per FOR loop
int monte_carlo_parallel() {
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 32;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
{
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi (Parallel): %f\n", pi);
return 0;
}
int monte_carlo_serial(){
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
pi = ((double)count/(double)(niter))*4.0;
printf("Pi (Serial): %f\n", pi);
return 0;
}
void main(){
clock_t start = clock(), diff;
monte_carlo_serial();
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
start = clock(), diff;
monte_carlo_parallel();
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
}
The variable
count
is shared across all of your spawned threads. Each of them has to lock count to increment it. In addition if the threads are running on separate cpu's (and there's no possible win if they're not) you have the cost of sending the value of count from one core to another and back again.
This is a textbook example of false sharing. Accessing count in your serial version it will be in a register and cost 1 cycle to increment. In the parallel version it will usually not be in cache, you have to tell the other cores to invalidate that cache line, fetch it (L3 is going to take 66 cycles at best) increment it, and store it back. Every time count migrates from one cpu core to another you have a minimum ~125 cycle cost which is a lot worse than 1. The threads will never be able to run in parallel because they depend on count.
Try to modify your code so that each thread has its own count, then sum all values of count from all the threads at the end and you /might/ see a speedup.

C Threads program

I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?
The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.

time execution can show up

#include <conio.h>
#include <stdio.h>
#include<time.h>
double multi();
void main()
{
clrscr();
clock_t start = clock();
for (int i = 0; i < 1000; i++)
{
multi();
//printf("Answer (%d)",s);
}
clock_t end = clock();
float diff;
diff = (float) (end - start) / CLOCKS_PER_SEC;
printf("time execution :%f", diff);
getch();
}
double multi()
{
double a;
a = 5 * 5;
return a;
}
The execution time appear as 0.000000 what the problem!
would it be cause of the nanoseconeds
The man for the clock() function says:
The clock() function returns an approximation of processor time used by the program.
Approximation, so it's not going to be exact, it depends on the granularity of your system. So for starters you can check the granularity of clock() on your system with something like:
clock_t start =clock(), end;
while(1)
{
if(start != (end=clock()))
break;
}
diff=(float)(end - start)/CLOCKS_PER_SEC;
printf("best time :%f",diff);
Doing this for me, I get 0.001 (which is 1ms), so anything that takes less that 1ms to do I will get back "0" instead. That's what's happening to you, your code is running faster than clock()s granularity and so you're getting back the best approximation which happens to be "0"

Resources