I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?
The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.
Related
I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI.
The MPI code I am using is as following,
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
double ttime = -omp_get_wtime();
int np, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int n = 10000;
int repeat = 10000;
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = my_rank * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double dot = 0;
double sum = 1;
int j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
sum += (dot/(double)(n));
}
time += omp_get_wtime();
if (my_rank == 0)
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
return 0;
}
I have tried several different implementation with OpenMP.
Here is the version which not to complicate and close to best performance I can achieve.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int n = 10000;
int repeat = 10000;
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
int nstart =0;
int sublength =n;
double loc_dot = 0;
double sum = 1;
#pragma omp parallel
{
int i, j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
#pragma omp for reduction(+: loc_dot)
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
#pragma omp single
{
sum += (loc_dot/(double)(n));
loc_dot =0;
}
}
time += omp_get_wtime();
#pragma omp single nowait
printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
}
return 0;
}
here is my test results:
OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32
MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32
Can anyone tell me what I am missing?
thanks!
update:
I have written an acceptable reduce function for OMP. the perfomance is close to MPI reduce function now. the code is as following.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
double darr[2][64];
int nreduce=0;
#pragma omp threadprivate(nreduce)
double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
darr[nreduce][tid]=loc_dot;
#pragma omp barrier
double dsum =0;
int i;
for (i=0; i<np; i++)
{
dsum += darr[nreduce][i];
}
nreduce=1-nreduce;
return dsum;
}
int main(int argc, char* argv[])
{
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
double ttime = -omp_get_wtime();
int n = 10000;
int repeat = 10000;
#pragma omp parallel
{
int tid = omp_get_thread_num();
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = tid * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double sum = 1;
double time = -omp_get_wtime();
int j, k;
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
sum +=(dot/(double)(n));
}
time += omp_get_wtime();
#pragma omp master
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
}
return 0;
}
First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects). Indeed, a full synchronization is required for each iteration of the j-based loop executed every 45 ms. This means 4.5 us/iteration. In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted. If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far. This small additional time represent an overhead of 43% on the overall execution time because of the barriers! Due to contention on atomic variables, timings are often much worse. Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).
The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects. Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single. The reduction itself can implemented in several ways. The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). Moreover, the omp single section will cause some cache-line bouncing. Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core. In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state). The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core). Finally, the sum variable must be broadcasted on each core so they can start a new iteration.
The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used. The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern. This hardware-operation can scale barely but should be sequential either.
Note that the last OpenMP implementation suffer from false-sharing. Indeed, items of darr will be stored contiguously in memory and share the same cache-line. As a result, when a thread writes in darr, the associated core will request the cache-line and invalidates the ones located on others cores. This causes cache-line bouncing between cores. However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line. Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones. That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation. To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line. The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use. Ideally, the sum reduction and the barrier can be merged together for better performance. This is what the MPI implementation can do internally (MPI_Allreduce).
Note that all implementations suffer from the very high thread synchronization. This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.). One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time. The slowest thread will slow down all the others because of the barrier. In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.
The takeaway is:
the more synchronized a code is, the lower its scaling and the challenging its optimization.
I have 2 unsorted arrays and 2 copies of these arrays. I am using two different threads to sort two arrays, then I am sorting other two unsorted array one by one. What I thought was that the thread process would be faster but it's not, so how does threads take more time?
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
struct thread_data
{
int count;
unsigned int *arr;
};
struct thread_data thread_data_array[2];
void insertionSort(unsigned int arr[], int n)
{
int i, key, j;
for (i = 1; i < n; i++)
{
key = arr[i];
j = i-1;
while (j >= 0 && arr[j] > key)
{
arr[j+1] = arr[j];
j = j-1;
}
arr[j+1] = key;
}
}
void *sortAndMergeArrays(void *threadarg)
{
int count;
unsigned int *arr;
struct thread_data *my_data;
my_data = (struct thread_data *) threadarg;
count = my_data->count;
arr = my_data->arr;
insertionSort(arr, count);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int count, i, rc;
clock_t start, end, total_t;
pthread_t threads[2];
//get the loop count. If loop count is not provided take 10000 as default loop count.
if(argc == 2){
count = atoi(argv[1]);
}
else{
count = 10000;
}
unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];
srand(time(0));
for(i = 0; i<count; i++){
arr1[i] = rand();
arr2[i] = rand();
copyArr1[i] = arr1[i];
copyArr2[i] = arr2[i];
}
start = clock();
for(int t=0; t<2; t++) {
thread_data_array[t].count = count;
if(t==0)
thread_data_array[t].arr = arr1;
else
thread_data_array[t].arr = arr2;
rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *) &thread_data_array[t]);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
end = clock();
total_t = (double)(end - start);
printf("Total time taken by CPU to sort using threads: %d\n", total_t);
start = clock();
insertionSort(copyArr1, count);
insertionSort(copyArr2, count);
end = clock();
total_t = (double)(end - start);
printf("Total time taken by CPU to sort sequentially: %d\n", total_t);
pthread_exit(NULL);
}
I am using Linux server to execute the code. First I am randomly populating the arrays and copying them to two separate arrays. For the first two arrays I am creating two threads using pthread and passing the two arrays to them, which uses insertion sort to sort them. And for the other two arrays I am just sorting one by one.
I expected that by using threads I would reduce the execution time but actually takes more time.
Diagnosis
The reason you get practically the same time — and slightly more time from the threaded code than from the sequential code — is that clock() measures CPU time, and the two ways of sorting take almost the same amount of CPU time because they're doing the same job (and the threading number is probably slightly bigger because of the time to setup and tear down threads).
The clock() function shall return the implementation's best approximation to the processor time used by the process since the beginning of an implementation-defined era related only to the process invocation.
BSD (macOS) man page:
The clock() function determines the amount of processor time used since the invocation of the calling process, measured in CLOCKS_PER_SECs of a second.
The amount of CPU time it takes to sort the two arrays is basically the same; the difference is the overhead of threading (more or less).
Revised code
I have a set of functions that can use clock_gettime() instead (code in timer.c and timer.h at GitHub). These measure wall clock time — elapsed time, not CPU time.
Here's a mildly tweaked version of your code — the substantive changes were changing the type of key in the sort function from int to unsigned int to match the data in the array, and to fix the conversion specification of %d to %ld to match the type identified by GCC as clock_t. I mildly tweaked the argument handling, and the timing messages so that they're consistent in length, and added the elapsed time measurement code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#include "timer.h"
struct thread_data
{
int count;
unsigned int *arr;
};
struct thread_data thread_data_array[2];
static
void insertionSort(unsigned int arr[], int n)
{
for (int i = 1; i < n; i++)
{
unsigned int key = arr[i];
int j = i - 1;
while (j >= 0 && arr[j] > key)
{
arr[j + 1] = arr[j];
j = j - 1;
}
arr[j + 1] = key;
}
}
static
void *sortAndMergeArrays(void *threadarg)
{
int count;
unsigned int *arr;
struct thread_data *my_data;
my_data = (struct thread_data *)threadarg;
count = my_data->count;
arr = my_data->arr;
insertionSort(arr, count);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int count = 10000;
int i, rc;
clock_t start, end, total_t;
pthread_t threads[2];
// get the loop count. If loop count is not provided take 10000 as default loop count.
if (argc == 2)
count = atoi(argv[1]);
unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];
srand(time(0));
for (i = 0; i < count; i++)
{
arr1[i] = rand();
arr2[i] = rand();
copyArr1[i] = arr1[i];
copyArr2[i] = arr2[i];
}
Clock clk;
clk_init(&clk);
start = clock();
clk_start(&clk);
for (int t = 0; t < 2; t++)
{
thread_data_array[t].count = count;
if (t == 0)
thread_data_array[t].arr = arr1;
else
thread_data_array[t].arr = arr2;
rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *)&thread_data_array[t]);
if (rc)
{
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
clk_stop(&clk);
end = clock();
char buffer[32];
printf("Elapsed using threads: %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
total_t = (double)(end - start);
printf("CPU time using threads: %ld\n", total_t);
start = clock();
clk_start(&clk);
insertionSort(copyArr1, count);
insertionSort(copyArr2, count);
clk_stop(&clk);
end = clock();
printf("Elapsed sequentially: %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
total_t = (double)(end - start);
printf("CPU time sequentially: %ld\n", total_t);
return 0;
}
Results
Example runs (program inssortthread23) — run on a MacBook Pro (15" 2016) with 16 GiB RAM and 2.7 GHz Intel Core i7 CPU, running macOS High Sierra 10.13, using GCC 7.2.0 for compilation.
I had routine background programs running — e.g. browser not being actively used, no music or videos playing, no downloads in progress etc. (These things matter for benchmarking.)
$ inssortthread23 100000
Elapsed using threads: 1.060299 s
CPU time using threads: 2099441
Elapsed sequentially: 2.146059 s
CPU time sequentially: 2138465
$ inssortthread23 200000
Elapsed using threads: 4.332935 s
CPU time using threads: 8616953
Elapsed sequentially: 8.496348 s
CPU time sequentially: 8469327
$ inssortthread23 300000
Elapsed using threads: 9.984021 s
CPU time using threads: 19880539
Elapsed sequentially: 20.000900 s
CPU time sequentially: 19959341
$
Conclusions
Here, you can clearly see that:
The elapsed time is approximately twice as long for the non-threaded code as for the threaded code.
The CPU time for the threaded and non-threaded code is almost the same.
The overall time is quadratic in the number of rows sorted.
All of which are very much in line with expectations — once you realize that clock() is measuring CPU time, not elapsed time.
Minor puzzle
You can also see that I'm getting the threaded CPU time as slightly smaller than the CPU time for sequential sorts, some of the time. I don't have an explanation for that — I deem it 'lost in the noise', though the effect is persistent:
$ inssortthread23 100000
Elapsed using threads: 1.051229 s
CPU time using threads: 2081847
Elapsed sequentially: 2.138538 s
CPU time sequentially: 2132083
$ inssortthread23 100000
Elapsed using threads: 1.053656 s
CPU time using threads: 2089886
Elapsed sequentially: 2.128908 s
CPU time sequentially: 2122983
$ inssortthread23 100000
Elapsed using threads: 1.058283 s
CPU time using threads: 2093644
Elapsed sequentially: 2.126402 s
CPU time sequentially: 2120625
$
$ inssortthread23 200000
Elapsed using threads: 4.259660 s
CPU time using threads: 8479978
Elapsed sequentially: 8.872929 s
CPU time sequentially: 8843207
$ inssortthread23 200000
Elapsed using threads: 4.463954 s
CPU time using threads: 8883267
Elapsed sequentially: 8.603401 s
CPU time sequentially: 8580240
$ inssortthread23 200000
Elapsed using threads: 4.227154 s
CPU time using threads: 8411582
Elapsed sequentially: 8.816412 s
CPU time sequentially: 8797965
$
I have the following code that uses OMP to parallelize a monte carlo method. My question is why does the serial version of the code (monte_carlo_serial) run a lot faster than the parallel version (monte_carlo_parallel). I am running the code on a machine with 32 cores and get the following result printed to the console:
-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
Pi (Serial): 3.140856
Time taken 0 seconds 50 milliseconds
Pi (Parallel): 3.132103
Time taken 127 seconds 990 milliseconds
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>
int niter = 1000000; //number of iterations per FOR loop
int monte_carlo_parallel() {
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 32;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
{
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi (Parallel): %f\n", pi);
return 0;
}
int monte_carlo_serial(){
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
pi = ((double)count/(double)(niter))*4.0;
printf("Pi (Serial): %f\n", pi);
return 0;
}
void main(){
clock_t start = clock(), diff;
monte_carlo_serial();
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
start = clock(), diff;
monte_carlo_parallel();
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
}
The variable
count
is shared across all of your spawned threads. Each of them has to lock count to increment it. In addition if the threads are running on separate cpu's (and there's no possible win if they're not) you have the cost of sending the value of count from one core to another and back again.
This is a textbook example of false sharing. Accessing count in your serial version it will be in a register and cost 1 cycle to increment. In the parallel version it will usually not be in cache, you have to tell the other cores to invalidate that cache line, fetch it (L3 is going to take 66 cycles at best) increment it, and store it back. Every time count migrates from one cpu core to another you have a minimum ~125 cycle cost which is a lot worse than 1. The threads will never be able to run in parallel because they depend on count.
Try to modify your code so that each thread has its own count, then sum all values of count from all the threads at the end and you /might/ see a speedup.
given the code below
#include<time.h>
#include <stdio.h>
#include <stdlib.h>
void firstSequence()
{
int columns = 999999;
int rows = 400000;
int **matrix;
int j;
int counter = 0;
matrix = (int **)malloc(columns*sizeof(int*));
for(j=0;j<columns;j++)
{
matrix[j]=(int*)malloc(rows*sizeof(int));
}
for(counter = 1;counter < columns; counter ++)
{
free(matrix[counter]);
}
}
void secondSequence()
{
int columns = 111;
int rows = 600000;
int **matrix;
int j;
matrix = (int **)malloc(columns*sizeof(int*));
for(j=0;j<columns;j++)
{
matrix[j]=(int*)malloc(rows*sizeof(int));
}
}
int main()
{
long t1;
long t2;
long diff;
t1 = clock();
firstSequence();
t2 = clock();
diff = (t2-t1) * 1000.0 / CLOCKS_PER_SEC;
printf("%f",t2);
t1 = clock();
secondSequence();
t2 = clock();
diff = (t2-t1) * 1000.0 / CLOCKS_PER_SEC;
printf("%f",diff);
return(0);
}
I need to be able to see how long it takes for both sequence one and sequence two to run. However both times I get 0 as the time elapsed. From looking online I have seen that this can be an issue but I do not how to fix the issue
You display the time incorrectly, so even if your functions take more than 0ms the call to printf() invokes undefined behaviour.
printf("%f",diff);
%f is used to display doubles. You probably want to use %ld.
If your functions really do take 0 ms to execute then a simple method to calculate the time for one call to the function is to call it multiple times, evough to be measurable, and then take the average of the total time elapsed.
clock is not the suitable function for calculating the time a program used.
You should use clock_gettime instead. detail explain about clock_gettime
Simple usage:
struct timespec start, end;
clock_gettime(CLOCK_REALTIME, &start);
for(int i = 0; i < 10000; i++) {
f1();
}
clock_gettime(CLOCK_REALTIME, &end);
cout <<"time elapsed = " << (double)((end.tv_sec - start.tv_sec)*1000000 + end.tv_nsec - start.tv_nsec) << endl;
PS: when you are compiling on linux, remember using the -lrt.
So i am having an issues calculating the elapsed time of the thread function of each thread, I need to be able to find the time total elapsed time for all of the threads but it is not performing this properly. (see output below code)
#include <unistd.h>
#include <sys/types.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#include <math.h>
#include <time.h>
int numthread;
double x1;
double x2;
double h;
double totalintegral;
int n; //number of trapezoids
int localn;
double gnolock;
double gmute;
double gbusy;
double gsema;
double doTrapRule(double localx1, double localx2, double h, int localn);
double doFunction(double x);
void *threadCalc(void* threadid);
int main(int argc, char * argv[])
{
int i;
x1 = 0.0;
x2 = 20.0;
n = 200000;
numthread = 10;
pthread_t* threads = malloc(numthread*sizeof(pthread_t));
h = (x2 - x1)/n;
localn = n/numthread;
for(i = 0; i < numthread; i++)
{
pthread_create(&threads[i], NULL, (void *) &threadCalc, (void*) i);
}
for(i = 0; i < numthread; i++)
{
pthread_join(threads[i], NULL);
}
printf("Trap rule result with %d trap(s) is %f\n", n, totalintegral);
fflush(stdout);
printf("no lock completed in %f\n", gnolock);
exit(0);
}
void *threadCalc(void* threadid)
{
clock_t start = clock();
double localx1;
double localx2;
double localintegral;
int cur_thread = (int)threadid;
localx1 = x1 + cur_thread * localn * h;
localx2 = localx1 + localn * h;
localintegral = doTrapRule(localx1, localx2, h, localn);
totalintegral = totalintegral + localintegral;
//printf("Trap rule result with %d trap(s) is %f", n, totalintegral);
clock_t stop = clock();
double time_elapsed = (long double)(stop - start)/CLOCKS_PER_SEC;
printf("time elapsed of each thread %f\n",time_elapsed);
gnolock = gnolock + time_elapsed;
return NULL;
}
double doTrapRule(double localx1, double localx2, double h, int localn)
{
//time start here
double localtrapintegral;
double tempx1;
int i;
localtrapintegral = (doFunction(localx1) + doFunction(localx2)) / 2.0;
for(i = 1; i <= (localn - 1); i++)
{
tempx1 = localx1 + i * h;
localtrapintegral = localtrapintegral + doFunction(tempx1);
}
localtrapintegral = localtrapintegral * h;
//time end here, add elapsed to global
return localtrapintegral;
}
double doFunction(double x)
{
double result;
result = x*x*x;
return result;
}
output:
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
time elapsed of each thread 0.010000
time elapsed of each thread 0.010000
time elapsed of each thread 0.000000
time elapsed of each thread 0.000000
Trap rule result with 200000 trap(s) is 40000.000001
no lock completed in 0.020000
As you can see for whatever reason only someone of the threads are actually returning a time. I ran this multiple times, and every time only a few threads returned a result. Just as FYI gnolock is my variable that stores the total amount of time elapsed. My guess as to why this isnt working is because the decimal point is out of range, but it shouldnt be?
If you call clock() on your system, it has a resolution of 10 ms. So if a process takes 2 ms, then it will usually report a time of 0.00s or 0.01s, depending on a bunch of things which you have no control over.
Use one of the high resolution clocks instead. You can use clock_gettime with CLOCK_THREAD_CPUTIME_ID or CLOCK_PROCESS_CPUTIME_ID, I believe the resolution of this clock is several orders of magnitude better than clock().
See man 2 clock_gettime for more information.
Most likely thing is that your clock tick is too coarse for the elapsed time you are trying to measure. Mostly start and stop clocks are the same. Occasionally, by chance a clock tick occurs during your thread execution and you see 1 tick. (This is effectively what Dietrich said above).
As an example of what this means, imagine your thread takes an hour to complete and your clock ticks once a day, at midnight. Mostly when you run the thread is starts and ends on the same day. But if you happen to run it within an hour of midnight, you will see the start and stop on different days (1 tick). What you need then is a faster clock, but such a clock might well not be available.
You are using the wrong tool. clock doesn't measure the elapsed time but
The clock() function returns an approximation of processor time used
by the program.
these are two completely different things. Perhaps your threads don't use up much of processor time.