Serial Execution faster than Parallel Execution with one thread of OpenMP - c

I am trying to compute value of pi using trapezoidal rule of numerical integration. For that I have written a serial code which does iterations in a given range. For computing the parallel overhead, I have run the same code by setting number of threads to 1. Now, I have obtained the following graph of execution time versus the problem size.
Since, we are only creating one thread, I don't think there is much of communication overhead involved in this. So what might be the reason behind this? And as far as I know, the directive's invocation is done at compile time, i.e., if you define a MACRO then it gets expanded before runtime, so am I missing something there? Or is it something totally different from what I have thought?
Below is the serial code
#include<stdio.h>
#include<omp.h>
int main()
{
FILE *fp = fopen("pi_serial.txt", "a+");
long num_steps = 1e9;
double step_size = 1.0 / num_steps;
long i;
double sum = 0;
double start_time = omp_get_wtime();
for(i = 0; i< num_steps; i++) {
double x = (i + 0.5) * step_size;
sum += (4.0 / (1.0 + (x * x)));
}
sum = sum * step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the multi-threaded code
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
FILE* fp = fopen("pi_parallel.txt", "a+");
omp_set_num_threads(1);
long num_steps = atol(argv[1]);
double step_size = 1.0 / num_steps;
double sum = 0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
int id = omp_get_thread_num();
double private_sum = 0;
int i;
for(i = id; i <= num_steps; i += 1){
double x = (i + 0.5) * step_size;
private_sum += (4.0 / (1.0 + x * x));
}
#pragma omp critical
sum += private_sum;
}
sum *= step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the graph for Execution time

https://www.youtube.com/watch?v=OuzYICZUthM&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=7
The above video will help in understanding why a serial code might be faster than a parallel code with one thread.
According to the presenter, it can be seen that since you are setting up omp environment variables, creating a thread in the middle of the program it is normal for the openmp program to run slower than the serial code.
But the main thing would be look at the scalability of your code- how fast is your code compared to serial when running on more than 1 thread?
When you are running the same code on multiple threads and still do not see an increase in performance it may be due to false sharing. From what I understand, consider two variables that reside in the same cache line. The master thread accesses one of the variables and modifies it which causes the cache line to be invalidated. If thread 1 has to access the modified cache line then the modified cache line is written to memory and the thread then fetches the cache line from memory and modifies it. This process may increase the execution time.
References:
https://docs.oracle.com/cd/E37069_01/html/E37081/aewcy.html
*I dont own the video.

Related

Openmp pragma barrier for calculating pi in

I have a program in .C that uses openmp that can be seen below; the program is used to compute pi given a set of steps; however, I am new to openMp, so my knowledge is limited.
I'm attempting to implement a barrier for this program, but I believe one is already implicit, so I'm not sure if I even need to implement it.
Thank you!
#include <omp.h>
#include <stdio.h>
#define NUM_THREADS 4
static long num_steps = 100000000;
double step;
int main()
{
int i;
double start_time, run_time, pi, sum[NUM_THREADS];
omp_set_num_threads(NUM_THREADS);
step = 1.0 / (double)num_steps;
start_time = omp_get_wtime();
#pragma omp parallel
{
int i, id, currentThread;
double x;
id = omp_get_thread_num();
currentThread = omp_get_num_threads();
for (i = id, sum[id] = 0.0; i < num_steps; i = i + currentThread)
{
x = (i + 0.5) * step;
sum[id] = sum[id] + 4.0 / (1.0 + x * x);
}
}
run_time = omp_get_wtime() - start_time;
//we then get the value of pie
for (i = 0, pi = 0.0; i < NUM_THREADS; i++)
{
pi = pi + sum[i] * step;
}
printf("\n pi with %ld steps is %lf \n ", num_steps, pi);
printf("run time = %6.6f seconds\n", run_time);
}
In your case there is no need for an explicit barrier, there is an implicit barrier at the end of the parallel section.
Your code, however, has a performance issue. Different threads update adjacent elements of sum array which can cause false sharing:
When multiple threads access same cache line and at least one of them
writes to it, it causes costly invalidation misses and upgrades.
To avoid it you have to be sure that each element of the sum array is located on a different cache line, but there is a simpler solution: to use OpenMP's reduction clause. Please check this example suggested by #JeromeRichard. Using reduction your code should be something like this:
double sum=0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < num_steps; i++)
{
const double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x * x);
}
Note also that you should use your variables in their minimum required scope.

I can't understand why this openmp program doesn't scale well

I've just recently started using openmp and can't understand why this program isn't scaling well:
#include <stdio.h>
#include <omp.h>
#define THREAD_NUMBER 4
int main() {
double start, end;
double pi = 0.0;;
int actual_threads;
unsigned long numSteps = 1000000000;
double stepSize = 1.0/numSteps;
omp_set_num_threads(THREAD_NUMBER);
start = omp_get_wtime();
#pragma omp parallel for reduction (+:pi)
for (int i = 0; i < numSteps; i++) {
pi += 4.0/(1 + (0.5 + i)*stepSize*(0.5 + i)*stepSize);
}
pi *= stepSize;
end = omp_get_wtime();
printf("runtime: %.10fs\n", end - start);
printf("pi = %.15f\n", pi);
return 0;
}
It's just a basic test program in C for me to get used to the openmp directives. When I run it with 1 thread it gives me about 3.5 seconds of runtime in the parallel section only. But when I run it with 4 threads it returns about 1.4 seconds. This is horrible scaling, it's not even close to cutting the time by 4 and I can't understand why. I appreciate any help.

Why am I not getting the same estimation of PI using a parallelized (OpenMP) algothrim copied from working code?

The code below is a direct translation from a youtube video on Estimating PI using OpenMP and Monte Carlo. Even with the same inputs I'm not getting here their output. In fact, it seems like around half the value is what I get.
int main() {
int num; // number of iterations
printf("Enter number of iterations you want the loop to run for: ");
scanf_s("%d", &num);
double x, y, z, pi;
long long int i;
int count = 0;
int num_thread;
printf("Enter number of threads you want to run to parallelize the process:\t");
scanf_s("%d", &num_thread);
printf("\n");
#pragma omp parallel firstprivate(x,y,z,i) shared(count) num_threads(num_thread)
{
srand((int)time(NULL) ^ omp_get_thread_num());
for (i = 0; i < num; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
z = pow(((x * x) + (y * y)), .5);
if (z <= 1) {
count++;
}
}
} // END PRAGMA
pi = ((double)count / (double)(num * num_thread)) * 4;
printf("The value of pi obtained is %f\n", pi);
return 0;
}
I've also used a similar algorithm straight from the Oak Ridge National Laboratory's website (https://www.olcf.ornl.gov/tutorials/monte-carlo-pi/):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
int main(int argc, char* argv[])
{
int niter = 1000000; //number of iterations per FOR loop
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 16;
#pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads)
{
srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)random()/RAND_MAX; //gets a random x coordinate
y = (double)random()/RAND_MAX; //gets a random y coordinate
z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
//print the value of each thread/rank
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi: %f\n", pi);
return 0;
}
And I am have the exact problem, so I'm think it isn't the code but somehow my machine.
I am running in VS Studio 22, Windows 11 with 16 core i9-12900kf and 32 gb ram.
Edit: I forgot to mention I did alter the second algorithm to use srand() and rand() instead.
There are many errors in the code:
As pointed out by #JeromeRichard and #JohnBollinger rand\srand\random are not threadsafe you should use a threadsafe solution.
There is a race condition at line ++count; (different threads read and write a shared variable). You should use reduction to avoid it.
The code assumes that you use numthreads threads, but OpenMP does not guarantee that you actually got all of the threads you requested. I think if you got PI/2 as a result, the problem should be the difference between the requested and obtained number of threads. If you use #pragma omp parallel for... before the loop, you do not need any assumptions about the number of threads (ie. in this case the equation to calculate PI does not contain the number of threads).
A minor comment is that you do not need to use the time-consuming pow function.
Putting it together your code should be something like this:
#pragma omp parallel for reduction(+:count) num_threads(num_thread)
for (long long int i = 0; i < num; i++) {
const double x = threadsafe_random_number_between_0_1();
const double y = threadsafe_random_number_between_0_1();
const double z = x * x + y * y;
if (z <= 1) {
count++;
}
}
double pi = ((double) count / (double) num ) * 4.0;
One assumption but I may be wrong : you initialise random with time, so it may happen than different thread use the same time , which may result in same random number generated, and so the result will be really bad as you got multiple time the same values. This is a problem with the Monte-Carlo method where 2 identical points will make wrong result.

Why OpenMP reduction is slower than MPI on share memory structure?

I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI.
The MPI code I am using is as following,
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
double ttime = -omp_get_wtime();
int np, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int n = 10000;
int repeat = 10000;
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = my_rank * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double dot = 0;
double sum = 1;
int j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
sum += (dot/(double)(n));
}
time += omp_get_wtime();
if (my_rank == 0)
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
return 0;
}
I have tried several different implementation with OpenMP.
Here is the version which not to complicate and close to best performance I can achieve.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int n = 10000;
int repeat = 10000;
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
int nstart =0;
int sublength =n;
double loc_dot = 0;
double sum = 1;
#pragma omp parallel
{
int i, j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
#pragma omp for reduction(+: loc_dot)
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
#pragma omp single
{
sum += (loc_dot/(double)(n));
loc_dot =0;
}
}
time += omp_get_wtime();
#pragma omp single nowait
printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
}
return 0;
}
here is my test results:
OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32
MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32
Can anyone tell me what I am missing?
thanks!
update:
I have written an acceptable reduce function for OMP. the perfomance is close to MPI reduce function now. the code is as following.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
double darr[2][64];
int nreduce=0;
#pragma omp threadprivate(nreduce)
double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
darr[nreduce][tid]=loc_dot;
#pragma omp barrier
double dsum =0;
int i;
for (i=0; i<np; i++)
{
dsum += darr[nreduce][i];
}
nreduce=1-nreduce;
return dsum;
}
int main(int argc, char* argv[])
{
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
double ttime = -omp_get_wtime();
int n = 10000;
int repeat = 10000;
#pragma omp parallel
{
int tid = omp_get_thread_num();
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = tid * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double sum = 1;
double time = -omp_get_wtime();
int j, k;
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
sum +=(dot/(double)(n));
}
time += omp_get_wtime();
#pragma omp master
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
}
return 0;
}
First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects). Indeed, a full synchronization is required for each iteration of the j-based loop executed every 45 ms. This means 4.5 us/iteration. In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted. If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far. This small additional time represent an overhead of 43% on the overall execution time because of the barriers! Due to contention on atomic variables, timings are often much worse. Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).
The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects. Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single. The reduction itself can implemented in several ways. The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). Moreover, the omp single section will cause some cache-line bouncing. Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core. In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state). The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core). Finally, the sum variable must be broadcasted on each core so they can start a new iteration.
The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used. The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern. This hardware-operation can scale barely but should be sequential either.
Note that the last OpenMP implementation suffer from false-sharing. Indeed, items of darr will be stored contiguously in memory and share the same cache-line. As a result, when a thread writes in darr, the associated core will request the cache-line and invalidates the ones located on others cores. This causes cache-line bouncing between cores. However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line. Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones. That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation. To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line. The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use. Ideally, the sum reduction and the barrier can be merged together for better performance. This is what the MPI implementation can do internally (MPI_Allreduce).
Note that all implementations suffer from the very high thread synchronization. This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.). One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time. The slowest thread will slow down all the others because of the barrier. In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.
The takeaway is:
the more synchronized a code is, the lower its scaling and the challenging its optimization.

C Threads program

I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?
The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.

Resources