I have a program in .C that uses openmp that can be seen below; the program is used to compute pi given a set of steps; however, I am new to openMp, so my knowledge is limited.
I'm attempting to implement a barrier for this program, but I believe one is already implicit, so I'm not sure if I even need to implement it.
Thank you!
#include <omp.h>
#include <stdio.h>
#define NUM_THREADS 4
static long num_steps = 100000000;
double step;
int main()
{
int i;
double start_time, run_time, pi, sum[NUM_THREADS];
omp_set_num_threads(NUM_THREADS);
step = 1.0 / (double)num_steps;
start_time = omp_get_wtime();
#pragma omp parallel
{
int i, id, currentThread;
double x;
id = omp_get_thread_num();
currentThread = omp_get_num_threads();
for (i = id, sum[id] = 0.0; i < num_steps; i = i + currentThread)
{
x = (i + 0.5) * step;
sum[id] = sum[id] + 4.0 / (1.0 + x * x);
}
}
run_time = omp_get_wtime() - start_time;
//we then get the value of pie
for (i = 0, pi = 0.0; i < NUM_THREADS; i++)
{
pi = pi + sum[i] * step;
}
printf("\n pi with %ld steps is %lf \n ", num_steps, pi);
printf("run time = %6.6f seconds\n", run_time);
}
In your case there is no need for an explicit barrier, there is an implicit barrier at the end of the parallel section.
Your code, however, has a performance issue. Different threads update adjacent elements of sum array which can cause false sharing:
When multiple threads access same cache line and at least one of them
writes to it, it causes costly invalidation misses and upgrades.
To avoid it you have to be sure that each element of the sum array is located on a different cache line, but there is a simpler solution: to use OpenMP's reduction clause. Please check this example suggested by #JeromeRichard. Using reduction your code should be something like this:
double sum=0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < num_steps; i++)
{
const double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x * x);
}
Note also that you should use your variables in their minimum required scope.
Related
I've just recently started using openmp and can't understand why this program isn't scaling well:
#include <stdio.h>
#include <omp.h>
#define THREAD_NUMBER 4
int main() {
double start, end;
double pi = 0.0;;
int actual_threads;
unsigned long numSteps = 1000000000;
double stepSize = 1.0/numSteps;
omp_set_num_threads(THREAD_NUMBER);
start = omp_get_wtime();
#pragma omp parallel for reduction (+:pi)
for (int i = 0; i < numSteps; i++) {
pi += 4.0/(1 + (0.5 + i)*stepSize*(0.5 + i)*stepSize);
}
pi *= stepSize;
end = omp_get_wtime();
printf("runtime: %.10fs\n", end - start);
printf("pi = %.15f\n", pi);
return 0;
}
It's just a basic test program in C for me to get used to the openmp directives. When I run it with 1 thread it gives me about 3.5 seconds of runtime in the parallel section only. But when I run it with 4 threads it returns about 1.4 seconds. This is horrible scaling, it's not even close to cutting the time by 4 and I can't understand why. I appreciate any help.
The code below is a direct translation from a youtube video on Estimating PI using OpenMP and Monte Carlo. Even with the same inputs I'm not getting here their output. In fact, it seems like around half the value is what I get.
int main() {
int num; // number of iterations
printf("Enter number of iterations you want the loop to run for: ");
scanf_s("%d", &num);
double x, y, z, pi;
long long int i;
int count = 0;
int num_thread;
printf("Enter number of threads you want to run to parallelize the process:\t");
scanf_s("%d", &num_thread);
printf("\n");
#pragma omp parallel firstprivate(x,y,z,i) shared(count) num_threads(num_thread)
{
srand((int)time(NULL) ^ omp_get_thread_num());
for (i = 0; i < num; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
z = pow(((x * x) + (y * y)), .5);
if (z <= 1) {
count++;
}
}
} // END PRAGMA
pi = ((double)count / (double)(num * num_thread)) * 4;
printf("The value of pi obtained is %f\n", pi);
return 0;
}
I've also used a similar algorithm straight from the Oak Ridge National Laboratory's website (https://www.olcf.ornl.gov/tutorials/monte-carlo-pi/):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
int main(int argc, char* argv[])
{
int niter = 1000000; //number of iterations per FOR loop
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 16;
#pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads)
{
srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)random()/RAND_MAX; //gets a random x coordinate
y = (double)random()/RAND_MAX; //gets a random y coordinate
z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
//print the value of each thread/rank
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi: %f\n", pi);
return 0;
}
And I am have the exact problem, so I'm think it isn't the code but somehow my machine.
I am running in VS Studio 22, Windows 11 with 16 core i9-12900kf and 32 gb ram.
Edit: I forgot to mention I did alter the second algorithm to use srand() and rand() instead.
There are many errors in the code:
As pointed out by #JeromeRichard and #JohnBollinger rand\srand\random are not threadsafe you should use a threadsafe solution.
There is a race condition at line ++count; (different threads read and write a shared variable). You should use reduction to avoid it.
The code assumes that you use numthreads threads, but OpenMP does not guarantee that you actually got all of the threads you requested. I think if you got PI/2 as a result, the problem should be the difference between the requested and obtained number of threads. If you use #pragma omp parallel for... before the loop, you do not need any assumptions about the number of threads (ie. in this case the equation to calculate PI does not contain the number of threads).
A minor comment is that you do not need to use the time-consuming pow function.
Putting it together your code should be something like this:
#pragma omp parallel for reduction(+:count) num_threads(num_thread)
for (long long int i = 0; i < num; i++) {
const double x = threadsafe_random_number_between_0_1();
const double y = threadsafe_random_number_between_0_1();
const double z = x * x + y * y;
if (z <= 1) {
count++;
}
}
double pi = ((double) count / (double) num ) * 4.0;
One assumption but I may be wrong : you initialise random with time, so it may happen than different thread use the same time , which may result in same random number generated, and so the result will be really bad as you got multiple time the same values. This is a problem with the Monte-Carlo method where 2 identical points will make wrong result.
I am studying this tutorial about OpenMP and I came across this exercise, on page 19. It is a pi calculation algorithm which I have to parallelize:
static long num_steps = 100000;
double step;
void main ()
{
int i;
double x, pi
double sum = 0.0;
step = 1.0 / (double)num_steps;
for(i = 0; i < num_steps; i++)
{
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum;
}
I can not use, up to this point, #pragma parallel for. I can only use:
#pragma omp parallel {}
omp_get_thread_num();
omp_set_num_threads(int);
omp_get_num_threads();
My implementation looks like this :
#define NUM_STEPS 800
int main(int argc, char **argv)
{
int num_steps = NUM_STEPS;
int i;
double x;
double pi;
double step = 1.0 / (double)num_steps;
double sum[num_steps];
for(i = 0; i < num_steps; i++)
{
sum[i] = 0;
}
omp_set_num_threads(num_steps);
#pragma omp parallel
{
x = (omp_get_thread_num() + 0.5) * step;
sum[omp_get_thread_num()] += 4.0 / (1.0 + x * x);
}
double totalSum = 0;
for(i = 0; i < num_steps; i++)
{
totalSum += sum[i];
}
pi = step * totalSum;
printf("Pi: %.5f", pi);
}
Ignoring the problem by using an sum array (It explains later that it needs to define a critical section for the sum value with #pragma omp critical or #pragma omp atomic), the above impelentation only works for a limited number of threads (800 in my case), where the serial code uses 100000 steps. Is there a way to achieve this with only the aforementioned OpenMP commands, or am I obliged to use #pragma omp parallel for, which hasn't been mentioned yet in the tutorial?
Thanks a lot for your time, I am really trying to grasp the concept of parallelization in C using OpenMP.
You will need to find a way to make your parallel algorithm somewhat independent from the number of threads.
The most simple way is to do something like:
int tid = omp_get_thread_num();
int n_threads = omp_get_num_threads();
for (int i = tid; i < num_steps; i += n_threads) {
// ...
}
This way the work is split across all threads regardless of the number of threads.
If there were 3 threads and 9 steps:
Thread 0 would do steps 0, 3, 6
Thread 1 would do steps 1, 4, 7
Thread 2 would do steps 2, 5, 8
This works but isn't ideal if each thread is accessing data from some shared array. It is better if threads access sections of data nearby for locality purposes.
In that case you can divide the number of steps by the number of threads and give each thread a contiguous set of tasks like so:
int tid = omp_get_thread_num();
int n_threads = omp_get_num_threads();
int steps_per_thread = num_steps / n_threads;
int start = tid * steps_per_thread;
int end = start + steps_per_thread;
for (int i = start; i < end; i++) {
// ...
}
Now the 3 threads performing 9 steps looks like:
Thread 0 does steps 0, 1, 2
Thread 1 does steps 3, 4, 5
Thread 2 does steps 6, 7, 8
This approach is actually what is most likely happening when #pragma omp for is used. In most cases the compiler just divides the tasks according to the number of threads and assigns each thread a section.
So given a set of 2 threads and a 100 iteration for loop, the compiler would likely give iterations 0-49 to thread 0 and iterations 50-99 to thread 1.
Note that if the number of iterations does not divide evenly by the number of threads the remainder needs to be handled explicitly.
I am trying to compute value of pi using trapezoidal rule of numerical integration. For that I have written a serial code which does iterations in a given range. For computing the parallel overhead, I have run the same code by setting number of threads to 1. Now, I have obtained the following graph of execution time versus the problem size.
Since, we are only creating one thread, I don't think there is much of communication overhead involved in this. So what might be the reason behind this? And as far as I know, the directive's invocation is done at compile time, i.e., if you define a MACRO then it gets expanded before runtime, so am I missing something there? Or is it something totally different from what I have thought?
Below is the serial code
#include<stdio.h>
#include<omp.h>
int main()
{
FILE *fp = fopen("pi_serial.txt", "a+");
long num_steps = 1e9;
double step_size = 1.0 / num_steps;
long i;
double sum = 0;
double start_time = omp_get_wtime();
for(i = 0; i< num_steps; i++) {
double x = (i + 0.5) * step_size;
sum += (4.0 / (1.0 + (x * x)));
}
sum = sum * step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the multi-threaded code
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
FILE* fp = fopen("pi_parallel.txt", "a+");
omp_set_num_threads(1);
long num_steps = atol(argv[1]);
double step_size = 1.0 / num_steps;
double sum = 0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
int id = omp_get_thread_num();
double private_sum = 0;
int i;
for(i = id; i <= num_steps; i += 1){
double x = (i + 0.5) * step_size;
private_sum += (4.0 / (1.0 + x * x));
}
#pragma omp critical
sum += private_sum;
}
sum *= step_size;
double end_time = omp_get_wtime();
fprintf(fp, "%lf %lf\n", sum, end_time - start_time);
fclose(fp);
return 0;
}
And here is the graph for Execution time
https://www.youtube.com/watch?v=OuzYICZUthM&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=7
The above video will help in understanding why a serial code might be faster than a parallel code with one thread.
According to the presenter, it can be seen that since you are setting up omp environment variables, creating a thread in the middle of the program it is normal for the openmp program to run slower than the serial code.
But the main thing would be look at the scalability of your code- how fast is your code compared to serial when running on more than 1 thread?
When you are running the same code on multiple threads and still do not see an increase in performance it may be due to false sharing. From what I understand, consider two variables that reside in the same cache line. The master thread accesses one of the variables and modifies it which causes the cache line to be invalidated. If thread 1 has to access the modified cache line then the modified cache line is written to memory and the thread then fetches the cache line from memory and modifies it. This process may increase the execution time.
References:
https://docs.oracle.com/cd/E37069_01/html/E37081/aewcy.html
*I dont own the video.
I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).