How to parallelise a code inside a while using OpenMP - c

I am trying to parallelise the heat_plate algorithm but I am stuck at this bit of code inside my while:
while(1)
{
.....
.....
#pragma omp parallel shared(diff, u, w) private(i, j, my_diff)
{
my_diff = 0.0;
#pragma omp for
for (i = 1; i < M - 1; i++)
{
for (j = 1; j < N - 1; j++)
{
if ( my_diff < fabs (w[i][j] - u[i][j]))
{
my_diff = fabs (w[i][j] - u[i][j]);
}
}
}
#pragma omp critical
{
if (diff < my_diff)
{
diff = my_diff;
}
}
}
....
....
}
Not only I can't get it to work in parallel it actually takes longer to finish
Edit:
The Program runs in parallel
.
Thank you in advance for your help.

In OpenMP this data dependency:
for (i = 1; i < M - 1; i++)
for (j = 1; j < N - 1; j++)
if ( my_diff < fabs (w[i][j] - u[i][j]))
my_diff = fabs (w[i][j] - u[i][j]);
is typically solved using OpenMP reduction feature, which in your case will avoid the critical region (after the parallel for) and, consequently, improve the overall performance of the parallelization. So if you apply that feature your code would look like the following:
#pragma omp parallel shared(u, w) private(i, j)
{
#pragma omp for reduction(max:diff)
for (i = 1; i < M - 1; i++)
for (j = 1; j < N - 1; j++)
if ( diff < fabs (w[i][j] - u[i][j]))
diff = fabs (w[i][j] - u[i][j]);
}
In turn you can merge both pragmas into one:
#pragma omp parallel for reduction(max:diff) shared(u, w) private(i, j)
for (i = 1; i < M - 1; i++)
for (j = 1; j < N - 1; j++)
if ( diff < fabs (w[i][j] - u[i][j]))
diff = fabs (w[i][j] - u[i][j]);

Related

Numbers not randomized after runs

I'm trying to create an openMP program that randomizes double arrays and run the values through the formula: y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
If I run the program multiple times I've realised that the Y[] values are the same even though they are supposed to be randomized when the arrays are initialized in the first #pragma omp for . Any Ideas as to why this might be happening?
#include<stdio.h>
#include <stdio.h>
#include <stdlib.h>
#include<omp.h>
#define ARRAY_SIZE 10
double randfrom(double min, double max);
double randfrom(double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand() / div);
}
int main() {
int i;
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
double min, max;
int imin, imax;
/*A[10] consists of random number in between 1 and 100
B[10] consists of random number in between 10 and 50
C[10] consists of random number in between 1 and 10
D[10] consists of random number in between 1 and 50
E[10] consists of random number in between 1 and 5
F[10] consists of random number in between 10 and 80*/
srand(time(NULL));
#pragma omp parallel
{
#pragma omp parallel for
for (i = 0; i < ARRAY_SIZE; i++) {
a[i] = randfrom(1, 100);
b[i] = randfrom(10, 50);
c[i] = randfrom(1, 50);
d[i] = randfrom(1, 50);
e[i] = randfrom(1, 5);
f[i] = randfrom(10, 80);
}
}
printf("This is the parallel Print\n\n\n");
#pragma omp parallel shared(a,b,c,d,e,f,y) private(i)
{
//Y=(A*B)+C+(D*E)+(F/2)
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
/*printf("A[%d]%.2f",i, a[i]);
printf("\n\n");
printf("B[%d]%.2f", i, b[i]);
printf("\n\n");
printf("C[%d]%.2f", i, c[i]);
printf("\n\n");
printf("D[%d]%.2f", i, d[i]);
printf("\n\n");
printf("E[%d]%.2f", i, e[i]);
printf("\n\n");
printf("F[%d]%.2f", i, f[i]);
printf("\n\n");*/
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
printf("Y[%d]=%.2f\n", i, y[i]);
}
}
#pragma omp parallel shared(y, min,imin,max,imax) private(i)
{
//min
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
min = y[i];
imin = i;
}
else {
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
//max
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
max = y[i];
imax = i;
}
else {
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
return 0;
}
First of all, I would like to emphasize that OpenMP has significant overheads, so you need a reasonable amount of work in your code, otherwise the overhead is bigger than the gain by parallelization. In your code this is the case, so the fastest solution is to use serial code. However, you mentioned that your goal is to learn OpenMP, so I will show you how to do it.
In your previous post's comments #paleonix linked a post ( How to generate random numbers in parallel? ) which answers your question about random numbers. One of the solutions is to use rand_r.
Your code has a data race when searching for minimum and maximum values of array Y. If you need to find the minimum/maximum value only it is very easy, because you can use reduction like this:
double max=y[0];
#pragma omp parallel for default(none) shared(y) reduction(max:max)
for (int i = 1; i < ARRAY_SIZE; i++) {
if (y[i] > max) {
max = y[i];
}
}
But in your case you also need the indices of minimum and maximum value, so it is a bit more complicated. You have to use a critical section to be sure that other threads can not change the max, min, imax and imin values while you updating their values. So, it can be done the following way (e.g. for finding minimum value):
#pragma omp parallel for
for (int i = 0; i < ARRAY_SIZE; i++) {
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
Note that the if (y[i] < min) appears twice, because after the first comparison other threads may change the value of min, so inside the critical region before updating min and imin values you have to check it again. You can do it exactly the same way in the case of finding the maximum value.
Always use your variables at their minimum required scope.
It is also recommend to use default(none) clause in your OpenMP parallel region so, you have to explicitly define the sharing attributes all of your variables.
You can fill the array and find its minimum/maximum values in a single loop and print their values in a different serial loop.
If you set min and max before the loop, you can get rid of the extra comparison if (i == 0) used inside the loop.
Putting it together:
double threadsafe_rand(unsigned int* seed, double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand_r(seed) / div);
}
In main:
double min=DBL_MAX;
double max=-DBL_MAX;
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y,imin,imax,min,max)
{
unsigned int seed=omp_get_thread_num();
#pragma omp for
for (int i = 0; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
if (y[i] > max) {
#pragma omp critical
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
Update:
I have updated the code according to #Qubit's and #JérômeRichard's suggestions:
I used the 'Really minimal PCG32 code' / (c) 2014 M.E. O'Neill / from https://www.pcg-random.org/download.html. Note that I do not intend to properly handle the seeding of this simple random number generator. If you would like to do so, please use a complete random number generator library.
I have changed the code to use user defined reductions. Indeed, it makes the code much more efficient, but not really beginner friendly. It would require a very long post to explain it, so if you are interested in the details, please read a book about OpenMP.
I have reduced the number of divisions in threadsafe_rand
The updated code:
#include<stdio.h>
#include<stdint.h>
#include<time.h>
#include<float.h>
#include<limits.h>
#include<omp.h>
#define ARRAY_SIZE 10
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
inline uint32_t pcg32_random_r(pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
// Advance internal state
rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
// Calculate output function (XSH RR), uses old state for max ILP
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
inline double threadsafe_rand(pcg32_random_t* seed, double min, double max)
{
const double tmp=1.0/UINT32_MAX;
return min + tmp*(max - min)*pcg32_random_r(seed);
}
struct v{
double value;
int i;
};
#pragma omp declare reduction(custom_min: struct v: \
omp_out = omp_in.value < omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={DBL_MAX,0} )
#pragma omp declare reduction(custom_max: struct v: \
omp_out = omp_in.value > omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={-DBL_MAX,0} )
int main() {
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
struct v max={-DBL_MAX,0};
struct v min={DBL_MAX,0};
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y) reduction(custom_min:min) reduction(custom_max:max)
{
pcg32_random_t seed={omp_get_thread_num()*7842 + time(NULL)%2299, 1234+omp_get_thread_num()};
#pragma omp for
for (int i=0 ; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min.value) {
min.value = y[i];
min.i = i;
}
if (y[i] > max.value) {
max.value = y[i];
max.i = i;
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", min.i, min.value, max.i, max.value);
return 0;
}

How to reduce exection time of FFT in a loop using OpenMP?

Unable to reduce the execution time of multiple FFTs using OpenMP.
Tried parallelizing the outermost loop, but thsi degraded the performance
typedef struct{float r; float i;}cmplx_f32_t;
double src[2*128];
double dst[2*128];
double w[128];
cmplx_f32_t data[128][4][256];
cffti(128, w);
for (k = 0; k < 128; k++)
{
for (j = 0; j < 4; j++)
{
for (i = 0; i < 2*32; i++)
{
src[i] = data[i/2][j][k].r;
src[i+1] = data[i/2][j][k].i;
}
cfft2(128, src, dst, w, 1);
}
}
cffti and cfft2 and as given in the example at https://people.sc.fsu.edu/~jburkardt/c_src/fft_openmp/fft_openmp.html
If I disable the #pragma omp directives from the fft_openmp.c files, the run time is about 11ms. If we use #pragma omp, the total execution time is about 220 ms

Actual dIfference between 2 ways of equal parallelism using omp threads

I am trying to parallelize my program using OMP threads .
What I am doing is the following and it works perfectly :
#pragma omp parallel num_threads(threadnum) \
default(none) shared(scoreBoard, nDiag, qlength, dlength) private(nEle, i, si, sj, ai, aj, max)
{
for (i = 1; i < nDiag; ++i)
{
if (i <= qlength && i <= dlength) nEle = i;
else if(i <= findmax(qlength, dlength)) nEle = findmin(qlength, dlength);
else nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj);
#pragma omp for
for (j = 1; j <= nEle; ++j)
{
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}
}
But isn't it equal to :
for (i = 1; i < nDiag; ++i)
{
if (i <= qlength && i <= dlength) nEle = i;
else if(i <= findmax(qlength, dlength)) nEle = findmin(qlength, dlength);
else nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj);
#pragma omp parallel num_threads(threadnum) \
default(none) shared(scoreBoard) private(nEle, i, si, sj, ai, aj, max)
#pragma omp for
for (j = 1; j <= nEle; ++j)
{
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}
Why when i use the second one my program is making more time than the serial one , whereas in the first case it works lot faster than the serial ? Can't understand the difference between them
Your second code is wrong and has an undefined behavior.
The reason for that is that by declaring nEle, si and sj private, you create some local (per-thread) versions of these variables, without giving them any value. Therefore, nEle notably, which is the upper bound of you for loop, can have whatever value, likely increasing quite dramatically the length of your computation.
In order to fix your code, the snippet you gave should look like this (with a few simplifications, not tested obviously):
for (int i = 1; i < nDiag; ++i) {
if (i <= qlength && i <= dlength)
nEle = i;
else if(i <= findmax(qlength, dlength))
nEle = findmin(qlength, dlength);
else
nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj); // not sure what this suppose to mean...
#pragma omp parallel for num_threads(threadnum) private(ai, aj, max)
for (int j = 1; j <= nEle; ++j) {
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}

Parallel inner loop slows down program

I'm trying to optimize a program as an experiment.
When I parallelized the first two outer loops(with "it" and "i") I saw significant difference on execution time. But when I tried to parallelize the inner most loop the program became much slower than sequential one. I also tried using reduction but the result was the same.
Is this something that I should expect or I made a mistake on the parallelization?
When I use the "nowait" clause it runs faster than the other two previous parallelizations.
#pragma omp parallel private(it,i,j) firstprivate(u,sigma,dt,mu)
{
for (it = 0; it < itime; it++) {
for (i = 0; i < n; i++) {
sum = 0.0;
#pragma omp for schedule(static)
for (j = 0; j < n; j+=1) {
sum += sigma[i * n + j] * (u[j] - u[i]);
}
#pragma omp atomic write
uplus[i]= (u[i] + dt * (mu - u[i])) + dt * sum / divide;
if (u[i] > uth) {
#pragma omp atomic write
uplus[i] = 0.0;
if (it >= ttransient) {
#pragma omp atomic
omega1[i] += 1.0;
}
}
}
}//omp end

OpenMP with C program

I am having a hard time using OpenMP with C to parallelize this method. I was wondering if anyone could help and possibly tell me what is wrong with my parallelization of this method.
void blur(float **out, float **in) {
// assumes "padding" to avoid messy border cases
int i, j, r, c;
float tmp, term;
term = 1.0 / 157.0;
#pragma omp parallel num_threads(8)
#pragma omp for private(r,c)
for (i = 0; i < N-4; i++) {
for (j = 0; j < N-4; j++) {
tmp = 0.0;
for (r = 0; r < 5; r++) {
for (c = 0; c < 5; c++) {
tmp += in[i+r][j+c] * mask[r][c];
}
}
out[i+2][j+2] = term * tmp;
}
}
}
You shall either declare tmp inside the loop:
// at line 11:
float tmp = 0.0;
or specify tmp as a private variable:
// at line 7:
#pragma omp for private(r,c,tmp)
Or it would be treated like a shared variable among threads.

Resources