OPenMP: When can a Loop be parallelized

OPenMP: When can a Loop be parallelized - c

For each of the following code segments, use OpenMP pragmas to make the loop parallel, or
explain why the code segment is not suitable for parallel execution.
a. for (i = 0; i < sqrt(x); i++)
a[i] = 2.3 * i;
if (i < 10)
b[i] = a[i];
}
b. flag = 0;
for (i = 0; i < n && !flag; i++)
a[i] = 2.3 * i;
if (a[i] < b[i])
flag = 1;
}
c. for (i = 0; i < n && !flag; i++)
a[i] = foo(i);
d. for (i = 0; i < n && !flag; i++) {
a[i] = foo(i);
if (a[i] < b[i])
a[i] = b[i];
}
e. for (i = 0; i < n && !flag; i++) {
a[i] = foo(i);
if (a[i] < b[i])
break;
}
f. dotp = 0;
for (i = 0; i < n; i++)
dotp += a[i] * b[i];
g. for (i = k; i < 2 * k; i++)
a[i] = a[i] + a[i – k];
h. for (i = k; i < n; i++) {
a[i] = c * a[i – k];
Any help regarding the above question would be very much welcome..any line of thinking..

I will not do your HW, but I will give a hint. When playing around with OpenMp for loops, you should be alert about the scope of the variables. For example:
#pragma omp parallel for
for(int x=0; x < width; x++)
{
for(int y=0; y < height; y++)
{
finalImage[x][y] = RenderPixel(x,y, &sceneData);
}
}
is OK, since x and y are private variables.
What about
int x,y;
#pragma omp parallel for
for(x=0; x < width; x++)
{
for(y=0; y < height; y++)
{
finalImage[x][y] = RenderPixel(x,y, &sceneData);
}
}
?
Here, we have defined x and y outside of the for loop. Now consider y. Every thread will access/write it without any synchronization, thus data races will occur, which are very likely to result in logical errors.
Read more here and good luck with your HW.

Related

parallelize nested for loop

A is a 2D array, n is the matrix size and we are dealing with a square matrix. threads are number of threads the user input
#pragma omp parallel for shared(A,n,k) private(i) schedule(static) num_threads(threads)
for(k = 0; k < n - 1; ++k) {
// for the vectoriser
for(i = k + 1; i < n; i++) {
A[i][k] /= A[k][k];
}
for(i = k + 1; i < n; i++) {
long long int j;
const double Aik = A[i][k];
for(j = k + 1; j < n; j++) {
A[i][j] -= Aik * A[k][j];
}
}
}
i tried using collapse but failed the error it was showing was work-sharing region may not be closely nested inside of work-sharing, ‘loop’, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or ‘task loop’ region.
After what i though was correct, the time increased as i executed the code with more threads.
I tired using collapse. this is the output:
26:17: error: collapsed loops not perfectly nested before ‘for’
26 | for(i = k + 1; i < n; i++) {
This is LU-Decomposition

Gauss-Jacobi iteration method

I'm trying to write a programm that solves system of equations Ax=B using Gauss-Jacobi iteration method.
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
int main(void) {
double **a, *b, *x, *f, eps = 1.e-2, c;
int n = 3, m = 3, i, j, bool = 1, d = 3;
/* printf("n=") ; scanf("%d", &n);
printf("m=") ; scanf("%d", &n) */
a =malloc(n * sizeof *a);
for (i = 0; i < n; i++)
a[i] = (double*)malloc(m * sizeof(double));
b = malloc(m * sizeof *b);
x = malloc(m * sizeof *x) ;
f = malloc(m * sizeof *f) ;
for (i = 0; i < n; i++) {
for (j = 0; j < m; j++) {
printf("a[%d][%d]=", i, j);
scanf("%le", &a[i][j]);
if(fabs(a[i][i])<1.e-10) return 0 ;
}
printf("\n") ;
}
printf("\n") ;
for (i = 0; i < n; i++) {
for (j = 0; j < m; j++) {
printf("a[%d][%d]=%le ", i, j, a[i][j]);
}
printf("\n") ;
}
for (j = 0; j < m; j++) {
printf("x[%d]=", j);
scanf("%le", &x[j]);
} //intial guess
printf("\n") ;
for (j = 0; j < m; j++) {
printf("b[%d]=", j);
scanf("%le", &b[j]);
}
printf("\n") ;
while (1) {
bool = 0;
for (i = 0; i < n; i++) {
c = 0.0;
for (j = 0; j < m; j++)
if (j != i)
c += a[i][j] * x[j];
f[i] = (b[i] - c) / a[i][i];
}
for (i = 0; i < m; i++)
if (fabs(f[i] - x[i]) > eps)
bool = 1;
if (bool == 1)
for (i = 0; i < m; i++)
x[i] = f[i];
else if (bool == 0)
break;
}
for (j = 0; j < m; j++)
printf("%le\n", f[j]);
return 0;
}
The condition of stoping the loop is that previous approximation minus current approximation for all x is less than epsilon.
It seems like i did everything according to algorithm,but the programm doesn't work.
Where did i make a mistake?

While not the most strict condition, the usual condition requiered to guarantee convergence in the Jacobi and Gauss-Seidel methods is diagonal dominance,
abs(a[i][i]) > sum( abs(a[i][j]), j=0...n-1, j!=i)
This test is also easy to implement as a check to run before the iteration.
The larger the relative gap in all these inequalities, the faster the convergence of the method.

Best approach to parallelize BW and FW algorithms

I have implemented the BW and FW algorithms to solve L and U triangular matrix.
The algorithm that I implement run very fast in a serial way, but I can not figure out if this is the best method to parallelize it.
I think that I have taken into account every possible data race (on alpha), am I right?
void solveInverse (double **U, double **L, double **P, int rw, int cw) {
double **inverseA = allocateMatrix(rw,cw);
double* x = allocateArray(rw);
double* y = allocateArray(rw);
double alpha;
//int i, j, t;
// Iterate along the column , so at each iteration we generate a column of the inverse matrix
for (int j = 0; j < rw; j++) {
// Lower triangular solve Ly=P
y[0] = P[0][j];
#pragma omp parallel for reduction(+:alpha)
for (int i = 1; i < rw; i++) {
alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
// Upper triangular solve Ux=P
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp parallel for reduction(+:alpha)
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
freeMemory(inverseA,rw);
free(x);
free(y);
}
After a private discussion with the user dreamcrash, we have come to the solution proposed in his comments, creating a couple of vector x and y for each thread that will work indipendently on a single column.

After a discussion with the OP on the comments (that were removed afterwards), we both came to the conclusion that:
You do not need to reduce the alpha variable, because outside the first parallel region it is initialized again to zero. Instead, make the alpha variable private.
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
and the same applies to the second parallel region as well.
#pragma omp parallel for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
Instead of having one parallel region per j iteration. You can extract the parallel region to encapsulate the entire outermost loop, and use #pragma omp for instead of #pragma omp parallel for. Notwithstanding, although with this approach we reduced the number of parallel regions created from rw to only 1, the speedup achieved with this optimization should not be that significant, because an efficient OpenMP implementation will use a pool of threads, where the threads are initialized on the first parallel region but reused on the next parallel regions. Consequently, saving on the overhead of creating and destroying threads.
#pragma omp parallel
{
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
#pragma omp for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
#pragma omp for
for (int i = rw-2; (i < rw) && (i >= 0); i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
#pragma omp for
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
}
I have shown you this code transformations so that you could see some potential tricks that you can use on other future parallelizations. Unfortunately, as it is that parallelization will not work.
Why?
Let us look at the first loop:
#pragma omp parallel for
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
there is a dependency between y[t] being read in alpha += L[i][t] * y[t]; and y[i] being written in y[i] = P[i][j] - alpha;.
So what you can do instead is to parallelize the outermost loop (i.e., assign each column to the threads) and create separate x and y arrays for each thread so that there is no race-conditions during the updates/reads of those arrays.
#pragma omp parallel
{
double* x = allocateArray(rw);
double* y = allocateArray(rw);
#pragma omp for
for (int j = 0; j < rw; j++)
{
y[0] = P[0][j];
for (int i = 1; i < rw; i++) {
double alpha = 0;
for (int t = 0; t <= i-1; t++)
alpha += L[i][t] * y[t];
y[i] = P[i][j] - alpha;
}
x[rw-1] = y[rw-1] / U[rw-1][rw-1];
for (int i = rw-2; i >= 0; i--) {
double alpha = 0;
for (int t = i+1; t < rw; t++)
alpha += U[i][t]*x[t];
x[i] = (y[i] - alpha) / U[i][i];
}
for (int i = 0; i < rw; i++)
inverseA[i][j] = x[i];
}
free(x);
free(y);
}

Find inverse of matrix using b^nth power

I've searched for hours and spent many more trying to figure how to fix this problem. I need to find the inverse of a predefined matrix using
A^-1 = I + (B + B^2 + ... + B^20) where B = I-A.
void invA(double a[][3], double id[][3], double z[][3])
{
int i, j, n, k;
double pb[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double temp[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double b[3][3];
temp[i][j] = 0;
b[i][j] = 0;
for(i = 0; i < 3; i++)
for (j = 0; j < 3; j++)
b[i][j] = id[i][j] - a[i][j];
for (n = 0; n < 20; n++) //run loop n times
{
for (i = 0; i < 3; i++) //find b to the power 20
for (j = 0; j < 3; j++)
for (k = 0; k < 3; k++)
temp[i][j] += pb[i][k] * b[k][j];
for (i = 0; i < 3; i++) //allocate pb from temp
for (j = 0; j < 3; j++)
pb[i][j] = temp[i][j];
for (i = 0; i < 3; i++) //summing b n time
for (j = 0; j < 3; j++) //to find inverse
z[i][j] = z[i][j] + pb[i][j];
}
}
Matrix a is the defined matrix, id is the identity and z is the inverse (result). I can't seem to figure out where I've gone wrong.

You have few problems.
First, temp[i][j] = 0; and b[i][j] = 0; at the beginning of the function use uninitialized variables i and j. The behaviour is undefined, and who knows how temp is actually initialized.
Then, temp must be reinitialized to a zero matrix at each iteration. I don't know what exactly does your code compute, but it is not a power for sure.
Finally, (unless z is initialized to I), you are missing the initial term.
All that said, I highly recommend to factor out most of the loops into functions: matAdd() and matMult(). Once they are unit tested, the rest is much simpler.

Remove conditions in a C program for speedup

I have a number crunching C program which involves a main loop with two conditionals:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
if (k == i || k == j) continue;
...(calculate a, b, c, d (depending on k)
if (a*a + b*b + c*c < d*d) {break;}
} //k
} //j
} //i
The hardware here is the SPE of the Cell processor, where there is a big penalty when using branching. So in order to optimize my program for speedup I need to remove these 2 conditionals, do you know about good strategies for this?

For the first one, you could break it into multiple loops, eg change:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < 1000; j++) {
for(int k = 0; k < 1000; k++) {
if(k==i || k == j) continue;
// other code
}
}
to:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < 1000; j++) {
for(int k = 0; k < min(i, j); k++) {
// other code
}
for(int k = min(i, j) + 1; k < max(i, j); k++) {
// other code
}
for(int k = max(i, j) + 1; k < 1000; k++) {
// other code
}
}
To remove the second, you could store the previous total and use it in the for loop conditions, i.e.:
int left_side = 1, right_side = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < N; j++) {
for(int k = 0; k < min(i, j) && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
for(int k = min(i, j) + 1; k < max(i, j) && left_side >= right_side; k++) {
// same as in previous loop
}
for(int k = max(i, j) + 1; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
Implementing min and max without branching could also be tricky. Maybe this version is better:
int i, j, k,
left_side = 1, right_side = 0;
for(i = 0; i < N; i++) {
// this loop covers the case where j < i
for(j = 0; j < i; j++) {
k = 0;
for(; k < j && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
k++; // skip k == j
for(; k < i && left_side >= right_side; k++) {
// same as in previous loop
}
k++; // skip k == i
for(; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
j++; // skip j == i
// and now, j > i
for(; j < N; j++) {
k = 0;
for(; k < i && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
k++; // skip k == i
for(; k < j && left_side >= right_side; k++) {
// same as in previous loop
}
k++; // skip k == j
for(; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
}

I agree with 'sje397'.
Besides this, you provide too little information about your problem. You say branching is pricey. But how often does it actually happen? Maybe your problem is that compiler-generated code does branching in the common scenario?
Perhaps you could re-arrange your if-s. The implementation of the if is actually compiler-dependent, bust many compilers treat it in a straight-forward way. That is: if - common - else - rare (jump).
Then try the following:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
if (k != i && k != j)
{
...(calculate a, b, c, d)
if (a*a + b*b + c*c >= d*d)
{
...
} else
break;
}
} //k
} //j
} //i
EDIT:
Of course you may go into assembler level to ensure correct code generated.

I would look first at your calculate code, because that could swamp all these branching issues. Some sampling would find out for sure.
However, it looks like you're doing, for each i,j, a linear search for the first point inside a sphere. Could you have 3 arrays, one for each of the X, Y, and Z axes, and in each array store indexes of all the original points in ascending order by that axis? That could facilitate a nearest-neighbor search. Also, you might be able to use an in-cube test, rather than an in-sphere test, since you're not hunting for the closest point, but only a nearby point.

Are you sure you actually need the first if-statement? Even if it jumps one calculation when k equals i or j, the penalty for checking it every iteration is very costly. Also, keep in mind that if N is not a constant, the compiler probably wont be able to unroll the for loops.
Although, if it's a cell processor, the compiler might even try to vectorize the loops.
If the for loops compiles to normal iterative loops it could be an idea to make them compare with zero instead, as the decrement operation will often do the comparison for you when it hits zero.
for (i = 0; i < N; i++) {
...can become...
for (i = N; i != 0; i--) {
Although, if "i" is used as an index or a variable in a calculation, you might get performance degradation as you will get cache misses.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

OPenMP: When can a Loop be parallelized - c

Related

parallelize nested for loop

Gauss-Jacobi iteration method

Best approach to parallelize BW and FW algorithms

Find inverse of matrix using b^nth power

Remove conditions in a C program for speedup

Categories

Resources