Why can't the following loop be parallelized? - c

double A[N] = { ... }, B[N] = { ... };
for (int i = 1; i < N; i++) {
A[i] = B[i] – A[i –1];
}
Why can't this loop be parallelized using the #pragma omp parallel for construct?

The reason is pretty plain in the code. To calculate A[i] you need to calculate A[i-1] first. There are N steps where every step depends on the previous step.
In general, a loop for(int i=1; i<N; i++) is suitable for parallelism if the loop would produce the exact same result if you changed the header to for(int i=N-1; i>0; i--)
Maybe that was a bit confusing. It has nothing to do with reverse order. The point is that you should be able to do each step individually.

Related

Parallel sections code with nested loops in openmp

I made this parallel code to share the iterations like first and last, fisrst+1 and last-1,... But I don't know how to improve the code in every one of the two parallel sections because I have an inner loop in the sections and I can't think of any way to simplify it, thanks.
This isn't about which values are stored in x or y, I use this sections design because the requisite is execute the iterations from 0 to N like: 0 N, 1 N-1, 2 N-2 but I would like to know if I can optimize the inner loops maintaining this model
int x = 0, y = 0,k,i,j,h;
#pragma omp parallel private(i, h) reduction(+:x, y)
{
#pragma omp sections
{
#pragma omp section
{
for (i=0; i<N/2; i++)
{
C[i] = 0;
for (j=0; j<N; j++)
{
C[i] += MAT[i][j] * B[j];
}
x += C[i];
}
}
#pragma omp section
{
for (h=N-1; h>=N/2; h--)
{
C[h] = 0;
for (k=0; k<N; k++)
{
C[h] += MAT[h][k] * B[k];
}
y += C[h];
}
}
}
}
x = x + y;
Using sections seems like the wrong approach. A pragma omp for seems more appropriate. Also note that you forgot to declare j private.
int x = 0, y = 0,k,i,j;
#pragma omp parallel private(i,j) reduction(+:x, y)
{
# pragma omp for nowait
for(i=0; i<N/2; i++) {
// local variable to make the life easier on the compiler
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
x += ci;
C[i] = ci;
}
# pragma omp for nowait
for(i=N/2; i < N; i++) {
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
y += ci;
C[i] = ci;
}
}
x = x + y;
Also, I'm not sure but if you just want x as your final output, you can simplify the code even further:
int x=0, i, j;
#pragma omp parallel for reduction(+:x) private(i,j)
for(i=0; i < N; ++i)
for(j=0; j < N; ++j)
x += MAT[i][j] * B[j];
The section construct is to distribute different tasks to different threads and each section block marks a different task so you will not be able to do that iterations in the order you want I answered you here:
Distribution of loop iterations between threads with a specific order
But I want to clarify that the requirement to use sections is that each block must be independent of the other blocks.
A section gets only one thread, so you can't make the loops parallel. How about
Make a parallel loop to N at the top level,
then inside each iteration use a conditional to decide whether to accumulate into x,y?
Although #Homer512 's solution looks correct to me too.

Nested loop when inner loop index unequal outer loop index

Nested loop logic to skip the inner loop when its index equals the outer loop index.
Used an if statement within the inner loop to the effect of:
for (i=0;i<N;i++)
for (j=0;j<N;j++)
if (j!=i)
... some code
I believe this gives me the expected results but is there a less CPU consuming method that I may not be aware of?
If you can assume that N <= i, you can split the inner loop into 2 separate for loops to reduce the number of tests:
for (i = 0; i < N; i++) {
for (j = 0; j < i; j++) {
... some code
}
/* here we have j == i, skip this one */
j++;
for (; j < N; j++) {
... same code
}
}
This results in more code but half as many tests on j. Note however that if N is a constant, the compiler might unroll the original inner loop more efficiently. Careful benchmarking is the only way to determine if this solution is worth the effort for your problem, compiler and architecture.
For completeness, this code can be simplified as:
for (i = 0; i < N; i++) {
for (j = 0; j < i; j++) {
... some code
}
/* here we have j == i, skip this one */
while (++j < N) {
... same code
}
}

Computing entries of a matrix in OpenMP

I am very new to openMP, but am trying to write a simple program that generates the entries of matrix in parallel, namely for the N by M matrix A, let A(i,j) = i*j. A minimal example is included below:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc,
char **argv)
{
int i, j, N, M;
N = 20;
M = 20;
int* A;
A = (int*) calloc(N*M, sizeof(int));
// compute entries of A in parallel
#pragma omp parallel for shared(A)
for (i = 0; i < N; ++i){
for (j = 0; j < M; ++j){
A[i*M + j] = i*j;
}
}
// print parallel results
for (i = 0; i < N; ++i){
for (j = 0; j < M; ++j){
printf("%d ", A[i*M + j]);
}
printf("\n");
}
free(A);
return 0;
}
The results are not always correct. In theory, I am only parallelizing the outer loop, and each iteration of the for loop does not modify the entries that the other iterations will modify. But I am not sure how to translate this to openMP. When doing a similar procedure for a vector array (i.e. just one for loop), there seems to be no issue, e.g.
#pragma omp parallel for
for (i = 0; i < N; ++i)
{
v[i] = i*i;
}
Can someone explain to me how to fix this?
The issue in this case is that j is shared between threads which messes with the control flow of the inner loop. By default variables declared outside of a parallel region are shared whereas variables declared inside of a parallel region are private.
Follow the general rule to declare variables as locally as possible. In the for loop this means:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {
This makes reasoning about your code much easier - and OpenMP code mostly correct by default. (Note A is shared by default because it is defined outside).
Alternatively you can manually specify private(i,j) shared(A) - this is more explicit and can help beginners. However it creates redundancy and can also be dangerous: private variables are uninitialized even if they had a valid value outside of the parallel region. Therefore I strongly recommend the implicit default approach unless necessary for advanced usage.
According to e.g. this
http://supercomputingblog.com/openmp/tutorial-parallel-for-loops-with-openmp/
The declaration of variables outside of a parallelized part is dangerous.
It can be defused by explicitly making the loop variable of the inner loop private.
For that, change this
#pragma omp parallel for shared(A)
to
#pragma omp parallel for private(j) shared(A)

nested loops, inner loop parallelization, reusing threads

Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.
The problem:
We have an n*m matrix, and we want to copy elements from previous row as in the following code:
for (i = 1; i < n; i++)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
Approach:
Outer loop iterations have to be executed in order, they would be executed sequentially.
Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.
#pragma omp parallel private(j)
{
for (i = 1; i < n; i++)
{
#pragma omp for scheduled(dynamic)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain.
I am looking to solution for the scenario above, even if I had to use some workaround.
I am adding my actual code. This is is actually slower than seq. version. Please review:
/* load input */
for (i = 1; i <= n; i++)
scanf ("%d %d", &in[i][W], &in[i][V]);
/* init */
for (i = 0; i <= wc; i++)
a[0][i] = 0;
/* compute */
#pragma omp parallel private(i,w)
{
for(i = 1; i <= n; ++i) // 1 000 000
{
j=i%2;
jn = j == 1 ? 0 : 1;
#pragma omp for
for(w = 0; w <= in[i][W]; w++) // 1000
a[j][w] = a[jn][w];
#pragma omp for
for(w = in[i][W]+1; w <= wc; w++) // 350 000
a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
}
}
As for measuring, I am using something like this:
double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;
To sum up the parallelization in OpenMP for this particular case: It is not worth it.
Why?
Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body.
Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was.
I also found out, there is no real performance gain in such construct:
#pragma omp parallel private(i,j)
{
for (i = 1; i < n; i++)
{
#pragma omp for
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
because it's performance is similar to this one
for (i = 1; i < n; i++)
{
#pragma omp parallel for private(j)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/
Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.

Loop interchange versus Loop tiling

Which of these optimizations is better and in what situation? Why?
Intuitively, I am getting the feeling that loop tiling will in general
be a better optimization.
What about for the below example?
Assume a cache which can only store about 20 elements in it at any time.
Original Loop:
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < 1000; j++)
{
a[i] += a[i]*b[j];
}
}
Loop Interchange:
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j < 10; j++)
{
a[j] += a[j]*b[i];
}
}
Loop Tiling:
for(int k = 0; k < 1000; k += 20)
{
for(int i = 0; i < 10; i++)
{
for(int j = k; j < min(1000, k+20); j++)
{
a[i] += a[i]*b[j];
}
}
}
The first two cases you are exposing in your question are about the same. Things would really change in the following two cases:
CASE 1:
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < 1000; j++)
{
b[i] += a[i]*a[j];
}
}
Here you are accessing the matrix "a" as follows: a[0]*a[0], a[0]*a1, a[0]*a[2],.... In most architectures, matrix structures are stored in memory like: a[0]*a[0], a1*a[0], a[2]*a[0] (first column of first row followed by second column of first raw,....). Imagine your cache only could store 5 elements and your matrix is 6x6. The first "pack" of elements that would be stored in cache would be a[0]*a[0] to a[4]*a[0]. Your first acces would cause no cache miss so a[0][0] is stored in cache but the second yes!! a0 is not stored in cache! Then the OS would bring to cache the pack of elements a0 to a4. Then you do the third acces: a[0]*a[2] wich is out of cache again. Another cache miss!
As you can colcude, case 1 is not a good solution for the problem. It causes lots of cache misses that we can avoid changing the code for the following:
CASE 2:
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < 1000; j++)
{
b[i] += a[i]*a[j];
}
}
Here, as you can see, we are accessing the matrix as it's stored in memory. Consequently it's much better (faster) than case 1.
About the third code you posted about loop tiling, loop tiling and also loop unrolling are optimizations that in most cases the compiler does automaticaly. Here's a very interesting post in stackoverflow explaining these two techniques;
Hope it helps! (sorry about my english, I'm not a native speaker)

Resources