Parallelizing nested loop in OpenMP using #pragma parallel for shared - c

I'm trying to parallelize a code. My code looks like this -
#pragma omp parallel private(i,j,k)
#pragma omp parallel for shared(A)
for(k=0;k<100;<k++)
for(i=1;i<1024;<i++)
for(j=0;j<1024;<j++)
A[i][j+1]=<< some expression involving elements of A[i-1][j-1] >>
On executing this code I'm getting a different result from serial execution of the loops.
I'm unable to understand what I'm doing wrong.
I've also tried the collapse()
#pragma omp parallel private(i,j,k)
#pragma omp parallel for collapse(3) shared(A)
for(k=0;k<100;<k++)
for(i=1;i<1024;<i++)
for(j=0;j<1024;<j++)
A[i][j+1]=<< some expression involving elements of A[][] >>
Another thing I tried was having a #pragma omp parallel for before each loop instead of collapse().
The issue, as I think, is the data dependency. Any idea how to parallelize in case of data dependency?

If this is really your use case, just parallelize for the outer loop, k, this should largely suffice for the modest parallelism that you have on common architectures.
If you want more, you'd have to re-write your loops such that you have an inner part that doesn't have the dependency. In your example case this is relatively easy, you'd have to process by "diagonals" (outer loop, sequential) and then inside the diagonals you'd be independent.
for (size_t d=0; d<nDiag(100); ++d) {
size_t nPoints = somefunction(d);
#pragma omp parallel
for (size_t p=0; p<nPoints; ++p) {
size_t x = coX(p, d);
size_t y = coY(p, d);
... your real code ...
}
}
Part of this could be done automatically, but I don't think that such tools are already readily implemented in everydays OMP. This is an active line of research.
Also note the following
int is rarely a good idea for indices, in particular if you access matrices. If you have to compute the absolute position of an entry yourself (and you see that here you might be) this overflows easily. int usually is 32 bit wide and of these 32 you are even wasting one for the sign. In C, object sizes are computed with size_t, most of the times 64 bit wide and in any case the correct type chosen by your platform designer.
use local variables for loop indices and other temporaries, as you can see writing OMP pragmas becomes much easier, then. Locality is one key to parallelism. Help yourself and the compiler by expressing this correctly.

You're only parallelizing the outer 'k' for loop. Every parallel thread is executing the 'i' and 'j' loops, and they're all writing into the same 'A' result. Since they're all reading and writing the same slots in A, the final result will be non-deterministic.
It's not clear from your problem that any parallelism is possible, since each step seems to depend on every previous step.

Related

How to ensure data synchronization with OpenMP?

When I try to do the math expression from the following code the matrix values are not consistent, how can I fix that?
#pragma omp parallel num_threads (NUM_THREADS)
{
#pragma omp for
for(int i = 1; i < qtdPassos; i++)
{
#pragma omp critical
matriz[i][0] = matriz[i-1][0]; /
for (int j = 1; j < qtdElementos-1; j++)
{
matriz[i][j] = (matriz[i-1][j-1] + (2 * matriz[i-1][j]) + matriz[i-1][j+1]) / 4; // Xi(t+1) = [Xi-1 ^ (t) + 2 * Xi ^ (t)+ Xi+1 ^ (t)] / 4
}
matriz[i][qtdElementos-1] = matriz[i-1][qtdElementos-1];
}
}
The problem comes from a race condition which is due to a loop carried dependency. The encompassing loop cannot be parallelised (nor the inner loop) since loop iterations matriz read/write the current and previous row. The same applies for the column.
Note that OpenMP does not check if the loop can be parallelized (in fact, it theoretically cannot in general). It is your responsibility to check that. Additionally, note that using a critical section for the whole iteration serializes the execution defeating the purpose of a parallel loop (in fact, it will be slower due to the overhead of the critical section). Note also that #pragma omp critical only applies on the next statement. Protecting the line matriz[i][0] = matriz[i-1][0]; is not enough to avoid the race condition.
I do not think this current code can be (efficiently) parallelised. That being said, if your goal is to implement a 1D/2D stencil, then you can use a double buffering technique (ie. write in a 2D array that is different from the input array). A similar logic can be applied for 1D stencil repeated multiple times (which is apparently what you want to do). Note that the results will be different in that case. For the 1D stencil case, this double buffering strategy can fix the dependency issue and enable you to parallelize the inner-loop. For the 2D stencil case, the two nested loops can be parallelized.

OpenMP parallelize grouped array sum using pointers

I want to effectively parallelize the following sum in C:
#pragma omp parallel for num_threads(nth)
for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];
where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.
The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.
Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.
From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.
Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.
Here is the crucial bit of code:
#pragma omp parallel shared(sum,threadsum)
{
int thread = omp_get_thread_num(),
myfirst = thread*ngroups;
#pragma omp for
for ( int i=0; i<biglen; i++ )
threadsum[ myfirst+indexes[i] ] += 1;
#pragma omp for
for ( int igrp=0; igrp<ngroups; igrp++ )
for ( int t=0; t<nthreads; t++ )
sum[igrp] += threadsum[ t*ngroups+igrp ];
}
Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.
Final word: I also coded #Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

OpenMP to write concurrently in the same block of memory

Assume to have two 3D arrays: A and B, with different number of elements each. I do some operations with values of A that correspond to some values of B with different indices.
For example : I use A[i][j][k] to calculate some quantities. Since each calculation is independent, I can do this using parallel for with no problem. But the updated value are used to increase the values of some positions of B array.
For example :
A[i][j][k]->C(a number)->B[l][m][n]. But at the same time a lot of writes could occur to B[l][m][n]. I use B[l][m][n]+=c to update B elements. Of course I cannot use OpenMP here because I violate the independence of loops. And neither do I know a priori the indices l,l,m in order to group them in buffer writes. Any good ideas on how to implement it in parallel? Maybe critical or atomic would benefit me here but I don't know how
A simplified version of my problem (1D)
for(int i=0,i<size_A,i++)
{
//Some code here to update A[i]. Each update is independent.
}
for(int j=0,j<size_A,j++)
{
//From A[j] B[m],B[m+1] are evaluated
int m=A[j]/dx;
double c;//Make some calculations
B[m] += c;
B[m+1] += c*c;
}
There are two ways to do this. Either you make each update atomic. This just looks like:
#pragma omp atomic update
B[m] += c;
Or each threads gets a private update-version of B and after the loop, the private copies are all safely added together. This is possible by using:
#pragma omp parallel for reduction(+:B)
for (...)
The latter requires OpenMP 4.5, but there are workarounds for earlier versions.
Which one is better for you depends on the rate of updates and size of the matrix. You have to measure both to be sure.

how to parallelize code using openmp to add matrix sum with reduction

I want to write parallel code using openmp and reduction for square addition of matrix(X*X) values. Can I use "2 for loops" after #pragma omp parallel for reduction. if not kindly suggest.
#pragma omp parallel
{
#pragma omp parallel for reduction(+:SqSumLocal)
for(index=0; index<X; index++)
{
for(i=0; i<X; i++)
{
SqSumLocal = SqSumLocal + pow(InputBuffer[index][i],2);
}
}
}
Solution: Adding int i under #pragma omp parallel solves the problem.
The way you've written it is correct, but not ideal: only the outer loop will be parallelized, and each of the inner loops will be executed on individual threads. If X is large enough (significantly larger than the number of threads) this may be fine. If you want to parallelize both loops, then you should add a collapse(2) clause to the directive. This tells the compiler to merge the two loops into a single loop and execute the whole thing in parallel.
Consider an example where you have 8 threads, and X=4. Without the collapse clause, only four threads will do work: each one will complete the work for one value of index. With the collapse clause, all 8 threads will each do half as much work. (Of course, parallelizing such a trivial amount of work is pointless - this is just an example.)

Multi-dimensional nested OpenMP loop

What is the proper way to parallelize a multi-dimensional embarrassingly parallel loop in OpenMP? The number of dimensions is known at compile-time, but which dimensions will be large is not. Any of them may be one, two, or a million. Surely I don't want N omp parallel's for an N-dimensional loop...
Thoughts:
The problem is conceptually simple. Only the outermost 'large' loop needs to be parallelized, but the loop dimensions are unknown at compile-time and may change.
Will dynamically setting omp_set_num_threads(1) and #pragma omp for schedule(static, huge_number) make certain loop parallelizations a no-op? Will this have undesired side-effects/overhead? Feels like a kludge.
The OpenMP Specification (2.10, A.38, A.39) tells the difference between conforming and non-conforming nested parallelism, but doesn't suggest the best approach to this problem.
Re-ordering the loops is possible but may result in a lot of cache-misses. Unrolling is possible but non-trivial. Is there another way?
Here's what I'd like to parallelize:
for(i0=0; i0<n[0]; i0++) {
for(i1=0; i1<n[1]; i1++) {
...
for(iN=0; iN<n[N]; iN++) {
<embarrasingly parallel operations>
}
...
}
}
Thanks!
The collapse directive is probably what you're looking for, as described here. This will essentially form a single loop, which is then parallized, and is designed for exactly these sorts of situations. So you'd do:
#pragma omp parallel for collapse(N)
for(int i0=0; i0<n[0]; i0++) {
for(int i1=0; i1<n[1]; i1++) {
...
for(int iN=0; iN<n[N]; iN++) {
<embarrasingly parallel operations>
}
...
}
}
and be all set.

Resources