Decomposing a loop in OpenMP - loops

I have several loops which follow this pattern:
do j = ms,mst
ic = ic + 1
df = mm(j)*data(ic)
dff(1:3)= vec(1:3)*df*qm
end do
As you can see, the variable ic is updated at every cycle of
j and the result of ic is used by the variable df. If I use
atomic operation of OpenMP I could reduce the performance of OpenMP.
Do you know an efficient way to deal with these kind of loops in
OpenMP?

If ic is not changed apart from the increment (i.e. data is an array or a function w/o side-effects), there is a fixed relation between j and ic:
icStart = ic
delta = icStart - ms + 1
do j = ms,mst
ic = delta + j
df = mm(j)*data(ic)
dff(1:3)= vec(1:3)*df*qm
end do
This can easily be parallelized with ic and df being thread-private. You still need to take care about dff, as you will get a race condition as it is now...

As you've written your code the value of ic is increased by 1 at each iteration, just as the value of ms is. A straightforward parallelisation of the loop, something like
!$OMP PARALLEL DO
do j = ms,mst
...
will distribute the work across threads giving each of them a discrete set of the values that j takes. Simple static scheduling of a 64-trip loop (with ms==1 and mst==64) across 4 threads will mean that thread 0 gets j = 1..16, thread 1 gets j = 17..32 and so on.
However, without care on your part the values of ic won't get neatly split across threads in this way. It looks to me, from the sample you've provided, as if the behaviour you want is for blocks of values of ic to accompany corresponding blocks of values of j -- they both increase by 1 at each trip round the loop.
Perhaps in the part of the code you haven't shown us ic is set to ms+k where k is some integer. In that case you could simply drop ic from inside the loop and write
!$OMP PARALLEL DO
do j = ms,mst
df = mm(j)*data(j+k)
dff(1:3)= vec(1:3)*df*qm
end do
Without knowing more about the relationship between j and ic it's difficult to offer more pertinent advice than this. But the principle remains, rewrite ic as a function of j if you can and avoid difficulties inside the parallelised loop.

Related

accessing boundaries of array without duplicating lots of code

When I try to compile my code using -fcheck=all I get a runtime error since it seems I step out of bounds of my array dimension size. It comes from the part of my code shown below. I think it is because my loops over i,j only run from -ny to ny, -nx to nx but I try to use points at i+1,j+1,i-1,j-1 which takes me out of bounds in my arrays. When the loop over j starts at -ny, it needs j-1, so it immediately takes me out of bounds since I'm trying to access -ny-1. Similarly when j=ny, i=-nx,nx.
My question is, how can I fix this problem efficiently using minimal code?
I need the array grad(1,i,j) correctly defined on the boundary, and it needs to be defined exactly as on the right hand side of the equality below, I just don't know an efficient way of doing this. I can explicitly define grad(1,nx,j), grad(1,-nx,j), etc, separately and only loop over i=-nx+1,nx-1,j=-ny+1,ny-1 but this causes lots of duplicated code and I have many of these arrays so I don't think this is the logical/efficient approach. If I do this, I just end up with hundreds of lines of duplicated code that makes it very hard to debug. Thanks.
integer :: i,j
integer, parameter :: nx = 50, ny = 50
complex, dimension (3,-nx:nx,-ny:ny) :: grad,psi
real, parameter :: h = 0.1
do j = -ny,ny
do i = -nx,nx
psi(1,i,j) = sin(i*h)+sin(j*h)
psi(2,i,j) = sin(i*h)+sin(j*h)
psi(3,i,j) = sin(i*h)+sin(j*h)
end do
end do
do j = -ny,ny
do i = -nx,nx
grad(1,i,j) = (psi(1,i+1,j)+psi(1,i-1,j)+psi(1,i,j+1)+psi(1,i,j-1)-4*psi(1,i,j))/h**2 &
- (psi(2,i+1,j)-psi(2,i,j))*psi(1,i,j)/h &
- (psi(3,i,j+1)-psi(3,i,j))*psi(1,i,j)/h &
- psi(2,i,j)*(psi(1,i+1,j)-psi(1,i,j))/h &
- psi(3,i,j)*(psi(1,i,j+1)-psi(1,i,j))/h
end do
end do
If I was to do this directly for grad(1,nx,j), grad(1,-nx,j), it would be given by
do j = -ny+1,ny-1
grad(1,nx,j) = (psi(1,nx,j)+psi(1,nx-2,j)+psi(1,nx,j+1)+psi(1,nx,j-1)-2*psi(1,nx-1,j)-2*psi(1,nx,j))/h**2 &
- (psi(2,nx,j)-psi(2,nx-1,j))*psi(1,nx,j)/h &
- (psi(3,nx,j+1)-psi(3,nx,j))*psi(1,nx,j)/h &
- psi(2,nx,j)*(psi(1,nx,j)-psi(1,nx-1,j))/h &
- psi(3,nx,j)*(psi(1,nx,j+1)-psi(1,nx,j))/h
grad(1,-nx,j) = (psi(1,-nx+2,j)+psi(1,-nx,j)+psi(1,-nx,j+1)+psi(1,-nx,j-1)-2*psi(1,-nx+1,j)-2*psi(1,-nx,j))/h**2 &
- (psi(2,-nx+1,j)-psi(2,-nx,j))*psi(1,-nx,j)/h &
- (psi(3,-nx,j+1)-psi(3,-nx,j))*psi(1,-nx,j)/h &
- psi(2,-nx,j)*(psi(1,-nx+1,j)-psi(1,-nx,j))/h &
- psi(3,-nx,j)*(psi(1,-nx,j+1)-psi(1,-nx,j))/h
end do
One possible way for you could be using an additional index variable for the boundaries, modified from the original index to avoid getting out-of-bounds. I mean something like this:
do j = -ny,ny
jj = max(min(j, ny-1), -ny+1)
do i = -nx,nx
ii = max(min(i, nx-1), -nx+1)
grad(1,i,j) = (psi(1,ii+1,j)+psi(1,ii-1,j)+psi(1,i,jj+1)+psi(1,i,jj-1)-4*psi(1,i,j))/h**2 &
- (psi(2,ii+1,j)-psi(2,ii,j))*psi(1,i,j)/h &
- (psi(3,i,jj+1)-psi(3,i,jj))*psi(1,i,j)/h &
- psi(2,i,j)*(psi(1,ii+1,j)-psi(1,ii,j))/h &
- psi(3,i,j)*(psi(1,i,jj+1)-psi(1,i,jj))/h
end do
end do
It's hard for me to write a proper code because it seems you trimmed part of the original expression in the code you presented in the question, but I hope you understand the idea and apply it correctly for your logic.
Opinions:
Even though this is what you are asking for (as far as I understand), I would not recommend doing this before profiling and checking if assigning the boundary conditions manually after a whole array operation wouldn't be more efficient, instead. Maybe those extra calculations on the indices on each iteration could impact on performance (arguably less than if conditionals or function calls). Using "ghost cells", as suggested by #evets, could be even more performant. You should profile and compare.
I'd recommend you declaring your arrays as dimension(-nx:nx,-ny:ny,3) instead. Fortran stores arrays in column-major order and, as you are accessing values on the neighborhood of the "x" and "y", they would be non-contiguous memory locations for a fixed "other" dimension is the leftest, and that could mean less cache-hits.
In somewhat pseudo-code, you can do
do j = -ny, ny
if (j == -ny) then
p1jm1 = XXXXX ! Some boundary condition
else
p1jm1 = psi(1,i,j-1)
end if
if (j == ny) then
p1jp1 = YYYYY ! Some other boundary condition
else
p1jp1 = psi(1,i,j+1)
end if
do i = -nx, ny
grad(1,i,j) = ... term involving p1jm1 ... term involving p1jp1 ...
...
end do
end do
The j-loop isn't bad in that you are adding 2*2*ny conditionals. The inner i-loop is adding 2*2*nx conditionals for each j iteration (or 2*2*ny * 2*2*nx conditional). Note, you need a temporary for each psi with the triplet indices are unique, ie., psi(1,i,j+1), psi(1,i,j-1), and psi(3,i,j+1).

Controlling Number of Threads in Parallel Loops & Reducing Overhead

In my Fortran 95 code I have a series of nested DO loops, the whole of which takes a significant amount of time to compute, so I wanted to add parallel functionality with OpenMP (using gfortran -fopenmp to compile/build).
There is one main DO loop, running 1000 times.
There is a sub DO loop within this, running 100 times.
Several other DO loops are nested within this, the number of iterations increases with each iteration of the DO loop (Once the first time, up to 1000 for the last time).
Example:
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
END DO
d = d + 1
END DO
Some of the nested DO loops have to be run in serial as they contain dependencies within themselves (that is to say, each iteration of the loop has a calculation that includes a value from the previous iteration), and cannot easily be parallelised in this instance.
I can easily make the loops without any dependencies run in parallel, as below:
d = 1
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
!$OMP PARALLEL
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
!$OMP END PARALLEL
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
However I understand that there is significant overhead with opening and closing the parallel threads, given that this occurs so many times within the loops. The code runs significantly slower than it previously did when run sequentially.
Following this, I figured that it made sense to open and close the parallel code either side of the main loop (therefore only applying the overhead once), and setting the number of threads to either 1 or 8 to control whether sections are run sequentially or in parallel, as below:
d = 1
CALL omp_set_num_threads(1)
!$OMP PARALLEL
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
CALL omp_set_num_threads(4)
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
CALL omp_set_num_threads(1)
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
!$OMP END PARALLEL
However, when I set this to run I don't get the speedup that I was expecting from running parallel code. I expect the first few to be slower to account for the overhead, but after a while I expect the parallel code to run faster than the sequential code, which it doesn't. I compared how fast each iteration of the main DO loop ran, for DO a = 1, 50, results below:
Iteration Serial Parallel
1 3.8125 4.0781
2 5.5781 5.9843
3 7.4375 7.9218
4 9.2656 9.7500
...
48 89.0625 94.9531
49 91.0937 97.3281
50 92.6406 99.6093
My first thought is that I'm somehow not setting the number of threads correctly.
Questions:
Is there something obviously wrong with how I've structured the parallel code?
Is there a better way to implement what I've done / want to do?
There is indeed something that is obviously wrong: you have removed any parallelism out of your code. Before creating the outermost parallel region, you defined its size to be of one thread. Therefore, only one single thread will be created to handle whatever code is inside this region. Subsequently using omp_set_num_threads(4) won't change that. This call merely says that whichever next parallel directive will create 4 threads (unless explicitly requested otherwise). But there's no such new parallel directive, which would have been here nested within the current one. You only have a work-sharing do directive which applied on the current enclosing parallel region of one unique thread.
There are two ways of addressing your issue:
Keeping your code as it was: although formally, you will fork and join your threads upon entry and exit of the parallel region, the OpenMP standard doesn't request that the threads are created and destroyed. Actually, it even encourages that the threads are kept alive to reduce the overhead of the parallel directive, which is done by most OpenMP run-time libraries. Therefore, the payload of such a simple approach of the problem isn't too big.
Using your second approach of pushing the parallel directive outside of the outermost loop, but creating as many threads as you'll need for the work-sharing (4 here I believe). Then, you enclose whatever has to be sequential within your parallel region with a single directive. This will ensure that no unwanted interaction with the extra threads will happen (implicit barrier and flushing of shared variable upon exit) while avoiding the parallelism where you don't want it.
This last version would look like this:
d = 1
!$omp parallel num_threads( 4 ) private( a, b, c ) firstprivate( d )
do a = 1, 1000
do b = 1, 100
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
!$omp do
do c = 1, d
some calculations without dependencies
end do
!$omp end do
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
end do
d = d + 1
end do
!$omp end parallel
Now whether this version would be actually faster compared to the naive one, it is up to you to test.
A last remark though: since there are quite a lot of sequential parts in your code, don't expect too much speed-up anyway. Amdahl's law is forever.
Nothing obviously wrong but if the serial loops take long, your speedup will be limited. Doing parallel computing might require to redesign your algorithms.
Instead of setting the number of threads in the loop, use the !$omp master - !$omp end master directives to reduce the execution to a single thread. Add a !$omp barrier if you can run this block only once all other threads are done.

which method simply a specific inner for loop? Using cell or other method

I am working on a on-line sequential algorithm. At time t, I receive a data vector y of length t, and I need to check each element in y. For the newest obs y(t)has impact on the judgement for all previous obs, so even at each time only one new obs y(t) is received, the whole vector y needs to be checked. Here is my current solution, and of course the whole observation y(1:t) is pre-generated:
for i = 1:length(y)%%length(y) = t
for j = 1:i
p(i,j) = fct(y(1:j),y(j+1:i))
end
end
I am thinking for at time t, the only difference in the inner for loop is the segmentation of data y(1:j) and y(j+1:t), so if it constructing a t*2 size cell c with c{t,1} = y(1:j) and c{t,2} = y(j:t) could make the process faster. I know using cell cause lot of overhead, and is there any better way?

Parallelizing pairwise gravitational force calculation with OpenMP in Fortran 90

I am trying to parallelize the calculation of gravitational forces in my program using OpenMP. The calculation of the distances (R and R2) is no problem but the forces/accelerations (A) come out wrong. I know that it has something to do with race conditions in the summation. I have experimented a bit with atomic and critical constructs but could not find a solution. Also, I'm not sure which variables should be private and why.
Does someone with more experience in using OpenMP have a suggestion on how to correct this in the following code example?
A = 0.0
!$omp parallel do
do i = 1, Nobj
do j = i + 1, Nobj
R2(i,j) = (X(j,1) - X(i,1))**2 &
+ (X(j,2) - X(i,2))**2 &
+ (X(j,3) - X(i,3))**2
R(i,j) = sqrt(R2(i,j))
do k = 1, 3
A(i,k) = A(i,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(j,k) - X(i,k)) / R(i,j)))
A(j,k) = A(j,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(i,k) - X(j,k)) / R(i,j)))
enddo
enddo
A(i,:) = A(i,:) * G / mass_acc(i)
enddo
!$omp end parallel do
You are modifying A(j,k) - neither j nor k are "local" to thread as the thread-parallel index is i. What I mean is neither of those index ranges are restricted to a particular thread, all threads will update all A(j,k) hence the race condition.
Things you can do - split up R and A calculations or not use symmetry to update A.
Also, Fortran is column major and you are traversing outer index first which is bad for performance.

Are there any languages that have a do-until loop?

Is there any programming language that has a do-until loop?
Example:
do
{
<statements>
}
until (<condition>);
which is basically equivalent to:
do
{
<statements>
}
while (<negated condition>);
NOTE: I'm looking for post-test loops.
Ruby has until.
i=0
begin
puts i
i += 1
end until i==5
VBA!
Do-Until-Loop
Do-Loop-Until
Although I think quite a number of people here would doubt if it is a real language at all, but well, BASIC is how Microsoft started (quite weak argument for many, I know)...
It is possible in VB.Net
bExitFromLoop = False
Do
'Executes the following Statement
Loop Until bExitFromLoop
It is also possible in SDF-P on BS2000 (Fujitsu/Siemens Operating System)
/ DECLARE-VARIABLE A
/ DECLARE-VARIABLE SWITCH-1(TYPE=*BOOLEAN)
/ SET-VARIABLE A = 5
/ SET-VARIABLE SWITCH-1 = ON
/ REPEAT
/ A = A + 10
/ IF (A > 50)
/ SET-VARIABLE SWITCH-1 = OFF
/ END-IF
/ UNTIL (SWITCH-1 = OFF)
/ SHOW-VARIABLE A
A = 55
Is is also possible is C or C++ using a macro that define until
Example (definition):
#define until(cond) while(!(##cond))
Example (utilisation):
int i = 0;
do {
cout << i << "\n";
i++;
} until(i == 5);
In VB we can find something like:
Reponse = InputBox("Please Enter Pwd")
Do Until Reponse = "Bob-pwr148" ...
Eiffel offers you an until loop.
from
x := 1
until
x > 100
loop
...
end
There is also an "across" loop as well. Both are very powerful and expressive.
The design of this loop has more to offer. There are two more parts to its grammar that will help us resolve two important "correctness" problems.
Endless loop protection.
Iteration failure detection.
Endless Loop Protection
Let's modify our loop code a little by adding a loop variant.
from
x := 1
v := 1_000
until
x > 100
variant
v
loop
...
v := v - 1
end
The loop variant is (essentially) a count-down variable, but not just any old variable. By using the variant keyword, we are telling the compiler to pay attention to v. Specifically, the compiler is going to generate code that watchdogs the v variable for two conditions:
Does v decrease with each iteration of the loop (are we counting down). It does no good to try and use a count-down variable if it is (in fact) not counting down, right? If the loop variant is not counting down (decreasing by any amount), then we throw an exception.
Does v ever reach a condition of less than zero? If so, then we throw an exception.
Both of these work together through the compiler and variant variable to detect when and if our iterating loop fails to iterate or iterates too many times.
In the example above, our code is communicating to us a story that it expects to iterate zero to 1_000 times, but not more. If it is more, then we stop the loop, which leaves us to wonder: Do we really have cases were we iterate more than 1_000 times, or is there something wrong that our condition is failing to become True?
Loop Invariant
Now that we know what a loop variant is, we need to understand what a loop invariant is.
The invariant is a set of one or more Boolean conditions that must hold True after each iteration through the loop. Why do we want these?
Imagine you have 1_000_000 iterations and one of them fails. You don't have time to walk through each iteration, examining it to see it is okay or not. So, you create a set of one or more conditions that are tested upon completion of each iteration. If the one or all of the conditions fail, then you know precisely which iteration (and its deterministic state) is causing the problem!
The loop invariant might look something like:
from
x := 1
y := 0
v := 1_000
invariant
y = x - 1
until
x > 100
variant
v
loop
...
x := x + 1
y := y + 1
v := v - 1
end
In the example above, y is trailing x by 1. We expect that after each iteration, y will always be x - 1. So, we create a loop invariant using the invariant keyword that states our Boolean assertion. If y fails to be x - 1, the loop will immediately throw an exception and let us know precisely which iteration has failed to keep the assertion True.
CONCLUSION
Our loop is now quite tight and secure—well guarded against failure (bugs, errors).

Resources