Open MP in do nested loop - loops

If I have this code,
subroutine min_distance(r,n,k,centroid,distance,indices,distancereg)
integer, intent(out):: n,k
real,dimension(:,:),intent(in),allocatable::centroid
real,dimension(:,:),intent(in),allocatable::r
integer,dimension(:),intent(out),allocatable::indices,distancereg
real ::d_min
integer::y,i_min,j,i
integer,parameter :: data_dim=2
allocate (indices(n))
allocate (distancereg(k))
!cost=0.d0
do j=1,n
i_min = -1
d_min=1.d6
do i=1,k
distance=0.d0
distancereg(i)=0.d0
do y=1,data_dim
distance = distance+abs(r(y,j)-centroid(y,i))
distancereg(i)=distancereg(i)+abs(r(y,j)-centroid(y,i))
end do
if (distance<d_min) then
d_min=distance
i_min=i
end if
end do
if( i_min < 0 ) print*," found error by assigning k-index to particle ",j
indices(j)=i_min
end do
What I want to do is, when I calculate distance for each k, I want to paralelize it. ie. Assign each thread to do it. For example if k=3, then for k=1 the distance calculated by thread 1, and so on. I have tried with omp_nested, omp_ordered, but still showing some error. will appreciate if there is any advice / guidance .
Thanks

If you want to parallelize a loop (or loop nest) you have to wonder first which iterations are independent. In your case, each outer j iteration computes an i_min value that is 1. initialized in each i iteration, and 2. written into location (j). So each i_min calculation is independent and you can make the j loop parallel. (You also have d_min but that is never used.)
If the j loop is long enough that should be enough to get high performance. You might be tempted to look at the next loop over i. It computes a separate distance value for each iteration, so that's again parallel. Except that you update i_min,d_min, so you need to declare that loop a reduction.
However, the two loops are not "perfectly nested", so you can not spread the total i,j iteration space over the threads.
TLDR: your outer j loop can be parallelized.

What simply about:
do j=1,n
distancereg(:)=0.d0
!$OMP PARALLEL DO PRIVATE(y)
do i=1,k
do y=1,data_dim
distancereg(i)=distancereg(i)+abs(r(y,j)-centroid(y,i))
end do
end do
!$OMP PARALLEL END DO
indices(j)=minloc(distancereg,dim=1)
end do
Since you are storing the distances for each i, the search for the minimum value can be postponed after the loop on i
Or parallelizing the outer loop (here you don't need to store the distances):
!$OMP PARALLEL DO PRIVATE(i,y,i_min,d_min,distance)
do j=1,n
i_min = -1
d_min=1.d6
do i=1,k
distance=0.d0
do y=1,data_dim
distance = distance+abs(r(y,j)-centroid(y,i))
end do
if (distance<d_min) then
d_min=distance
i_min=i
end if
end do
if( i_min < 0 ) print*," found error by assigning k-index to particle ",j
indices(j)=i_min
end do
!$OMP END PARALLEL DO

Related

Optimizing matrix calculations in for loops in Octave

I imported code from Matlab to Octave and the speed of certain functions seems to have dropped.
I looked into vectorization and could not come up with a solution with my limited knowledge.
What i want to ask, is there a way to speed this up?
n = 181;
N = 250;
for i=1:n
for j=1:n
par=0;
for k=1:N;
par=par+log2(1+(10.^(matrix1(j,i,matrix2(j,i))./10)./(matrix3(j,i).*double1+double2)));
end
resultingMatrix(j,i)=2.^((1/N).*par)-1;
end
end
Where dimensions are:
matrix1 = 181x181x2,
matrix2 = 181x181 --> containing values either 1 or 2 only,
matrix3 = 181x181,
double1, double2 = just doubles
Here's my testing code, I've completed your code by making some random matrices:
n = 181;
N = 250;
matrix1 = rand(n,n,2);
matrix2 = randi(2,n,n);
matrix3 = rand(n,n);
double1 = 1;
double2 = 1;
tic
for i=1:n
for j=1:n
par=0;
for k=1:N
par=par+log2(1+(10.^(matrix1(j,i,matrix2(j,i))./10)./(matrix3(j,i).*double1+double2)));
end
resultingMatrix(j,i)=2.^((1/N).*par)-1;
end
end
toc
Note that the code inside the loop over k doesn't use k. This makes the loop superfluous. We can easily remove it. The loop does the same computation 250 times, adds up the results, and divides by 250, yielding the value of one of the repeated computations.
Another important thing to do is preallocate resultingMatrix, to avoid it growing with every loop iteration.
This is the resulting code:
tic
resultingMatrix2 = zeros(n,n);
for i=1:n
for j=1:n
par=log2(1+(10.^(matrix1(j,i,matrix2(j,i))./10)./(matrix3(j,i).*double1+double2)));
resultingMatrix2(j,i)=2.^par-1;
end
end
toc
max(abs((resultingMatrix(:)-resultingMatrix2(:))./resultingMatrix(:)))
The last line computes the maximum relative difference. It is 9.9424e-15 in my version of Octave. It will differ depending on the version, the system, and more. This error is the floating-point rounding error. Note that the original code, adding the same value 250 times, and then dividing it by 250, will produce a larger rounding error than the modified code. For example,
x = pi;
t = 0;
for i = 1:N
t = t + x;
end;
t = t / N;
t-x
gives -8.4377e-15, a similar rounding error to what we saw above.
The original code took 81.5 s, the modified code takes only 0.4 s. This is not a gain of vectorization, it is just a gain of preallocation and not needlessly repeating the same computation over and over again.
Next, we can remove the other two loops by vectorizing the operations. The difficult bit here is matrix1(j,i,matrix2(j,i)). We can produce each of the n*n linear indices with (1:n*n).' + (matrix2(:)-1)*(n*n). This is not trivial, I suggest you think about how this computation works. You need to know that linear indices count, starting at 1 for the top-left array element, first down, then right, then along the 3rd dimension. So 1:n*n is simply the linear indices for each of the elements of a 2D array, in order. To each of these we add n*n if we need to access the 2nd element along the 3rd dimension.
We now have the code
tic
index = reshape((1:n*n).' + (matrix2(:)-1)*(n*n), n, n);
par = log2(1+(10.^(matrix1(index)./10)./(matrix3.*double1+double2)));
resultingMatrix3 = 2.^par-1;
toc
max(abs((resultingMatrix(:)-resultingMatrix3(:))./resultingMatrix(:)))
This code produces the exact same result as my previous version, and runs in only 0.013 s, 30 times faster than the non-vectorized code, and 6000 times faster than the original code.

Finding the Complexity of Nested Loops

I'm given the loop pseudocode:
where "to" is equivalent to "<="
sum = 0;
for i = 1 to n
for j = 1 to i^3
for k = 1 to j
sum++
I know the outermost loop runs n times.
Do the two inner loops also run n times though? (Making the entire Complexity O(n^3).
Where for instance n = 5
Then:
1 <= 5 2<= 5
j = 1 <= 1^3 2 <= 2^3 = 8
k=1 <= 1 2 <= 2
And this would continue n times for each loop, making it n^3?
This seems like a tricky problem, those inner loops are more complex than just n.
The outer loop is n.
The next loop goes to i^3. At the end of the outer loop i will be equal to n. This means that this loop at the end will be at n^3. Technically it would be (n^3)/2, but we ignore that since this is Big O.
The third loop goes to j, but at the end of the previous loop j will be equal to i^3. And we already determined that i^3 was equal to n^3.
So it looks like:
1st loop: n
2nd loop: n^3
3rd loop: n^3
Which looks like it comes to n^7. I'd want someone else to verify this though. Gotta love Big O.
You can use Sigma notation to explicitly unroll the number of basic operations in the loop (let sum++ be a basic operation):
Where
(i): Partial sum formula from Wolfram Alpha.
(ii): Expanding the expression from Wolfram Alpha.
Hence, the complexity, using Big-O notation, is O(n^7).

Nested loop in Fortran with OPENMP

My (Fortran) code is very simple. All it does is filling up a large array, that depends on five (independent!) variables. Here is a brief example
do i = 1, imax
do j = 1, jmax
do k = 1, kmax
array(i,j,k) = ! some function of i,j,k
end do
end do
end do
I would to use different threads to fill the values of array in a faster way.
I thought the simplest way to achieve that would be to enclose the loop in these commands
!$OMP PARALLEL DO
!$OMP PARALLEL END
However, if I do this I get completely different results from the serial case. I apologize if the question is too simple, but I couldn't really find a proper example to help solve my problem. Can you recommend a solution or provide an example?
I don't exatly know what is happening, but it could be a race condition or just bad declaration of the directives. Try this and see if it works
!replace ... with variables that are constants as in shared(a,b,c)
!$omp parallel do default(private) shared(...)
do i=1,imax
j=1,jmax
k=1,kmax
array(i,j,k) = ! some function of i,j,k
end do
end do
end do
!$omp end parallel do

Start one loop where another one stopped

I have a loop that internally unrolls a sparse matrix vector multiplication. We calculate this using a diagonal approach for the upper right matrix with leads to a different length for each diagonal.
The unrolling then happens linewise, i.e. I calculate several diagonals at once, until the shortest diagonal reaches the end of the matrix. Then I want to calculate the remaining diagonals with another loop with decreased unrolling length.
This leads to the problem that the second loop needs to start where the first loop has ended. I'm now stumbling upon a construct like the following (very simplified):
do diag=1, nDiagonals-3, 4
! here be dragons
end do
do diag=diag, nDiagonals-2, 3
! here be smaller dragons
end do
In Fortran the do index has to be set in the control clause, in contrast to C where for(;n<m;==n) is a possible loop control clause. But is the construct above with do index=index, upperbound valid? Or are there better approaches for this kind of loop index handling?
I can't see anything syntactically wrong with your code, nor do I think you are doing anything dangerous if legal.
After the end of the first loop diag will have the value it would have if the loop continued for one more iteration. This behaviour is defined by the standard. Given the snippet
do diag = start, stop, stride
! do stuff
end do
at the end of the loop diag has value equal to (start + n*stride) where n is the smallest integer such that (start + n*stride)>stop
So, for a loop such as
do diag = 1,10,3
! do stuff
end do
! now diag == 13
and you can carry on using it to start the next loop as you outline.
What you can't do, in Fortran, is adjust the value of the do-variable inside the loop, the compiler behaves as if it establishes the loop limits at the first encounter with the do statement.

DO loop excluding several values in Fortran

How to achieve this in Fortran ?
do i = 1, n Except n/2
Is there a convenient way instead of using 'if' in the loop ?
There are many solutions. Here is one using cycle. It still has an if statement in the loop but doesn't have an if ... end if block.
MyLoop: do i=1, N
if ( i == N/2 ) cycle MyLoop
! use the loop....
write (*, *) i
end do MyLoop
If you have an aversion to conditionals inside loops
do i = 1,(n/2)-1
...
end do
do i = (n/2)+1,n
...
end do
If n is, or may be, odd, you'll need to adjust the stop/start indices for the loops.
Place an if statement inside a loop
do i=1,n
if (i /= n/2) ...
end do
or the forall statement with a mask or the where statement are possible to use in certain situations.

Resources