My (Fortran) code is very simple. All it does is filling up a large array, that depends on five (independent!) variables. Here is a brief example
do i = 1, imax
do j = 1, jmax
do k = 1, kmax
array(i,j,k) = ! some function of i,j,k
end do
end do
end do
I would to use different threads to fill the values of array in a faster way.
I thought the simplest way to achieve that would be to enclose the loop in these commands
!$OMP PARALLEL DO
!$OMP PARALLEL END
However, if I do this I get completely different results from the serial case. I apologize if the question is too simple, but I couldn't really find a proper example to help solve my problem. Can you recommend a solution or provide an example?
I don't exatly know what is happening, but it could be a race condition or just bad declaration of the directives. Try this and see if it works
!replace ... with variables that are constants as in shared(a,b,c)
!$omp parallel do default(private) shared(...)
do i=1,imax
j=1,jmax
k=1,kmax
array(i,j,k) = ! some function of i,j,k
end do
end do
end do
!$omp end parallel do
Related
If I have this code,
subroutine min_distance(r,n,k,centroid,distance,indices,distancereg)
integer, intent(out):: n,k
real,dimension(:,:),intent(in),allocatable::centroid
real,dimension(:,:),intent(in),allocatable::r
integer,dimension(:),intent(out),allocatable::indices,distancereg
real ::d_min
integer::y,i_min,j,i
integer,parameter :: data_dim=2
allocate (indices(n))
allocate (distancereg(k))
!cost=0.d0
do j=1,n
i_min = -1
d_min=1.d6
do i=1,k
distance=0.d0
distancereg(i)=0.d0
do y=1,data_dim
distance = distance+abs(r(y,j)-centroid(y,i))
distancereg(i)=distancereg(i)+abs(r(y,j)-centroid(y,i))
end do
if (distance<d_min) then
d_min=distance
i_min=i
end if
end do
if( i_min < 0 ) print*," found error by assigning k-index to particle ",j
indices(j)=i_min
end do
What I want to do is, when I calculate distance for each k, I want to paralelize it. ie. Assign each thread to do it. For example if k=3, then for k=1 the distance calculated by thread 1, and so on. I have tried with omp_nested, omp_ordered, but still showing some error. will appreciate if there is any advice / guidance .
Thanks
If you want to parallelize a loop (or loop nest) you have to wonder first which iterations are independent. In your case, each outer j iteration computes an i_min value that is 1. initialized in each i iteration, and 2. written into location (j). So each i_min calculation is independent and you can make the j loop parallel. (You also have d_min but that is never used.)
If the j loop is long enough that should be enough to get high performance. You might be tempted to look at the next loop over i. It computes a separate distance value for each iteration, so that's again parallel. Except that you update i_min,d_min, so you need to declare that loop a reduction.
However, the two loops are not "perfectly nested", so you can not spread the total i,j iteration space over the threads.
TLDR: your outer j loop can be parallelized.
What simply about:
do j=1,n
distancereg(:)=0.d0
!$OMP PARALLEL DO PRIVATE(y)
do i=1,k
do y=1,data_dim
distancereg(i)=distancereg(i)+abs(r(y,j)-centroid(y,i))
end do
end do
!$OMP PARALLEL END DO
indices(j)=minloc(distancereg,dim=1)
end do
Since you are storing the distances for each i, the search for the minimum value can be postponed after the loop on i
Or parallelizing the outer loop (here you don't need to store the distances):
!$OMP PARALLEL DO PRIVATE(i,y,i_min,d_min,distance)
do j=1,n
i_min = -1
d_min=1.d6
do i=1,k
distance=0.d0
do y=1,data_dim
distance = distance+abs(r(y,j)-centroid(y,i))
end do
if (distance<d_min) then
d_min=distance
i_min=i
end if
end do
if( i_min < 0 ) print*," found error by assigning k-index to particle ",j
indices(j)=i_min
end do
!$OMP END PARALLEL DO
When I try to parallelize my program in Fortran90 by OpenMP, I get a segmentation fault error.
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
do irep = 1, nrep
do i=1, 10
PRINT *, numstrain(i)
end do
end do
!$OMP END PARALLEL DO
I find that if I comment out "PRINT *, numstrain(i)" or remove openmp flags it works without error. I think it is because memory access conflict happens when I access numstrain(i) in parallel. I already declared i and numstrain as private variables. Could someone please give me some idea why it is the case? Thank you so much. :)
UPDATE:
I modified the previous version and this version can print out correct result.
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer :: n
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
n = 1000000
do irep = 1, nrep
allocate (numstrain(n), stat = allocate_status)
do i=1, 10
PRINT *, numstrain(i)
end do
deallocate (numstrain, stat = allocate_status)
end do
!$OMP END PARALLEL DO
However if I move the numstrain accessing to another subroutine called by this subroutine (code attached below), 1. It always processes in one thread. 2. At some point (i=4 or 5), it returns Segmentation Fault:11. The variable i when it returns Segmentation Fault:11 is different when I have different NUM_THREADS.
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer :: n
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
n = 1000000
do irep = 1, nrep
allocate (numstrain(n), stat = allocate_status)
call anotherSubroutine(numstrain)
deallocate (numstrain, stat = allocate_status)
end do
!$OMP END PARALLEL DO
subroutine anotherSubroutine(numstrain)
integer, allocatable :: numstrain(:)
do i=1, 10
PRINT *, numstrain(i)
end do
end subroutine anotherSubroutine
I also tried to both allocate/deallocate in help subroutine and main subroutine, and only allocate/deallocate in help subroutine. Nothing is changed.
The most typical reason for this is that not enough space is available on the stack to hold the private copy of numstrain. Compute and compare the following two values:
the size of the array in bytes
the stack size limit
There are two kinds of stack size limits. The stack size of the main thread is controlled by things like process limits on Unix systems (use ulimit -s to check and modify this limit) or is fixed at link time on Windows (recompilation or binary edit of the executable is necessary in order to change the limit). The stack size of the additional OpenMP threads is controlled by environment variables like the standard OMP_STACKSIZE, or the implementation-specific GOMP_STACKSIZE (GNU/GCC OpenMP) and KMP_STACKSIZE (Intel OpenMP).
Note that most Fortran OpenMP implementations always put private arrays on the stack, no matter if you enable compiler options that allocate large arrays on the heap (tested with GNU's gfortran and Intel's ifort).
If you comment out the PRINT statement, you effectively remove the reference to numstrain and the compiler is free to optimise it out, e.g. it could simply not make a private copy of numstrain, thus the stack limit is not exceeded.
After the additional information that you've provided one can conclude, that stack size is not the culprit. When dealing with private ALLOCATABLE arrays, you should know that:
private copies of unallocated arrays remain unallocated;
private copies of allocated arrays are allocated with the same bounds.
If you do not use numstrain outside of the parallel region, it is fine to do what you've done in your first case, but with some modifications:
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer, parameter :: n = 1000000
interface
subroutine anotherSubroutine(numstrain)
integer, allocatable :: numstrain(:)
end subroutine anotherSubroutine
end interface
!$OMP PARALLEL NUM_THREADS(4) PRIVATE(numstrain, allocate_status)
allocate (numstrain(n), stat = allocate_status)
!$OMP DO
do irep = 1, nrep
call anotherSubroutine(numstrain)
end do
!$OMP END DO
deallocate (numstrain)
!$OMP END PARALLEL
If you also use numstrain outside of the parallel region, then the allocation and deallocation go outside:
allocate (numstrain(n), stat = allocate_status)
!$OMP PARALLEL DO NUM_THREADS(4) PRIVATE(numstrain)
do irep = 1, nrep
call anotherSubroutine(numstrain)
end do
!$OMP END PARALLEL DO
deallocate (numstrain)
You should also know that when you call a routine that takes an ALLOCATABLE array as argument, you have to provide an explicit interface for that routine. You can either write an INTERFACE block or you can put the called routine in a module and then USE that module - both cases would provide the explicit interface. If you do not provide the explicit interface, the compiler would not pass the array correctly and the subroutine would fail to access its content.
I am having an issue with private arrays when using the !$OMP TASK construct. Arrays listed as PRIVATE for tasks are crashing/becoming corrupted when their bounds are given by input parameters in the subroutine. I am using static arrays to avoid the usual issues with allocatable arrays and !$OMP PARALLEL PRIVATE.
The following simplified code reproduces the issue, and crashes with SIGSEV:
SUBROUTINE do_work(n_in)
USE omp_lib
IMPLICIT NONE
INTEGER, INTENT(IN) :: n_in
INTEGER :: i, counter
REAL, DIMENSION(n_in) :: a
REAL, DIMENSION(20) :: b
!$OMP PARALLEL PRIVATE(a,i)
!$OMP SINGLE
counter = 0
DO WHILE(counter .LE. 20)
!$OMP TASK FIRSTPRIVATE(counter) PRIVATE(a,i)
a(:) = 5.0
DO i = 1,n_in
a(1) = a(1) + a(i)
END DO
b(counter) = a(1)
!$OMP END TASK
counter = counter + 1
END DO
!$OMP END SINGLE
!$OMP END PARALLEL
END SUBROUTINE do_work
The issue, however, is cleared away simply by hardcoding the size of array a i.e. REAL, DIMENSION(5) :: a. It is almost as if the task space is not aware of the array size parameter n_in. However, I have verified n_in both inside and outside of the task construct and outside of the parallel construct. Furthermore, if a is declared as a scalar, it works
Is the usage of PRIVATE clauses incorrect or incomplete?
SIDE NOTES:
I've written this simplified code to reproduce the problem. In reality, I am parallelizing a series of linked lists, as you can probably tell from the structure
Any code calling this subroutine is serial. There is no parallel nesting, recursion, etc.
I have no clue why, but in my case it works fine, if I include a radom print command (e.g print*,'test' or print*,a or even only print*, ) somewhere in the parallel region. If I comment it out, again I also get a SIGSEV ... strange. Sorry for this more or less answer.
I'm trying to parallelize a code. My code looks like this -
#pragma omp parallel private(i,j,k)
#pragma omp parallel for shared(A)
for(k=0;k<100;<k++)
for(i=1;i<1024;<i++)
for(j=0;j<1024;<j++)
A[i][j+1]=<< some expression involving elements of A[i-1][j-1] >>
On executing this code I'm getting a different result from serial execution of the loops.
I'm unable to understand what I'm doing wrong.
I've also tried the collapse()
#pragma omp parallel private(i,j,k)
#pragma omp parallel for collapse(3) shared(A)
for(k=0;k<100;<k++)
for(i=1;i<1024;<i++)
for(j=0;j<1024;<j++)
A[i][j+1]=<< some expression involving elements of A[][] >>
Another thing I tried was having a #pragma omp parallel for before each loop instead of collapse().
The issue, as I think, is the data dependency. Any idea how to parallelize in case of data dependency?
If this is really your use case, just parallelize for the outer loop, k, this should largely suffice for the modest parallelism that you have on common architectures.
If you want more, you'd have to re-write your loops such that you have an inner part that doesn't have the dependency. In your example case this is relatively easy, you'd have to process by "diagonals" (outer loop, sequential) and then inside the diagonals you'd be independent.
for (size_t d=0; d<nDiag(100); ++d) {
size_t nPoints = somefunction(d);
#pragma omp parallel
for (size_t p=0; p<nPoints; ++p) {
size_t x = coX(p, d);
size_t y = coY(p, d);
... your real code ...
}
}
Part of this could be done automatically, but I don't think that such tools are already readily implemented in everydays OMP. This is an active line of research.
Also note the following
int is rarely a good idea for indices, in particular if you access matrices. If you have to compute the absolute position of an entry yourself (and you see that here you might be) this overflows easily. int usually is 32 bit wide and of these 32 you are even wasting one for the sign. In C, object sizes are computed with size_t, most of the times 64 bit wide and in any case the correct type chosen by your platform designer.
use local variables for loop indices and other temporaries, as you can see writing OMP pragmas becomes much easier, then. Locality is one key to parallelism. Help yourself and the compiler by expressing this correctly.
You're only parallelizing the outer 'k' for loop. Every parallel thread is executing the 'i' and 'j' loops, and they're all writing into the same 'A' result. Since they're all reading and writing the same slots in A, the final result will be non-deterministic.
It's not clear from your problem that any parallelism is possible, since each step seems to depend on every previous step.
How to achieve this in Fortran ?
do i = 1, n Except n/2
Is there a convenient way instead of using 'if' in the loop ?
There are many solutions. Here is one using cycle. It still has an if statement in the loop but doesn't have an if ... end if block.
MyLoop: do i=1, N
if ( i == N/2 ) cycle MyLoop
! use the loop....
write (*, *) i
end do MyLoop
If you have an aversion to conditionals inside loops
do i = 1,(n/2)-1
...
end do
do i = (n/2)+1,n
...
end do
If n is, or may be, odd, you'll need to adjust the stop/start indices for the loops.
Place an if statement inside a loop
do i=1,n
if (i /= n/2) ...
end do
or the forall statement with a mask or the where statement are possible to use in certain situations.