Best way to handle large private arrays in openmp parallel region [duplicate]

Best way to handle large private arrays in openmp parallel region [duplicate] - arrays

When I try to parallelize my program in Fortran90 by OpenMP, I get a segmentation fault error.
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
do irep = 1, nrep
do i=1, 10
PRINT *, numstrain(i)
end do
end do
!$OMP END PARALLEL DO
I find that if I comment out "PRINT *, numstrain(i)" or remove openmp flags it works without error. I think it is because memory access conflict happens when I access numstrain(i) in parallel. I already declared i and numstrain as private variables. Could someone please give me some idea why it is the case? Thank you so much. :)
UPDATE:
I modified the previous version and this version can print out correct result.
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer :: n
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
n = 1000000
do irep = 1, nrep
allocate (numstrain(n), stat = allocate_status)
do i=1, 10
PRINT *, numstrain(i)
end do
deallocate (numstrain, stat = allocate_status)
end do
!$OMP END PARALLEL DO
However if I move the numstrain accessing to another subroutine called by this subroutine (code attached below), 1. It always processes in one thread. 2. At some point (i=4 or 5), it returns Segmentation Fault:11. The variable i when it returns Segmentation Fault:11 is different when I have different NUM_THREADS.
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer :: n
!$OMP PARALLEL DO NUM_THREADS(4) &
!$OMP PRIVATE(numstrain, i)
n = 1000000
do irep = 1, nrep
allocate (numstrain(n), stat = allocate_status)
call anotherSubroutine(numstrain)
deallocate (numstrain, stat = allocate_status)
end do
!$OMP END PARALLEL DO
subroutine anotherSubroutine(numstrain)
integer, allocatable :: numstrain(:)
do i=1, 10
PRINT *, numstrain(i)
end do
end subroutine anotherSubroutine
I also tried to both allocate/deallocate in help subroutine and main subroutine, and only allocate/deallocate in help subroutine. Nothing is changed.

The most typical reason for this is that not enough space is available on the stack to hold the private copy of numstrain. Compute and compare the following two values:
the size of the array in bytes
the stack size limit
There are two kinds of stack size limits. The stack size of the main thread is controlled by things like process limits on Unix systems (use ulimit -s to check and modify this limit) or is fixed at link time on Windows (recompilation or binary edit of the executable is necessary in order to change the limit). The stack size of the additional OpenMP threads is controlled by environment variables like the standard OMP_STACKSIZE, or the implementation-specific GOMP_STACKSIZE (GNU/GCC OpenMP) and KMP_STACKSIZE (Intel OpenMP).
Note that most Fortran OpenMP implementations always put private arrays on the stack, no matter if you enable compiler options that allocate large arrays on the heap (tested with GNU's gfortran and Intel's ifort).
If you comment out the PRINT statement, you effectively remove the reference to numstrain and the compiler is free to optimise it out, e.g. it could simply not make a private copy of numstrain, thus the stack limit is not exceeded.
After the additional information that you've provided one can conclude, that stack size is not the culprit. When dealing with private ALLOCATABLE arrays, you should know that:
private copies of unallocated arrays remain unallocated;
private copies of allocated arrays are allocated with the same bounds.
If you do not use numstrain outside of the parallel region, it is fine to do what you've done in your first case, but with some modifications:
integer, allocatable :: numstrain(:)
integer :: allocate_status
integer, parameter :: n = 1000000
interface
subroutine anotherSubroutine(numstrain)
integer, allocatable :: numstrain(:)
end subroutine anotherSubroutine
end interface
!$OMP PARALLEL NUM_THREADS(4) PRIVATE(numstrain, allocate_status)
allocate (numstrain(n), stat = allocate_status)
!$OMP DO
do irep = 1, nrep
call anotherSubroutine(numstrain)
end do
!$OMP END DO
deallocate (numstrain)
!$OMP END PARALLEL
If you also use numstrain outside of the parallel region, then the allocation and deallocation go outside:
allocate (numstrain(n), stat = allocate_status)
!$OMP PARALLEL DO NUM_THREADS(4) PRIVATE(numstrain)
do irep = 1, nrep
call anotherSubroutine(numstrain)
end do
!$OMP END PARALLEL DO
deallocate (numstrain)
You should also know that when you call a routine that takes an ALLOCATABLE array as argument, you have to provide an explicit interface for that routine. You can either write an INTERFACE block or you can put the called routine in a module and then USE that module - both cases would provide the explicit interface. If you do not provide the explicit interface, the compiler would not pass the array correctly and the subroutine would fail to access its content.

Related

Fortran Openmp large array on Eclipse; Program Crash [duplicate]

This question already has answers here:
Why Segmentation fault is happening in this openmp code?
(2 answers)
Closed 6 years ago.
I am using Eclipse with GNU Fortran compiler to compute a large arrays to solve a matrix problem. However, I have read and notice that I am unable to read all my data into the array which causes my project.exe to crash when I invoke -fopenmp into my compiler settings; otherwise, the program works fine.
program Top_tier
integer, parameter:: n=145894, nz_num=4608168
integer ia(n+1), ja(nz_num)
double precision a(nz_num), rhs(n)
integer i
open (21, file='ia.dat')
do i=1, n+1
read(21,*) ia(i)
enddo
close(21)
open (21, file='a.dat')
do i=1, nz_num
read(21,*) a(i)
enddo
close(21)
open (21, file='ja.dat')
do i=1, nz_num
read(21,*) ja(i)
enddo
close(21)
open (21, file='b.dat')
do i=1, n
read(21,*) rhs(i)
enddo
close(21)
End
In my quest to find a solution around it, I have found the most probable cause is the limit of the stack size which can be seen by the fact that if I set nz_num to lesser or equal to 26561, the program will run properly. A possible solution is to set environment variable to increase stacksize but the program does not recognise when I type "setenv" or "export" OMP_STACKSIZE into the program. Am I doing something wrong? Is there any advise on how I can solve this problem?
Thanks!

You are allocating a, rhs, ia ja on the stack, which is why you are running out of stack space in the first place. I would suggest to always allocate large arrays on the heap:
integer, parameter:: n=145894, nz_num=4608168
integer, dimension(:), allocatable :: ia, ja
double precision, dimension(:), allocatable :: a, rhs
integer i
allocate(ia(n+1), ja(nz_num))
allocate(a(nz_num), rhs(n))
! rest of your code...
deallocate(ia, ja)
deallocate(a, rhs)
Instead of directly declaring your four arrays of a certain size (causing them to be allocated on the stack) you declare them as allocatable and give the shape of the arrays. Further down you can then allocate your arrays to the size you want. This size can be chosen at runtime. That means, if you are reading your arrays from a file you could also store the size of the arrays at the beginning of the file and use this for your allocate call.
Finally, as always with dynamically allocated memory, don't forget to deallocate them when you don't need them anymore.
Edit:
And I forgot to say that this doesn't really have anything to do with openmp except for that openmp threads probably have much small stack size limits (in this case it would be only the openmp master thread).

Nested loop in Fortran with OPENMP

My (Fortran) code is very simple. All it does is filling up a large array, that depends on five (independent!) variables. Here is a brief example
do i = 1, imax
do j = 1, jmax
do k = 1, kmax
array(i,j,k) = ! some function of i,j,k
end do
end do
end do
I would to use different threads to fill the values of array in a faster way.
I thought the simplest way to achieve that would be to enclose the loop in these commands
!$OMP PARALLEL DO
!$OMP PARALLEL END
However, if I do this I get completely different results from the serial case. I apologize if the question is too simple, but I couldn't really find a proper example to help solve my problem. Can you recommend a solution or provide an example?

I don't exatly know what is happening, but it could be a race condition or just bad declaration of the directives. Try this and see if it works
!replace ... with variables that are constants as in shared(a,b,c)
!$omp parallel do default(private) shared(...)
do i=1,imax
j=1,jmax
k=1,kmax
array(i,j,k) = ! some function of i,j,k
end do
end do
end do
!$omp end parallel do

Determine assumed-shape array strides at runtime

Is it possible in a modern Fortran compiler such as Intel Fortran to determine array strides at runtime? For example, I may want to perform a Fast Fourier Transform (FFT) on an array section:
program main
complex(8),allocatable::array(:,:)
allocate(array(17, 17))
array = 1.0d0
call fft(array(1:16,1:16))
contains
subroutine fft(a)
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine
end program
However, the MKL Dfti* routines need to be explicitly told the array strides.
Looking through reference manuals I have not found any intrinsic functions which return stride information.
A couple of interesting resources are here and here which discuss whether array sections are copied and how Intel Fortran handles arrays internally.
I would rather not restrict myself to the way that Intel currently uses its array descriptors.
How can I figure out the stride information? Note that in general I would want the fft routine (or any similar routine) to not require any additional information about the array to be passed in.
EDIT:
I have verified that an array temporary is not created in this scenario, here is a simpler piece of code which I have checked on Intel(R) Visual Fortran Compiler XE 14.0.2.176 [Intel(R) 64], with optimizations disabled and heap arrays set to 0.
program main
implicit none
real(8),allocatable::a(:,:)
pause
allocate(a(8192,8192))
pause
call random_number(a)
pause
call foo(a(:4096,:4096))
pause
contains
subroutine foo(a)
implicit none
real(8)::a(:,:)
open(unit=16, file='a_sum.txt')
write(16, *) sum(a)
close(16)
end subroutine
end program
Monitoring the memory usage, it is clear that an array temporary is never created.
EDIT 2:
module m_foo
implicit none
contains
subroutine foo(a)
implicit none
real(8),contiguous::a(:,:)
integer::i, j
open(unit=16, file='a_sum.txt')
write(16, *) sum(a)
close(16)
call nointerface(a)
end subroutine
end module
subroutine nointerface(a)
implicit none
real(8)::a(*)
end subroutine
program main
use m_foo
implicit none
integer,parameter::N = 8192
real(8),allocatable::a(:,:)
integer::i, j
real(8)::count
pause
allocate(a(N, N))
pause
call random_number(a)
pause
call foo(a(:N/2,:N/2))
pause
end program
EDIT 3:
The example illustrates what I'm trying to achieve. There is a 16x16 contiguous array, but I only want to transform the upper 4x4 array. The first call simply passes in the array section, but it doesn't return a single one in the upper left corner of the array. The second call sets the appropriate stride and a subsequently contains the correct upper 4x4 array. The stride of the upper 4x4 array with respect to the full 16x16 array is not one.
program main
implicit none
complex(8),allocatable::a(:,:)
allocate(a(16,16))
a = 0.0d0
a(1:4,1:4) = 1.0d0
call fft(a(1:4,1:4))
write(*,*) a(1:4,1:4)
pause
a = 0.0d0
a(1:4,1:4) = 1.0d0
call fft_stride(a(1:4,1:4), 1, 16)
write(*,*) a(1:4,1:4)
pause
contains
subroutine fft(a) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
subroutine fft_stride(a, s1, s2) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
integer::s1, s2
type(dfti_descriptor),pointer::desc
integer::stat
integer::strides(3)
strides = [0, s1, s2]
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiSetValue(desc, DFTI_INPUT_STRIDES, strides)
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
end program

I'm guessing you get confused because you worked around the explicit interface of the MKL function DftiComputeForward by giving it a(:,1). This is contiguous and doesn't need an array temporary. It's wrong, however, the low-level routine will get the whole array and that's why you see that it works if you specify strides. Since the DftiComputeForward exects an array complex(kind), intent inout :: a(*), you can work by passing it through an external subroutine.
program ...
call fft(4,4,a(1:4,1:4))
end program
subroutine fft(m,n,a) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(*)
integer :: m, n
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, (/m,n/) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a)
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
This will create an array temporary though when going into the subroutine. A more efficient solution is then indeed strides:
program ...
call fft_strided(4,4,a,16)
end program
subroutine fft_strided(m,n,a,lda) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(*)
integer :: m, n, lda
type(dfti_descriptor),pointer::desc
integer::stat
integer::strides(3)
strides = [0, 1, lda]
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, (/m,n/) )
stat = DftiSetValue(desc, DFTI_INPUT_STRIDES, strides)
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a)
stat = DftiFreeDescriptor(desc)
end subroutine !}}}

Tho routine DftiComputeForward accepts an assumed size array. If you pass something complicated and non-contiguous, a copy will have to be made at passing. The compiler can check at run-time if the copy is actually necessary or not. In any case for you the stride is always 1, because that will be the stride the MKL routine will see.
In your case you pass A(:,something), this is a contiguous section, provided A is contiguous. If A is not contiguous a copy will have to be made. Stride is always 1.

Some of the answers here do not understand the different between fortran strides and memory strides (though they are related).
To answer your question for future readers beyond the specific case you have here - there does not appear to be away to find an array stride solely in fortran, but it can be done via C using inter-operability features in newer compilers.
You can do this in C:
#include "stdio.h"
size_t c_compute_stride(int * x, int * y)
{
size_t px = (size_t) x;
size_t py = (size_t) y;
size_t d = py-px;
return d;
}
and then call this function from fortran on the first two elements of an array, e.g.:
program main
use iso_c_binding
implicit none
interface
function c_compute_stride(x, y) bind(C, name="c_compute_stride")
use iso_c_binding
integer :: x, y
integer(c_size_t) :: c_compute_stride
end function
end interface
integer, dimension(10) :: a
integer, dimension(10,10) :: b
write(*,*) find_stride(a)
write(*,*) find_stride(b(:,1))
write(*,*) find_stride(b(1,:))
contains
function find_stride(x)
integer, dimension(:) :: x
integer(c_size_t) :: find_stride
find_stride = c_compute_stride(x(1), x(2))
end function
end program
This will print out:
4
4
40

In short: assumed-shape arrays always have stride 1.
A bit longer: When you pass a section of an array to a subroutine which takes an assumed-shape array, as you have here, then the subroutine doesn't know anything about the original size of the array. If you look at the upper- and lower-bounds of the dummy argument in the subroutine, you'll see they will always be the size of the array section and 1.
integer, dimension(10:20) :: array
integer :: i
array = [ (i, i=10,20) ]
call foo(array(10:20:2))
subroutine foo(a)
integer, dimension(:) :: a
integer :: i
print*, lbound(a), ubound(a)
do i=lbound(a,1), ubound(a,2)
print*, a(i)
end do
end subroutine foo
This gives the output:
1 6
10 12 14 16 18 20
So, even when your array indices start at 10, when you pass it (or a section of it), the subroutine thinks the indices start at 1. Similarly, it thinks the stride is 1. You can give a lower bound to the dummy argument:
integer, dimension(10:) :: a
which will make lbound(a) 10 and ubound(a) 15. But it's not possible to give an assumed-shape array a stride.

OpenMP crashes with parameter-defined array bounds

I am having an issue with private arrays when using the !$OMP TASK construct. Arrays listed as PRIVATE for tasks are crashing/becoming corrupted when their bounds are given by input parameters in the subroutine. I am using static arrays to avoid the usual issues with allocatable arrays and !$OMP PARALLEL PRIVATE.
The following simplified code reproduces the issue, and crashes with SIGSEV:
SUBROUTINE do_work(n_in)
USE omp_lib
IMPLICIT NONE
INTEGER, INTENT(IN) :: n_in
INTEGER :: i, counter
REAL, DIMENSION(n_in) :: a
REAL, DIMENSION(20) :: b
!$OMP PARALLEL PRIVATE(a,i)
!$OMP SINGLE
counter = 0
DO WHILE(counter .LE. 20)
!$OMP TASK FIRSTPRIVATE(counter) PRIVATE(a,i)
a(:) = 5.0
DO i = 1,n_in
a(1) = a(1) + a(i)
END DO
b(counter) = a(1)
!$OMP END TASK
counter = counter + 1
END DO
!$OMP END SINGLE
!$OMP END PARALLEL
END SUBROUTINE do_work
The issue, however, is cleared away simply by hardcoding the size of array a i.e. REAL, DIMENSION(5) :: a. It is almost as if the task space is not aware of the array size parameter n_in. However, I have verified n_in both inside and outside of the task construct and outside of the parallel construct. Furthermore, if a is declared as a scalar, it works
Is the usage of PRIVATE clauses incorrect or incomplete?
SIDE NOTES:
I've written this simplified code to reproduce the problem. In reality, I am parallelizing a series of linked lists, as you can probably tell from the structure
Any code calling this subroutine is serial. There is no parallel nesting, recursion, etc.

I have no clue why, but in my case it works fine, if I include a radom print command (e.g print*,'test' or print*,a or even only print*, ) somewhere in the parallel region. If I comment it out, again I also get a SIGSEV ... strange. Sorry for this more or less answer.

How to declare an array variable and its size mid-routine in Fortran

I would like to create an array with a dimension based on the number of elements meeting a certain condition in another array. This would require that I initialize an array mid-routine, which Fortran won't let me do.
Is there a way around that?
Example routine:
subroutine example(some_array)
real some_array(50) ! passed array of known dimension
element_count = 0
do i=1,50
if (some_array.gt.0) then
element_count = element_count+1
endif
enddo
real new_array(element_count) ! new array with length based on conditional statement
endsubroutine example

Your question isn't about initializing an array, which involves setting its values.
However, there is a way to do what you want. You even have a choice, depending on how general it's to be.
I'm assuming that the element_count means to have a some_array(i) in that loop.
You can make new_array allocatable:
subroutine example(some_array)
real some_array(50)
real, allocatable :: new_array(:)
allocate(new_array(COUNT(some_array.gt.0)))
end subroutine
Or have it as an automatic object:
subroutine example(some_array)
real some_array(50)
real new_array(COUNT(some_array.gt.0))
end subroutine
This latter works only when your condition is "simple". Further, automatic objects cannot be used in the scope of modules or main programs. The allocatable case is much more general, such as when you want to use the full loop rather than the count intrinsic, or want the variable not as a procedure local variable.
In both of these cases you meet the requirement of having all the declarations before executable statements.
Since Fortran 2008 the block construct allows automatic objects even after executable statements and in the main program:
program example
implicit none
real some_array(50)
some_array = ...
block
real new_array(COUNT(some_array.gt.0))
end block
end program example

Try this
real, dimension(50) :: some_array
real, dimension(:), allocatable :: other_array
integer :: status
...
allocate(other_array(count(some_array>0)),stat=status)
at the end of this sequence of statements other_array will have the one element for each element of some_array greater than 0, there is no need to write a loop to count the non-zero elements of some_array.
Following #AlexanderVogt's advice, do check the status of the allocate statement.

You can use allocatable arrays for this task:
subroutine example(some_array)
real :: some_array(50)
real,allocatable :: new_array(:)
integer :: i, element_count, status
element_count = 0
do i=lbound(some_array,1),ubound(some_array,1)
if ( some_array(i) > 0 ) then
element_count = element_count + 1
endif
enddo
allocate( new_array(element_count), stat=status )
if ( status /= 0 ) stop 'cannot allocate memory'
! set values of new_array
end subroutine

You need to use an allocatable array (see this article for more on it). This would change your routine to
subroutine example(input_array,output_array)
real,intent(in) :: input_array(50) ! passed array of known dimension
real, intent(out), allocatable :: output_array(:)
integer :: element_count, i
element_count = 0
do i=1,50
if (some_array.gt.0) element_count = element_count+1
enddo
allocate(output_array(element_count))
end subroutine
Note that the intents may not be necessary, but are probably good practice. If you don't want to call a second array, it is possible to create a reallocate subroutine; though this would require the array to already be declared as allocatable.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Best way to handle large private arrays in openmp parallel region [duplicate] - arrays

Related

Fortran Openmp large array on Eclipse; Program Crash [duplicate]

Nested loop in Fortran with OPENMP

Determine assumed-shape array strides at runtime

OpenMP crashes with parameter-defined array bounds

How to declare an array variable and its size mid-routine in Fortran

Categories

Resources