Plain vs. allocatable/pointer arrays, Fortran advice? - arrays

I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast and -fast, respectively.
program matrix_multiply
implicit none
integer, parameter :: n = 1500
real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
The timings are as follows:
! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays
! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays
My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?
At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:
program testArrays
implicit none
integer, parameter :: m = 1223, n = 2015
real(8), parameter :: pi = acos(-1.d0)
real(8) :: a(m,n)
real(8), allocatable :: b(:,:)
real(8), pointer :: c(:,:)
integer :: i, sz = min(m, n), t0, t1, count_rate, count_max
allocate( b(m,n), c(m,n) )
call random_seed()
call random_number(a)
call random_number(b)
call random_number(c)
call system_clock(t0, count_rate, count_max)
do i=1,1000
call doit(a,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( a(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doit(b,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( b(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doitp(c,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( c(1:3,1:3) )
contains
subroutine doit(a,sz)
real(8) :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
subroutine doitp(a,sz)
real(8), pointer :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
end program testArrays
ifort timings:
Time plain: 2.857000, sum 3x3 = -.9913536
Time alloc: 2.750000, sum 3x3 = .4471794
Time p.ptr: 2.786000, sum 3x3 = 2.036269
gfortran timings, however, are much longer but follow my expectation:
Time plain: 51.5600014, sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007, sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034, sum 3x3 = -2.1546109819149963

To get an idea whether the compiler thinks there is a difference, look at the generated assembly for the procedures. Based on a quick look here, the assembly for the timed section of the two cases for the first example appears to be more or less equivalent, in terms of the work that the processor has to do. This is as expected, because the arrays presented to the timed section are more or less equivalent - they are large, contiguous, not overlapping and with element values only known at runtime.
(Beyond the compiler, there can then be differences due to the way data presents in the various caches at runtime, but that should be similar for both cases as well.)
The main difference between explicit shape and allocatable arrays is in the time that it takes to allocate and deallocate the storage for the latter. There are only four allocations at most in your first example (so it is not likely to onerous relative to subsequent calculations), and you don't time that part of the program. Stick the allocation/implicit deallocation pair inside a loop, then see how you go.
Arrays with the pointer or target attribute may be subject to aliasing, so the compiler may have to do extra work to account for the possibility of storage for the arrays overlapping. However the nature of the expression in the second example (only the one array is referenced) is such that the compiler likely knows that there is no need for the extra work in this particular case, and the operations become equivalent again.
In response to "I thought allocatable was only needed for large arrays that don't fit into the stack" - allocatable is needed (i.e. you have no real choice) when you cannot determine the size or other characteristics of the thing being allocated in the specification part of the procedure responsible for the entirety of the existence of the thing. Even for things not known until runtime, if you can still determine the characteristics in the specification part of the relevant procedure, then automatic variables are an option. (There are no automatic variables in your example though - in the non-allocatable, non-pointer cases, all the characteristics of the arrays are known at compile time.) At a Fortran processor implementation level, which varies between compilers and compile options, automatic variables may require more stack space than is available, and this can cause problems that allocatables may alleviate (or you can just change compiler options).

This is not an answer to why you get what you observe, but rather a report of disagreement with your observations. Your code,
program matrix_multiply
implicit none
integer, parameter :: n = 1500
!real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
compiled with Intel Fortran compiler 18.0.2 on Windows and optimization flags turned on,
ifort /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 main.f90 -o run.exe
gives, in fact, the opposite of what you observe:
Elapsed time: 1.580000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.560000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.588000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.551000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.566000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.602000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.623000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.597000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.607000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.617000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.606000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.626000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.614000, c_(n/2+1) = -143.8334 ! allocatable arrays
As you can see, the allocatable arrays are in fact slightly slower, on average, which is what I expected to see, which also contradicts your observations. The only source of difference that I can see is the optimization flags used, though I am not sure how that could make a difference. Perhaps you'd want to run your tests in multiple different modes of no optimization and with different levels of optimization, and see if you get consistent performance differences in all modes or not. To get more info about the meaning of the optimization flags used, see Intel's reference page.
Also, do not use real(8) for variable declarations. It is a non-standard syntax, non-portable, and therefore, potentially problematic. A more consistent way, according to the Fortran standard is to use iso_fortran_env intrinsic module, like:
!...
use, intrinsic :: iso_fortran_env, only: real64, int32
integer(int32), parameter :: n=100
real(real64) :: a(n)
!...
This intrinsic module has the following kinds,
int8 ! 8-bit integer
int16 ! 16-bit integer
int32 ! 32-bit integer
int64 ! 64-bit integer
real32 ! 32-bit real
real64 ! 64-bit real
real128 ! 128-bit real
So, for example, if you wanted to declare a complex variable with components of 64-bit kind, you could write:
program complex
use, intrinsic :: iso_fortran_env, only: RK => real64, output_unit
! the intrinsic attribute above is not essential, but recommended, so this would be also valid:
! use iso_fortran_env, only: RK => real64, output_unit
complex(RK) :: z = (1._RK, 2._RK)
write(output_unit,"(*(g0,:,' '))") "Hello World! This is a complex variable:", z
end program complex
which gives:
$gfortran -std=f2008 *.f95 -o main
$main
Hello World! This is a complex variable: 1.0000000000000000 2.0000000000000000
Note that this requires Fortran 2008 compliant compiler. There are also other functions and entities in iso_fortran_env, like output_unit which is the unit number for the preconnected standard output unit (the same one that is used by print or write with a unit specifier of *), as well as several others like compiler_version(), compiler_options(), and more.

Related

Segmentation fault with array indexing in Fortran

Let A and I be an arrays of the integer type with dimension N. In general, I is a permutation of the integers 1:N. I want to do A(1:N) = A(I(1:N)). For small N this works fine, but I got Segmentation fault when N is large.
Here is an example of what I actually did:
integer N
integer,dimension(:),allocatable::A,I
N = 10000000
allocate(A(N))
allocate(I(N))
A = (/ (i,i=1,N) /)
I = (/ (N-i+1,i=1,N) /)
A(1:N) = A(I(1:N))
Is there a better way to do this?
It seems that A(I(1:N)) is valid syntax, at least in my testing (gfortran 4.8, ifort 16.0, pgfortran 15.10). One problem is that i and I are the same thing, and the array I cannot be used in an implied do as you are doing. Replacing it with j yields a program that runs for me:
program main
implicit none
integer :: N, j
integer, allocatable, dimension(:) :: A, I
! -- Setup
N = 10000000
allocate(A(N),I(N))
A = (/ (j,j=1,N) /)
I = (/ (N-j+1,j=1,N) /)
! -- Main operation
A(1:N) = A(I(1:N))
write(*,*) 'A(1): ', A(1)
write(*,*) 'A(N): ', A(N)
end program main
As to why you're seeing a segmentation fault, I guess you're running out of memory when the array sizes get huge. If you're still having trouble, though, I suggest the following.
Instead of using A(1:N) = A(I(1:N)), you really should be using a loop, such as
! -- Main operation
do j=1,N
Anew(j) = A(I(j))
enddo
A = Anew
This is more readable and easier to debug moving forward.

How to do a fftw3 MPI "transposed" 2D transform if possible at all?

Consider a 2D transform of the form L x M (column major setup), from a complex array src to a real array tgt. Or , in Fortranese,
complex(C_DOUBLE_COMPLEX), pointer :: src(:,:)
real(8), pointer :: tgt(:,:) .
Corresponding pointers are
type(C_PTR) :: csrc,ctgt .
I would allocate them in the following manner:
! The complex array first
alloc_local = fftw_mpi_local_size_2d(M,L/2+1,MPI_COMM_WORLD,local_M,local_offset1)
csrc = fftw_alloc_complex(alloc_local)
call c_f_pointer(csrc, src, [L/2,local_M])
! Now the real array
alloc_local = fftw_mpi_local_size_2d(2*(L/2+1),M, &
MPI_COMM_WORLD,local_L,local_offset2)
ctgt = fftw_alloc_real(alloc_local)
call c_f_pointer(ctgt, tgt, [M,local_L])
Now, the plan would be created as:
! Create c-->r transform with one transposition left out
plan = fftw_mpi_plan_dft_c2r_2d(M,L,src,tgt, MPI_COMM_WORLD, &
ior(FFTW_MEASURE,FFTW_MPI_TRANSPOSED_OUT))
Finally, the transform would be performed as:
call fftw_mpi_execute_dft_c2r(plan, src, tgt)
However, this prescription does not work. The last call causes a segmentation fault. At first, i thought this might have something to do with how I allocate src and tgt arrays, but playing with different amount of memory allocated to tgt did not give any result. So, I am either doing something really silly, or this is not possible to do at all.
EDIT : MINIMALISTIC COMPILEABLE EXAMPLE
program trashingfftw
use, intrinsic :: iso_c_binding
use MPI
implicit none
include 'fftw3-mpi.f03'
integer(C_INTPTR_T), parameter :: L = 256
integer(C_INTPTR_T), parameter :: M = 256
type(C_PTR) :: plan, ctgt, csrc
complex(C_DOUBLE_COMPLEX), pointer :: src(:,:)
real(8), pointer :: tgt(:,:)
integer(C_INTPTR_T) :: alloc_local, local_M, &
& local_L,local_offset1,local_offset2
integer :: ierr,id
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,id,ierr)
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d(M,L/2+1, MPI_COMM_WORLD, &
local_M, local_offset1)
csrc = fftw_alloc_complex(alloc_local)
call c_f_pointer(csrc, src, [L/2,local_M])
alloc_local = fftw_mpi_local_size_2d(2*(L/2+1),M, MPI_COMM_WORLD, &
& local_L, local_offset2)
ctgt = fftw_alloc_real(alloc_local)
call c_f_pointer(ctgt, tgt, [M,local_L])
plan = fftw_mpi_plan_dft_c2r_2d(M,L,src,tgt, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_OUT))
call fftw_mpi_execute_dft_c2r(plan, src, tgt)
call mpi_finalize(ierr)
end program trashingfftw
And the answer is:
For mpi real transforms, there are only two allowed combinations of transpositions and directions:
real to complex transform and FFTW_MPI_TRANSPOSED_OUT
complex to real transform and FFTW_MPI_TRANSPOSED_IN
I have found this while digging inside the fftw3 ver. 3.3.4 code, file "rdft2-problem.c", comment on the line 120.
EDIT:
MINIMAL COMPILABLE AND WORKING EXAMPLE:
program trashingfftw
use, intrinsic :: iso_c_binding
use MPI
implicit none
include 'fftw3-mpi.f03'
integer(C_INTPTR_T), parameter :: L = 256
integer(C_INTPTR_T), parameter :: M = 256
type(C_PTR) :: plan, ctgt, csrc
complex(C_DOUBLE_COMPLEX), pointer :: src(:,:)
real(8), pointer :: tgt(:,:)
integer(C_INTPTR_T) :: alloc_local, local_M, &
& local_L,local_offset1,local_offset2
integer :: ierr,id
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,id,ierr)
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d(L/2+1,M, MPI_COMM_WORLD, &
local_l, local_offset1)
print *, id, "alloc complex=",alloc_local, local_l
csrc = fftw_alloc_complex(alloc_local)
call c_f_pointer(csrc, src, [M,local_l])
!Caveat: Must partition the real storage according to complex layout, this is why
! I am using M and L/2+1 instead of M, 2*(L/2+1) as it was done in the original post
alloc_local = fftw_mpi_local_size_2d(M,L/2+1, MPI_COMM_WORLD, &
& local_M, local_offset2)
print *, id, "alloc real=",alloc_local, local_m
! Two reals per complex
ctgt = fftw_alloc_real(2*alloc_local)
! Only the first L are relevant, the rest is just dangling space (see fftw3 docs)
!caveat: since the padding is in the first index, the 2d data is laid out non-contiguously
!(L sensible reals, padding, padding, L sensible reals, padding, padding, ....)
call c_f_pointer(ctgt, tgt, [2*(L/2+1),local_m])
plan = fftw_mpi_plan_dft_c2r_2d(M,L,src,tgt, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_IN))
! Should be non-null
print *, 'plan:', plan
src(3,2)=(1.,0)
call fftw_mpi_execute_dft_c2r(plan, src, tgt)
call mpi_finalize(ierr)
end program thrashingfftw

Determine assumed-shape array strides at runtime

Is it possible in a modern Fortran compiler such as Intel Fortran to determine array strides at runtime? For example, I may want to perform a Fast Fourier Transform (FFT) on an array section:
program main
complex(8),allocatable::array(:,:)
allocate(array(17, 17))
array = 1.0d0
call fft(array(1:16,1:16))
contains
subroutine fft(a)
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine
end program
However, the MKL Dfti* routines need to be explicitly told the array strides.
Looking through reference manuals I have not found any intrinsic functions which return stride information.
A couple of interesting resources are here and here which discuss whether array sections are copied and how Intel Fortran handles arrays internally.
I would rather not restrict myself to the way that Intel currently uses its array descriptors.
How can I figure out the stride information? Note that in general I would want the fft routine (or any similar routine) to not require any additional information about the array to be passed in.
EDIT:
I have verified that an array temporary is not created in this scenario, here is a simpler piece of code which I have checked on Intel(R) Visual Fortran Compiler XE 14.0.2.176 [Intel(R) 64], with optimizations disabled and heap arrays set to 0.
program main
implicit none
real(8),allocatable::a(:,:)
pause
allocate(a(8192,8192))
pause
call random_number(a)
pause
call foo(a(:4096,:4096))
pause
contains
subroutine foo(a)
implicit none
real(8)::a(:,:)
open(unit=16, file='a_sum.txt')
write(16, *) sum(a)
close(16)
end subroutine
end program
Monitoring the memory usage, it is clear that an array temporary is never created.
EDIT 2:
module m_foo
implicit none
contains
subroutine foo(a)
implicit none
real(8),contiguous::a(:,:)
integer::i, j
open(unit=16, file='a_sum.txt')
write(16, *) sum(a)
close(16)
call nointerface(a)
end subroutine
end module
subroutine nointerface(a)
implicit none
real(8)::a(*)
end subroutine
program main
use m_foo
implicit none
integer,parameter::N = 8192
real(8),allocatable::a(:,:)
integer::i, j
real(8)::count
pause
allocate(a(N, N))
pause
call random_number(a)
pause
call foo(a(:N/2,:N/2))
pause
end program
EDIT 3:
The example illustrates what I'm trying to achieve. There is a 16x16 contiguous array, but I only want to transform the upper 4x4 array. The first call simply passes in the array section, but it doesn't return a single one in the upper left corner of the array. The second call sets the appropriate stride and a subsequently contains the correct upper 4x4 array. The stride of the upper 4x4 array with respect to the full 16x16 array is not one.
program main
implicit none
complex(8),allocatable::a(:,:)
allocate(a(16,16))
a = 0.0d0
a(1:4,1:4) = 1.0d0
call fft(a(1:4,1:4))
write(*,*) a(1:4,1:4)
pause
a = 0.0d0
a(1:4,1:4) = 1.0d0
call fft_stride(a(1:4,1:4), 1, 16)
write(*,*) a(1:4,1:4)
pause
contains
subroutine fft(a) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
subroutine fft_stride(a, s1, s2) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
integer::s1, s2
type(dfti_descriptor),pointer::desc
integer::stat
integer::strides(3)
strides = [0, s1, s2]
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiSetValue(desc, DFTI_INPUT_STRIDES, strides)
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
end program
I'm guessing you get confused because you worked around the explicit interface of the MKL function DftiComputeForward by giving it a(:,1). This is contiguous and doesn't need an array temporary. It's wrong, however, the low-level routine will get the whole array and that's why you see that it works if you specify strides. Since the DftiComputeForward exects an array complex(kind), intent inout :: a(*), you can work by passing it through an external subroutine.
program ...
call fft(4,4,a(1:4,1:4))
end program
subroutine fft(m,n,a) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(*)
integer :: m, n
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, (/m,n/) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a)
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
This will create an array temporary though when going into the subroutine. A more efficient solution is then indeed strides:
program ...
call fft_strided(4,4,a,16)
end program
subroutine fft_strided(m,n,a,lda) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(*)
integer :: m, n, lda
type(dfti_descriptor),pointer::desc
integer::stat
integer::strides(3)
strides = [0, 1, lda]
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, (/m,n/) )
stat = DftiSetValue(desc, DFTI_INPUT_STRIDES, strides)
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a)
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
Tho routine DftiComputeForward accepts an assumed size array. If you pass something complicated and non-contiguous, a copy will have to be made at passing. The compiler can check at run-time if the copy is actually necessary or not. In any case for you the stride is always 1, because that will be the stride the MKL routine will see.
In your case you pass A(:,something), this is a contiguous section, provided A is contiguous. If A is not contiguous a copy will have to be made. Stride is always 1.
Some of the answers here do not understand the different between fortran strides and memory strides (though they are related).
To answer your question for future readers beyond the specific case you have here - there does not appear to be away to find an array stride solely in fortran, but it can be done via C using inter-operability features in newer compilers.
You can do this in C:
#include "stdio.h"
size_t c_compute_stride(int * x, int * y)
{
size_t px = (size_t) x;
size_t py = (size_t) y;
size_t d = py-px;
return d;
}
and then call this function from fortran on the first two elements of an array, e.g.:
program main
use iso_c_binding
implicit none
interface
function c_compute_stride(x, y) bind(C, name="c_compute_stride")
use iso_c_binding
integer :: x, y
integer(c_size_t) :: c_compute_stride
end function
end interface
integer, dimension(10) :: a
integer, dimension(10,10) :: b
write(*,*) find_stride(a)
write(*,*) find_stride(b(:,1))
write(*,*) find_stride(b(1,:))
contains
function find_stride(x)
integer, dimension(:) :: x
integer(c_size_t) :: find_stride
find_stride = c_compute_stride(x(1), x(2))
end function
end program
This will print out:
4
4
40
In short: assumed-shape arrays always have stride 1.
A bit longer: When you pass a section of an array to a subroutine which takes an assumed-shape array, as you have here, then the subroutine doesn't know anything about the original size of the array. If you look at the upper- and lower-bounds of the dummy argument in the subroutine, you'll see they will always be the size of the array section and 1.
integer, dimension(10:20) :: array
integer :: i
array = [ (i, i=10,20) ]
call foo(array(10:20:2))
subroutine foo(a)
integer, dimension(:) :: a
integer :: i
print*, lbound(a), ubound(a)
do i=lbound(a,1), ubound(a,2)
print*, a(i)
end do
end subroutine foo
This gives the output:
1 6
10 12 14 16 18 20
So, even when your array indices start at 10, when you pass it (or a section of it), the subroutine thinks the indices start at 1. Similarly, it thinks the stride is 1. You can give a lower bound to the dummy argument:
integer, dimension(10:) :: a
which will make lbound(a) 10 and ubound(a) 15. But it's not possible to give an assumed-shape array a stride.

finding specific indices with pointer array

I am relatively new to Fortran and break my head about one thing for hours now:
I want to write a subroutine for finding the indexes for specific elements in a real 1D array (given to the routine as input).
I have generated an array with 100 random reals, called arr, and now want to determine the indexes of those elements which are greater than a real value min, which is also passed to subroutine.
Plus, in the end I would like to have a pointer I'd allocate in the end, which I was said would be better than using an array indices containing the indexes once found.
I just didn't find how to solve that, I had following approach:
SUBROUTINE COMP(arr, min)
real, intent(in) :: arr(:)
real, intent(in) :: min
integer, pointer, dimension(:) :: Indices
integer :: i, j
! now here I need a loop which assigns to each element of the pointer
! array the Indices one after another, i don't know how many indices there
! are to be pointed at
! And I dont know how to manage that the Indices are pointed at one after another,
! like Indices(1) => 4
! Indices(2) => 7
! Indices(3) => 32
! Indices(4) => 69
! ...
! instead of
! Indices(4) => 4
! Indices(7) => 7
! Indices(32) => 32
! Indices(69) => 69
! ...
DO i = 1, size(arr)
IF (arr(i) > min) THEN
???
ENDIF
ENDDO
allocate(Indices)
END SUBROUTINE COMP
If succinctness (rather than performance) floats your boat... consider:
FUNCTION find_indexes_for_specific_elements_in_a_real_1D_array(array, min) &
RESULT(indices)
REAL, INTENT(IN) :: array(:)
REAL, INTENT(IN) :: min
INTEGER, ALLOCATABLE :: indices(:)
INTEGER :: i
indices = PACK([(i,i=1,SIZE(array))], array >= min)
END FUNCTION find_indexes_for_specific_elements_in_a_real_1D_array
[Requires F2003. Procedures that have assumed shape arguments and functions with allocatable results need to have an explicit interface accessible where they are referenced, so all well behaved Fortran programmers put them in a module.]
A simple way to get the indices of a rank 1 array arr for elements greater than value min is
indices = PACK([(i, i=LBOUND(arr,1), UBOUND(arr,1))], arr.gt.min)
where indices is allocatable, dimension(:). If your compiler doesn't support automatic (re-)allocation than an ALLOCATE(indices(COUNT(arr.gt.min)) would be needed before that point (with a DEALLOCATE before that if indices is already allocated).
As explanation: the [(i, i=...)] creates an array with the numbers of the indices of the other array, and the PACK intrinsic selects those corresponding to the mask condition.
Note that if you are doing index calculations in a subroutine you have to be careful:
subroutine COMP(arr, min, indices)
real, intent(in) :: arr(:)
real, intent(in) :: min
integer, allocatable, intent(out) :: indices(:)
!...
end subroutine
arr in the subroutine will have lower bound 1 regardless of the bounds of the actual argument (the array passed) (which could be, say VALS(10:109). You will also then want to pass the lower bound to the subroutine, or address that later.
[Automatic allocation is not an F90 feature, but in F90 one also has to think about allocatable subroutine arguments
I think you're on the right track, but you're ignoring some intrinsic Fortran functions, specifically count, and you aren't returning anything!
subroutine comp(arr, min)
real, intent(in) :: arr(:)
real, intent(in) :: min
! local variables
integer, allocatable :: indices(:)
integer :: i,j, indx
! count counts the number of TRUE elements in the array
indx = count(arr > min)
allocate(indices(indx))
! the variable j here is the counter to increment the index of indices
j=1
do i=1,size(arr)
if(arr(i) > min) then
indices(j) = i
j = j+1
endif
enddo
end subroutine comp
Then you can use the array indices as
do i=1,size(indices)
var = arr(indices(i))
enddo
Note that since you are not returning indices, you will lose all the information found once you return--unless you plan on using it in the subroutine, then you're okay. If you're not using it there, you could define it as a global variable in a module and the other subroutines should see it.

Large dynamic array Fortran declaration-seg fault

i wish to use dynamic declaration for a large array in fortran95 with allocate(matrix(size)),while size=10^7 and the content real*8 numbers.If size<13*10^6 everything runs smoothly without any error, but if size>13*10^6 then i get a segmentation fault on the run. It is important that I use dynamic declaration since the size of the array is calculated within the program. I use Mac OSX 64bit and gfortran 4.6.Can someone help me?
10**7 elements of real*8 is 76 MiB, so should pose no problem (I have successfully allocated several GiB arrays with GFortran, though I don't use OSX). Can you post a self-contained example in order to further analyze your problem?
Here is an example using an array of size 10**8. It worked for me with Mac OS X and gfortran 4.6. Does it work on your computer?
program test_lrg
integer, parameter :: DoubleReal_K = selected_real_kind (14)
integer, parameter :: QuadReal_K = selected_real_kind (32)
integer, parameter :: RegularInt_K = selected_int_kind (8)
integer, parameter :: VeryLongInt_K = selected_int_kind (18)
real (DoubleReal_K), dimension (:), allocatable :: array
integer (RegularInt_K) :: i
integer (RegularInt_K), parameter :: N = 100000000_RegularInt_K
real (QuadReal_K) :: sum
integer (VeryLongInt_K) :: CalcSum
allocate (array (N))
do i=1, N
array (i) = i
end do
do i=1, N
sum = sum + array (i)
end do
write (*, *) sum
CalcSum = N
CalcSum = ( CalcSum * (CalcSum + 1_VeryLongInt_K) ) / 2_VeryLongInt_K
write (*, *) CalcSum
stop
end program test_lrg
Try compiling with debugging options, such as:
-fimplicit-none -Wall -Wline-truncation -Wcharacter-truncation -Wsurprising -Waliasing -Wimplicit-interface -Wunused-parameter -fwhole-file -fcheck=bounds -fcheck=do -fcheck=mem -fcheck=recursion -std=f2008 -pedantic -fbacktrace

Resources