Clever/fast way of array multiplication and summation - arrays

I have to solve a double integral
in my program, that can be translated into the i-,j- loops in the following minimum working example:
program test
implicit none
integer :: i, j, n
double precision, allocatable :: y(:), res(:), C(:,:,:)
n=200
allocate(y(n), res(n), C(n,n,n))
call random_number(y)
call random_number(C)
res = 0.d0
do i=1, n
do j=1, n
res(:) = res(:) + y(i) * y(j) * C(:, j, i)
end do
end do
deallocate(y, res, C)
end program test
I have to solve this integral multiple times per execution and profiling tells me that it's the bottle neck of my calculation consuming more than 95 % of the execution time.
I was wondering whether there's any possibility to solve this in a more clever, i.e., fast way and maybe get rid of one or both of the loops.
My question is not to optimize the code with compiler flags or parallelization, but whether the double loop is the best practice to tackle the given problem. Usually loops are slow and I try to avoid them. I was thinking that it might be possible to avoid the loops by reshaping or spreading the arrays. But I just don't see it.

If you write the double loop in Matrix notation,y(i)*y(j) becomes the diadic YY^t, with Y being a n x 1 matrix. With this you can re-write the loop to (pseudo-code)
do n=1,size(C,1)
res(n) = sum( YY^t * C_n )
enddo
where C_n = C(n,:,:) and * is an element-wise multiplication. Apart from the element-wise calculation you already did, this leaves you two additional ways of calculating the results:
res(n) = sum( (YY^t) * C_n )
res(n) = sum( Y * (Y^t C_n) )
In both cases, it is beneficial to have contiguous data and re-order the array C:
do i=1,n
C2(:,:,i) = C(i,:,:)
enddo !i
The number of floating point operations is the same with both approaches and slightly less than in the original approach. So let's measure the time for all of them...
Here are the implementations using LAPACK for the matrix operations (and using dot products where applicable):
1. sum( (YY^t) * C_n )
call system_clock(iTime1)
call dgemm('N','N',n,n,1,1.d0,y,n,y,1,0.d0,mat,n)
nn=n*n
do i=1,n
res(i) = ddot( nn, mat, 1, C2(:,:,i), 1 )
enddo !i
2. sum( Y * (Y^t C_n) )
do i=1,n
call dgemm('N','N',1,n,n,1.d0,y,1,C2(:,:,i),n,0.d0,vec,1)
res(i) = ddot( n, y, 1, vec, 1 )
enddo !i
The outcome is as follows:
Orig: 0.111000001
sum((YY^t)C): 0.116999999
sum(Y(Y^tC)): 0.187000006
Your original implementation is the fastest! Why? Most probably due to the ideal usage of the cache on the CPU. Fortran compilers typically are very smart in optimizing loops, and in the element-wise calculation, you simply add and scale vectors, without any matrix operation. This can be utilized very efficiently.
So, is there room for improvement? Certainly :) The operation you are performing inside the loop is commonly known as axpy: y = a*x + y. This is a commonly used BLAS subroutine - usually highly optimized.
Utilizing this leads to
res = 0.d0
do i=1, n
do j=1, n
call daxpy(n, y(i)*y(j), C(:,j,i), 1, res, 1)
end do
end do
and takes
Orig (DAXPY): 0.101000004
Which is roughly 10% faster.
Here is the complete code, all measurements have been performed with OpenBLAS and with n=500 (to better see the impact)
program test
implicit none
integer :: i, j, n, nn
double precision, allocatable, target :: y(:), res(:), resC(:), C(:,:,:), C2(:,:,:), mat(:,:), vec(:)
integer :: count_rate, iTime1, iTime2
double precision :: ddot
n=500
allocate(y(n), res(n), resC(n), C(n,n,n), C2(n,n,n), mat(n,n), vec(n))
call random_number(y)
call random_number(C)
! Get the count rate
call system_clock(count_rate=count_rate)
! Original Aproach
call system_clock(iTime1)
res = 0.d0
do i=1, n
do j=1, n
res(:) = res(:) + y(i) * y(j) * C(:, j, i)
end do
end do
call system_clock(iTime2)
print *,'Orig: ',real(iTime2-iTime1)/real(count_rate)
! Original Aproach, DAXPY
call system_clock(iTime1)
resC = 0.d0
do i=1, n
do j=1, n
call daxpy(n, y(i)*y(j), C(:,j,i), 1, resC, 1)
end do
end do
call system_clock(iTime2)
print *,'Orig (DAXPY): ',real(iTime2-iTime1)/real(count_rate)
! print *,maxval( abs(resC-res) )
! Re-order
do i=1,n
C2(:,:,i) = C(i,:,:)
enddo !i
! sum((YY^t)C)
call system_clock(iTime1)
call dgemm('N','N',n,n,1,1.d0,y,n,y,1,0.d0,mat,n)
nn=n*n
do i=1,n
resC(i) = ddot( nn, mat, 1, C2(:,:,i), 1 )
enddo !i
call system_clock(iTime2)
print *,'sum((YY^t)C): ',real(iTime2-iTime1)/real(count_rate)
! print *,maxval( abs(resC-res) )
! sum(Y(Y^tC))
call system_clock(iTime1)
do i=1,n
call dgemm('N','N',1,n,n,1.d0,y,1,C2(:,:,i),n,0.d0,vec,1)
resC(i) = ddot( n, y, 1, vec, 1 )
enddo !i
call system_clock(iTime2)
print *,'sum(Y(Y^tC)): ',real(iTime2-iTime1)/real(count_rate)
! print *,maxval( abs(resC-res) )
end program test

Related

Fortran array input

I haven't done any Fortran programming for year and it seems I'm rather rusty now. So, I won't provide you with all my failed attempts but will humbly ask you to help me with the following.
I've got the following "input" file
1 5 e 4
A b & 1
c Z ; b
y } " N
t r ' +
It can have more columns and/or rows. I would now like to assign each of these ASCII characters to arrays x(i,j) so that I can process them further after ICHAR conversions. In this example i=1,4, j=1,5, but it can be any No depending on the input file. The simplest example
PROGRAM Example
integer :: i, j
CHARACTER, ALLOCATABLE, DIMENSION(:,:) :: A
READ *, A
ALLOCATE (A(i,j))
PRINT *, A
END PROGRAM Example
compiles (Example.f95) but
cat input | ./Example.f95
does not give any output.
I would greatly appreciate an advice on how to import the afore-mentioned strings into the program as x(i,j) terms of an array.
In Fortran, it's always best to know in advance how big your arrays need to be. I understand that in your case you can't know.
Assuming that your input is at least formatted correctly (i.e. the columns match up and have only a single space in between them), I've created a code that should in theory be able to read them in an arbitrary shape. (Not quite arbitrary, it assumes that there are fewer than 511 columns.)
It uses two ways:
It simply reads the first line in at once (1024 characters, hence the 511 limit on columns) then calculates from the length the number of columns
It then allocates an array with a guessed number of rows, and once it notices that the guess was too small, it creates a new allocation with double the number of rows. It then uses the move_alloc command to swap the allocations.
To find when it should end reading the values, it simply checks whether the read returns the IOSTAT_END error code.
Here's the code:
program read_input
use iso_fortran_env, only: IOSTAT_END
implicit none
character, dimension(:,:), allocatable :: A, A_tmp
character(len=1024) :: line ! Assumes that there are never more than 500 or so columns
integer :: i, ncol, nrow, nrow_guess
integer :: ios
character :: iom
! First, read the first line, to see how many columns there are
read(*, '(A)', iostat=ios, iomsg=iom) line
call iocheck('read first line', ios, iom)
ncol = (len_trim(line) + 1) / 2
! Let's first allocate memory for two rows, we can make it more later.
nrow_guess = 2
allocate(A(ncol, nrow_guess))
! Instead of standard input, we're reading from the line we read before.
read(line, '(*(A1,X))', iostat=ios, iomsg=iom) A(:, 1)
call iocheck('read first line into vals', ios, iom)
! Now loop over all the rows
nrow = 1
read_loop: do
if (nrow >= nrow_guess) then ! We have guessed too small.
! This is a bit convoluted, but the best
! way to increase the array shape.
nrow_guess = nrow_guess * 2
allocate(A_tmp(ncol, nrow_guess))
A_tmp(:, 1:nrow_guess/2) = A(:,:)
call move_alloc(A_tmp, A)
end if
read(*, '(*(A1,X))', iostat = ios, iomsg=iom) A(:, nrow+1)
if (ios == IOSTAT_END) exit read_loop ! We're done reading.
call iocheck('read line into vals', ios, iom)
nrow = nrow + 1
end do read_loop
! The last guess was probably too large,
! let's move it to an array of the correct size.
if (nrow < nrow_guess) then
allocate(A_tmp(ncol, nrow))
A_tmp(:,:) = A(:, 1:nrow)
call move_alloc(A_tmp, A)
end if
! To show we have all values, print them out.
do i = 1, nrow
print '(*(X,A))', A(:, i)
end do
contains
! This is a subroutine to check for IO Errors
subroutine iocheck(op, ios, iom)
character(len=*), intent(in) :: op, iom
integer, intent(in) :: ios
if (ios == 0) return
print *, "IO ERROR"
print *, "Operation: ", op
print *, "Message: ", iom
end subroutine iocheck
end program read_input
Edited to add
I had trouble with the special characters in your example input file, otherwise I'd just have made a read(*, *) A(:, nrow) -- but that messed the special characters up. That's why I chose the explicit (*(A1, X)) format. Of course that messes up when your characters don't start at the first position in the line.
You need to read the first line and determine how characters there in the line. Then read the entire file to determine the number of lines. Allocate the 2D array to hold characters. Then read the file and parse each line into the 2D array. There are more elegant ways of doing this, but here you go
program foo
implicit none
character(len=:), allocatable :: s
character, allocatable :: a(:,:)
integer fd, i, j, n, nr, nc
!
! Open file for reading
!
open(newunit=fd, file='tmp.dat', status='old', err=9)
!
! Determine number of characters in a row. Assumes all rows
! are of the same length.
!
n = 128
1 if (allocated(s)) then
deallocate(s)
n = 2 * n
end if
allocate(character(len=n) :: s)
read(fd,'(A)') s
if (len_trim(s) == 128) goto 1
s = adjustl(s)
n = len_trim(s)
deallocate(s)
!
! Allocate a string of the correct length.
!
allocate(character(len=n) :: s)
!
! Count the number of rows
!
rewind(fd)
nr = 0
do
read(fd,*,end=2)
nr = nr + 1
end do
!
! Read file and store individual characters in a(:,:)
!
2 rewind(fd)
nc = n / 2 + 1
allocate(a(nr,nc))
do i = 1, nr
read(fd,'(A)') s
do j = 1, nc
a(i,j) = s(2*j-1:2*j-1)
end do
end do
close(fd)
write(s,'(I0)') nc
s = '(' // trim(s) // '(A,1X))'
do i = 1, nr
write(*,s) a(i,:)
end do
stop
9 write(*,'(A)') 'Error: cannot open tmp.dat'
end program foo
Apparently, GOTO is verbotem, here. Here's an elegant solution.
program foo
implicit none
character, allocatable :: s(:), a(:,:)
integer fd, i, j, n, nr, nc
! Open file for reading
open(newunit=fd, file='tmp.dat', status='old', access='stream', err=9)
inquire(fd, size = n) ! Determine file size.
allocate(s(n)) ! Allocate space
read(fd) s ! Read the entire file
close(fd)
nr = count(ichar(s) == 10) ! Number of rows
nc = (count(ichar(s) /= 32) - nr) / nr ! Number of columns
a = reshape(pack(s, ichar(s) /= 32 .and. ichar(s) /= 10), [nc,nr])
a = transpose(a)
do i = 1, nr
do j = 1, nc
write(*,'(A,1X)',advance='no') a(i,j)
end do
write(*,*)
end do
stop
9 write(*,'(A)') 'Error: cannot open tmp.dat'
end program foo

Plain vs. allocatable/pointer arrays, Fortran advice?

I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast and -fast, respectively.
program matrix_multiply
implicit none
integer, parameter :: n = 1500
real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
The timings are as follows:
! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays
! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays
My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?
At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:
program testArrays
implicit none
integer, parameter :: m = 1223, n = 2015
real(8), parameter :: pi = acos(-1.d0)
real(8) :: a(m,n)
real(8), allocatable :: b(:,:)
real(8), pointer :: c(:,:)
integer :: i, sz = min(m, n), t0, t1, count_rate, count_max
allocate( b(m,n), c(m,n) )
call random_seed()
call random_number(a)
call random_number(b)
call random_number(c)
call system_clock(t0, count_rate, count_max)
do i=1,1000
call doit(a,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( a(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doit(b,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( b(1:3,1:3) )
call system_clock(t0)
do i=1,1000
call doitp(c,sz)
end do
call system_clock(t1)
print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ', sum 3x3 = ', sum( c(1:3,1:3) )
contains
subroutine doit(a,sz)
real(8) :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
subroutine doitp(a,sz)
real(8), pointer :: a(:,:)
integer :: sz
a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
end
end program testArrays
ifort timings:
Time plain: 2.857000, sum 3x3 = -.9913536
Time alloc: 2.750000, sum 3x3 = .4471794
Time p.ptr: 2.786000, sum 3x3 = 2.036269
gfortran timings, however, are much longer but follow my expectation:
Time plain: 51.5600014, sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007, sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034, sum 3x3 = -2.1546109819149963
To get an idea whether the compiler thinks there is a difference, look at the generated assembly for the procedures. Based on a quick look here, the assembly for the timed section of the two cases for the first example appears to be more or less equivalent, in terms of the work that the processor has to do. This is as expected, because the arrays presented to the timed section are more or less equivalent - they are large, contiguous, not overlapping and with element values only known at runtime.
(Beyond the compiler, there can then be differences due to the way data presents in the various caches at runtime, but that should be similar for both cases as well.)
The main difference between explicit shape and allocatable arrays is in the time that it takes to allocate and deallocate the storage for the latter. There are only four allocations at most in your first example (so it is not likely to onerous relative to subsequent calculations), and you don't time that part of the program. Stick the allocation/implicit deallocation pair inside a loop, then see how you go.
Arrays with the pointer or target attribute may be subject to aliasing, so the compiler may have to do extra work to account for the possibility of storage for the arrays overlapping. However the nature of the expression in the second example (only the one array is referenced) is such that the compiler likely knows that there is no need for the extra work in this particular case, and the operations become equivalent again.
In response to "I thought allocatable was only needed for large arrays that don't fit into the stack" - allocatable is needed (i.e. you have no real choice) when you cannot determine the size or other characteristics of the thing being allocated in the specification part of the procedure responsible for the entirety of the existence of the thing. Even for things not known until runtime, if you can still determine the characteristics in the specification part of the relevant procedure, then automatic variables are an option. (There are no automatic variables in your example though - in the non-allocatable, non-pointer cases, all the characteristics of the arrays are known at compile time.) At a Fortran processor implementation level, which varies between compilers and compile options, automatic variables may require more stack space than is available, and this can cause problems that allocatables may alleviate (or you can just change compiler options).
This is not an answer to why you get what you observe, but rather a report of disagreement with your observations. Your code,
program matrix_multiply
implicit none
integer, parameter :: n = 1500
!real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n) ! plain arrays
integer :: i, j, k, ts, te, count_rate, count_max
real(8) :: tmp
real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:) ! allocatable arrays
allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )
do i = 1,n
do j = 1,n
a(i,j) = 1.d0/n/n * (i-j) * (i+j)
b(i,j) = 1.d0/n/n * (i-j) * (i+j)
end do
end do
! transpose for cache-friendliness
do i = 1,n
do j = 1,n
aT(j,i) = a(i,j)
end do
end do
call system_clock(ts, count_rate, count_max)
do i = 1,n
do j = 1,n
tmp = 0
do k = 1,n
tmp = tmp + aT(k,i) * b(k,j)
end do
c(i,j) = tmp
end do
end do
call system_clock(te)
print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)
end program matrix_multiply
compiled with Intel Fortran compiler 18.0.2 on Windows and optimization flags turned on,
ifort /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 main.f90 -o run.exe
gives, in fact, the opposite of what you observe:
Elapsed time: 1.580000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.560000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.588000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.551000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.566000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.555000, c_(n/2+1) = -143.8334 ! plain arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.634000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.602000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.623000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.597000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.607000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.617000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.606000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.626000, c_(n/2+1) = -143.8334 ! allocatable arrays
Elapsed time: 1.614000, c_(n/2+1) = -143.8334 ! allocatable arrays
As you can see, the allocatable arrays are in fact slightly slower, on average, which is what I expected to see, which also contradicts your observations. The only source of difference that I can see is the optimization flags used, though I am not sure how that could make a difference. Perhaps you'd want to run your tests in multiple different modes of no optimization and with different levels of optimization, and see if you get consistent performance differences in all modes or not. To get more info about the meaning of the optimization flags used, see Intel's reference page.
Also, do not use real(8) for variable declarations. It is a non-standard syntax, non-portable, and therefore, potentially problematic. A more consistent way, according to the Fortran standard is to use iso_fortran_env intrinsic module, like:
!...
use, intrinsic :: iso_fortran_env, only: real64, int32
integer(int32), parameter :: n=100
real(real64) :: a(n)
!...
This intrinsic module has the following kinds,
int8 ! 8-bit integer
int16 ! 16-bit integer
int32 ! 32-bit integer
int64 ! 64-bit integer
real32 ! 32-bit real
real64 ! 64-bit real
real128 ! 128-bit real
So, for example, if you wanted to declare a complex variable with components of 64-bit kind, you could write:
program complex
use, intrinsic :: iso_fortran_env, only: RK => real64, output_unit
! the intrinsic attribute above is not essential, but recommended, so this would be also valid:
! use iso_fortran_env, only: RK => real64, output_unit
complex(RK) :: z = (1._RK, 2._RK)
write(output_unit,"(*(g0,:,' '))") "Hello World! This is a complex variable:", z
end program complex
which gives:
$gfortran -std=f2008 *.f95 -o main
$main
Hello World! This is a complex variable: 1.0000000000000000 2.0000000000000000
Note that this requires Fortran 2008 compliant compiler. There are also other functions and entities in iso_fortran_env, like output_unit which is the unit number for the preconnected standard output unit (the same one that is used by print or write with a unit specifier of *), as well as several others like compiler_version(), compiler_options(), and more.

Segmentation fault with array indexing in Fortran

Let A and I be an arrays of the integer type with dimension N. In general, I is a permutation of the integers 1:N. I want to do A(1:N) = A(I(1:N)). For small N this works fine, but I got Segmentation fault when N is large.
Here is an example of what I actually did:
integer N
integer,dimension(:),allocatable::A,I
N = 10000000
allocate(A(N))
allocate(I(N))
A = (/ (i,i=1,N) /)
I = (/ (N-i+1,i=1,N) /)
A(1:N) = A(I(1:N))
Is there a better way to do this?
It seems that A(I(1:N)) is valid syntax, at least in my testing (gfortran 4.8, ifort 16.0, pgfortran 15.10). One problem is that i and I are the same thing, and the array I cannot be used in an implied do as you are doing. Replacing it with j yields a program that runs for me:
program main
implicit none
integer :: N, j
integer, allocatable, dimension(:) :: A, I
! -- Setup
N = 10000000
allocate(A(N),I(N))
A = (/ (j,j=1,N) /)
I = (/ (N-j+1,j=1,N) /)
! -- Main operation
A(1:N) = A(I(1:N))
write(*,*) 'A(1): ', A(1)
write(*,*) 'A(N): ', A(N)
end program main
As to why you're seeing a segmentation fault, I guess you're running out of memory when the array sizes get huge. If you're still having trouble, though, I suggest the following.
Instead of using A(1:N) = A(I(1:N)), you really should be using a loop, such as
! -- Main operation
do j=1,N
Anew(j) = A(I(j))
enddo
A = Anew
This is more readable and easier to debug moving forward.

Passing array of unknown size (subroutine output) to another subroutine

I'm new to Intel MKL. Here's a problem I've come across -- apparently a problem not related to MKL itself, but to the problem of how to declare and pass an array of hitherto unknown size as an output of a subroutine to another subroutine.
I'm trying to use mkl_ddnscsr to convert a matrix to its CSR format suitable for calling by Pardiso:
CALL mkl_ddnscsr(job,Nt,Nt,Adns,Nt,Acsr,ja,ia,info)
CALL PARDISO(pt,1,1,11,13,Nt,Acsr,ia,ja,perm,1,iparm,0,b,x,errr)
Problem is, I have no idea what the length of the CSR form Acsr and the index vector ja before calling the mkl_ddnscsr subroutine. How should one declare Acsr and ja in the main program, or the subroutine where these two lines are located?
I tried something like
INTERFACE
SUBROUTINE mkl_ddnscsr(job, m, n, Adns, lda, Acsr, ja, ia, info)
IMPLICIT NONE
INTEGER :: job(8)
INTEGER :: m, n, lda, info
INTEGER, ALLOCATABLE :: ja(:)
INTEGER :: ia(m+1)
REAL(KIND=8), ALLOCATABLE :: Acsr(:)
REAL(KIND=8) :: Adns(:)
END SUBROUTINE
END INTERFACE
followed by
INTEGER, ALLOCATABLE :: ja(:)
REAL(KIND=8), ALLOCATABLE :: Acsr(:)
outside the INTERFACE, in the main program. But this configuration gives me the segmentation fault upon running.
On the other hand, if I try something like
INTERFACE
SUBROUTINE mkl_ddnscsr(job, m, n, Adns, lda, Acsr, ja, ia, info)
IMPLICIT NONE
INTEGER :: job(8)
INTEGER :: m, n, lda, info
INTEGER :: ja(:), ia(m+1)
REAL(KIND=8) :: Acsr(:), Adns(:)
END SUBROUTINE
END INTERFACE
and then
INTEGER, DIMENSION(:) :: ja
REAL(KIND=8), DIMENSION(:) :: Acsr
Then ifort would give me the following message:
error #6596: If a deferred-shape array is intended, then the ALLOCATABLE or POINTER attribute is missing; if an assumed-shape array is intended, the array must be a dummy argument.
Anyone got any idea how to work around this? What's the right way to declare ja and Acsr in the main program (or main subroutine) and pass them around?
Note that the subroutines are part of the Intel MKL package, not something I write on my own, so it appears that module would be out of the question.
You can find the interface for mkl_ddnscsr from the manual page, or from the include file mkl_spblas.fi in your MKL install directory (e.g., /path/to/mkl/include/).
INTERFACE
subroutine mkl_ddnscsr ( job, m, n, Adns, lda, Acsr, AJ, AI, info )
integer job(8)
integer m, n, lda, info
integer AJ(*), AI(m+1)
double precision Adns(*), Acsr(*)
end
END INTERFACE
Because this routine has only Fortran77-style dummy arguments (i.e., explicit-shape array AI(m+1) or assumed-size arrays like Adns(*)), you can pass any local or allocatable arrays (after allocated in the caller side) as actual arguments. Also, it is not mandatory to write an interface block explicitly, but it should be useful to include it (in the caller side) to detect potential interface mismatch.
According to the manual, it looks like mkl_ddnscsr (a routine for converting a dense to sparse matrix) works something like this:
program main
implicit none
! include 'mkl_spblas.fi' !! or mkl.fi (not mandatory but recommended)
integer :: nzmax, nnz, job( 8 ), m, n, lda, info, irow, k
double precision :: A( 10, 20 )
double precision, allocatable :: Asparse(:)
integer, allocatable :: ia(:), ja(:)
A( :, : ) = 0.0d0
A( 2, 3 ) = 23.0d0
A( 2, 7 ) = 27.0d0
A( 5, 4 ) = 54.0d0
A( 9, 9 ) = 99.0d0
!! Give an estimate of the number of non-zeros.
nzmax = 10
!! Or assume that non-zeros occupy at most 2% of A(:,:), for example.
! nzmax = size( A ) / 50
!! Or count the number of non-zeros directly.
! nzmax = count( abs( A ) > 0.0d0 )
print *, "nzmax = ", nzmax
m = size( A, 1 ) !! number of rows
n = size( A, 2 ) !! number of columns
lda = m !! leading dimension of A
allocate( Asparse( nzmax ) )
allocate( ja( nzmax ) ) !! <-> columns(:)
allocate( ia( m + 1 ) ) !! <-> rowIndex(:)
job( 1 ) = 0 !! convert dense to sparse A
job( 2:3 ) = 1 !! use 1-based indices
job( 4 ) = 2 !! use the whole A as input
job( 5 ) = nzmax !! maximum allowed number of non-zeros
job( 6 ) = 1 !! generate Asparse, ia, and ja as output
call mkl_ddnscsr( job, m, n, A, lda, Asparse, ja, ia, info )
if ( info /= 0 ) then
print *, "insufficient nzmax (stopped at ", info, "row)"; stop
endif
nnz = ia( m+1 ) - 1
print *, "number of non-zero elements = ", nnz
do irow = 1, m
!! This loop runs only for rows having nonzero elements.
do k = ia( irow ), ia( irow + 1 ) - 1
print "(2i5, f15.8)", irow, ja( k ), Asparse( k )
enddo
enddo
end program
Compiling with ifort -mkl test.f90 (with ifort14.0) gives the expected result
nzmax = 10
number of non-zero elements = 4
2 3 23.00000000
2 7 27.00000000
5 4 54.00000000
9 9 99.00000000
As for the determination of nzmax, I think there are at least three ways for this: (1) just use a guess value (as above); (2) assume the fraction of nonzero elements in the whole array; or (3) directly count the number of nonzeros in the dense array. In any case, because we have the exact number of nonzeros as output (nnz), we could re-allocate Asparse and ja to have the exact size (if necessary).
Similarly, you can find the interface for PARDISO from the include file mkl_pardiso.fi or from this (or this) page.

Determine assumed-shape array strides at runtime

Is it possible in a modern Fortran compiler such as Intel Fortran to determine array strides at runtime? For example, I may want to perform a Fast Fourier Transform (FFT) on an array section:
program main
complex(8),allocatable::array(:,:)
allocate(array(17, 17))
array = 1.0d0
call fft(array(1:16,1:16))
contains
subroutine fft(a)
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine
end program
However, the MKL Dfti* routines need to be explicitly told the array strides.
Looking through reference manuals I have not found any intrinsic functions which return stride information.
A couple of interesting resources are here and here which discuss whether array sections are copied and how Intel Fortran handles arrays internally.
I would rather not restrict myself to the way that Intel currently uses its array descriptors.
How can I figure out the stride information? Note that in general I would want the fft routine (or any similar routine) to not require any additional information about the array to be passed in.
EDIT:
I have verified that an array temporary is not created in this scenario, here is a simpler piece of code which I have checked on Intel(R) Visual Fortran Compiler XE 14.0.2.176 [Intel(R) 64], with optimizations disabled and heap arrays set to 0.
program main
implicit none
real(8),allocatable::a(:,:)
pause
allocate(a(8192,8192))
pause
call random_number(a)
pause
call foo(a(:4096,:4096))
pause
contains
subroutine foo(a)
implicit none
real(8)::a(:,:)
open(unit=16, file='a_sum.txt')
write(16, *) sum(a)
close(16)
end subroutine
end program
Monitoring the memory usage, it is clear that an array temporary is never created.
EDIT 2:
module m_foo
implicit none
contains
subroutine foo(a)
implicit none
real(8),contiguous::a(:,:)
integer::i, j
open(unit=16, file='a_sum.txt')
write(16, *) sum(a)
close(16)
call nointerface(a)
end subroutine
end module
subroutine nointerface(a)
implicit none
real(8)::a(*)
end subroutine
program main
use m_foo
implicit none
integer,parameter::N = 8192
real(8),allocatable::a(:,:)
integer::i, j
real(8)::count
pause
allocate(a(N, N))
pause
call random_number(a)
pause
call foo(a(:N/2,:N/2))
pause
end program
EDIT 3:
The example illustrates what I'm trying to achieve. There is a 16x16 contiguous array, but I only want to transform the upper 4x4 array. The first call simply passes in the array section, but it doesn't return a single one in the upper left corner of the array. The second call sets the appropriate stride and a subsequently contains the correct upper 4x4 array. The stride of the upper 4x4 array with respect to the full 16x16 array is not one.
program main
implicit none
complex(8),allocatable::a(:,:)
allocate(a(16,16))
a = 0.0d0
a(1:4,1:4) = 1.0d0
call fft(a(1:4,1:4))
write(*,*) a(1:4,1:4)
pause
a = 0.0d0
a(1:4,1:4) = 1.0d0
call fft_stride(a(1:4,1:4), 1, 16)
write(*,*) a(1:4,1:4)
pause
contains
subroutine fft(a) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
subroutine fft_stride(a, s1, s2) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(:,:)
integer::s1, s2
type(dfti_descriptor),pointer::desc
integer::stat
integer::strides(3)
strides = [0, s1, s2]
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, shape(a) )
stat = DftiSetValue(desc, DFTI_INPUT_STRIDES, strides)
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a(:,1))
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
end program
I'm guessing you get confused because you worked around the explicit interface of the MKL function DftiComputeForward by giving it a(:,1). This is contiguous and doesn't need an array temporary. It's wrong, however, the low-level routine will get the whole array and that's why you see that it works if you specify strides. Since the DftiComputeForward exects an array complex(kind), intent inout :: a(*), you can work by passing it through an external subroutine.
program ...
call fft(4,4,a(1:4,1:4))
end program
subroutine fft(m,n,a) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(*)
integer :: m, n
type(dfti_descriptor),pointer::desc
integer::stat
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, (/m,n/) )
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a)
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
This will create an array temporary though when going into the subroutine. A more efficient solution is then indeed strides:
program ...
call fft_strided(4,4,a,16)
end program
subroutine fft_strided(m,n,a,lda) !{{{
use mkl_dfti
implicit none
complex(8),intent(inout)::a(*)
integer :: m, n, lda
type(dfti_descriptor),pointer::desc
integer::stat
integer::strides(3)
strides = [0, 1, lda]
stat = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 2, (/m,n/) )
stat = DftiSetValue(desc, DFTI_INPUT_STRIDES, strides)
stat = DftiCommitDescriptor(desc)
stat = DftiComputeForward(desc, a)
stat = DftiFreeDescriptor(desc)
end subroutine !}}}
Tho routine DftiComputeForward accepts an assumed size array. If you pass something complicated and non-contiguous, a copy will have to be made at passing. The compiler can check at run-time if the copy is actually necessary or not. In any case for you the stride is always 1, because that will be the stride the MKL routine will see.
In your case you pass A(:,something), this is a contiguous section, provided A is contiguous. If A is not contiguous a copy will have to be made. Stride is always 1.
Some of the answers here do not understand the different between fortran strides and memory strides (though they are related).
To answer your question for future readers beyond the specific case you have here - there does not appear to be away to find an array stride solely in fortran, but it can be done via C using inter-operability features in newer compilers.
You can do this in C:
#include "stdio.h"
size_t c_compute_stride(int * x, int * y)
{
size_t px = (size_t) x;
size_t py = (size_t) y;
size_t d = py-px;
return d;
}
and then call this function from fortran on the first two elements of an array, e.g.:
program main
use iso_c_binding
implicit none
interface
function c_compute_stride(x, y) bind(C, name="c_compute_stride")
use iso_c_binding
integer :: x, y
integer(c_size_t) :: c_compute_stride
end function
end interface
integer, dimension(10) :: a
integer, dimension(10,10) :: b
write(*,*) find_stride(a)
write(*,*) find_stride(b(:,1))
write(*,*) find_stride(b(1,:))
contains
function find_stride(x)
integer, dimension(:) :: x
integer(c_size_t) :: find_stride
find_stride = c_compute_stride(x(1), x(2))
end function
end program
This will print out:
4
4
40
In short: assumed-shape arrays always have stride 1.
A bit longer: When you pass a section of an array to a subroutine which takes an assumed-shape array, as you have here, then the subroutine doesn't know anything about the original size of the array. If you look at the upper- and lower-bounds of the dummy argument in the subroutine, you'll see they will always be the size of the array section and 1.
integer, dimension(10:20) :: array
integer :: i
array = [ (i, i=10,20) ]
call foo(array(10:20:2))
subroutine foo(a)
integer, dimension(:) :: a
integer :: i
print*, lbound(a), ubound(a)
do i=lbound(a,1), ubound(a,2)
print*, a(i)
end do
end subroutine foo
This gives the output:
1 6
10 12 14 16 18 20
So, even when your array indices start at 10, when you pass it (or a section of it), the subroutine thinks the indices start at 1. Similarly, it thinks the stride is 1. You can give a lower bound to the dummy argument:
integer, dimension(10:) :: a
which will make lbound(a) 10 and ubound(a) 15. But it's not possible to give an assumed-shape array a stride.

Resources