OpenMp parallelization region calls a subroutine which also calls a subroutine - arrays

My code works well without parallelization. But it doesn't with parallelization.
My code consists of a module that contains an array of size 100x100x100.
real(8), dimension(1:100,1:100,1:100) :: array
This module is placed at the lower hierarchical level so it can be called by any other module. And then the other module that consists of a do loop which calls a subroutine.
do i=1,100
do j=1,100
do k=1,100
call some_calculation(i,j,k)
enddo
enddo
enddo
The subroutine some_calculation performs some arithmetic process using the array(i,j,k) and then update array(i,j,k). The input values i,j,k correspond to accessing array(i,j,k). But when I parallelize the most outer do loop
!$OMP PARALLEL DO
do i=1,100
do j=1,100
do k=1,100
call some_calculation(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLE DO
I receive different results of array. Does anyone has any clue for this? Thanks you!

Your loop indices should be private:
!$OMP PARALLEL DO private (i,j,k)
do i=1,100
do j=1,100
do k=1,100
call some_calculation(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLEL DO

Related

Parallel array assignment in Fortran90

I need to speed up a multi-dimensional array assignment with matrix multiplication over some indices. The original cycle is
FORALL (ia=1:2,ib=1:2,ic=1:2) U(:,ia,:,ib,ic)=
+ W(ic,ia,ib,1)*matmul(FS(:,ia,:,1),transpose(F(:,ib,:,1)))+
+ W(ic,ia,ib,2)*matmul(FS(:,ia,:,2),transpose(F(:,ib,:,2)))
The matrix multiplication is over the large dimension >100.
What is the best way to increase speed using OMP directives on a workstation with parallel Intel Xeon cores ? Say,
!$OMP DO
DO ic=1,2
DO ib=1,2
DO ia=1,2
U(:,ia,:,ib,ic)=
+ W(ic,ia,ib,1)*matmul(FS(:,ia,:,1),transpose(F(:,ib,:,1)))+
+ W(ic,ia,ib,2)*matmul(FS(:,ia,:,2),transpose(F(:,ib,:,2)))
ENDDO
ENDDO
ENDDO
!$OMP END DO
Will this work or there are better alternatives ? Thank you

OpenMP causing different array values in Fortran after each run

I'm running a script that contains one loop to calculate two arrays (alpha_x and alpha_y) using input arrays (X,Y,M_list and zeta_list) from Python. They work fine when I run the Fortran script with no OpenMp. The array values are the same as if I did the calculation in Python alone. However, when I add in OpenMP support and make use of multiple cores, my array outputs for alpha_x and alpha_y are different values after each time I run the script! The code is below:
PROGRAM LensingTest1
IMPLICIT NONE
DOUBLE PRECISION,DIMENSION(1,1000)::M_list
DOUBLE PRECISION,DIMENSION(1000,1000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z, alpha_x, alpha_y
DOUBLE PRECISION,DIMENSION(2,1000)::zeta_list
REAL::start_time,stop_time
INTEGER::i,j
open(10,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/M_list.dat')
open(9,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/zeta_list.dat')
open(8,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/X.dat')
open(7,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/Y.dat')
read(10,*)M_list
read(9,*)zeta_list
read(8,*)X
read(7,*)Y
call cpu_time(start_time)
!$OMP PARALLEL DO
do i=1,size(M_list,2),1
z_m_z_x = X - zeta_list(1,i)
z_m_z_y = Y - zeta_list(2,i)
dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
alpha_x = alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
alpha_y = alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO
call cpu_time(stop_time)
print *, "Setup time:", &
stop_time - start_time, "seconds"
open(6,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/alpha_x.dat')
do i =1,1000
write(6,'(1000F14.7)')(alpha_x(i,j), j=1,1000)
end do
open(5,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/alpha_y.dat')
do i =1,1000
write(5,'(1000F14.7)')(alpha_y(i,j), j=1,1000)
end do
stop
END PROGRAM LensingTest1
The only difference is that I add in the !$OMP PARALLEL DO and !$OMP END PARALLEL DO for the OpenMP support. I compile with gfortran -fopenmp script.f90 and then export OMP_NUM_THREADS=4

How to have generic subroutine to work in fortran with assumed size array

I have an interface block to define a generic subroutine which have an assumed size array as dummy argument (in order to be able to act on 'the middle' of a passed array, like a C pointer) and it does not compile. Here is simple exemple:
module foo
interface sub
module procedure isub
module procedure dsub
end interface
contains
subroutine isub(a,n)
integer, intent(in) :: a(*), n
integer :: i
print*, 'isub'
do i=1,n
print*, a(i)
enddo
end subroutine isub
subroutine dsub(a)
real(8), intent(in) :: a(*)
integer, intent(in) :: n
integer :: i
print*, 'dsub'
do i=1,n
print*, a(i)
enddo
end subroutine dsub
end module foo
program test
use foo
implicit none
integer :: ai(4)
real(8) :: ad(4)
ai=(/1,2,3,4/)
ad=(/1.,2.,3.,4./)
call sub(ai,3)
call sub(ad,3)
call isub(ai(2),3)
!call sub(ai(2),3)
end program test
The commented line does not compile, whereas it is ok when calling directly the subroutine with call isub(ai(2),3) (tested with gfortran and ifort). Why and is it possible to have it to work with call sub(ai(2),3)?
edit: with ifort, it says:
$ ifort overload.f90
overload.f90(37): error #6285: There is no matching specific subroutine for this generic subroutine call. [SUB]
call sub(ai(2),3)
-------^
compilation aborted for overload.f90 (code 1)
Thanks
You are passing a scalar to a function that is expecting an array. Try
call sub(ai(2:2))
which is passing an array of length one. I'm wondering why call isub(ai(2)) is accepted, though...
To answer your new question (partly in the comments):
If you restrict yourself to contiguous arrays, you can use call sub(ai(2:4)) without loss of performance using deferred shape arrays:
subroutine isub(a)
integer,intent(in) :: a(:)
integer :: i
print*, 'isub'
do i=1,size(a)
print*, a(i)
enddo
end subroutine isub
There are no temporary arrays created with ifort or gfortran. You can check for this using:
ifort -check arg_temp_created
gfortran -Warray-temporaries

Fastest way to get Inverse Mapping from Values to Indices in fortran

Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.

Using an array coming from a module to be used in another subroutine or main program in Fortran

I'd be glad if somebody could help me with this. I'm studying modules in fortran, and I have a question. Let's say that my module creates a matrix [A(3,3)] that is read from user's input. Then, I'd like to use such a matrix in a new subroutine so that I can do an operation with it (for the sake of simplicity let's say a sum). My code looks like this:
module matrixm
contains
subroutine matrixc
integer i,j
real, DIMENSION(3,3) :: a
do 10 i=1,3
do 20 j=1,3
read(*,*) a(i,j)
20 continue
10 continue
end subroutine matrixc
end module matrixm
program matrix
use matrixm
real, dimension(3,3) :: b,c
integer i,j
call matrixc
b=10.0
c=a+b
write statements here...
end
If the input of A is: 1 2 3 4 5 6 7 8 9 one would expect C[3,3] to be 11 12 13 14 15 16 17 18 19. However, the result shows only a matrix C whose elemets are all of them equal to 10.0. What is the error I have in my program?, and much more important, am I correct on what's the use of a module?. I have a similar issue on a big problem I'm working on right now. Thanks.
The problem you have in your program is the visible memory:
You read the data in the matrix a which is local to your subroutine matrixc. This means, that this change is not visible to the program.
The next thing is, that the variable a in your program is implicitely defined as real and as a result, doesn't throw an error (keyword: IMPLICIT NONE).
There are two easy solutions:
1: Put the definition of matrix a in the definition part of your module:
module matrixm
REAL, DIMENSION(3,3) :: a
CONTAINS
subroutine matrixc
integer i,j
do i=1,3
do j=1,3
read(*,*) a(i,j)
end do
end do
end subroutine matrixc
end module matrixm
2: Use a as a parameter to your subroutine and define it in the main program:
module matrixm
CONTAINS
subroutine matrixc(a)
integer i,j
REAL, DIMENSION(3,3) :: a
do i=1,3
do j=1,3
read(*,*) a(i,j)
end do
end do
end subroutine matrixc
end module matrixm
program matrix
use matrixm
IMPLICIT NONE
real, dimension(3,3) :: a,b,c
integer i,j
call matrixc(a)
b=10.0
c=a+b
write statements here...
end program

Resources