I need to speed up a multi-dimensional array assignment with matrix multiplication over some indices. The original cycle is
FORALL (ia=1:2,ib=1:2,ic=1:2) U(:,ia,:,ib,ic)=
+ W(ic,ia,ib,1)*matmul(FS(:,ia,:,1),transpose(F(:,ib,:,1)))+
+ W(ic,ia,ib,2)*matmul(FS(:,ia,:,2),transpose(F(:,ib,:,2)))
The matrix multiplication is over the large dimension >100.
What is the best way to increase speed using OMP directives on a workstation with parallel Intel Xeon cores ? Say,
!$OMP DO
DO ic=1,2
DO ib=1,2
DO ia=1,2
U(:,ia,:,ib,ic)=
+ W(ic,ia,ib,1)*matmul(FS(:,ia,:,1),transpose(F(:,ib,:,1)))+
+ W(ic,ia,ib,2)*matmul(FS(:,ia,:,2),transpose(F(:,ib,:,2)))
ENDDO
ENDDO
ENDDO
!$OMP END DO
Will this work or there are better alternatives ? Thank you
Related
Background
I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
Method
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]
Then Coulomb is computed as follows
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
im = omp_get_thread_num() + 1 ! How much threads
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = ia(im,1), ib(im,1) ! loop over threads
do j = 1, N ! loop over all ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
a2_x(i) = a2_x(i) - rji(1)*r2inv
a2_y(i) = a2_y(i) - rji(2)*r2inv
a2_z(i) = a2_z(i) - rji(3)*r2inv
enddo
enddo
!$omp end parallel
Problematics
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
Edit (Remarks)
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Edit (Test of the propositions)
Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.
Edit 2 (Test with !$omp reduction(+:...))
I have tested the program with the following instructions
original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).
for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.
Version
N = 1024
N = 16384
original
84 s
15194 s
schedule(dynamic,1)
1300 s
not tested
reduction and schedule(dynamic,1)
123 s
24860 s
reduction and schedule(guided) (2)
121 s
24786 s
What is weird is that #PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?
Hypothesis
The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.
Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads
!$OMP DO
do i = 1, N ! loop over all ions
do j = 1, N ! loop over all ions
why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
Dividing the work by 2
The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
!$omp do schedule(dynamic,1)
do i = 1, N-1 ! loop over all ions
do j = i+1, N ! loop over some ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
rji(:) = rji(:)*r2inv
a2_x(i) = a2_x(i) - rji(1)
a2_y(i) = a2_y(i) - rji(2)
a2_z(i) = a2_z(i) - rji(3)
a2_x(j) = a2_x(j) + rji(1)
a2_y(j) = a2_y(j) + rji(2)
a2_z(j) = a2_z(j) + rji(3)
enddo
enddo
!$omp end do
!$omp end parallel
Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:
!$omp do schedule(guided)
do i = 2, N ! loop over all ions
do j = 1, i-1 ! loop over some ions
TIMING
I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.
No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0016
0.026
0.41
6.9
116
1 thread
0.0019
0.027
0.48
8.0
118
4 threads
0.0014
0.013
0.20
3.4
55
dynamic
0.0018
0.020
0.29
5.3
84
nonmonotonic
0.29
5.2
85
guided (1)
0.0014
0.013
0.21
3.7
61
guided (2)
0.0009
0.093
0.13
2.2
38
The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.
I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0028
0.047
0.72
11.5
183
4 threads
0.0013
0.019
0.25
4.0
71
I did not see how you limited the number of threads. This could be an additional !$omp directive.
The following could be effective with schedule(static).
Should "if ( i==j ) cycle" be included ?
Is N = 16 for your code example ?
dimension rjm(3)
!$omp parallel do default(none) &
!$omp private (i,j, rji, r2inv, rjm) &
!$omp shared (a2_x, a2_y, a2_z, r_x,r_y,r_z, N, softening, alpha ) &
!$omp schedule (static)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = 1,16 ! loop over threads , ignore ia(im,1), ib(im,1), is N=16 ?
rjm = 0
do j = 1, N ! loop over all ions
if ( i==j ) cycle
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = sqrt ( rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening )
r2inv = alpha(1) / ( r2inv * r2inv * r2inv ) ! alpha is 1/4.pi.eps0
! computation of the accelerations
! rjm = rjm + rji*r2inv
rjm(1) = rjm(1) + rji(1)*r2inv
rjm(2) = rjm(2) + rji(2)*r2inv
rjm(3) = rjm(3) + rji(3)*r2inv
end do
a2_x(i) = a2_x(i) - rjm(1)
a2_y(i) = a2_y(i) - rjm(2)
a2_z(i) = a2_z(i) - rjm(3)
end do
!$omp end parallel do
I have no experience of using firstprivate to shift shared variables into the stack for improved performance. Is it worthwile ?
I'm running a script that contains one loop to calculate two arrays (alpha_x and alpha_y) using input arrays (X,Y,M_list and zeta_list) from Python. They work fine when I run the Fortran script with no OpenMp. The array values are the same as if I did the calculation in Python alone. However, when I add in OpenMP support and make use of multiple cores, my array outputs for alpha_x and alpha_y are different values after each time I run the script! The code is below:
PROGRAM LensingTest1
IMPLICIT NONE
DOUBLE PRECISION,DIMENSION(1,1000)::M_list
DOUBLE PRECISION,DIMENSION(1000,1000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z, alpha_x, alpha_y
DOUBLE PRECISION,DIMENSION(2,1000)::zeta_list
REAL::start_time,stop_time
INTEGER::i,j
open(10,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/M_list.dat')
open(9,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/zeta_list.dat')
open(8,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/X.dat')
open(7,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/Y.dat')
read(10,*)M_list
read(9,*)zeta_list
read(8,*)X
read(7,*)Y
call cpu_time(start_time)
!$OMP PARALLEL DO
do i=1,size(M_list,2),1
z_m_z_x = X - zeta_list(1,i)
z_m_z_y = Y - zeta_list(2,i)
dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
alpha_x = alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
alpha_y = alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO
call cpu_time(stop_time)
print *, "Setup time:", &
stop_time - start_time, "seconds"
open(6,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/alpha_x.dat')
do i =1,1000
write(6,'(1000F14.7)')(alpha_x(i,j), j=1,1000)
end do
open(5,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/alpha_y.dat')
do i =1,1000
write(5,'(1000F14.7)')(alpha_y(i,j), j=1,1000)
end do
stop
END PROGRAM LensingTest1
The only difference is that I add in the !$OMP PARALLEL DO and !$OMP END PARALLEL DO for the OpenMP support. I compile with gfortran -fopenmp script.f90 and then export OMP_NUM_THREADS=4
I am trying to parallelize the calculation of gravitational forces in my program using OpenMP. The calculation of the distances (R and R2) is no problem but the forces/accelerations (A) come out wrong. I know that it has something to do with race conditions in the summation. I have experimented a bit with atomic and critical constructs but could not find a solution. Also, I'm not sure which variables should be private and why.
Does someone with more experience in using OpenMP have a suggestion on how to correct this in the following code example?
A = 0.0
!$omp parallel do
do i = 1, Nobj
do j = i + 1, Nobj
R2(i,j) = (X(j,1) - X(i,1))**2 &
+ (X(j,2) - X(i,2))**2 &
+ (X(j,3) - X(i,3))**2
R(i,j) = sqrt(R2(i,j))
do k = 1, 3
A(i,k) = A(i,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(j,k) - X(i,k)) / R(i,j)))
A(j,k) = A(j,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(i,k) - X(j,k)) / R(i,j)))
enddo
enddo
A(i,:) = A(i,:) * G / mass_acc(i)
enddo
!$omp end parallel do
You are modifying A(j,k) - neither j nor k are "local" to thread as the thread-parallel index is i. What I mean is neither of those index ranges are restricted to a particular thread, all threads will update all A(j,k) hence the race condition.
Things you can do - split up R and A calculations or not use symmetry to update A.
Also, Fortran is column major and you are traversing outer index first which is bad for performance.
My code works well without parallelization. But it doesn't with parallelization.
My code consists of a module that contains an array of size 100x100x100.
real(8), dimension(1:100,1:100,1:100) :: array
This module is placed at the lower hierarchical level so it can be called by any other module. And then the other module that consists of a do loop which calls a subroutine.
do i=1,100
do j=1,100
do k=1,100
call some_calculation(i,j,k)
enddo
enddo
enddo
The subroutine some_calculation performs some arithmetic process using the array(i,j,k) and then update array(i,j,k). The input values i,j,k correspond to accessing array(i,j,k). But when I parallelize the most outer do loop
!$OMP PARALLEL DO
do i=1,100
do j=1,100
do k=1,100
call some_calculation(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLE DO
I receive different results of array. Does anyone has any clue for this? Thanks you!
Your loop indices should be private:
!$OMP PARALLEL DO private (i,j,k)
do i=1,100
do j=1,100
do k=1,100
call some_calculation(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLEL DO
Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.