Parallelizing pairwise gravitational force calculation with OpenMP in Fortran 90

Parallelizing pairwise gravitational force calculation with OpenMP in Fortran 90 - arrays

I am trying to parallelize the calculation of gravitational forces in my program using OpenMP. The calculation of the distances (R and R2) is no problem but the forces/accelerations (A) come out wrong. I know that it has something to do with race conditions in the summation. I have experimented a bit with atomic and critical constructs but could not find a solution. Also, I'm not sure which variables should be private and why.
Does someone with more experience in using OpenMP have a suggestion on how to correct this in the following code example?
A = 0.0
!$omp parallel do
do i = 1, Nobj
do j = i + 1, Nobj
R2(i,j) = (X(j,1) - X(i,1))**2 &
+ (X(j,2) - X(i,2))**2 &
+ (X(j,3) - X(i,3))**2
R(i,j) = sqrt(R2(i,j))
do k = 1, 3
A(i,k) = A(i,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(j,k) - X(i,k)) / R(i,j)))
A(j,k) = A(j,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(i,k) - X(j,k)) / R(i,j)))
enddo
enddo
A(i,:) = A(i,:) * G / mass_acc(i)
enddo
!$omp end parallel do

You are modifying A(j,k) - neither j nor k are "local" to thread as the thread-parallel index is i. What I mean is neither of those index ranges are restricted to a particular thread, all threads will update all A(j,k) hence the race condition.
Things you can do - split up R and A calculations or not use symmetry to update A.
Also, Fortran is column major and you are traversing outer index first which is bad for performance.

Related

Optimization of OpenMP parallel do loop in Fortran

Background
I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
Method
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]
Then Coulomb is computed as follows
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
im = omp_get_thread_num() + 1 ! How much threads
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = ia(im,1), ib(im,1) ! loop over threads
do j = 1, N ! loop over all ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
a2_x(i) = a2_x(i) - rji(1)*r2inv
a2_y(i) = a2_y(i) - rji(2)*r2inv
a2_z(i) = a2_z(i) - rji(3)*r2inv
enddo
enddo
!$omp end parallel
Problematics
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
Edit (Remarks)
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Edit (Test of the propositions)
Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.
Edit 2 (Test with !$omp reduction(+:...))
I have tested the program with the following instructions
original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).
for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.
Version
N = 1024
N = 16384
original
84 s
15194 s
schedule(dynamic,1)
1300 s
not tested
reduction and schedule(dynamic,1)
123 s
24860 s
reduction and schedule(guided) (2)
121 s
24786 s
What is weird is that #PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?
Hypothesis
The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.

Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads
!$OMP DO
do i = 1, N ! loop over all ions
do j = 1, N ! loop over all ions
why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
Dividing the work by 2
The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
!$omp do schedule(dynamic,1)
do i = 1, N-1 ! loop over all ions
do j = i+1, N ! loop over some ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
rji(:) = rji(:)*r2inv
a2_x(i) = a2_x(i) - rji(1)
a2_y(i) = a2_y(i) - rji(2)
a2_z(i) = a2_z(i) - rji(3)
a2_x(j) = a2_x(j) + rji(1)
a2_y(j) = a2_y(j) + rji(2)
a2_z(j) = a2_z(j) + rji(3)
enddo
enddo
!$omp end do
!$omp end parallel
Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:
!$omp do schedule(guided)
do i = 2, N ! loop over all ions
do j = 1, i-1 ! loop over some ions
TIMING
I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.
No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0016
0.026
0.41
6.9
116
1 thread
0.0019
0.027
0.48
8.0
118
4 threads
0.0014
0.013
0.20
3.4
55
dynamic
0.0018
0.020
0.29
5.3
84
nonmonotonic
0.29
5.2
85
guided (1)
0.0014
0.013
0.21
3.7
61
guided (2)
0.0009
0.093
0.13
2.2
38
The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.
I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0028
0.047
0.72
11.5
183
4 threads
0.0013
0.019
0.25
4.0
71

I did not see how you limited the number of threads. This could be an additional !$omp directive.
The following could be effective with schedule(static).
Should "if ( i==j ) cycle" be included ?
Is N = 16 for your code example ?
dimension rjm(3)
!$omp parallel do default(none) &
!$omp private (i,j, rji, r2inv, rjm) &
!$omp shared (a2_x, a2_y, a2_z, r_x,r_y,r_z, N, softening, alpha ) &
!$omp schedule (static)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = 1,16 ! loop over threads , ignore ia(im,1), ib(im,1), is N=16 ?
rjm = 0
do j = 1, N ! loop over all ions
if ( i==j ) cycle
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = sqrt ( rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening )
r2inv = alpha(1) / ( r2inv * r2inv * r2inv ) ! alpha is 1/4.pi.eps0
! computation of the accelerations
! rjm = rjm + rji*r2inv
rjm(1) = rjm(1) + rji(1)*r2inv
rjm(2) = rjm(2) + rji(2)*r2inv
rjm(3) = rjm(3) + rji(3)*r2inv
end do
a2_x(i) = a2_x(i) - rjm(1)
a2_y(i) = a2_y(i) - rjm(2)
a2_z(i) = a2_z(i) - rjm(3)
end do
!$omp end parallel do
I have no experience of using firstprivate to shift shared variables into the stack for improved performance. Is it worthwile ?

Summing a fortran array with mask

I have a fortran array a(i,j). I wish to sum it on dimension 2(j) with a mask that j is not equal to i.
i.e,
a1=0
do j=1,n
if(j.ne.i) then
a1=a1+a(i,j)
endif
enddo
What is the way of doing this using the intrinsic sum function in fortran as I found the intrinsic to be much faster than the explicit loop.
I thought of trying sum(a(i,:),j.ne.i), but this is naturally giving error. Also if one can suggest how to only some the values of a(i,:) where abs(a(i,j)) is greater than, say 0.01, it would be helpful.

You can easily avoid any branching for the off-diagonal case. It should be much faster than creating any mask array and checking the mask. Branching (conditional jumps) is costly even when branch prediction can be very efficient.
do j=1,n
do i = 1,j-1
a1=a1+a(i,j)
end do
do i = j+1,n
a1=a1+a(i,j)
end do
end do
If you need your code to be fast and not short, you should test this kind of approach. In my tests it is much faster.

To answer your last question, you can use the WHERE construct to build a mask. For example,
logical :: y(3,3) = .false.
real x(3,3)
x = 1
x(1,1) = 0.1
x(2,2) = 0.1
x(3,3) = 0.1
print * , sum(x)
where(abs(x) > 0.25) y = .true.
print *, sum(x,y)
end
Whether this is better than nested do-loops is questionable.

I find that summing the whole array then subtracting sum of diagonal elements can be 2x faster.
a1 = 0
do i = 1, n
a1 = a1 + a(i,i)
end do
a1 = sum(a) - a1
end do

Do loop with condition - segmentation fault

I am doing a project on particle dynamics and I started by letting a particle (a sphere) fall from a certain height towards a fixed particle in the ground.
Inside a do loop (a time loop, from the initial time to a certain time elapsed with a certain time-step), I use an Euler's method to integrate the positions and velocities, and also calculate the forces (gravitational and elastic) and the collision conditions as well.
This model will later be generalized for 3, 4, ..., n particles (in the scale of hundreds of thousands), and so I am using arrays to punctuate the particles which positions and velocities as time goes by I am integrating. That way, I also put a do loop inside the time loop - for each particle - from 1 to N (the number of particles), and define N as 2 (since in this case alone I have only two particles). This is the segmentation fault, since I tell it to calculate 3 things when I only specified that I have two.
Whilst trying to fix it, when I define the parameters for i and i+1, when i=2, i+1 = 2+1 = 3 will be calculated - but I do not have a third particle. In a similar way, if I put i-1 and i instead, for i = 1 (where the loop starts), i-1 = 0, but that doesn't make sense, since I do not have a "0th" particle. In another attempt, if I change the loop from 1,N to 1,N-1, since N=1, it won't calculate for N=2. Also, I have thought about printing my results in twos, that is, for particles 1 and 2, 2 and 3, 3 and 4, and so on... (calculating i AND i+1 simultaneously for each integration, making the run time longer - which will cost me a lot of time later, since these simulations for a big number of particles can take weeks). But if I state that in files, it will repeat the file creation for all particles, except for the first and the last (even more time wasted). How can I run it only considering the first and two particles, generalizing for any number of particles that I choose?
do t = tmin, tmax, dt
do i = 1,N
call contact (xold(i), xold(i+1), r(i), r(i+1))
call forces (m(i), g, k, r(i), r(i+1), xold(i), xold(i+1))
call euler(xold(i), xnew(i), vold(i), vnew(i), dt, F(i), m(i))
write(i, *), "t=", t, "x=", xold(i), "v=", vold(i), "dx=", dx, "force=", F(i)
end do
end do

I am not quite sure, what exactly you would like to achieve. For a particle simulation, where each particle interacts with each other particle you would need to have a second loop, wouldn't you?
Somewhat like this:
do i = 1,N
do j=i+1,N
call contact (xold(i), xold(j), r(i), r(j))
call forces (m(i), g, k, r(i), r(j), xold(i), xold(j))
end do
call euler(xold(i), xnew(i), vold(i), vnew(i), dt, F(i), m(i))
write(i, *), "t=", t, "x=", xold(i), "v=", vold(i), "dx=", dx, "force=", F(i)
end do
The inner loop will not be executed if i+1 > N, so everything should be fine. For N=2 you would just get one execution with i=1 and j=2.
Edit:
calculating i AND i+1 simultaneously for each integration, making the run time longer - which will cost me a lot of time later, since these simulations for a big number of particles can take weeks
You most likely do not want to do a all-to-all particle simulation for large numbers of particles. Most people use some tree algorithms to speed this up considerably. Consider using an existing Framework for that, like PEPC.

when i=2, i+1 = 2+1 = 3 will be calculated - but I do not have a third
particle. In a similar way, if I put i-1 and i instead, for i = 1
(where the loop starts), i-1 = 0, but that doesn't make sense, since I
do not have a "0th" particle
Modulo?
when i = 0, i % 2 = 0, (i % 2) + 1 = 1
when i = 1, i % 2 = 1, (i % 2) + 1 = 2
when i = 2, i % 2 = 0, (i % 2) + 1 = 1
when i = 3, i % 2 = 1, (i % 2) + 1 = 2

Decomposing a loop in OpenMP

I have several loops which follow this pattern:
do j = ms,mst
ic = ic + 1
df = mm(j)*data(ic)
dff(1:3)= vec(1:3)*df*qm
end do
As you can see, the variable ic is updated at every cycle of
j and the result of ic is used by the variable df. If I use
atomic operation of OpenMP I could reduce the performance of OpenMP.
Do you know an efficient way to deal with these kind of loops in
OpenMP?

If ic is not changed apart from the increment (i.e. data is an array or a function w/o side-effects), there is a fixed relation between j and ic:
icStart = ic
delta = icStart - ms + 1
do j = ms,mst
ic = delta + j
df = mm(j)*data(ic)
dff(1:3)= vec(1:3)*df*qm
end do
This can easily be parallelized with ic and df being thread-private. You still need to take care about dff, as you will get a race condition as it is now...

As you've written your code the value of ic is increased by 1 at each iteration, just as the value of ms is. A straightforward parallelisation of the loop, something like
!$OMP PARALLEL DO
do j = ms,mst
...
will distribute the work across threads giving each of them a discrete set of the values that j takes. Simple static scheduling of a 64-trip loop (with ms==1 and mst==64) across 4 threads will mean that thread 0 gets j = 1..16, thread 1 gets j = 17..32 and so on.
However, without care on your part the values of ic won't get neatly split across threads in this way. It looks to me, from the sample you've provided, as if the behaviour you want is for blocks of values of ic to accompany corresponding blocks of values of j -- they both increase by 1 at each trip round the loop.
Perhaps in the part of the code you haven't shown us ic is set to ms+k where k is some integer. In that case you could simply drop ic from inside the loop and write
!$OMP PARALLEL DO
do j = ms,mst
df = mm(j)*data(j+k)
dff(1:3)= vec(1:3)*df*qm
end do
Without knowing more about the relationship between j and ic it's difficult to offer more pertinent advice than this. But the principle remains, rewrite ic as a function of j if you can and avoid difficulties inside the parallelised loop.

How to improve the execution time of this function?

Suppose that f(x,y) is a bivariate function as follows:
function [ f ] = f(x,y)
UN=(g)1.6*(1-acos(g)/pi)-0.8;
f= 1+UN(cos(0.5*pi*x+y));
end
How to improve execution time for function F(N) with the following code:
function [VAL] = F(N)
x=0:4/N:4;
y=0:2*pi/1000:2*pi;
VAL=zeros(N+1,3);
for i = 1:N+1
val = zeros(1,N+1);
for j = 1:N+1
val(j) = trapz(y,f(0,y).*f(x(i),y).*f(x(j),y))/2/pi;
end
val = fftshift(fft(val))/N;
l = (length(val)+1)/2;
VAL(i,:)= val(l-1:l+1);
end
VAL = fftshift(fft(VAL,[],1),1)/N;
L = (size(VAL,1)+1)/2;
VAL = VAL(L-1:L+1,:);
end
Note that N=2^p where p>10, so please consider the memory limitations while optimizing the code using ndgrid, arrayfun, etc.
FYI: The code intends to find the central 3-by-3 submatrix of the fftn of
fun=#(a,b) trapz(y,f(0,y).*f(a,y).*f(b,y))/2/pi;
where a,b are in [0,4]. The key idea is that we can save memory using the code above specially when N is very large. But the execution time is still an issue because of nested loops. See the figure below for N=2^2:

This is not a full answer, but some possibly helpful hints:
0) The trivial: Are you sure you need numerics? Can't you do the computation analytically?
1) Do not use function handles:
function [ f ] = f(x,y)
f= 1+1.6*(1-acos(cos(0.5*pi*x+y))/pi)-0.8
end
2) Simplify analytically: acos(cos(x)) is the same as abs(mod(x + pi, 2 * pi) - pi), which should compute slightly faster. Or, instead of sampling and then numerically integrating, first integrate analytically and sample the result.
3) The FFT is a very efficient algorithm to compute the full DFT, but you don't need the full DFT. Since you only want the central 3 x 3 coefficients, it might be more efficient to directly apply the DFT definition and evaluate the formula only for those coefficients that you want. That should be both fast and memory-efficient.
4) If you repeatedly do this computation, it might be helpful to precompute DFT coefficients. Here, dftmtx from the Signal Processing toolbox can assist.
5) To get rid of the loops, think about the problem not in the form of computation instructions, but a single matrix operation. If you consider your input N x N matrix as a vector with N² elements, and your output 3 x 3 matrix as a 9-element vector, then the whole operation you apply (numerical integration via trapz and DFT via fft) appears to be a simple linear transform, which it should be possible to express as an N² x 9 matrix.