Background
I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
Method
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]
Then Coulomb is computed as follows
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
im = omp_get_thread_num() + 1 ! How much threads
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = ia(im,1), ib(im,1) ! loop over threads
do j = 1, N ! loop over all ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
a2_x(i) = a2_x(i) - rji(1)*r2inv
a2_y(i) = a2_y(i) - rji(2)*r2inv
a2_z(i) = a2_z(i) - rji(3)*r2inv
enddo
enddo
!$omp end parallel
Problematics
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
Edit (Remarks)
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Edit (Test of the propositions)
Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.
Edit 2 (Test with !$omp reduction(+:...))
I have tested the program with the following instructions
original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).
for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.
Version
N = 1024
N = 16384
original
84 s
15194 s
schedule(dynamic,1)
1300 s
not tested
reduction and schedule(dynamic,1)
123 s
24860 s
reduction and schedule(guided) (2)
121 s
24786 s
What is weird is that #PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?
Hypothesis
The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.
Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads
!$OMP DO
do i = 1, N ! loop over all ions
do j = 1, N ! loop over all ions
why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
Dividing the work by 2
The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
!$omp do schedule(dynamic,1)
do i = 1, N-1 ! loop over all ions
do j = i+1, N ! loop over some ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
rji(:) = rji(:)*r2inv
a2_x(i) = a2_x(i) - rji(1)
a2_y(i) = a2_y(i) - rji(2)
a2_z(i) = a2_z(i) - rji(3)
a2_x(j) = a2_x(j) + rji(1)
a2_y(j) = a2_y(j) + rji(2)
a2_z(j) = a2_z(j) + rji(3)
enddo
enddo
!$omp end do
!$omp end parallel
Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:
!$omp do schedule(guided)
do i = 2, N ! loop over all ions
do j = 1, i-1 ! loop over some ions
TIMING
I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.
No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0016
0.026
0.41
6.9
116
1 thread
0.0019
0.027
0.48
8.0
118
4 threads
0.0014
0.013
0.20
3.4
55
dynamic
0.0018
0.020
0.29
5.3
84
nonmonotonic
0.29
5.2
85
guided (1)
0.0014
0.013
0.21
3.7
61
guided (2)
0.0009
0.093
0.13
2.2
38
The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.
I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0028
0.047
0.72
11.5
183
4 threads
0.0013
0.019
0.25
4.0
71
I did not see how you limited the number of threads. This could be an additional !$omp directive.
The following could be effective with schedule(static).
Should "if ( i==j ) cycle" be included ?
Is N = 16 for your code example ?
dimension rjm(3)
!$omp parallel do default(none) &
!$omp private (i,j, rji, r2inv, rjm) &
!$omp shared (a2_x, a2_y, a2_z, r_x,r_y,r_z, N, softening, alpha ) &
!$omp schedule (static)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = 1,16 ! loop over threads , ignore ia(im,1), ib(im,1), is N=16 ?
rjm = 0
do j = 1, N ! loop over all ions
if ( i==j ) cycle
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = sqrt ( rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening )
r2inv = alpha(1) / ( r2inv * r2inv * r2inv ) ! alpha is 1/4.pi.eps0
! computation of the accelerations
! rjm = rjm + rji*r2inv
rjm(1) = rjm(1) + rji(1)*r2inv
rjm(2) = rjm(2) + rji(2)*r2inv
rjm(3) = rjm(3) + rji(3)*r2inv
end do
a2_x(i) = a2_x(i) - rjm(1)
a2_y(i) = a2_y(i) - rjm(2)
a2_z(i) = a2_z(i) - rjm(3)
end do
!$omp end parallel do
I have no experience of using firstprivate to shift shared variables into the stack for improved performance. Is it worthwile ?
Related
In my Fortran 95 code I have a series of nested DO loops, the whole of which takes a significant amount of time to compute, so I wanted to add parallel functionality with OpenMP (using gfortran -fopenmp to compile/build).
There is one main DO loop, running 1000 times.
There is a sub DO loop within this, running 100 times.
Several other DO loops are nested within this, the number of iterations increases with each iteration of the DO loop (Once the first time, up to 1000 for the last time).
Example:
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
END DO
d = d + 1
END DO
Some of the nested DO loops have to be run in serial as they contain dependencies within themselves (that is to say, each iteration of the loop has a calculation that includes a value from the previous iteration), and cannot easily be parallelised in this instance.
I can easily make the loops without any dependencies run in parallel, as below:
d = 1
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
!$OMP PARALLEL
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
!$OMP END PARALLEL
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
However I understand that there is significant overhead with opening and closing the parallel threads, given that this occurs so many times within the loops. The code runs significantly slower than it previously did when run sequentially.
Following this, I figured that it made sense to open and close the parallel code either side of the main loop (therefore only applying the overhead once), and setting the number of threads to either 1 or 8 to control whether sections are run sequentially or in parallel, as below:
d = 1
CALL omp_set_num_threads(1)
!$OMP PARALLEL
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
CALL omp_set_num_threads(4)
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
CALL omp_set_num_threads(1)
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
!$OMP END PARALLEL
However, when I set this to run I don't get the speedup that I was expecting from running parallel code. I expect the first few to be slower to account for the overhead, but after a while I expect the parallel code to run faster than the sequential code, which it doesn't. I compared how fast each iteration of the main DO loop ran, for DO a = 1, 50, results below:
Iteration Serial Parallel
1 3.8125 4.0781
2 5.5781 5.9843
3 7.4375 7.9218
4 9.2656 9.7500
...
48 89.0625 94.9531
49 91.0937 97.3281
50 92.6406 99.6093
My first thought is that I'm somehow not setting the number of threads correctly.
Questions:
Is there something obviously wrong with how I've structured the parallel code?
Is there a better way to implement what I've done / want to do?
There is indeed something that is obviously wrong: you have removed any parallelism out of your code. Before creating the outermost parallel region, you defined its size to be of one thread. Therefore, only one single thread will be created to handle whatever code is inside this region. Subsequently using omp_set_num_threads(4) won't change that. This call merely says that whichever next parallel directive will create 4 threads (unless explicitly requested otherwise). But there's no such new parallel directive, which would have been here nested within the current one. You only have a work-sharing do directive which applied on the current enclosing parallel region of one unique thread.
There are two ways of addressing your issue:
Keeping your code as it was: although formally, you will fork and join your threads upon entry and exit of the parallel region, the OpenMP standard doesn't request that the threads are created and destroyed. Actually, it even encourages that the threads are kept alive to reduce the overhead of the parallel directive, which is done by most OpenMP run-time libraries. Therefore, the payload of such a simple approach of the problem isn't too big.
Using your second approach of pushing the parallel directive outside of the outermost loop, but creating as many threads as you'll need for the work-sharing (4 here I believe). Then, you enclose whatever has to be sequential within your parallel region with a single directive. This will ensure that no unwanted interaction with the extra threads will happen (implicit barrier and flushing of shared variable upon exit) while avoiding the parallelism where you don't want it.
This last version would look like this:
d = 1
!$omp parallel num_threads( 4 ) private( a, b, c ) firstprivate( d )
do a = 1, 1000
do b = 1, 100
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
!$omp do
do c = 1, d
some calculations without dependencies
end do
!$omp end do
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
end do
d = d + 1
end do
!$omp end parallel
Now whether this version would be actually faster compared to the naive one, it is up to you to test.
A last remark though: since there are quite a lot of sequential parts in your code, don't expect too much speed-up anyway. Amdahl's law is forever.
Nothing obviously wrong but if the serial loops take long, your speedup will be limited. Doing parallel computing might require to redesign your algorithms.
Instead of setting the number of threads in the loop, use the !$omp master - !$omp end master directives to reduce the execution to a single thread. Add a !$omp barrier if you can run this block only once all other threads are done.
I am trying to parallelize the calculation of gravitational forces in my program using OpenMP. The calculation of the distances (R and R2) is no problem but the forces/accelerations (A) come out wrong. I know that it has something to do with race conditions in the summation. I have experimented a bit with atomic and critical constructs but could not find a solution. Also, I'm not sure which variables should be private and why.
Does someone with more experience in using OpenMP have a suggestion on how to correct this in the following code example?
A = 0.0
!$omp parallel do
do i = 1, Nobj
do j = i + 1, Nobj
R2(i,j) = (X(j,1) - X(i,1))**2 &
+ (X(j,2) - X(i,2))**2 &
+ (X(j,3) - X(i,3))**2
R(i,j) = sqrt(R2(i,j))
do k = 1, 3
A(i,k) = A(i,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(j,k) - X(i,k)) / R(i,j)))
A(j,k) = A(j,k) + ((mass_2_acc(i,j) / R2(i,j)) * ((X(i,k) - X(j,k)) / R(i,j)))
enddo
enddo
A(i,:) = A(i,:) * G / mass_acc(i)
enddo
!$omp end parallel do
You are modifying A(j,k) - neither j nor k are "local" to thread as the thread-parallel index is i. What I mean is neither of those index ranges are restricted to a particular thread, all threads will update all A(j,k) hence the race condition.
Things you can do - split up R and A calculations or not use symmetry to update A.
Also, Fortran is column major and you are traversing outer index first which is bad for performance.
I'm trying to optimizing the value N to split arrays up for vectorizing an array so it runs the quickest on different machines. I have some test code below
#example use random values
clear all,
t=rand(1,556790);
inner_freq=rand(8193,6);
N=100; # use N chunks
nn = int32(linspace(1, length(t)+1, N+1))
aa_sig_combined=zeros(size(t));
total_time_so_far=0;
for ii=1:N
tic;
ind = nn(ii):nn(ii+1)-1;
aa_sig_combined(ind) = sum(diag(inner_freq(1:end-1,2)) * cos(2 .* pi .* inner_freq(1:end-1,1) * t(ind)) .+ repmat(inner_freq(1:end-1,3),[1 length(ind)]));
toc
total_time_so_far=total_time_so_far+sum(toc)
end
fprintf('- Complete test in %4.4fsec or %4.4fmins\n',total_time_so_far,total_time_so_far/60);
This takes 162.7963sec or 2.7133mins to complete when N=100 on a 16gig i7 machine running ubuntu
Is there a way to find out what value N should be to get this to run the fastest on different machines?
PS: I'm running Octave 3.8.1 on 16gig i7 ubuntu 14.04 but it will also be running on even a 1 gig raspberry pi 2.
This is the Matlab test script that I used to time each parameter. The return is used to break it after the first iteration as it looks like the rest of the iterations are similar.
%example use random values
clear all;
t=rand(1,556790);
inner_freq=rand(8193,6);
N=100; % use N chunks
nn = int32( linspace(1, length(t)+1, N+1) );
aa_sig_combined=zeros(size(t));
D = diag(inner_freq(1:end-1,2));
for ii=1:N
ind = nn(ii):nn(ii+1)-1;
tic;
cosPara = 2 * pi * A * t(ind);
toc;
cosResult = cos( cosPara );
sumParaA = D * cosResult;
toc;
sumParaB = repmat(inner_freq(1:end-1,3),[1 length(ind)]);
toc;
aa_sig_combined(ind) = sum( sumParaA + sumParaB );
toc;
return;
end
The output is indicated as follows. Note that I have a slow computer.
Elapsed time is 0.156621 seconds.
Elapsed time is 17.384735 seconds.
Elapsed time is 17.922553 seconds.
Elapsed time is 18.452994 seconds.
As you can see, the cos operation is what's taking so long. You are running cos on a 8192x5568 matrix (45,613,056 elements) which makes sense that it takes so long.
If you wish to improve performance, use parfor as it appears each iteration is independent. Assuming you had 100 cores to run your 100 iterations, your script would be done in 17 seconds + parfor overhead.
Within the cos calculation, you might want to look into if another method exists to calculate cos of a value faster and more parallel than the stock method.
Another minor optimization is this line. It ensures that the diag function isn't called within the loop as the diagonal matrix is constant. You don't want a 8192x8192 diagonal matrix to be generated every time! I just stored it outside the loop and it gives a bit of a performance boost as well.
D = diag(inner_freq(1:end-1,2));
Note that I didn't use the Matlab profile as it didn't work for me, but you should use that in the future for more functionalized code.
Suppose that f(x,y) is a bivariate function as follows:
function [ f ] = f(x,y)
UN=(g)1.6*(1-acos(g)/pi)-0.8;
f= 1+UN(cos(0.5*pi*x+y));
end
How to improve execution time for function F(N) with the following code:
function [VAL] = F(N)
x=0:4/N:4;
y=0:2*pi/1000:2*pi;
VAL=zeros(N+1,3);
for i = 1:N+1
val = zeros(1,N+1);
for j = 1:N+1
val(j) = trapz(y,f(0,y).*f(x(i),y).*f(x(j),y))/2/pi;
end
val = fftshift(fft(val))/N;
l = (length(val)+1)/2;
VAL(i,:)= val(l-1:l+1);
end
VAL = fftshift(fft(VAL,[],1),1)/N;
L = (size(VAL,1)+1)/2;
VAL = VAL(L-1:L+1,:);
end
Note that N=2^p where p>10, so please consider the memory limitations while optimizing the code using ndgrid, arrayfun, etc.
FYI: The code intends to find the central 3-by-3 submatrix of the fftn of
fun=#(a,b) trapz(y,f(0,y).*f(a,y).*f(b,y))/2/pi;
where a,b are in [0,4]. The key idea is that we can save memory using the code above specially when N is very large. But the execution time is still an issue because of nested loops. See the figure below for N=2^2:
This is not a full answer, but some possibly helpful hints:
0) The trivial: Are you sure you need numerics? Can't you do the computation analytically?
1) Do not use function handles:
function [ f ] = f(x,y)
f= 1+1.6*(1-acos(cos(0.5*pi*x+y))/pi)-0.8
end
2) Simplify analytically: acos(cos(x)) is the same as abs(mod(x + pi, 2 * pi) - pi), which should compute slightly faster. Or, instead of sampling and then numerically integrating, first integrate analytically and sample the result.
3) The FFT is a very efficient algorithm to compute the full DFT, but you don't need the full DFT. Since you only want the central 3 x 3 coefficients, it might be more efficient to directly apply the DFT definition and evaluate the formula only for those coefficients that you want. That should be both fast and memory-efficient.
4) If you repeatedly do this computation, it might be helpful to precompute DFT coefficients. Here, dftmtx from the Signal Processing toolbox can assist.
5) To get rid of the loops, think about the problem not in the form of computation instructions, but a single matrix operation. If you consider your input N x N matrix as a vector with N² elements, and your output 3 x 3 matrix as a 9-element vector, then the whole operation you apply (numerical integration via trapz and DFT via fft) appears to be a simple linear transform, which it should be possible to express as an N² x 9 matrix.
Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.