In my Fortran 95 code I have a series of nested DO loops, the whole of which takes a significant amount of time to compute, so I wanted to add parallel functionality with OpenMP (using gfortran -fopenmp to compile/build).
There is one main DO loop, running 1000 times.
There is a sub DO loop within this, running 100 times.
Several other DO loops are nested within this, the number of iterations increases with each iteration of the DO loop (Once the first time, up to 1000 for the last time).
Example:
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
DO c = 1, d
some calculations
END DO
END DO
d = d + 1
END DO
Some of the nested DO loops have to be run in serial as they contain dependencies within themselves (that is to say, each iteration of the loop has a calculation that includes a value from the previous iteration), and cannot easily be parallelised in this instance.
I can easily make the loops without any dependencies run in parallel, as below:
d = 1
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
!$OMP PARALLEL
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
!$OMP END PARALLEL
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
However I understand that there is significant overhead with opening and closing the parallel threads, given that this occurs so many times within the loops. The code runs significantly slower than it previously did when run sequentially.
Following this, I figured that it made sense to open and close the parallel code either side of the main loop (therefore only applying the overhead once), and setting the number of threads to either 1 or 8 to control whether sections are run sequentially or in parallel, as below:
d = 1
CALL omp_set_num_threads(1)
!$OMP PARALLEL
DO a = 1, 1000
DO b = 1, 100
DO c = 1, d
some calculations with dependencies
END DO
CALL omp_set_num_threads(4)
!$OMP DO
DO c = 1, d
some calculations without dependencies
END DO
!$OMP END DO
CALL omp_set_num_threads(1)
DO c = 1, d
some calculations with dependencies
END DO
END DO
d = d + 1
END DO
!$OMP END PARALLEL
However, when I set this to run I don't get the speedup that I was expecting from running parallel code. I expect the first few to be slower to account for the overhead, but after a while I expect the parallel code to run faster than the sequential code, which it doesn't. I compared how fast each iteration of the main DO loop ran, for DO a = 1, 50, results below:
Iteration Serial Parallel
1 3.8125 4.0781
2 5.5781 5.9843
3 7.4375 7.9218
4 9.2656 9.7500
...
48 89.0625 94.9531
49 91.0937 97.3281
50 92.6406 99.6093
My first thought is that I'm somehow not setting the number of threads correctly.
Questions:
Is there something obviously wrong with how I've structured the parallel code?
Is there a better way to implement what I've done / want to do?
There is indeed something that is obviously wrong: you have removed any parallelism out of your code. Before creating the outermost parallel region, you defined its size to be of one thread. Therefore, only one single thread will be created to handle whatever code is inside this region. Subsequently using omp_set_num_threads(4) won't change that. This call merely says that whichever next parallel directive will create 4 threads (unless explicitly requested otherwise). But there's no such new parallel directive, which would have been here nested within the current one. You only have a work-sharing do directive which applied on the current enclosing parallel region of one unique thread.
There are two ways of addressing your issue:
Keeping your code as it was: although formally, you will fork and join your threads upon entry and exit of the parallel region, the OpenMP standard doesn't request that the threads are created and destroyed. Actually, it even encourages that the threads are kept alive to reduce the overhead of the parallel directive, which is done by most OpenMP run-time libraries. Therefore, the payload of such a simple approach of the problem isn't too big.
Using your second approach of pushing the parallel directive outside of the outermost loop, but creating as many threads as you'll need for the work-sharing (4 here I believe). Then, you enclose whatever has to be sequential within your parallel region with a single directive. This will ensure that no unwanted interaction with the extra threads will happen (implicit barrier and flushing of shared variable upon exit) while avoiding the parallelism where you don't want it.
This last version would look like this:
d = 1
!$omp parallel num_threads( 4 ) private( a, b, c ) firstprivate( d )
do a = 1, 1000
do b = 1, 100
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
!$omp do
do c = 1, d
some calculations without dependencies
end do
!$omp end do
!$omp single
do c = 1, d
some calculations with dependencies
end do
!$omp end single
end do
d = d + 1
end do
!$omp end parallel
Now whether this version would be actually faster compared to the naive one, it is up to you to test.
A last remark though: since there are quite a lot of sequential parts in your code, don't expect too much speed-up anyway. Amdahl's law is forever.
Nothing obviously wrong but if the serial loops take long, your speedup will be limited. Doing parallel computing might require to redesign your algorithms.
Instead of setting the number of threads in the loop, use the !$omp master - !$omp end master directives to reduce the execution to a single thread. Add a !$omp barrier if you can run this block only once all other threads are done.
Related
Background
I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
Method
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]
Then Coulomb is computed as follows
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
im = omp_get_thread_num() + 1 ! How much threads
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = ia(im,1), ib(im,1) ! loop over threads
do j = 1, N ! loop over all ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
a2_x(i) = a2_x(i) - rji(1)*r2inv
a2_y(i) = a2_y(i) - rji(2)*r2inv
a2_z(i) = a2_z(i) - rji(3)*r2inv
enddo
enddo
!$omp end parallel
Problematics
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
Edit (Remarks)
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Edit (Test of the propositions)
Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.
Edit 2 (Test with !$omp reduction(+:...))
I have tested the program with the following instructions
original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).
for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.
Version
N = 1024
N = 16384
original
84 s
15194 s
schedule(dynamic,1)
1300 s
not tested
reduction and schedule(dynamic,1)
123 s
24860 s
reduction and schedule(guided) (2)
121 s
24786 s
What is weird is that #PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?
Hypothesis
The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.
Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads
!$OMP DO
do i = 1, N ! loop over all ions
do j = 1, N ! loop over all ions
why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
Dividing the work by 2
The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
!$omp do schedule(dynamic,1)
do i = 1, N-1 ! loop over all ions
do j = i+1, N ! loop over some ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
rji(:) = rji(:)*r2inv
a2_x(i) = a2_x(i) - rji(1)
a2_y(i) = a2_y(i) - rji(2)
a2_z(i) = a2_z(i) - rji(3)
a2_x(j) = a2_x(j) + rji(1)
a2_y(j) = a2_y(j) + rji(2)
a2_z(j) = a2_z(j) + rji(3)
enddo
enddo
!$omp end do
!$omp end parallel
Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:
!$omp do schedule(guided)
do i = 2, N ! loop over all ions
do j = 1, i-1 ! loop over some ions
TIMING
I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.
No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0016
0.026
0.41
6.9
116
1 thread
0.0019
0.027
0.48
8.0
118
4 threads
0.0014
0.013
0.20
3.4
55
dynamic
0.0018
0.020
0.29
5.3
84
nonmonotonic
0.29
5.2
85
guided (1)
0.0014
0.013
0.21
3.7
61
guided (2)
0.0009
0.093
0.13
2.2
38
The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.
I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0028
0.047
0.72
11.5
183
4 threads
0.0013
0.019
0.25
4.0
71
I did not see how you limited the number of threads. This could be an additional !$omp directive.
The following could be effective with schedule(static).
Should "if ( i==j ) cycle" be included ?
Is N = 16 for your code example ?
dimension rjm(3)
!$omp parallel do default(none) &
!$omp private (i,j, rji, r2inv, rjm) &
!$omp shared (a2_x, a2_y, a2_z, r_x,r_y,r_z, N, softening, alpha ) &
!$omp schedule (static)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = 1,16 ! loop over threads , ignore ia(im,1), ib(im,1), is N=16 ?
rjm = 0
do j = 1, N ! loop over all ions
if ( i==j ) cycle
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = sqrt ( rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening )
r2inv = alpha(1) / ( r2inv * r2inv * r2inv ) ! alpha is 1/4.pi.eps0
! computation of the accelerations
! rjm = rjm + rji*r2inv
rjm(1) = rjm(1) + rji(1)*r2inv
rjm(2) = rjm(2) + rji(2)*r2inv
rjm(3) = rjm(3) + rji(3)*r2inv
end do
a2_x(i) = a2_x(i) - rjm(1)
a2_y(i) = a2_y(i) - rjm(2)
a2_z(i) = a2_z(i) - rjm(3)
end do
!$omp end parallel do
I have no experience of using firstprivate to shift shared variables into the stack for improved performance. Is it worthwile ?
I'm running a script that contains one loop to calculate two arrays (alpha_x and alpha_y) using input arrays (X,Y,M_list and zeta_list) from Python. They work fine when I run the Fortran script with no OpenMp. The array values are the same as if I did the calculation in Python alone. However, when I add in OpenMP support and make use of multiple cores, my array outputs for alpha_x and alpha_y are different values after each time I run the script! The code is below:
PROGRAM LensingTest1
IMPLICIT NONE
DOUBLE PRECISION,DIMENSION(1,1000)::M_list
DOUBLE PRECISION,DIMENSION(1000,1000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z, alpha_x, alpha_y
DOUBLE PRECISION,DIMENSION(2,1000)::zeta_list
REAL::start_time,stop_time
INTEGER::i,j
open(10,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/M_list.dat')
open(9,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/zeta_list.dat')
open(8,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/X.dat')
open(7,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/Y.dat')
read(10,*)M_list
read(9,*)zeta_list
read(8,*)X
read(7,*)Y
call cpu_time(start_time)
!$OMP PARALLEL DO
do i=1,size(M_list,2),1
z_m_z_x = X - zeta_list(1,i)
z_m_z_y = Y - zeta_list(2,i)
dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
alpha_x = alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
alpha_y = alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO
call cpu_time(stop_time)
print *, "Setup time:", &
stop_time - start_time, "seconds"
open(6,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/alpha_x.dat')
do i =1,1000
write(6,'(1000F14.7)')(alpha_x(i,j), j=1,1000)
end do
open(5,file='/home/Desktop/Coding/Fortran/Test_Programs/Lensing_Test/alpha_y.dat')
do i =1,1000
write(5,'(1000F14.7)')(alpha_y(i,j), j=1,1000)
end do
stop
END PROGRAM LensingTest1
The only difference is that I add in the !$OMP PARALLEL DO and !$OMP END PARALLEL DO for the OpenMP support. I compile with gfortran -fopenmp script.f90 and then export OMP_NUM_THREADS=4
In order to maximize speed I am trying to vectorize the following (to enable the compiler to vectorize as it deems good):
integer :: i,j
real :: a(4),b(4,5),c(4,5)
!... setting values to a and b...
do i=1,5
do j=1,4
c(j,i)=b(j,i)/a(i)
end do
end do
I have tried the following
c=b/a
but that doesn't work:
error #6366: The shapes of the array expressions do not conform.
My thought was that since you can do a/i, (array / scalar), I was hoping that it was possible to do (2d array / array). To begin with the dimension of b and c were (5,4) and I thought that was the problem, that it needs to conform to the variable with smaller rank on the first ranks, but this didn't seem to be the case. As of now, I am wondering if it is at all possible??? Or do I have to stick with the do loops? (of course I could be satisfied with vectorize the inner loop)
Very happy with any comments or ideas with this.
(I am using ifort 16 on windows)
In case you haven't already got your answer, the seemingly non-vectorized code looks like this:
!Non-vectorized
do i=1,5
do j=1,4
c(j,i) = b(j,i) / a(j)
enddo
enddo
and the seemingly vectorized version like this:
!Vectorized
do i=1,5
c(:,i) = b(:,i) / a(:)
enddo
But the intel compiler vectorizes both of them. To make sure if a certain loop has been vectorized or not, use the flag -qopt-report-phase=vec. This generates the vectorization report about your compiled program and is a neat way of knowing if a certain loop has been vectorized or not.
The generated vectorization report of the above code is as shown:
.... Beginning stuff...
LOOP BEGIN at vect_div.f90(11,5)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at vect_div.f90(12,5)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END
LOOP BEGIN at vect_div.f90(18,5)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at vect_div.f90(19,7)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END
Here, (11,5), (12,5) etc. are the row and column numbers in my .f90 text file where the do keyword is present. As you may notice, the outer loops are not vectorized and the inner ones are. They are both vectorized without any noticeable difference too.
More detailed reports can be generated by changing the 'n' value in the ifort flag -qopt-report=[n]
I have several loops which follow this pattern:
do j = ms,mst
ic = ic + 1
df = mm(j)*data(ic)
dff(1:3)= vec(1:3)*df*qm
end do
As you can see, the variable ic is updated at every cycle of
j and the result of ic is used by the variable df. If I use
atomic operation of OpenMP I could reduce the performance of OpenMP.
Do you know an efficient way to deal with these kind of loops in
OpenMP?
If ic is not changed apart from the increment (i.e. data is an array or a function w/o side-effects), there is a fixed relation between j and ic:
icStart = ic
delta = icStart - ms + 1
do j = ms,mst
ic = delta + j
df = mm(j)*data(ic)
dff(1:3)= vec(1:3)*df*qm
end do
This can easily be parallelized with ic and df being thread-private. You still need to take care about dff, as you will get a race condition as it is now...
As you've written your code the value of ic is increased by 1 at each iteration, just as the value of ms is. A straightforward parallelisation of the loop, something like
!$OMP PARALLEL DO
do j = ms,mst
...
will distribute the work across threads giving each of them a discrete set of the values that j takes. Simple static scheduling of a 64-trip loop (with ms==1 and mst==64) across 4 threads will mean that thread 0 gets j = 1..16, thread 1 gets j = 17..32 and so on.
However, without care on your part the values of ic won't get neatly split across threads in this way. It looks to me, from the sample you've provided, as if the behaviour you want is for blocks of values of ic to accompany corresponding blocks of values of j -- they both increase by 1 at each trip round the loop.
Perhaps in the part of the code you haven't shown us ic is set to ms+k where k is some integer. In that case you could simply drop ic from inside the loop and write
!$OMP PARALLEL DO
do j = ms,mst
df = mm(j)*data(j+k)
dff(1:3)= vec(1:3)*df*qm
end do
Without knowing more about the relationship between j and ic it's difficult to offer more pertinent advice than this. But the principle remains, rewrite ic as a function of j if you can and avoid difficulties inside the parallelised loop.
Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.