I work with Fortran now quite a long time but I have a question to which I can't find a satisfying answer.
If I have two arrays and I want to copy one into the other:
real,dimension(0:100,0:100) :: array1,array2
...
do i=0,100
do j=0,100
array1(i,j) = array2(i,j)
enddo
enddo
But I also noticed that it works as well if I do it like that:
real,dimension(0:100,0:100) :: array1,array2
...
array1 = array2
And there is a huge difference in computational time! (The second one is much faster!)
If I do it without a loop can there be a problem because I don't know maybe I'm not coping the content just the memory reference?
Does it change anything if I do another mathematical step like:
array1 = array2*5
Is there a problem on a different architecture (cluster server) or on a different compiler (gfortran, ifort)?
I have to perform various computational steps on huge amounts of data so the computational time is an issue.
Everything that #Alexander_Vogt said, but also:
do i=0,100
do j=0,100
array1(i,j) = array2(i,j)
enddo
enddo
will always be slower than
do j=0,100
do i=0,100
array1(i,j) = array2(i,j)
enddo
enddo
(Unless the compiler notices it and reorders the loops.)
In Fortran, the first parameter is the fastest changing. That means that in the second loop, the compiler can load several elements of the array in one big swoop in the lower level caches to do operations on.
If you have multidimensional loops, always have the innermost loop loop over the first index, and so on. (If possible in any way.)
Fortran is very capable of performing vector operations. Both
array1 = array2
and
array1 = array2*5
are valid operations.
This notation allows the compiler to efficiently parallelize (and/or) optimize the code, as no dependence on the order of the operations exists.
However, these construct are equivalent to the explicit loops, and it depends on the compiler which one will be faster.
Whether the memory will be copied or not depends on what further is done with the arrays and whether the compiler can optimize that. If there is no performance gain, it is safe to assume the array will be copied.
Related
This is a follow up of this question.
I have an array D(:,:,:) of size NxMxM. Typically, for the problem that I am considering now, it is M=400 and N=600000 (I reshaped the array in order to give the biggest size to the first entry).
Therefore, for each value l of the first entry, D(l,:,:) is an MxM matrix in a certain basis. I need to perform a change of components of this matrix using a basis set vec(:,:), of size MxM, so as to obtain the matrices D_mod(l,:,:).
I think that the easiest way to do it is with:
D_mod=0.0d0
do b=1,M
do a=1,M
do nu=1,M
do mu=1,M
D_mod(:,mu,nu)=D_mod(:,mu,nu)+vec(mu,a)*vec(nu,b)*D(:,a,b)
end do
end do
end do
end do
Is there a way to improve the speed of this calculation (also using LAPACK/BLAS libraries)?
I was considering this approach: reshaping D into a N x M^2 matrix D_re; doing the tensor product vec(:,:) x vec(:,:) and reshaping it in order to obtain an M^2 x M^2 matrix vecsq_re(:,:) (this motivates this question); finally, doing the matrix product of these two matrices with zgemm. However, I am not sure this is a good strategy.
EDIT
I am sorry, I wrote the question too fast and too late. The size can be up to 600000, yes, but I usually adopt strategies to reduce it by a factor 10 at least. The code is supposed to run on nodes with 100 Gb of memory
As #IanBush has said, your D array is enormous, and you're likely to need some kind of high-memory machine or cluster of machines to process it all at once. However, you probably don't need to process it all at once.
Before we get to that, let's first imagine you don't have any memory issues. For the operation you have described, D looks like an array of N matrices, each of size M*M. You say you have "reshaped the array in order to give the biggest size to the first entry", but for this problem this is the exact opposite of what you want. Fortran is a column-major language, so iterating across the first index of an array is much faster than iterating across the last. In practice, this means that an example triple-loop like
do i=1,N
do j=1,M
do k=1,M
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
will run much slower 1 than the re-ordered triple-loop
do k=1,M
do j=1,M
do i=1,N
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
and so you can immediately speed everything up by transposing D and D_mod from N*M*M arrays to M*M*N arrays and rearranging your loops. You can also speed everything up by replacing your manually-written matrix multiplication with matmul (or BLAS/LAPACK), to give
do i=1,N
D_mod(:,:,i) = matmul(matmul(vec , D(:,:,i)),transpose(vec))
enddo
Now that you're doing matrix multiplication one matrix at a time, you can also find a solution for your memory usage problems: instead of loading everything into memory and trying to do everything at once, just load one D matrix at a time into an M*M array, calculate the corresponding entry of D_mod, and write it out to disk before loading the next matrix.
1 if your compiler doesn't just optimise the loop order.
I have some iterative Fortran code which at each integration step produces some output. What is the best practice in terms of speed/accuracy for getting each of these steps saved to disk?
My current approach involves declaring some large array, at each integration step saving the output to a row of the array, and then finally saving a cropped version of the total array to file. A psuedo-example is shown below.
program IO_example
integer, parameter :: dp = selected_real_kind(33,4931)
integer(kind=dp) :: nrows = 1e6, ncols = 6
real(kind=dp), dimension(nrows,ncols) :: BigDataArray
real(kind=dp), dimension(ncols) :: RowVector
real(kind=dp), dimension(:,:), allocatable :: SmallDataArray
integer(kind=dp) :: i !for iterating
i = 1
do while (condition)
!Update RowVector
BigDataArray(i,:) = RowVector
i = i+1
enddo
!First reallocate to create a smaller array
allocate(SmallDataArray(i,ncols))
SmallDataArray = BigDataArray(1:i, :)
!Now save
open(unit=10,file=BinaryData,status='replace',form='unformatted')
write(10) SmallDataArray
close(10)
end program IO_example
Now this works fine, but my question is is this the best way to do this, or is some other approach more favourable? By best I am particularly referring to speed (how much does writing to array and writing to file slow down the code), although accuracy issues are also important (I understand these are avoided by writing in binary unformatted. See this StackOverflow answer).
Some potential issues I can foresee is the SmallDataArray being greater than the RAM (especially in quad precision) and so unable to write to disk. In addition, the number of iterations could become greater than nrows (in this case I suppose one can just increase nrows, but at what point does this start to impact performance?)
Thanks in advance for any help.
This is probably an extended comment, taking advantage of some formatting, and verges close to an opinion, but there are one or two matters which are amenable to measurement and which you might care to test for yourself.
I'm not sure what role BigDataArray plays in your code, since you don't seem to need all the data in memory after it has been computed. You could probably drop it altogether and simply accumulate results into SmallDataArray. If BigDataArray has 10^6 rows, then maybe give SmallDataArray 10^5 rows, and fill it up 10 times. Or, if you're not certain at the outset how many rows to allocate to Big, then don't, just set Small to 10^5 and fill it up as many times as necessary, exiting when the computation converges.
(And don't get hung up on the numbers I've chosen, the best size for Small is something you probably ought to experiment with.)
Once the code has filled Small write it to file, go back to row 1 and carry on.
If you follow this approach you will eliminate at least a couple of potential performance issues; the repeated allocation of Small (not sure what that's about anyway), and the movement of data when you copy a bunch of rows from Big to Small (which gains you nothing in terms of computation performance and is unnecessary for writing the data to the file).
As you seem to know, the rule when writing data to file (which is very slow computationally) is to write large volumes in one go, but it's difficult to state how large that volume should be without at least some measurements and some testing, so go measure and test.
By dropping Big altogether you remove that burden from the memory while the code runs. And if you do need all of Big at the end of the calculation, you could always read it back in (subject to memory being available of course).
Finally, let me get some retaliation in first: if your response to this 'answer' is something akin to Oh, that doesn't answer my real question, it only answers the simplified question I asked but I have all these other issues to consider would you mind taking a look at these too ... then you can take it that my response to that is (a) unprintable and (b) boils down to Yes, I would mind
I'm trying to check if an 1D array of integers A contains or not, at every one of it's size(A) positions, any of the elements of the set of integers S (also a 1D array), with the general case of size(S) > 1.
The easy and obvious way is to do the following nested loop:
DO i = 1, size(A)
DO j = 1, size(S)
IF(A(i) == S(j)) ** do something **
ENDDO
ENDDO
The problem is that, for large arrays A and S, this process is very inefficient. Is there an intrinsic FORTRAN subroutine or function that does this faster? Or any other method?
I've tried to do the following, but it doesn't want to compile:
DO i = 1, NNODES
IF(A(i) == ANY(S)) ** do something **
ENDDO
The error message that appears is the following: "error #6362: The data types of the argument(s) are invalid." I'm using VS2010 with Intel Parallel Studio 2013.
The expression
A(i) == ANY(S)
has an integer on the lhs and a logical on the rhs. We'll have none of that C-inspired nonsense of regarding those as comparable types in Fortran thank you very much. Actually, it's worse than that, any returns a logical but takes an array of logicals on input, so any(array_of_int) won't compile.
You could try
ANY(S==A(i))
instead. That should give you a compilable solution.
Now, as for efficiency, you're first snippet is O(n^2). You can do better, asymptotically. Sort both arrays and scan them in tandem, which is O(n + n log n) or something similar. If you need help coding that up, update your question, though I suspect it's already been asked and answered here on SO.
I strongly suspect, and you can check if you care to, that using any inside a single (explicit) loop will also be O(n^2) -- since any has to operate on the most general cases I can't see any realistic alternative to it scanning the array -- another loop in other words.
In addition to High Performance Mark's answere, when you scan the sorted arrays you can, and should, use a binary search algorithm.
I have several 4-dimensional arrays each having different sizes:
array_one(1:2,1:xm,1:ym,1:zm)
where current_step = 1 and previous_step = 2.
In a long loop, with many other operations, I need to shift the current_step values to the previous_step like:
array_one(previous_step,:,:,:) = array_one(current_step,:,:,:)
I know I can do that in a DO loop but, perhaps it is not the most efficient way. Since I have at least 24 such arrays each having different sizes (i.e. xm,ym,zm) so I need to run separate DO loops for each of them which could make it slower.
I failed with the following way:
array_one(previous_step,:,:,:) = array_one(current_step,:,:,:)
What is the efficient way for such shifting?
Copy methods
I ran a simple benchmark on my system with 8 different methods to copy the arrays. There were two basic forms of copy I tested:
do k=1,nx
do j=1,nx
do i=1,nx
array(2,i,j,k) = array(1,i,j,k)
end do
end do
end do
and
array(2,:,:,:) = array(1,:,:,:)
For each of these I also tested with the t index as the last array index, e.g.:
array(i,j,k,2) = array(i,j,k,1)
and
array(:,:,:,2) = array(:,:,:,1)
Finally I tested each of these 4 copies as shown serially and with openmp directives, e.g.
!$omp parallel do shared(array) private(i,j,k)
...
!$omp end parallel do
for the do loop copy and with
!$omp parallel workshare shared(array)
...
!$omp end parallel workshare
for the array slice copy.
Each copy was performed 100 times for each of arrays sized 100x100x100x2 up to 1000x1000x1000x2 in increments of 100 (ni=nj=nk for all tested arrays).
The compiler and compile flags
I tested with gfortran 4.9.1, and compiled my testcase with
gfortran -march=native -fopenmp -O3 -o arraycopy arraycopy.f90
My CPU is an intel i7 990x (6 cores with HT enabled), and native will target the highest instruction set supported by the chip. OpenMP will spawn 12 threads.
The OS is Linux 3.12.13.
Results
Average time per copy is on the y-axis and the array dimension is on the x-axis (e.g. 500 is a 500x500x500x2 or 2x500x500x500 array). The red lines are the do loop copy (dashed is the variation with t index last). The green lines are the array slice copy (dashed is the variation with t index last). For both serial copies the variations with t index first were faster (I did not investigate why) and the array notation copy is faster than the loop. The blue lines are the openmp copies with t index first. The black lines are the openmp copies with the t index last. The performance for the parallel do and parallel workshare constructs were equivalent.
Discussion
Run your own benchmarks on your own systems with your typical compile flags. The results here are going to be specific to my system including optimization flags, SIMD instructions and OpenMP with 12 threads. This will vary for a system with fewer cores and a CPUs with lesser or greater instruction sets (e.g. a CPU with AVX2 should perform better). These results are also influenced by cache locality, RAM and bus speeds and how my OS scheduler handles hyperthreading.
For my results on my system I would use array slice notation for serial copies and for best performance I would use OpenMP.
In short, when a program issues a memory read operation, say A(i), it will not only read A(i), but instead it will read something like A(i-2), A(i-1), A(i), A(i+1), A(i+2). These values will then be stored in the CPU cache, which is a much faster memory. That is, the CPU will read a chunk of memory and put it into cache for later use. This optimization is based on the fact that it is very likely that your next operation will use some of these surrounding values. If that's the case, the CPU won't need to go and fetch the memory again, which is a very expensive operation (like 100 times more expensive than floating point operations), instead it just needs to look for the value in the cache. This is called data locality.
In Fortran, multidimensional arrays are stored in column-major order. For instance, let's say you have the following 2x2 matrix:
A(1,1)=a11, A(1,2)=a12, A(2,1)=a21, A(2,2)=a22.
The matrix A(1:2,1:2) is stored linearly in memory in this order: a11, a21, a12, a22 (in contrast, in a row-major order like C language, the order would be a11, a12, a21, a22). You can deduce what the order is for higher dimensions.
In short, Fortran arrays are stored linearly in memory from left to right. If you want to exploit data locality, you need to travel through the array from left to right.
Short answer: I think you should change your structure to (1:xm,1:ym,1:zm,1:2), and if you are going to loop through the array, do it this way:
do h = 1, 2
do i = 1, zm
do j = 1, ym
do k = 1, xm
A[k,j,i,h] = *...something...*
end do
end do
end do
end do
Also, the difference between doing A(:)=B(:) and the equivalent do loop is that A(:)=B(:) is equivalent to a forall statement:
forall(i = 1:n)
A(i) = B(i)
end forall
More in here http://en.wikipedia.org/wiki/Fortran_95_language_features#The_FORALL_Statement_and_Construct
I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.