Efficient way to compute tensor change of components in Fortran - arrays

This is a follow up of this question.
I have an array D(:,:,:) of size NxMxM. Typically, for the problem that I am considering now, it is M=400 and N=600000 (I reshaped the array in order to give the biggest size to the first entry).
Therefore, for each value l of the first entry, D(l,:,:) is an MxM matrix in a certain basis. I need to perform a change of components of this matrix using a basis set vec(:,:), of size MxM, so as to obtain the matrices D_mod(l,:,:).
I think that the easiest way to do it is with:
D_mod=0.0d0
do b=1,M
do a=1,M
do nu=1,M
do mu=1,M
D_mod(:,mu,nu)=D_mod(:,mu,nu)+vec(mu,a)*vec(nu,b)*D(:,a,b)
end do
end do
end do
end do
Is there a way to improve the speed of this calculation (also using LAPACK/BLAS libraries)?
I was considering this approach: reshaping D into a N x M^2 matrix D_re; doing the tensor product vec(:,:) x vec(:,:) and reshaping it in order to obtain an M^2 x M^2 matrix vecsq_re(:,:) (this motivates this question); finally, doing the matrix product of these two matrices with zgemm. However, I am not sure this is a good strategy.
EDIT
I am sorry, I wrote the question too fast and too late. The size can be up to 600000, yes, but I usually adopt strategies to reduce it by a factor 10 at least. The code is supposed to run on nodes with 100 Gb of memory

As #IanBush has said, your D array is enormous, and you're likely to need some kind of high-memory machine or cluster of machines to process it all at once. However, you probably don't need to process it all at once.
Before we get to that, let's first imagine you don't have any memory issues. For the operation you have described, D looks like an array of N matrices, each of size M*M. You say you have "reshaped the array in order to give the biggest size to the first entry", but for this problem this is the exact opposite of what you want. Fortran is a column-major language, so iterating across the first index of an array is much faster than iterating across the last. In practice, this means that an example triple-loop like
do i=1,N
do j=1,M
do k=1,M
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
will run much slower 1 than the re-ordered triple-loop
do k=1,M
do j=1,M
do i=1,N
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
and so you can immediately speed everything up by transposing D and D_mod from N*M*M arrays to M*M*N arrays and rearranging your loops. You can also speed everything up by replacing your manually-written matrix multiplication with matmul (or BLAS/LAPACK), to give
do i=1,N
D_mod(:,:,i) = matmul(matmul(vec , D(:,:,i)),transpose(vec))
enddo
Now that you're doing matrix multiplication one matrix at a time, you can also find a solution for your memory usage problems: instead of loading everything into memory and trying to do everything at once, just load one D matrix at a time into an M*M array, calculate the corresponding entry of D_mod, and write it out to disk before loading the next matrix.
1 if your compiler doesn't just optimise the loop order.

Related

How save memory for a solving a symmetric (or upper traingular) matrix?

I need to solve system of linear algebraic equations A.X = B
The matrix A is double precision with about size of 33000x33000 and I will get an error when I try to allocate it:
Cannot allocate array - overflow on array size calculation.
Since I am using LAPACK dposv with the Intel MKL library, I was wondering if there is a way to somehow pass an smaller matrix to the library function? (because only half of the matrix arrays are needed to solve)
The dposv function only needs an upper or lower triangular matrix for A. Here is more details about dposv.
Update: Please notice that the A matrix is N x N and yet it takes lda: INTEGER as The leading dimension of a; lda ≥ max(1, n). So may be there is a way to parse A as an 1D array?
As the error says (Cannot allocate array - overflow on array size calculation) Your problem seems to be somewhere else: especially the limit of the integer type used to compute the array size internally. And I am afraid that you might not be able to solve that even if you add more memory. You will need to check the internals of the library that your are using for memory management (Possibly MKL, but I don't use MKL so I can not help) or choose another one.
Explanation, some functions use 4 bytes integer to compute the memory size when allocating. That gives you a limit of 2^32 or 4 Gbytes of memory that you can allocate wich is way lower than your 8 Gbytes array. In that I am assuming unsigned integer; with signed integer, that limit is 2 Gbytes.
Hints if you have limited memory:
If you do not have enough memory (about 4 Gbytes for the matrix alone since it is triangular) and you do not know the structure of the matrix, then forget about special solvers and solve your problem yourself. Solving a system with an upper triangular matrix is a backward substitution. Starting with the last row of the solution, you need only one row of the matrix to compute each component of the solution.
Find a way to load your matrix row by row starting with the last row.
Thanks to mecej4
There are several options to pass a huge matrix using less memory:
Using functions that support Matrix Storage Schemes e.g. ?pbsv
Using PARDISO

Copying arrays: loops vs. array operations

I work with Fortran now quite a long time but I have a question to which I can't find a satisfying answer.
If I have two arrays and I want to copy one into the other:
real,dimension(0:100,0:100) :: array1,array2
...
do i=0,100
do j=0,100
array1(i,j) = array2(i,j)
enddo
enddo
But I also noticed that it works as well if I do it like that:
real,dimension(0:100,0:100) :: array1,array2
...
array1 = array2
And there is a huge difference in computational time! (The second one is much faster!)
If I do it without a loop can there be a problem because I don't know maybe I'm not coping the content just the memory reference?
Does it change anything if I do another mathematical step like:
array1 = array2*5
Is there a problem on a different architecture (cluster server) or on a different compiler (gfortran, ifort)?
I have to perform various computational steps on huge amounts of data so the computational time is an issue.
Everything that #Alexander_Vogt said, but also:
do i=0,100
do j=0,100
array1(i,j) = array2(i,j)
enddo
enddo
will always be slower than
do j=0,100
do i=0,100
array1(i,j) = array2(i,j)
enddo
enddo
(Unless the compiler notices it and reorders the loops.)
In Fortran, the first parameter is the fastest changing. That means that in the second loop, the compiler can load several elements of the array in one big swoop in the lower level caches to do operations on.
If you have multidimensional loops, always have the innermost loop loop over the first index, and so on. (If possible in any way.)
Fortran is very capable of performing vector operations. Both
array1 = array2
and
array1 = array2*5
are valid operations.
This notation allows the compiler to efficiently parallelize (and/or) optimize the code, as no dependence on the order of the operations exists.
However, these construct are equivalent to the explicit loops, and it depends on the compiler which one will be faster.
Whether the memory will be copied or not depends on what further is done with the arrays and whether the compiler can optimize that. If there is no performance gain, it is safe to assume the array will be copied.

efficient shifting of multi-dimensional arrays in Fortran

I have several 4-dimensional arrays each having different sizes:
array_one(1:2,1:xm,1:ym,1:zm)
where current_step = 1 and previous_step = 2.
In a long loop, with many other operations, I need to shift the current_step values to the previous_step like:
array_one(previous_step,:,:,:) = array_one(current_step,:,:,:)
I know I can do that in a DO loop but, perhaps it is not the most efficient way. Since I have at least 24 such arrays each having different sizes (i.e. xm,ym,zm) so I need to run separate DO loops for each of them which could make it slower.
I failed with the following way:
array_one(previous_step,:,:,:) = array_one(current_step,:,:,:)
What is the efficient way for such shifting?
Copy methods
I ran a simple benchmark on my system with 8 different methods to copy the arrays. There were two basic forms of copy I tested:
do k=1,nx
do j=1,nx
do i=1,nx
array(2,i,j,k) = array(1,i,j,k)
end do
end do
end do
and
array(2,:,:,:) = array(1,:,:,:)
For each of these I also tested with the t index as the last array index, e.g.:
array(i,j,k,2) = array(i,j,k,1)
and
array(:,:,:,2) = array(:,:,:,1)
Finally I tested each of these 4 copies as shown serially and with openmp directives, e.g.
!$omp parallel do shared(array) private(i,j,k)
...
!$omp end parallel do
for the do loop copy and with
!$omp parallel workshare shared(array)
...
!$omp end parallel workshare
for the array slice copy.
Each copy was performed 100 times for each of arrays sized 100x100x100x2 up to 1000x1000x1000x2 in increments of 100 (ni=nj=nk for all tested arrays).
The compiler and compile flags
I tested with gfortran 4.9.1, and compiled my testcase with
gfortran -march=native -fopenmp -O3 -o arraycopy arraycopy.f90
My CPU is an intel i7 990x (6 cores with HT enabled), and native will target the highest instruction set supported by the chip. OpenMP will spawn 12 threads.
The OS is Linux 3.12.13.
Results
Average time per copy is on the y-axis and the array dimension is on the x-axis (e.g. 500 is a 500x500x500x2 or 2x500x500x500 array). The red lines are the do loop copy (dashed is the variation with t index last). The green lines are the array slice copy (dashed is the variation with t index last). For both serial copies the variations with t index first were faster (I did not investigate why) and the array notation copy is faster than the loop. The blue lines are the openmp copies with t index first. The black lines are the openmp copies with the t index last. The performance for the parallel do and parallel workshare constructs were equivalent.
Discussion
Run your own benchmarks on your own systems with your typical compile flags. The results here are going to be specific to my system including optimization flags, SIMD instructions and OpenMP with 12 threads. This will vary for a system with fewer cores and a CPUs with lesser or greater instruction sets (e.g. a CPU with AVX2 should perform better). These results are also influenced by cache locality, RAM and bus speeds and how my OS scheduler handles hyperthreading.
For my results on my system I would use array slice notation for serial copies and for best performance I would use OpenMP.
In short, when a program issues a memory read operation, say A(i), it will not only read A(i), but instead it will read something like A(i-2), A(i-1), A(i), A(i+1), A(i+2). These values will then be stored in the CPU cache, which is a much faster memory. That is, the CPU will read a chunk of memory and put it into cache for later use. This optimization is based on the fact that it is very likely that your next operation will use some of these surrounding values. If that's the case, the CPU won't need to go and fetch the memory again, which is a very expensive operation (like 100 times more expensive than floating point operations), instead it just needs to look for the value in the cache. This is called data locality.
In Fortran, multidimensional arrays are stored in column-major order. For instance, let's say you have the following 2x2 matrix:
A(1,1)=a11, A(1,2)=a12, A(2,1)=a21, A(2,2)=a22.
The matrix A(1:2,1:2) is stored linearly in memory in this order: a11, a21, a12, a22 (in contrast, in a row-major order like C language, the order would be a11, a12, a21, a22). You can deduce what the order is for higher dimensions.
In short, Fortran arrays are stored linearly in memory from left to right. If you want to exploit data locality, you need to travel through the array from left to right.
Short answer: I think you should change your structure to (1:xm,1:ym,1:zm,1:2), and if you are going to loop through the array, do it this way:
do h = 1, 2
do i = 1, zm
do j = 1, ym
do k = 1, xm
A[k,j,i,h] = *...something...*
end do
end do
end do
end do
Also, the difference between doing A(:)=B(:) and the equivalent do loop is that A(:)=B(:) is equivalent to a forall statement:
forall(i = 1:n)
A(i) = B(i)
end forall
More in here http://en.wikipedia.org/wiki/Fortran_95_language_features#The_FORALL_Statement_and_Construct

Eigen Sparse Matrix

I am trying to multiply two large sparse matrices of size 300k * 1000k and 1000k*300k using Eigen. The matrices are highly sparse ~0.01% non zero entries, however there's no block or other structure in their sparsity.
It turns out that Eigen chokes and ends up taking 55-60G of memory. Actually, it makes the final matrix dense which explains why it takes so much memory.
I have tried multiplying matrices of similar sizes when one of the matrix is diagonal and the multiplication works fine, with ~2-3 G of memory.
Any thoughts on whats going wrong?
Even though your matrices are sparse, the result might be completely dense. You can try to remove smallest entries with (A*B).prune(ref,eps); where ref is a reference value for what is not a zero and eps is a tolerance value. Basically, all entries smaller than ref*eps will be removed during the computation of the product, thus reducing both the memory usage and size of the result. A better option would be to find a way to avoid performing this product.

Flexible array size in C

I am coding an MCMC algorithm in C and I have a little problem. The idea of this algorithms is to make inferences for the number of groups in a population. So let us say that we start with k groups. Where the first value for k is given by the user or randomly selected. Now at each step of the algorithm k can decrease by 1, increase by 1 or stay the same. And I have some variables for each group;
double *mu;
double *lambda;
double **A
mu and lambda are indeed arrays of k elements and A is a two dimensional array of kxN. N as well changes at each iteration. I have some data y1, y2,..., yn so at each iteration I do some process, propose new values for the parameters and decide if to move k or not.
So far I have tied to use malloc and realloc to deal with all this changes of the dimension of my parameters but I have to iterate this algorithm for let us say 100,000 times so at certain point it crashes. If I start with k=10 in my case at the third iteration!
So two questions:
Can I use realloc at each iteration? or this is my big mistake. If yes well I imagine that should check my code!
If not what should I do, any suggestion?
I would consider not changing your storage on every iteration. realloc carries considerable overhead (in the worst-case, it has to copy your entire array every single time).
Can you simply allocate for the maximum dimensions at startup, and then just use less of it? Or at the very least, only realloc on an increase in storage requirements by doubling your capacity (thus mimicking how a std::vector operates).
[By the way, I don't know why your application crashes, as you haven't given us any details (e.g. the error message you get, or what you've found by debugging. But I guess you have a bug somewhere!]

Resources