Efficiently transforming NxJ matrix into an N-long array with J numbers - arrays

I'm doing some Bayesian analysis using Stan and I'm trying to make my code more efficient.
In my Stan model string, I have a variable that is an NxJ matrix. It is declared this way to make use of quick matrix operations and assignments.
However, in the final modeling step (assigning a distribution), I need to transform this NxJ matrix into an N-long array that contains J real values in each of the array's elements.
In other words, I want the following transformation:
matrix[N,J] x;
vector[J] y[N];
for (i in 1:N)
for (j in 1:J)
y[i][j] = x[i,j]
Is there any way to do this in a vectorized way without for loops?
Thank you!!!!

No. Loops are very fast in Stan. The only reason to vectorize for speed is if there are derivatives involved. You could shorten it a bit to
for (n in 1:N)
y[n] = x[n]';
but it wouldn't be any more efficient.
I should qualify this by saying that there is one inefficiency here, which is lack of memory locality. If the matrices are large, they'll be slow to traverse by row because internally they are stored column-major.

Related

Assemble eigen3 sparsematrix from smaller sparsematrices

I am assembling the jacobian of a coupled multi physics system. The jacobian consists of a blockmatrix on the diagonal for each system and off diagonal blocks for the coupling.
I find it the best to assemble to block separatly and then sum over them with projection matrices to get the complete jacobian.
pseudo-code (where J[i] are the diagonal elements and C[ij] the couplings, P are the projections to the complete matrix).
// diagonal blocks
J.setZero();
for(int i=0;i<N;++i){
J+=P[i]J[i]P[i].transpose()
}
// off diagonal elements
for(int i=0;i<N;++i){
for(int j=i+1;j<N;++j){
J+=P[i]C[ij]P[j].transpose()
J+=P[j]C[ji]P[i].transpose()
}
}
This takes a lot performance, around 20% of the whole programm, which is too much for some assembling. I have to recalculate the jacobian every time step since the system is highly nonlinear.
Valgrind indicates that the ressource consuming method is Eigen::internal::assign_sparse_to_sparse and in this method the call to Eigen::SparseMatrix<>::InsertBackByOuterInner.
Is there a more efficient way to assemble such a matrix?
(I also had to use P*(JP.transpose()) instead of PJ*J.transpose() to make the programm compile, may be there is already something wrong)
P.S: NDEBUG and optimizations are turned on
Edit: by storing P.transpose in a extra matrix ,I get a bit better performance, but the summation accounts still for 15% of the programm
Your code will be much faster by working inplace. First, estimate the number of non-zeros per column in the final matrix and reserve space (if not already done):
int nnz_per_col = ...;
J.reserve(VectorXi::Constant(n_cols, nnz_per_col));
If the number of nnz per column is highly non-uniform, then you can also compute it per column:
VectorXi nnz_per_col(n_cols);
for each j
nnz_per_col(j) = ...;
J.reserve(nnz_per_col);
Then manually insert elements:
for each block B[k]
for each elements i,j
J.coeffRef(foo(i),foo(j)) += B[k](i,j)
where foo implement the appropriate mapping of indices.
And for the next iteration, no need to reserve, but you need to set coefficient values to zero while preserving the structure:
J.coeffs().setZero();

Matrix calculations without loops in MATLAB

I have an issue with a code performing some array operations. It is too slow, because I use loops and input data are quite big. It was the easiest way for me, but now I am looking for something faster than for loops. I was trying to optimize or rewrite code, but unsuccessful. I really aprecciate Your help.
In my code I have three arrays x1, y1 (coordinates of points in grid), g1 (values in the points) and for example their size is 300 x 300. I treat each matrix as composition of 9 and I make calculation for points in the middle one. For example I start with g1(101,101), but I am using data from g1(1:201,1:201)=g2. I need to calculate distance from each point of g1(1:201,1:201) to g1(101,101) (ll matrix), then I calculate nn as it is in the code, next I find value for g1(101,101) from nn and put it in N array. Then I go to g1(101,102) and so on until g1(200,200), where in this last case g2=g1(99:300,99:300).
As i said, this code is not very efficient, even I have to use larger arrays than I gave in the example, it takes too much time. I hope I explain enough clearly what I expect from the code. I was thinking of using arrayfun, but I have never worked with this function, so I don't know how should use it, however it seems to me it won't handle. Maybe there are other solutions, however I couldn't find anything apropriate.
tic
x1=randn(300,300);
y1=randn(300,300);
g1=randn(300,300);
m=size(g1,1);
n=size(g1,2);
w=1/3*m;
k=1/3*n;
N=zeros(w,k);
for i=w+1:2*w
for j=k+1:2*k
x=x1(i,j);
y=y1(i,j);
x2=y1(i-k:i+k,j-w:j+w);
y2=y1(i-k:i+k,j-w:j+w);
g2=g1(i-k:i+k,j-w:j+w);
ll=1./sqrt((x2-x).^2+(y2-y).^2);
ll(isinf(ll))=0;
nn=ifft2(fft2(g2).*fft2(ll));
N(i-w,j-k)=nn(w+1,k+1);
end
end
czas=toc;
For what it's worth, arrayfun() is just a wrapper for a for loop, so it wouldn't lead to any performance improvements. Also, you probably have a typo in the definition of x2, I'll assume that it depends on x1. Otherwise it would be a superfluous variable. Also, your i<->w/k, j<->k/w pairing seems inconsistent, you should check that as well. Also also, just timing with tic/toc is rarely accurate. When profiling your code, put it in a function and run the timing multiple times, and exclude the variable generation from the timing. Even better: use the built-in profiler.
Disclaimer: this solution will likely not help for your actual problem due to its huge memory need. For your input of 300x300 matrices this works with arrays of size 300x300x100x100, which is usually a no-go. Still, it's here for reference with a smaller input size. I wanted to add a solution based on nlfilter(), but your problem seems to be too convoluted to be able to use that.
As always with vectorization, you can do it faster if you can spare the memory for it. You are trying to work with matrices of size [2*k+1,2*w+1] for each [i,j] index. This calls for 4d arrays, of shape [2*k+1,2*w+1,w,k]. For each element [i,j] you have a matrix with indices [:,:,i,j] to treat together with the corresponding elements of x1 and y1. It also helps that fft2 accepts multidimensional arrays.
Here's what I mean:
tic
x1 = randn(30,30); %// smaller input for tractability
y1 = randn(30,30);
g1 = randn(30,30);
m = size(g1,1);
n = size(g1,2);
w = 1/3*m;
k = 1/3*n;
%// these will be indexed on the fly:
%//x = x1(w+1:2*w,k+1:2*k); %// size [w,k]
%//y = x1(w+1:2*w,k+1:2*k); %// size [w,k]
x2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
y2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
g2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
%// manual definition for now, maybe could be done smarter:
for ii=w+1:2*w %// don't use i and j as variables
for jj=k+1:2*k %// don't use i and j as variables
x2(:,:,ii-w,jj-k) = x1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
y2(:,:,ii-w,jj-k) = y1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
g2(:,:,ii-w,jj-k) = g1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
end
end
%// use bsxfun to operate on [2*k+1,2*w+1,w,k] vs [w,k]-sized arrays
%// need to introduce leading singletons with permute() in the latter
%// in order to have shape [1,1,w,k] compatible with the first array
ll = 1./sqrt(bsxfun(#minus,x2,permute(x1(w+1:2*w,k+1:2*k),[3,4,1,2])).^2 ...
+ bsxfun(#minus,y2,permute(y1(w+1:2*w,k+1:2*k),[3,4,1,2])).^2);
ll(isinf(ll)) = 0;
%// compute fft2, operating on [2*k+1,2*w+1,w,k]
%// will return fft2 for each index in the [w,k] subspace
nn = ifft2(fft2(g2).*fft2(ll));
%// we need nn(w+1,k+1,:,:) which is exactly of size [w,k] as needed
N = reshape(nn(w+1,k+1,:,:),[w,k]); %// quicker than squeeze()
N = real(N); %// this solution leaves an imaginary part of around 1e-12
czas=toc;

Vectorizing access to a slice of a three-dimensional matrix in MATLAB

I have a three-dimensional matrix of these sizes, approximately
A = rand(20, 1000, 20);
where the first and third dimensions are always the same length. I want to zero the elements in a main diagonal slice. This does what I mean
for ii = 1:size(A, 1)
A(ii, :, ii) = 0;
end
Is there a vectorized or otherwise faster way to do this? This code runs about 100,000 times, with these approximate sizes, but not the exact same sizes each time.
You can use logical indexing for multible tailing dimensions while using subscript indexing for all previous dimensions individually. This way you can easily do the operation on an 1000 20 20 matrix. To apply this to your matrix, permute is required which might be slow:
n=size(A,3)
A=permute(A,[2,1,3]);
A(:,diag(true(n,1)))=0;
A=permute(A,[2,1,3]);
If it would be possible to permanently swap the dimensions of A in your code and avoid the permute, this would lead to the fastest solution.
Alternatively you can use repmat to expand the index to the dimensions of A
ix=repmat(reshape(diag(true(n,1)),n,1,n),[1,size(A,2),1])
A(ix)=0
For matrices of the same size you could keep ix. Not having access to MATLAB right now, I don't know which solution is faster.
You can use bsxfun to build a linear index of the elements to be zeroed:
ind = bsxfun(#plus, (0:size(A,2)-1).'*size(A,1), 1:size(A,1)*size(A,2)+1:numel(A) );
A(ind) = 0;

Run-time efficient transposition of a rectangular matrix of arbitrary size

I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.
There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).
I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".
Martin
I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.
I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.
template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
int r = dst.r, c = dst.c;
assert(r == src.c && c == src.r);
for (int i = 0; i < r; i += BLOCK)
for (int j = 0; j < c; j += BLOCK) {
if (i + BLOCK <= r && j + BLOCK <= c)
ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
else
ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
}
}
I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.
Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:
Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec
Here is the full code used.
So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.
If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like
OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).
The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.

Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}
There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).
No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.
If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.
Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

Resources