Assemble eigen3 sparsematrix from smaller sparsematrices - sparse-matrix

I am assembling the jacobian of a coupled multi physics system. The jacobian consists of a blockmatrix on the diagonal for each system and off diagonal blocks for the coupling.
I find it the best to assemble to block separatly and then sum over them with projection matrices to get the complete jacobian.
pseudo-code (where J[i] are the diagonal elements and C[ij] the couplings, P are the projections to the complete matrix).
// diagonal blocks
J.setZero();
for(int i=0;i<N;++i){
J+=P[i]J[i]P[i].transpose()
}
// off diagonal elements
for(int i=0;i<N;++i){
for(int j=i+1;j<N;++j){
J+=P[i]C[ij]P[j].transpose()
J+=P[j]C[ji]P[i].transpose()
}
}
This takes a lot performance, around 20% of the whole programm, which is too much for some assembling. I have to recalculate the jacobian every time step since the system is highly nonlinear.
Valgrind indicates that the ressource consuming method is Eigen::internal::assign_sparse_to_sparse and in this method the call to Eigen::SparseMatrix<>::InsertBackByOuterInner.
Is there a more efficient way to assemble such a matrix?
(I also had to use P*(JP.transpose()) instead of PJ*J.transpose() to make the programm compile, may be there is already something wrong)
P.S: NDEBUG and optimizations are turned on
Edit: by storing P.transpose in a extra matrix ,I get a bit better performance, but the summation accounts still for 15% of the programm

Your code will be much faster by working inplace. First, estimate the number of non-zeros per column in the final matrix and reserve space (if not already done):
int nnz_per_col = ...;
J.reserve(VectorXi::Constant(n_cols, nnz_per_col));
If the number of nnz per column is highly non-uniform, then you can also compute it per column:
VectorXi nnz_per_col(n_cols);
for each j
nnz_per_col(j) = ...;
J.reserve(nnz_per_col);
Then manually insert elements:
for each block B[k]
for each elements i,j
J.coeffRef(foo(i),foo(j)) += B[k](i,j)
where foo implement the appropriate mapping of indices.
And for the next iteration, no need to reserve, but you need to set coefficient values to zero while preserving the structure:
J.coeffs().setZero();

Related

Lower memory usage when searching for optimal route from one point of matrix to another

I have following problem:
Tourist wants to create an optimal route for the hike. He has a
terrain elevation map at a certain scale - an NxM-sized matrix
containing the elevation values ​​at the corresponding terrain points.
Tourist wants to make a route from the start point to the end point in
such a way that the total change in altitude while passing the route
is minimal. The total change in altitude on the route is the sum of
the changes in altitude modulo on each segment of the route. For example, if there is a continuous ascent or descent from the starting
point of the route to the end point, then such a route will be
optimal.
For simplicity, let's assume that you can only walk along the lines of
an imaginary grid, i.e. from position (i, j), which is not on the edge
of the card, you can go to position (i-1, j), (i + 1, j), (i, j-1),
(i, j + 1). You cannot go beyond the edge of the map.
On standard input: non-negative integers N, M, N0, M0, N1, M1 are entered. N is the number of rows in the heightmap, M is the number of columns. Point (N0, M0) is the starting point of route, point (N1, M1) is the end point of the route. Point coordinates are numbered starting
from zero. The points can be the same. It is known that the total
number of matrix elements is limited to 1,100,000.
After numbers height map is entered in rows - at first the first line, then
the second, and so on. Height is a non-negative integer not exceeding
10000.
Print to standard output the total change in elevation while
traversing the optimal route.
I came to conclusion that it's about shortest path in graph and wrote this
But for m=n=1000 program eats too much (~169MiB, mostly heap) memory.
Limits are as following:
Time limit: 2 s
Memory limit: 50M
Stack limit: 64M
I also wrote C++ program doing same thing with priority_queue(just to check, problem must be solved in C), but it still needs ~78MiB (mostly heap)
How should I solve this problem (use another algorithm, optimize existing C code or something else)?
You can fit a height value into a 16-bit unsigned int (uint_16_t, if using C++11). To store 1.1M of those 2-byte values requires 2.2M of memory. So far so good.
Your code builds an in-memory graph with linked lists and lots of pointers. Your heap-based queue also has a lot of pointers. You can greatly decrease memory usage by relying on a more-efficient representation - for example, you could build an NxM array of elements with
struct Cell {
uint_32_t distance; // shortest distance from start, initially INF
uint_16_t height;
int_16_t parent; // add this to cell index to find parent; 0 for `none`
};
A cell at row, col will be at index row + N*col in this array. I strongly recommend building utility methods to find each neighbor, which would return -1 to indicate "out of bounds" and a direct index otherwise. The difference between two indices would be usable in parent.
You can implement a (not very efficient) priority queue in C by calling stdlib's "sort" on an array of node indices, and sorting them by distance. This would cut a lot of additional overhead, as each pointer in your program probably takes 8 bytes (64-bit pointers).
By avoiding a lot of pointers, and using an in-memory description instead of a graph, you should be able to cut down memory consumption to
1.1M Cells x 8 bytes/cell = 8.8M
1.1M indices in the pq-array x 4 bytes/index = 4.4M
For a total of around 16 Mb w/overheads - well under stated limits. If it takes too long, you should use a better queue.

Efficiently transforming NxJ matrix into an N-long array with J numbers

I'm doing some Bayesian analysis using Stan and I'm trying to make my code more efficient.
In my Stan model string, I have a variable that is an NxJ matrix. It is declared this way to make use of quick matrix operations and assignments.
However, in the final modeling step (assigning a distribution), I need to transform this NxJ matrix into an N-long array that contains J real values in each of the array's elements.
In other words, I want the following transformation:
matrix[N,J] x;
vector[J] y[N];
for (i in 1:N)
for (j in 1:J)
y[i][j] = x[i,j]
Is there any way to do this in a vectorized way without for loops?
Thank you!!!!
No. Loops are very fast in Stan. The only reason to vectorize for speed is if there are derivatives involved. You could shorten it a bit to
for (n in 1:N)
y[n] = x[n]';
but it wouldn't be any more efficient.
I should qualify this by saying that there is one inefficiency here, which is lack of memory locality. If the matrices are large, they'll be slow to traverse by row because internally they are stored column-major.

"Blocking" method to make code cache friendly

Hey so I'm looking at a matrix shift code, and need to make it cache friendly (fewest cache misses possible). The code looks like this:
int i, j, temp;
for(i=1;, i< M; i++){
for(j=0; j< N; j++){
temp = A[i][j];
A[i][j] = A[i-1][j];
A[i-1]][j] = temp;
}
}
Assume M and N are parameters of the function, noting M to number of rows, and N to number of columns. Now to make this more cache friendly, the book gives out two problems for optimization. When the matrix is 4x4, s=1, E=2, b=3 , and when the matrix is 128x128, s=5, E=2, b=3.
(s = # of set index bits (S = s^2 is the number of sets, E = number of lines per set, and b = # of block bits (so B = b^2 is block size))
So using the blocking method, I should access the matrix by block size, to avoid getting a miss, and the cache having to fetch the information from the cache a level higher. So here is what I assume:
Block size is 9 bytes for each
With the 4x4 matrix, the number of elements that fit evenly on a block is:
blocksize*(number of columns/blocksize) = 9*(4/9) = 4
So if each row will fit on one block, why is it not cache friendly?
With the 128x128 matrix, with the same logic as above, each block will hold (9*(128/9)) = 128.
So obviously after calculating that, this equation is wrong. I'm looking at the code from this page http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf
Once I reached this point, I knew I was lost, which is where you guys come in! Is it as simple as saying each block holds 9 bytes, and 8 bytes (two integers) are what fits evenly into it? Sorry this stuff really confuses me, I know I'm all over the place. Just to be clear, these are my concerns:
How do you know how many elements will fit in a block?
Do the number of lines or sets affect this number? If so, how?
Any in depth explanation of the code posted on the linked page.
Really just trying to get a grasp of this.
UPDATE:
Okay so here is where I'm at for the 4x4 matrix.
I can read 8 bytes at a time, which is 2 integers. The original function will have cache misses because C loads into row-major order, so every time it wants A[i-1][j] it will miss, and load the block that holds A[i-1][j] which would either be A[i-1][0] and A[i-1][1] or A[i-1][2] and A[i-1][3].
So, would the best way to go about this be to create another temp variable, and do A[i][0] = temp, A[i][1] = temp2, then load A[i-1][0] A[i-1][1] and set them to temp, and temp2 and just set the loop to j<2? For this question, it is specifically for the matrices described; I understand this wouldn't work on all sizes.
The solution to this problem was to think of the matrix in column major order rather than row major order.
Hopefully this helps someone in the future. Thanks to #Michael Dorgan for getting me thinking.
End results for 128x128 matrix:
Original: 16218 misses
Optimized: 8196 misses

Best (fastest) way to find the number most frequently entered in C?

Well, I think the title basically explains my doubt. I will have n numbers to read, this n numbers go from 1 to x, where x is at most 105. What is the fastest (less possible time to run it) way to find out which number were inserted more times? That knowing that the number that appears most times appears more than half of the times.
What I've tried so far:
//for (1<=x<=10⁵)
int v[100000+1];
//multiple instances , ends when n = 0
while (scanf("%d", &n)&&n>0) {
zerofill(v);
for (i=0; i<n; i++) {
scanf("%d", &x);
v[x]++;
if (v[x]>n/2)
i=n;
}
printf("%d\n", x);
}
Zero-filling a array of x positions and increasing the position vector[x] and at the same time verifying if vector[x] is greater than n/2 it's not fast enough.
Any idea might help, thank you.
Observation: No need to care about amount of memory used.
The trivial solution of keeping a counter array is O(n) and you obviously can't get better than that. The fight is then about the constants and this is where a lot of details will play the game, including exactly what are the values of n and x, what kind of processor, what kind of architecture and so on.
On the other side this seems really the "knockout" problem, but that algorithm will need two passes over the data and an extra conditional, thus in practical terms in the computers I know it will be most probably slower than the array of counters solutions for a lot of n and x values.
The good point of the knockout solution is that you don't need to put a limit x on the values and you don't need any extra memory.
If you know already that there is a value with the absolute majority (and you simply need to find what is this value) then this could make it (but there are two conditionals in the inner loop):
initialize count = 0
loop over all elements
if count is 0 then set champion = element and count = 1
else if element != champion decrement count
else increment count
at the end of the loop your champion will be the value with the absolute majority of elements, if such a value is present.
But as said before I'd expect a trivial
for (int i=0,n=size; i<n; i++) {
if (++count[x[i]] > half) return x[i];
}
to be faster.
EDIT
After your edit seems you're really looking for the knockout algorithm, but caring about speed that's probably still the wrong question with modern computers (100000 elements is nothing even for a nail-sized single chip today).
I think you can create a max heap for the count of number you read,and use heap sort to find all the count which greater than n/2

Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}
There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).
No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.
If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.
Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

Resources