Linear indexing of Matlab matrices in MEX file - c

I have a NxN symmetric matrix F of the following form
F_11 F_12 F_13 ... F_1N
F_21 ...
F_31
.
.
.
F_N1 F_N2 F_N3 ... F_NN
with each submatrices F_IJ of size m x m.
This matrix is created in MatLab, and will be used in a C-programm. So the values are stored in a vector columnwise. (E.g the vector will be of the form : (F_11_11,F_11_21,F_11_31,...F_11_m1,F_21_11,...F_NN_(m-1)m,F_NN_mm).
My question is the following: For readability I would like to define in C a way to access the values of F, given the indices (I,J) of the location of the first submatrix, and the indices (i,j) of the location of value in the submatrix. How can I link the linear indexing of the matrix to the (I,J,i,j) indices?

I assume all indices to be zero based, as usual in C/C++. If you want to use Matlab style one based indices, subtract one from each index.
I didn't check it, but I guess your index should be...
int idx = I*m+J*N*m*m+i+j*N*m;

You can write a function that calculates the index. Note that in C, indices start at 0.
size_t index_of_2d(size_t x, size_t y, size_t n) {
return x + y*n;
}
size_t index_of_4d(size_t I, size_t J, size_t N, size_t i, size_t j, size_t m) {
size_t submatrix = index_of_2d(I, J, N) * m * m; // scale the index in super matrix by the size of the submatrix
return submatrix + index_of_2d(i, j, m);
}

Related

Use a dope vector to access arbitrary axial slices of a multidimensional array?

I'm building a suite of functions to work with a multidimensional-array data structure and I want to be able to define arbitrary slices of the arrays so I can implement a generalized inner product of two arbitrary matrices (aka Tensors or n-d arrays).
An APL paper I read (I honestly can't find which -- I've read so many) defines the matrix product on left-matrix X with dimensions A;B;C;D;E;F and right-matrix Y with dimensions G;H;I;J;K where F==G as
Z <- X +.× Y
Z[A;B;C;D;E;H;I;J;K] <- +/ X[A;B;C;D;E;*] × Y[*;H;I;J;K]
where +/ is sum of, and × applies element-by-element to two vectors of the same length.
So I need "row" slices of the left and "column" slices of the right. I can of course take a transpose and then a "row" slice to simulate a "column" slice, but I'd rather do it more elegantly.
Wikipedia's article on slicing leads to a stub about dope vectors which seem to be the miracle cure I'm looking for, but there's not a lot there to go on.
How do I use a dope vector to implement arbitrary slicing?
(Much later I did notice Stride of an array which has some details.)
Definition
General array slicing can be implemented (whether or not built into the language) by referencing every array through a dope vector or descriptor — a record that contains the address of the first array element, and then the range of each index and the corresponding coefficient in the indexing formula. This technique also allows immediate array transposition, index reversal, subsampling, etc. For languages like C, where the indices always start at zero, the dope vector of an array with d indices has at least 1 + 2d parameters.
http://en.wikipedia.org/wiki/Array_slicing#Details
That's a dense paragraph, but it's actually all in there. So we need a data structure like this:
struct {
TYPE *data; //address of first array element
int rank; //number of dimensions
int *dims; //size of each dimension
int *weight; //corresponding coefficient in the indexing formula
};
Where TYPE is whatever the element type is, the field of the matrices. For simplicity and concreteness, we'll just use int. For my own purposes, I've devised an encoding of various types into integer handles so int does the job for me, YMMV.
typedef struct arr {
int rank;
int *dims;
int *weight;
int *data;
} *arr;
All of the pointer members can be assigned locations within the
same allocated block of memory as the struct itself (which we will
call the header). But by replacing the earlier use of offsets
and struct-hackery, the implementation of algorithms can be made
independent of that actual memory layout within (or without) the
block.
The basic memory layout for a self-contained array object is
rank dims weight data
dims[0] dims[1] ... dims[rank-1]
weight[0] weight[1] ... weight[rank-1]
data[0] data[1] ... data[ product(dims)-1 ]
An indirect array sharing data (whole array or 1 or more row-slices)
will have the following memory layout
rank dims weight data
dims[0] dims[1] ... dims[rank-1]
weight[0] weight[1] ... weight[rank-1]
//no data! it's somewhere else!
And an indirect array containing an orthogonal slice along
another axis will have the same layout as a basic indirect array,
but with dims and weight suitably modified.
The access formula for an element with indices (i0 i1 ... iN)
is
a->data[ i0*a->weight[0] + i1*a->weight[1] + ...
+ iN*a->weight[N] ]
, assuming each index i[j] is between [ 0 ... dims[j] ).
In the weight vector for a normally laid-out row-major array, each element is the product of all lower dimensions.
for (int i=0; i<rank; i++)
weight[i] = product(dims[i+1 .. rank-1]);
So for a 3×4×5 array, the metadata would be
{ .rank=3, .dims=(int[]){3,4,5}, .weight=(int[]){4*5, 5, 1} }
or for a 7×6×5×4×3×2 array, the metadata would be
{ .rank=6, .dims={7,6,5,4,3,2}, .weight={720, 120, 24, 6, 2, 1} }
Construction
So, to create one of these, we need the same helper function from the previous question to compute the size from a list of dimensions.
/* multiply together rank integers in dims array */
int productdims(int rank, int *dims){
int i,z=1;
for(i=0; i<rank; i++)
z *= dims[i];
return z;
}
To allocate, simply malloc enough memory and set the pointers to the appropriate places.
/* create array given rank and int[] dims */
arr arraya(int rank, int dims[]){
int datasz;
int i;
int x;
arr z;
datasz=productdims(rank,dims);
z=malloc(sizeof(struct arr)
+ (rank+rank+datasz)*sizeof(int));
z->rank = rank;
z->dims = z + 1;
z->weight = z->dims + rank;
z->data = z->weight + rank;
memmove(z->dims,dims,rank*sizeof(int));
for(x=1, i=rank-1; i>=0; i--){
z->weight[i] = x;
x *= z->dims[i];
}
return z;
}
And using the same trick from the previous answer, we can make a variable-argument interface to make usage easier.
/* load rank integers from va_list into int[] dims */
void loaddimsv(int rank, int dims[], va_list ap){
int i;
for (i=0; i<rank; i++){
dims[i]=va_arg(ap,int);
}
}
/* create a new array with specified rank and dimensions */
arr (array)(int rank, ...){
va_list ap;
//int *dims=calloc(rank,sizeof(int));
int dims[rank];
int i;
int x;
arr z;
va_start(ap,rank);
loaddimsv(rank,dims,ap);
va_end(ap);
z = arraya(rank,dims);
//free(dims);
return z;
}
And even automatically produce the rank argument by counting the other arguments using the awesome powers of ppnarg.
#define array(...) (array)(PP_NARG(__VA_ARGS__),__VA_ARGS__) /* create a new array with specified dimensions */
Now constructing one of these is very easy.
arr a = array(2,3,4); // create a dynamic [2][3][4] array
Accessing elements
An element is retrieved with a function call to elema which multiplies each index by the corresponding weight, sums them, and indexes the data pointer. We return a pointer to the element so it can be read or modified by the caller.
/* access element of a indexed by int[] */
int *elema(arr a, int *ind){
int idx = 0;
int i;
for (i=0; i<a->rank; i++){
idx += ind[i] * a->weight[i];
}
return a->data + idx;
}
The same varargs trick can make an easier interface elem(a,i,j,k).
Axial Slices
To take a slice, first we need a way of specifying which dimensions to extract and which to collapse. If we just need to select a single index or all elements from a dimension, then the slice function can take rank ints as arguments and interpret -1 as selecting the whole dimension or 0..dimsi-1 as selecting a single index.
/* take a computed slice of a following spec[] instructions
if spec[i] >= 0 and spec[i] < a->rank, then spec[i] selects
that index from dimension i.
if spec[i] == -1, then spec[i] selects the entire dimension i.
*/
arr slicea(arr a, int spec[]){
int i,j;
int rank;
for (i=0,rank=0; i<a->rank; i++)
rank+=spec[i]==-1;
int dims[rank];
int weight[rank];
for (i=0,j=0; i<rank; i++,j++){
while (spec[j]!=-1) j++;
if (j>=a->rank) break;
dims[i] = a->dims[j];
weight[i] = a->weight[j];
}
arr z = casta(a->data, rank, dims);
memcpy(z->weight,weight,rank*sizeof(int));
for (j=0; j<a->rank; j++){
if (spec[j]!=-1)
z->data += spec[j] * a->weight[j];
}
return z;
}
All the dimensions and weights corresponding to the -1s in the argument array are collected and used to create a new array header. All arguments >= 0 are multiplied by their associated weight and added to the data pointer, offsetting the pointer to the correct element.
In terms of the array access formula, we're treating it as a polynomial.
offset = constant + sum_i=0,n( weight[i] * index[i] )
So for any dimension from which we're selecting a single element (+ all lower dimensions), we factor-out the now-constant index and add it to the constant term in the formula (which in our C representation is the data pointer itself).
The helper function casta creates the new array header with shared data. slicea of course changes the weight values, but by calculating weights itself, casta becomes a more generally usable function. It can even be used to construct a dynamic array structure that operates directly on a regular C-style multidimensional array, thus casting.
/* create an array header to access existing data in multidimensional layout */
arr casta(int *data, int rank, int dims[]){
int i,x;
arr z=malloc(sizeof(struct arr)
+ (rank+rank)*sizeof(int));
z->rank = rank;
z->dims = z + 1;
z->weight = z->dims + rank;
z->data = data;
memmove(z->dims,dims,rank*sizeof(int));
for(x=1, i=rank-1; i>=0; i--){
z->weight[i] = x;
x *= z->dims[i];
}
return z;
}
Transposes
The dope vector can also be used to implement transposes. The order of the dimensions (and indices) can be changed.
Remember that this is not a normal 'transpose' like everybody else
does. We don't rearrange the data at all. This is more
properly called a 'dope-vector pseudo-transpose'.
Instead of changing the data, we just change the
constants in the access formula, rearranging the
coefficients of the polynomial. In a real sense, this
is just an application of the commutativity and
associativity of addition.
So for concreteness, assume the data is arranged
sequentially starting at hypothetical address 500.
500: 0
501: 1
502: 2
503: 3
504: 4
505: 5
506: 6
if a is rank 2, dims {1, 7), weight (7, 1), then the
sum of the indices multiplied by the associated weights
added to the initial pointer (500) yield the appropriate
addresses for each element
a[0][0] == *(500+0*7+0*1)
a[0][1] == *(500+0*7+1*1)
a[0][2] == *(500+0*7+2*1)
a[0][3] == *(500+0*7+3*1)
a[0][4] == *(500+0*7+4*1)
a[0][5] == *(500+0*7+5*1)
a[0][6] == *(500+0*7+6*1)
So the dope-vector pseudo-transpose rearranges the
weights and dimensions to match the new ordering of
indices, but the sum remains the same. The initial
pointer remains the same. The data does not move.
b[0][0] == *(500+0*1+0*7)
b[1][0] == *(500+1*1+0*7)
b[2][0] == *(500+2*1+0*7)
b[3][0] == *(500+3*1+0*7)
b[4][0] == *(500+4*1+0*7)
b[5][0] == *(500+5*1+0*7)
b[6][0] == *(500+6*1+0*7)
Inner Product (aka Matrix Multiplication)
So, by using general slices or transpose+"row"-slices (which are easier), generalized inner product can be implemented.
First we need the two helper functions for applying a binary operation to two vectors producing a vector result, and reducing a vector with a binary operation producing a scalar result.
As in the previous question we'll pass in the operator, so the same function can be used with many different operators. For the style here, I'm passing operators as single characters, so there's already an indirect mapping from C operators to our function's operators. This is an x-macro table.
#define OPERATORS(_) \
/* f F id */ \
_('+',+,0) \
_('*',*,1) \
_('=',==,1) \
/**/
#define binop(X,F,Y) (binop)(X,*#F,Y)
arr (binop)(arr x, char f, arr y); /* perform binary operation F upon corresponding elements of vectors X and Y */
The extra element in the table is for the reduce function for the case of a null vector argument. In that case, reduce should return the operator's identity element, 0 for +, 1 for *.
#define reduce(F,X) (reduce)(*#F,X)
int (reduce)(char f, arr a); /* perform binary operation F upon adjacent elements of vector X, right to left,
reducing vector to a single value */
So the binop does a loop and a switch on the operator character.
/* perform binary operation F upon corresponding elements of vectors X and Y */
#define BINOP(f,F,id) case f: *elem(z,i) = *elem(x,i) F *elem(y,i); break;
arr (binop)(arr x, char f, arr y){
arr z=copy(x);
int n=x->dims[0];
int i;
for (i=0; i<n; i++){
switch(f){
OPERATORS(BINOP)
}
}
return z;
}
#undef BINOP
And the reduce function does a backwards loop if there are enough elements, having set the initial value to the last element if there was one, having preset the initial value to the operator's identity element.
/* perform binary operation F upon adjacent elements of vector X, right to left,
reducing vector to a single value */
#define REDID(f,F,id) case f: x = id; break;
#define REDOP(f,F,id) case f: x = *elem(a,i) F x; break;
int (reduce)(char f, arr a){
int n = a->dims[0];
int x;
int i;
switch(f){
OPERATORS(REDID)
}
if (n) {
x=*elem(a,n-1);
for (i=n-2;i>=0;i--){
switch(f){
OPERATORS(REDOP)
}
}
}
return x;
}
#undef REDID
#undef REDOP
And with these tools, inner product can be implemented in a higher-level manner.
/* perform a (2D) matrix multiplication upon rows of x and columns of y
using operations F and G.
Z = X F.G Y
Z[i,j] = F/ X[i,*] G Y'[j,*]
more generally,
perform an inner product on arguments of compatible dimension.
Z = X[A;B;C;D;E;F] +.* Y[G;H;I;J;K] |(F = G)
Z[A;B;C;D;E;H;I;J;K] = +/ X[A;B;C;D;E;*] * Y[*;H;I;J;K]
*/
arr (matmul)(arr x, char f, char g, arr y){
int i,j;
arr xdims = cast(x->dims,1,x->rank);
arr ydims = cast(y->dims,1,y->rank);
xdims->dims[0]--;
ydims->dims[0]--;
ydims->data++;
arr z=arraya(x->rank+y->rank-2,catv(xdims,ydims)->data);
int datasz = productdims(z->rank,z->dims);
int k=y->dims[0];
arr xs = NULL;
arr ys = NULL;
for (i=0; i<datasz; i++){
int idx[x->rank+y->rank];
vector_index(i,z->dims,z->rank,idx);
int *xdex=idx;
int *ydex=idx+x->rank-1;
memmove(ydex+1,ydex,y->rank);
xdex[x->rank-1] = -1;
free(xs);
free(ys);
xs = slicea(x,xdex);
ys = slicea(y,ydex);
z->data[i] = (reduce)(f,(binop)(xs,g,ys));
}
free(xs);
free(ys);
free(xdims);
free(ydims);
return z;
}
This function also uses the functions cast which presents a varargs interface to casta.
/* create an array header to access existing data in multidimensional layout */
arr cast(int *data, int rank, ...){
va_list ap;
int dims[rank];
va_start(ap,rank);
loaddimsv(rank,dims,ap);
va_end(ap);
return casta(data, rank, dims);
}
And it also uses vector_index to convert a 1D index into an nD vector of indices.
/* compute vector index list for ravel index ind */
int *vector_index(int ind, int *dims, int n, int *vec){
int i,t=ind, *z=vec;
for (i=0; i<n; i++){
z[n-1-i] = t % dims[n-1-i];
t /= dims[n-1-i];
}
return z;
}
github file. Additional supporting functions are also in the github file.
This Q/A pair is part of a series of related issues which arose in implementing inca an interpreter for the APL language written in C. Others: How can I work with dynamically-allocated arbitrary-dimensional arrays? , and How to pass a C math operator (+-*/%) into a function result=mathfunc(x,+,y);? . Some of this material was previously posted to comp.lang.c. More background in comp.lang.apl.

How to get the indices of top N values of an array?

I have a big size array that contains numbers, is there way to find the indices of top n values? Any lib function in C?
example:
an array : {1,2,6,5,3}
the indices of top 2 number is: {2,3}
If by top n you mean the n-th highest (or lowest) number in the array, you may want to look at the QuickSelect algorithm. Unfortunately there is no C library function I am aware of that implements it but Wikipedia should give you a good starting point.
QuickSelect is O(n) on average, if O(nlogn) and some overhead is fine as well, you can do qsort and take the n'th element.
Edit (In response to example) Getting all the indexes of the top-n in a single batch is straightforward with both approaches. QuickSelect sorts them all on one side of the final pivot.
So you want the top n numbers in a big array of N numbers. There is a straightforward algorithm which is O(N*n). If n is small (as it seems to be in your case) this is good enough.
size_t top_elems(int *arr, size_t N, size_t *top, size_t n) {
/*
insert into top[0],...,top[n-1] the indices of n largest elements
of arr[0],...,arr[N-1]
*/
size_t top_count = 0;
size_t i;
for (i=0;i<N;++i) {
// invariant: arr[top[0]] >= arr[top[1]] >= .... >= arr[top[top_count-1]]
// are the indices of the top_count larger values in arr[0],...,arr[i-1]
// top_count = max(i,n);
size_t k;
for (k=top_count;k>0 && arr[i]>arr[top[k-1]];k--);
// i should be inserted in position k
if (k>=n) continue; // element arr[i] is not in the top n
// shift elements from k to top_count
size_t j=top_count;
if (j>n-1) { // top array is already full
j=n-1;
} else { // increase top array
top_count++;
}
for (;j>k;j--) {
top[j]=top[j-1];
}
// insert i
top[k] = i;
}
return top_count;
}

Matrix operations using code vectorization

I have written a function to do the transpose of a 4x4 matrix, but I do not know how to extend the code for a matrix m x n.
Where can I find maybe some sample code on matrix operations with SSE? product, transpose, inverse, etc?
This is the code of transpose 4x4:
void transpose(float* src, int n) {
__m128 row0, row1, row2, row3;
__m128 tmp1;
tmp1=_mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
row1=_mm_loadh_pi(_mm_loadl_pi(row1, (__m64*)(src+8)), (__m64*)(src+12));
row0=_mm_shuffle_ps(tmp1, row1, 0x88);
row1=_mm_shuffle_ps(row1, tmp1, 0xDD);
tmp1=_mm_movelh_ps(tmp1, row1);
row1=_mm_movehl_ps(tmp1, row1);
tmp1=_mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)), (__m64*)(src+ 6));
row3= _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)), (__m64*)(src+14));
row2=_mm_shuffle_ps(tmp1, row3, 0x88);
row3=_mm_shuffle_ps(row3, tmp1, 0xDD);
tmp1=_mm_movelh_ps(tmp1, row3);
row3=_mm_movehl_ps(tmp1, row3);
_mm_store_ps(src, row0);
_mm_store_ps(src+4, row1);
_mm_store_ps(src+8, row2);
_mm_store_ps(src+12, row3);
}
I'm not sure how to do a in-place transpose for arbitrary matrices using SIMD efficiently but I do know how to do it for out-of-place. Let me describe how to do both
In place transpose
For in-place transpose you should see Agner Fog's Optimizing software in C++ manual. See section 9.10 "Cache contentions in large data structures" example 9.5a. For certain matrix sizes you will see a large drop in performance due to cache aliasing. See table 9.1 for examples and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?. Agner gives a way to fix this using loop tiling (similar to what Paul R described) in Example 9.5b.
Out of place transpose
See my answer here (the one with the most votes) What is the fastest way to transpose a matrix in C++?. I have not looked into this in ages but let me just repeat my code here:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
}
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
}
}
}
}
}
Here is one general approach you can use for transposing an NxN matrix using tiling. You could even use your existing 4x4 transpose and work with a 4x4 tile size:
for each 4x4 block in the matrix with top left indices r, c
if block is on diagonal (i.e. if r == c)
get block a = 4x4 block at r, c
transpose block a
store block a at r, c
else if block is above diagonal (i.e. if r < c)
get block a = 4x4 block at r, c
get block b = 4x4 block at c, r
transpose block a
transpose block b
store transposed block a at c, r
store transposed block b at r, c
else // block is below diagonal
do nothing
endif
endfor
Obviously N needs to be a multiple of 4 for this to work, otherwise you will need to do some additional housekeeping.
As mentioned above in the comments, an MxN in-place transpose is hard to do - you need to either use an additional temporary matrix (which effectively makes it a not-in-place transpose) or use the method described here, but this will be much harder to vectorize with SIMD.

Symmetric Matrix Inversion in C using CBLAS/LAPACK

I am writing an algorithm in C that requires Matrix and Vector multiplications. I have a matrix Q (W x W) which is created by multiplying the transpose of a vector J(1 x W) with itself and adding Identity matrix I, scaled using scalar a.
Q = [(J^T) * J + aI].
I then have to multiply the inverse of Q with vector G to get vector M.
M = (Q^(-1)) * G.
I am using cblas and clapack to develop my algorithm. When matrix Q is populated using random numbers (type float) and inverted using the routines sgetrf_ and sgetri_ , the calculated inverse is correct.
But when matrix Q is symmetrical, which is the case when you multiply (J^T) x J, the calculated inverse is wrong!!.
I am aware of the row-major (in C) and column-major (in FORTRAN) format of arrays while calling lapack routines from C, but for a symmetrical matrix this should not be a problem as A^T = A.
I have attached my C function code for matrix inversion below.
I am sure there is a better way to solve this. Can anyone help me with this?
A solution using cblas would be great...
Thanks.
void InverseMatrix_R(float *Matrix, int W)
{
int LDA = W;
int IPIV[W];
int ERR_INFO;
int LWORK = W * W;
float Workspace[LWORK];
// - Compute the LU factorization of a M by N matrix A
sgetrf_(&W, &W, Matrix, &LDA, IPIV, &ERR_INFO);
// - Generate inverse of the matrix given its LU decompsotion
sgetri_(&W, Matrix, &LDA, IPIV, Workspace, &LWORK, &ERR_INFO);
// - Display the Inverted matrix
PrintMatrix(Matrix, W, W);
}
void PrintMatrix(float* Matrix, int row, int colm)
{
int i,k;
for (i =0; i < row; i++)
{
for (k = 0; k < colm; k++)
{
printf("%g, ",Matrix[i*colm + k]);
}
printf("\n");
}
}
I don't know BLAS or LAPACK, so I have no idea what may cause this behaviour.
But, for matrices of the given form, calculating the inverse is quite easy. The important fact for this is
(J^T*J)^2 = (J^T*J)*(J^T*J) = J^T*(J*J^T)*J = <J|J> * (J^T*J)
where <u|v> denotes the inner product (if the components are real - the canonical bilinear form for complex components, but then you'd probably consider not the transpose but the conjugate transpose, and you'd be back at the inner product).
Generalising,
(J^T*J)^n = (<J|J>)^(n-1) * (J^T*J), for n >= 1.
Let us denote the symmetric square matrix (J^T*J) by S and the scalar <J|J> by q. Then, for general a != 0 of sufficiently large absolute value (|a| > q):
(a*I + S)^(-1) = 1/a * (I + a^(-1)*S)^(-1)
= 1/a * (I + ∑ (-1)^k * a^(-k) * S^k)
k>0
= 1/a * (I + (∑ (-1)^k * a^(-k) * q^(k-1)) * S)
k>0
= 1/a * (I - 1/(a+q)*S)
= 1/a*I - 1/(a*(a+q))*S
That formula holds (by analyticity) for all a except a = 0 and a = -q, as can be verified by calculating
(a*I + S) * (1/a*I - 1/(a*(a+q))*S) = I + 1/a*S - 1/(a+q)*S - 1/(a*(a+q))*S^2
= I + 1/a*S - 1/(a+q)*S - q/(a*(a+q))*S
= I + ((a+q) - a - q)/(a*(a+q))*S
= I
using S^2 = q*S.
That calculation is also much simpler and more efficient than first finding the LU decomposition.
You may want to try Armadillo, which is an easy to use C++ wrapper for LAPACK. It provides several inverse related functions:
inv(), general inverse, with an optional speedup for symmetric positive definite matrices
pinv(), pseudo-inverse
solve(), solve a system of linear equations (that can be over- or under-determined), without doing the actual inverse
Example for 3x3 matrix inversion, visit sgetri.f for more
//__CLPK_integer is typedef of int
//__CLPK_real is typedef of float
__CLPK_integer ipiv[3];
{
//Compute LU lower upper factorization of matrix
__CLPK_integer m=3;
__CLPK_integer n=3;
__CLPK_real *a=(float *)this->m1;
__CLPK_integer lda=3;
__CLPK_integer info;
sgetrf_(&m, &n, a, &lda, ipiv, &info);
}
{
//compute inverse of a matrix
__CLPK_integer n=3;
__CLPK_real *a=(float *)this->m1;
__CLPK_integer lda=3;
__CLPK_real work[3];
__CLPK_integer lwork=3;
__CLPK_integer info;
sgetri_(&n, a, &lda, ipiv, work, &lwork, &info);
}

How to find a subset of a 2d array by selecting rows and columns in c?

In an (m by n) array stored as double *d (column major), what is the fastest way of selecting a range of rows and or columns:
double *filter(double *mat, int m, int n, int rows[], int cols[]);
invoked as:
double *B;
int rows[]= {1,3,5}; int cols[]={2,4};
B = filter(A, 5, 4, rows, cols);
which is expected to return a 3-by-2 subset of A consisting of elements (1,2), (1,4), (3,2)...
c provides no native support for this, so you'll have to find a library that supports it, or define it by hand.
pseudocode:
a=length of rows // no built in c functionality for this either,
b=length of cols // use a sentinel, or pass to the function
nmat = allocate (sizeof(double)*a*b) on the heap
for (i=0; i<a; ++i)
for (j=0; j<b; ++j)
// Note the column major storage convention here (bloody FORTRAN!)
nmat[i + j*b] = mat[ rows[i] + cols[j]*n ];
return nmat
// caller responsible for freeing the allocated memeory
I've commented on the two big problems you face.
I'm pretty sure that m and n should be the dimension of the new array (i.e. m should be the number of items of rows and n should be the number of items in cols) and not the dimension of the source array as seems to be the case in your example because a) you don't need to know the dimension of the source array if you already know which indices to pick from it (and are presumably allowed to assume that those will be valid) and b) if you don't know the size of rows and columns they are useless as you can't iterate over them.
So under that assumption the algorithm could look like this:
Allocate memory for an m x n array
For each i in 0..m, j in 0..n
B[i,j] = A[ rows[i], cols[j] ];

Resources