Matrix operations using code vectorization - c

I have written a function to do the transpose of a 4x4 matrix, but I do not know how to extend the code for a matrix m x n.
Where can I find maybe some sample code on matrix operations with SSE? product, transpose, inverse, etc?
This is the code of transpose 4x4:
void transpose(float* src, int n) {
__m128 row0, row1, row2, row3;
__m128 tmp1;
tmp1=_mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
row1=_mm_loadh_pi(_mm_loadl_pi(row1, (__m64*)(src+8)), (__m64*)(src+12));
row0=_mm_shuffle_ps(tmp1, row1, 0x88);
row1=_mm_shuffle_ps(row1, tmp1, 0xDD);
tmp1=_mm_movelh_ps(tmp1, row1);
row1=_mm_movehl_ps(tmp1, row1);
tmp1=_mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)), (__m64*)(src+ 6));
row3= _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)), (__m64*)(src+14));
row2=_mm_shuffle_ps(tmp1, row3, 0x88);
row3=_mm_shuffle_ps(row3, tmp1, 0xDD);
tmp1=_mm_movelh_ps(tmp1, row3);
row3=_mm_movehl_ps(tmp1, row3);
_mm_store_ps(src, row0);
_mm_store_ps(src+4, row1);
_mm_store_ps(src+8, row2);
_mm_store_ps(src+12, row3);
}

I'm not sure how to do a in-place transpose for arbitrary matrices using SIMD efficiently but I do know how to do it for out-of-place. Let me describe how to do both
In place transpose
For in-place transpose you should see Agner Fog's Optimizing software in C++ manual. See section 9.10 "Cache contentions in large data structures" example 9.5a. For certain matrix sizes you will see a large drop in performance due to cache aliasing. See table 9.1 for examples and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?. Agner gives a way to fix this using loop tiling (similar to what Paul R described) in Example 9.5b.
Out of place transpose
See my answer here (the one with the most votes) What is the fastest way to transpose a matrix in C++?. I have not looked into this in ages but let me just repeat my code here:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
}
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
}
}
}
}
}

Here is one general approach you can use for transposing an NxN matrix using tiling. You could even use your existing 4x4 transpose and work with a 4x4 tile size:
for each 4x4 block in the matrix with top left indices r, c
if block is on diagonal (i.e. if r == c)
get block a = 4x4 block at r, c
transpose block a
store block a at r, c
else if block is above diagonal (i.e. if r < c)
get block a = 4x4 block at r, c
get block b = 4x4 block at c, r
transpose block a
transpose block b
store transposed block a at c, r
store transposed block b at r, c
else // block is below diagonal
do nothing
endif
endfor
Obviously N needs to be a multiple of 4 for this to work, otherwise you will need to do some additional housekeeping.
As mentioned above in the comments, an MxN in-place transpose is hard to do - you need to either use an additional temporary matrix (which effectively makes it a not-in-place transpose) or use the method described here, but this will be much harder to vectorize with SIMD.

Related

Linear indexing of Matlab matrices in MEX file

I have a NxN symmetric matrix F of the following form
F_11 F_12 F_13 ... F_1N
F_21 ...
F_31
.
.
.
F_N1 F_N2 F_N3 ... F_NN
with each submatrices F_IJ of size m x m.
This matrix is created in MatLab, and will be used in a C-programm. So the values are stored in a vector columnwise. (E.g the vector will be of the form : (F_11_11,F_11_21,F_11_31,...F_11_m1,F_21_11,...F_NN_(m-1)m,F_NN_mm).
My question is the following: For readability I would like to define in C a way to access the values of F, given the indices (I,J) of the location of the first submatrix, and the indices (i,j) of the location of value in the submatrix. How can I link the linear indexing of the matrix to the (I,J,i,j) indices?
I assume all indices to be zero based, as usual in C/C++. If you want to use Matlab style one based indices, subtract one from each index.
I didn't check it, but I guess your index should be...
int idx = I*m+J*N*m*m+i+j*N*m;
You can write a function that calculates the index. Note that in C, indices start at 0.
size_t index_of_2d(size_t x, size_t y, size_t n) {
return x + y*n;
}
size_t index_of_4d(size_t I, size_t J, size_t N, size_t i, size_t j, size_t m) {
size_t submatrix = index_of_2d(I, J, N) * m * m; // scale the index in super matrix by the size of the submatrix
return submatrix + index_of_2d(i, j, m);
}

How do I extract a vector from a column-major matrix in C?

Coming from a MATLAB background, I have often used the fancy matrix manipulation commands such as vec = matrix(:,1) for extracting, e.g., the first column of matrix as a vector.
Porting some code to C with the need to interface it with FORTRAN and MATLAB now has me store matrices in single-dimensional arrays with column-major order.
So basically, I am using the macro
#define SUB2IND_2D(s1, s2, i1, i2) (s1)*(i2) + (i1)
and the loops
for(size_t r=0; r<ROWS; ++r)
{
for(size_t c=0; c<COLS; ++c)
{
size_t index = SUB2IND_2D(ROWS,COLS,r,c);
// do something with matrix[index] here
}
}
in order to access the respective matrix. Now, my question is: How can I efficiently extract a column or row vector from matrix within this framework in C, just like I would do in MATLAB using matrix(:,1) or matrix(1,:) or similar?
let's say you want to extract a column number 2 give it a name ex_col:
int ex_col[];
for (size_t x=0; x<ROWS; x++)
{
size_t index = SUB2IND_2D(ROWS, COLS, x, 2); // fix column to 2 and extract all rows
ex_col[x] = matrix[index];
}
Now you can generalize it to a function
It's a little unclear what your trying to achieve. vec(r,c) will give you access to a specific element. Otherwise you answered your own question. vec(:,r) will extract your rows and vec(c,:) will extract your columns when you run your loop.

covariance matrix gsl

I am trying to calculate the Mahalanobis distance between two vectors a and b. Eventually, I will be using this as a distance measure in statistical algorithms. I am using gsl to implement them. The formula for the mahalanobis distance is sqrt((a-b)'c^-1(a-b)), where c is the covariance matrix. According to this gsl documentation, it takes in two data sets and returns one covariance value. I am not sure how to calculate the covariance matrix using that.
Any help is appreciated.
Thanks.
I think you need to understand the calcuation of a covariance matrix first, second heres a sample code to get you started
for (i = 0; i < A->size1; i++) {
for (j = i; j < A->size2; j++) {
a = gsl_matrix_column (A, i);
b = gsl_matrix_column (A, j);
double cov = gsl_stats_covariance(a.vector.data, a.vector.stride,b.vector.data, b.vector.stride, a.vector.size);
gsl_matrix_set (C, i, j, cov);
}
}

Implementing matrix multiplication with openCL / C

I understand the theory of matrix multiplication, I just have two questions about this particular kernel implementation:
For reference, num_rows = 32. The matrix B (b_mat) has been transposed before by another kernel, so as I understand it we're dot-ting row vectors together.
1) why do we need to use the param "vectors_per_row" and thus the inner loop? I thought we could just do sum += dot(row of A, row of B), and it seems like this param is splitting up the row into smaller portions (why?).
2) I don't understand the address offset for a_mat and b_mat, i.e. a_mat += start; b_mat += start*4;
__kernel void matrix_mult(__global float4 *a_mat,
__global float4 *b_mat, __global float *c_mat) {
float sum;
int num_rows = get_global_size(0);
int vectors_per_row = num_rows/4;
int start = get_global_id(0) * vectors_per_row;
a_mat += start;
c_mat += start*4;
for(int i=0; i<num_rows; i++) {
sum = 0.0f;
for(int j=0; j<vectors_per_row; j++) {
sum += dot(a_mat[j],
b_mat[i*vectors_per_row + j]);
}
c_mat[i] = sum;
}
}
Your matrix is composed of an array of float4's. Flaoa4's are vectors of 4 floats. This is where the 4 comes from. Dot only works with the builtin types, so you have to do it on the float4.
c_mat is of type float, which is why it has start*4 and a_mat has start. The offset is because the code is split up across several (potentially hundreds) of threads. Each thread is only calculating a small part of the multiply operation. start is simply where the thread starts computing. This is what the get_global_id(0) is for. It essentially gets your thread id. Technically it's the thread index of the first dimension, but it appears you only have one thread dimension, so here you can just think of it as thread id.

Symmetric Matrix Inversion in C using CBLAS/LAPACK

I am writing an algorithm in C that requires Matrix and Vector multiplications. I have a matrix Q (W x W) which is created by multiplying the transpose of a vector J(1 x W) with itself and adding Identity matrix I, scaled using scalar a.
Q = [(J^T) * J + aI].
I then have to multiply the inverse of Q with vector G to get vector M.
M = (Q^(-1)) * G.
I am using cblas and clapack to develop my algorithm. When matrix Q is populated using random numbers (type float) and inverted using the routines sgetrf_ and sgetri_ , the calculated inverse is correct.
But when matrix Q is symmetrical, which is the case when you multiply (J^T) x J, the calculated inverse is wrong!!.
I am aware of the row-major (in C) and column-major (in FORTRAN) format of arrays while calling lapack routines from C, but for a symmetrical matrix this should not be a problem as A^T = A.
I have attached my C function code for matrix inversion below.
I am sure there is a better way to solve this. Can anyone help me with this?
A solution using cblas would be great...
Thanks.
void InverseMatrix_R(float *Matrix, int W)
{
int LDA = W;
int IPIV[W];
int ERR_INFO;
int LWORK = W * W;
float Workspace[LWORK];
// - Compute the LU factorization of a M by N matrix A
sgetrf_(&W, &W, Matrix, &LDA, IPIV, &ERR_INFO);
// - Generate inverse of the matrix given its LU decompsotion
sgetri_(&W, Matrix, &LDA, IPIV, Workspace, &LWORK, &ERR_INFO);
// - Display the Inverted matrix
PrintMatrix(Matrix, W, W);
}
void PrintMatrix(float* Matrix, int row, int colm)
{
int i,k;
for (i =0; i < row; i++)
{
for (k = 0; k < colm; k++)
{
printf("%g, ",Matrix[i*colm + k]);
}
printf("\n");
}
}
I don't know BLAS or LAPACK, so I have no idea what may cause this behaviour.
But, for matrices of the given form, calculating the inverse is quite easy. The important fact for this is
(J^T*J)^2 = (J^T*J)*(J^T*J) = J^T*(J*J^T)*J = <J|J> * (J^T*J)
where <u|v> denotes the inner product (if the components are real - the canonical bilinear form for complex components, but then you'd probably consider not the transpose but the conjugate transpose, and you'd be back at the inner product).
Generalising,
(J^T*J)^n = (<J|J>)^(n-1) * (J^T*J), for n >= 1.
Let us denote the symmetric square matrix (J^T*J) by S and the scalar <J|J> by q. Then, for general a != 0 of sufficiently large absolute value (|a| > q):
(a*I + S)^(-1) = 1/a * (I + a^(-1)*S)^(-1)
= 1/a * (I + ∑ (-1)^k * a^(-k) * S^k)
k>0
= 1/a * (I + (∑ (-1)^k * a^(-k) * q^(k-1)) * S)
k>0
= 1/a * (I - 1/(a+q)*S)
= 1/a*I - 1/(a*(a+q))*S
That formula holds (by analyticity) for all a except a = 0 and a = -q, as can be verified by calculating
(a*I + S) * (1/a*I - 1/(a*(a+q))*S) = I + 1/a*S - 1/(a+q)*S - 1/(a*(a+q))*S^2
= I + 1/a*S - 1/(a+q)*S - q/(a*(a+q))*S
= I + ((a+q) - a - q)/(a*(a+q))*S
= I
using S^2 = q*S.
That calculation is also much simpler and more efficient than first finding the LU decomposition.
You may want to try Armadillo, which is an easy to use C++ wrapper for LAPACK. It provides several inverse related functions:
inv(), general inverse, with an optional speedup for symmetric positive definite matrices
pinv(), pseudo-inverse
solve(), solve a system of linear equations (that can be over- or under-determined), without doing the actual inverse
Example for 3x3 matrix inversion, visit sgetri.f for more
//__CLPK_integer is typedef of int
//__CLPK_real is typedef of float
__CLPK_integer ipiv[3];
{
//Compute LU lower upper factorization of matrix
__CLPK_integer m=3;
__CLPK_integer n=3;
__CLPK_real *a=(float *)this->m1;
__CLPK_integer lda=3;
__CLPK_integer info;
sgetrf_(&m, &n, a, &lda, ipiv, &info);
}
{
//compute inverse of a matrix
__CLPK_integer n=3;
__CLPK_real *a=(float *)this->m1;
__CLPK_integer lda=3;
__CLPK_real work[3];
__CLPK_integer lwork=3;
__CLPK_integer info;
sgetri_(&n, a, &lda, ipiv, work, &lwork, &info);
}

Resources