Is there any way to optimize matrix multiplication in C? - c

My code as follows, and in the main function, I recall Mat_product function about 223440 times, use 179ns, 23% in the whole runtime.
struct Matrix_SE3 {
float R[3][3];
double P[3]; //here i need use double type.
};
struct Matrix_SE3 Mat_product(struct Matrix_SE3 A, struct Matrix_SE3 B) {
struct Matrix_SE3 result = { { { 0, 0, 0 }, { 0, 0, 0 }, { 0, 0, 0 } }, { 0,
0, 0 } };
for (int i = 0; i < 3; i++) {
result.P[i] += A.P[i];
for (int j = 0; j < 3; j++) {
result.P[i] += A.R[i][j] * B.P[j];
for (int k = 0; k < 3; k++)
result.R[i][j] += A.R[i][k] * B.R[k][j];
}
}
return result;
}
where $R$ is the rotation matrix, and $P$ represent the position, the function is calculated at two special euclidean group $SE(3)$ matrix multiplication and return $SE(3)$ matrix.
Maybe this is a duplicate of Optimized matrix multiplication in C, the difference is my code use struct to describe matrix, does it affect the efficiency of calculation?

Not sure what are the P and R par in your code, but you should never use the ijk ordering for matrix multiplication.
Because of the row-major ordering, when accessing B.R[k][j] in your inner loop, many accesses will lead to a cache miss, reducing performances significantly, even with your small matrices.
The proper way to perform matrix multiplication is to iterate in the ikj order.
for (int i = 0; i < 3; i++) {
double r;
result.P[i] += A.P[i];
for (int k = 0; k < 3; k++) {
r=A.R[i][k];
for (int j = 0; j < 3; j++) {
result.P[i] += A.R[i][j] * B.P[j];
result.R[i][j] += r * B.R[k][j];
}
}
}
All accesses will properly be performed in row major order order and will benefit from the cache behavior.
And do not forget to use -O3 optimization. Most compilers will use sse/avx instructions to optimize the code.

Related

How do I multiply two dynamic matrices in C?

I'm trying to multiply two dynamic matrices by passing them through a function. I'm getting a segmentation fault during the multiplication.
The matrices are being passed through a function. The items in the arguments are correct because I had to use them for a different operation in this project. I have a feeling that I messed up with the pointers, but i'm pretty new to C and i'm not sure where I messed up.
double** multiplyMatrices(
double** a,
const uint32_t a_rows,
const uint32_t a_cols,
double** b,
const uint32_t b_cols){
uint32_t i = 0;
uint32_t j = 0;
uint32_t k = 0;
double** c;
//allocate memory to matrix c
c = (double **)malloc(sizeof(double *) * a_rows);
for (i = 0; i < a_rows; i++) {
*(c +i) = (double *)malloc(sizeof(double) * b_cols);
}
//clear matrix c
for(i = 0; i < a_rows; i++){
for(j = 0; j < a_cols; j++){
*c[j] = 0;
}
}
i = 0;
//multiplication
while(j = 0, i < a_rows ){
while(k = 0, j < b_cols){
while(k < a_cols){
//following line is where i'm getting the segmentation fault
*(*(c+(i*b_cols))+j) += (*(*(a+(i*a_cols))+k)) * (*(*(b+(k*b_cols))+j));
k++;
}
j++;
}
i++;
}
return c;
}
The obvious mistake is that you dereference c + i * b_cols while c is an array of pointers of size a_rows. So likely c + i * b_cols is outside of the area that you previously allocated with malloc().
I would suggest to simplify the matrix representation using a single array of double with the size equal to the total number of elements, i.e. rows * cols.
For example:
double *c;
c = malloc(sizeof(double) * a_rows * b_cols);
This not only has better overall performance, but simplifies the code. You would then have to "linearise" the offset inside your unidimensional array to convert from bi-dimensional matrix coordinates. For example:
c[i * b_cols + j] = ...
Of course, the other two matrices need to be allocated, filled and accessed in a similar manner.
For code clarity, I would also replace the while statements by for statements with the actual variable that they loop on. For example:
for (i = 0; i < a_rows; i++)
for (j = 0; j < b_cols; j++)
for (k = 0; k < a_cols; k++)
You can (ab)use the C language in many ways, but the trick is to make it more clear for you in the first place.

Speed up matrix-matrix multiplication using SSE vector instructions

I have some trouble in vectorize some C code using SSE vector instructions. The code which I have to victorize is
#define N 1000
void matrix_mul(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
for (k = 0; k < N; ++k)
{
result[i][k] += mat1[i][j] * mat2[j][k];
}
}
}
}
Here is what I got so far:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k; int* l;
__m128i v1, v2, v3;
v3 = _mm_setzero_si128();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j += 4)
{
for (k = 0; k < N; k += 4)
{
v1 = _mm_set1_epi32(mat1[i][j]);
v2 = _mm_loadu_si128((__m128i*)&mat2[j][k]);
v3 = _mm_add_epi32(v3, _mm_mul_epi32(v1, v2));
_mm_storeu_si128((__m128i*)&result[i][k], v3);
v3 = _mm_setzero_si128();
}
}
}
}
After execution I got wrong result. I know that the reason is the loading from memory to v2. I loop through mat1 in row major order so I need to load mat2[0][0], mat2[1][0], mat2[2][0], mat2[3][0].... but what actually loaded is mat2[0][0], mat2[0][1], mat2[0][2], mat2[0][3]... because mat2 has stored in the memory in row major order. I tried to fix this problem but without any improvement.
Can anyone help me please.
Below fixed your implementation:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
__m128i v1, v2, v3, v4;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j) // 'j' must be incremented by 1
{
// read mat1 here because it does not use 'k' index
v1 = _mm_set1_epi32(mat1[i][j]);
for (k = 0; k < N; k += 4)
{
v2 = _mm_loadu_si128((const __m128i*)&mat2[j][k]);
// read what's in the result array first as we will need to add it later to our calculations
v3 = _mm_loadu_si128((const __m128i*)&result[i][k]);
// use _mm_mullo_epi32 here instead _mm_mul_epi32 and add it to the previous result
v4 = _mm_add_epi32(v3, _mm_mullo_epi32(v1, v2));
// store the result
_mm_storeu_si128((__m128i*)&result[i][k], v4);
}
}
}
}
In short _mm_mullo_epi32 (requires SSE4.1) produces 4 x int32 results as opposed to _mm_mul_epi32 which does 2 x int64 results. If you cannot use SSE4.1 then have a look at the answer here for an alternative SSE2 solution.
Full description by Intel Intrinsic Guide:
_mm_mullo_epi32: Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store
the low 32 bits of the intermediate integers in dst.
_mm_mul_epi32: Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the
signed 64-bit results in dst.
I kinda changed around your code to make the addressing explicit [ it helps in this case ].
#define N 100
This is a stub for the vector unit multiple & accumulate operation; you should be able to replace NV with whatever throw your vector unit has, and put the relevant opcodes in here.
#define NV 8
int Vmacc(int *A, int *B) {
int i = 0;
int x = 0;
for (i = 0; i < NV; i++) {
x += *A++ * *B++;
}
return x;
}
This multiply has two notable variations from the norm:
1. It caches the columnar vector into a contiguous one.
2. It attempts to push slices of the multiply accumulate into a vector-like func.
Even without using the vector unit, this takes half the time of naive version just because of better cache/prefetch utilization.
void mm2(int *A, int *B, int n, int *C) {
int c, r;
int stride = 0;
int cache[N];
for (c = 0; c < n; c++) {
/* cache cumn i: */
for (r = 0; r < n; r++) {
cache[r] = B[c + r*n];
}
for (r = 0; r < n; r++) {
int k = 0;
int x = 0;
int *Av = A + r*n;
for (k = 0; k+NV-1 < n; k += NV) {
x += Vmacc(Av+k, cache+k);
}
while (k < n) {
x += Av[k] * cache[k];
k++;
}
C[r*n + c] = x;
}
}
}

Optimising divide operation inside Jacobi relaxation

I am trying to optimise the divide operation from the Jacobi relaxation formula.
Also doing profiling using perf.
Here is my code
for (int l = 0; l < iter; l++) {
for (i = 1; i < height; i++) {
for (j = 1; j < width; j++) {
for (k = 1; k < length; k++) {
float val = 0.0f;
// Do the Jacobi additions here
// From profiling, fastest is to fetch k+/-1,j,i
// Slowest is to fetch k,j,i+/-1
// Scale with dimensions of the array
val -= dim * array[k][j][i];
// Want to optimise this
val /= 6.0; // profiling shows this as the slowest op
// Some code here to put the result into the output array
}
}
}
}
The size of the 3D array can be from 100x100x100 up to 1000x1000x1000.
I've tried to multiply it to 1.0f/6.0f but this does not seem to make a difference. The array is a 3D array of floats.

Improve performance of a construction of p-values matrix for a permutation test

I used an R code which implements a permutation test for the distributional comparison between two populations of functions. We have p univariate p-values.
The bottleneck is the construction of a matrix which contains all the possible CONTIGUOS p-values.
The last row of the matrix of p-values contain all the univariate p-values.
The penultimate row contains all the bivariate p-values in this order:
p_val_c(1,2), p_val_c(2,3), ..., p_val_c(p, 1)
...
The elements of the first row are coincident and the value associated is the p-value of the global test p_val_c(1,...,p)=p_val_c(2,...,p,1)=...=pval(p,1,...,p-1).
For computational reasons, I have decided to implement this component in c and use it in R with .C.
Here the code. The unique important part is the definition of the function Build_pval_asymm_matrix.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix);
// Function used for the sorting of vector T_temp with qsort
int cmp(const void *x, const void *y);
int main() {
int B = 1000; // number Conditional Monte Carlo (CMC) runs
int p = 100; // number univariate tests
// Generate fictitiously data univariate p-values pval and matrix L.
// The j-th column of L is the empirical survival
// function of the statistics test associated to the j-th coefficient
// of the basis expansion. The dimension of L is B * p.
// Generate pval
double pval[p];
memset(pval, 0, sizeof(pval)); // initialize all elements to 0
for (int i = 0; i < p; i++) {
pval[i] = (double)rand() / (double)RAND_MAX;
}
// Construct L
double L[B * p];
// Inizialize to 0 the elements of L
memset(L, 0, sizeof(L));
// Array used to construct the columns of L
double temp_array[B];
memset(temp_array, 0, sizeof(temp_array));
for(int i = 0; i < B; i++) {
temp_array[i] = (double) (i + 1) / (double) B;
}
for (int iter_coeff=0; iter_coeff < p; iter_coeff++) {
// Shuffle temp_array
if (B > 1) {
for (int k = 0; k < B - 1; k++)
{
int j = rand() % B;
double t = temp_array[j];
temp_array[j] = temp_array[k];
temp_array[k] = t;
}
}
for (int i=0; i<B; i++) {
L[iter_coeff + p * i] = temp_array[i];
}
}
double pval_asymm_matrix[p * p];
memset(pval_asymm_matrix, 0, sizeof(pval_asymm_matrix));
// Construct the asymmetric matrix of p-values
clock_t start, end;
double cpu_time_used;
start = clock();
Build_pval_asymm_matrix(&p, &B, pval, L, pval_asymm_matrix);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("TOTAL CPU time used: %f\n", cpu_time_used);
return 0;
}
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix) {
int nbasis = *p, iter_CMC = *B;
// Scalar output fisher combining function applied on univariate
// p-values
double T0_temp = 0;
// Vector output fisher combining function applied on a set of
//columns of L
double T_temp[iter_CMC];
memset(T_temp, 0, sizeof(T_temp));
// Counter for elements of T_temp greater than or equal to T0_temp
int count = 0;
// Indexes for columns of L
int inf = 0, sup = 0;
// The last row of matrice_pval_asymm contains the univariate p-values
for(int i = 0; i < nbasis; i++) {
pval_asymm_matrix[i + nbasis * (nbasis - 1)] = pval[i];
}
// Construct the rows from bottom to up
for (int row = nbasis - 2; row >= 0; row--) {
for (int col = 0; col <= row; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = (nbasis - row) + col - 1;
// Combining function Fisher applied on
// p-values pval[inf:sup]
for (int k = inf; k <= sup; k++) {
T0_temp += log(pval[k]);
}
T0_temp *= -2;
// Combining function Fisher applied
// on columns inf:sup of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = inf; l <= sup; l++) {
T_temp[k] += log(L[l + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
}
// auxiliary variable for columns of L inf:nbasis-1 and 1:sup
int aux_first = 0, aux_second = 0;
int num_col_needed = 0;
for (int col = row + 1; col < nbasis; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = ((nbasis - row) + col) % nbasis - 1;
// Useful indexes
num_col_needed = nbasis - inf + sup + 1;
int index_needed[num_col_needed];
memset(index_needed, -1, num_col_needed * sizeof(int));
aux_first = inf;
for (int i = 0; i < nbasis - inf; i++) {
index_needed[i] = aux_first;
aux_first++;
}
aux_second = 0;
for (int j = 0; j < sup + 1; j++) {
index_needed[j + nbasis - inf] = aux_second;
aux_second++;
}
// Combining function Fisher applied on p-values
// pval[inf:p-1] and pval[0:sup-1]1]
for (int k = 0; k < num_col_needed; k++) {
T0_temp += log(pval[index_needed[k]]);
}
T0_temp *= -2;
// Combining function Fisher applied on columns inf:p-1 and 0:sup-1
// of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = 0; l < num_col_needed; l++) {
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
} // end for over col from row + 1 to nbasis - 1
} // end for over rows of asymm p-values matrix except the last row
}
int cmp(const void *x, const void *y)
{
double xx = *(double*)x, yy = *(double*)y;
if (xx < yy) return -1;
if (xx > yy) return 1;
return 0;
}
Here the times of execution in seconds measured in R:
time_original_function
user system elapsed
79.726 1.980 112.817
time_function_double_for
user system elapsed
79.013 1.666 89.411
time_c_function
user system elapsed
47.920 0.024 56.096
The first measure was obtained using an equivalent R function with duplication of the vector pval and matrix L.
What I wanted to ask is some suggestions in order to decrease the execution time with the C function for simulation purposes. The last time I used c was five years ago and consequently there is room for improvement. For instance I sort the vector T_temp with qsort in order to compute in linear time with a while the number of elements of T_temp greater than or equal to T0_temp. Maybe this task could be done in a more efficient way. Thanks in advance!!
I reduced the input size to p to 50 to avoid waiting on it (don't have such a fast machine) -- keeping p as is and reducing B to 100 has a similar effect, but profiling it showed that ~7.5 out of the ~8 seconds used to compute this was spent in the log function.
qsort doesn't even show up as a real hotspot. This test seems to headbutt the machine more in terms of micro-efficiency than anything else.
So unless your compiler has a vastly faster implementation of log than I do, my first suggestion is to find a fast log implementation if you can afford some accuracy loss (there are ones out there that can compute log over an order of magnitude faster with precision loss in the range of ~3% or so).
If you cannot have precision loss and accuracy is critical, then I'd suggest trying to memoize the values you use for log if you can and store them into a lookup table.
Update
I tried the latter approach.
// Create a memoized table of log values.
double log_cache[B * p];
for (int j=0, num=B*p; j < num; ++j)
log_cache[j] = log(L[j]);
Using malloc might be better here, as we're pushing rather large data to the stack and could risk overflows.
Then pass her into Build_pval_asymm_matrix.
Replace these:
T_temp[k] += log(L[l + nbasis * k]);
...
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
With these:
T_temp[k] += log_cache[l + nbasis * k];
...
T_temp[k] += log_cache[index_needed[l] + nbasis * k];
This improved the times for me from ~8 seconds to ~5.3 seconds, but we've exchanged the computational overhead of log for memory overhead which isn't that much better (in fact, it rarely is but calling log for double-precision floats is apparently quite expensive, enough to make this exchange worthwhile). The next iteration, if you want more speed, and it is very possible, involves looking into cache efficiency.
For this kind of huge matrix stuff, focusing on memory layouts and access patterns can work wonders.

Multiplying two arrays in C

I'm trying to multiply two multidimensional arrays to form a matrix. I have this function. This should work in theory. However, I am just getting 0s and large/awkward numbers. Can someone help me with this?
int **matrix_mult( int **a, int **b, int nr1, int nc1, int nc2 )
{
int **c;
int i,j,k,l;
c = malloc(sizeof(int *)*nr1);
if (c == NULL){
printf("Insuff memm");
}
for(l=0;l<nr1;l++){
c[l] = malloc(sizeof(int)*nc1);
if (c[l] == NULL){
printf("Insuff memm");
}
}//for loop
for (i=0;i<nr1;i++){
for (j=0;j<nc2;j++){
for (k=0;k<nc1;k++){
c[i][j] = (a[i][k]) * (b[k][j]);
}
}
}
return( c );
}
Are you doing mathematical matrix multiplication? If so shouldn't it be:
for(i = 0; i < nr1; i++)
{
for(j = 0; j < nc1; j++)
{
c[i][k] = 0;
for(k = 0; k < nc2; k++)
{
c[i][k] += (a[i][j]) * (b[j][k]);
}
}
}
My full and final solution, tested to produce sensible results (I didn't actually do all the calculations myself manually to check them) and without any sensible niceties such as checking memory allocations work, is:
int **matrix_mult(int **a, int **b, int nr1, int nc1, int nc2)
{
int **c;
int i, j, k;
c = malloc(sizeof(int *) * nr1);
for (i = 0; i < nr1; i++)
{
c[i] = malloc(sizeof(int) * nc2);
for (k = 0; k < nc2; k++)
{
c[i][k] = 0;
for (j = 0; j < nc1; j++)
{
c[i][k] += (a[i][j]) * (b[j][k]);
}
}
}
return c;
}
There were a few typos in the core of the for loop in my original answer, mostly due to my being mislead by a different answer. These have been corrected for posterity.
If you change c[i][j] = (a[i][k]) * (b[k][j]); to c[i][j] += (a[i][k]) * (b[k][j]); in your code then it will work just fine provided that
nr1 is number of rows of matrix a
nc1 is the number of columns of the matrix a
nc2 is the number of columns of the matrix b
Just be sure that the matrix c is initiated with zeroes. You can just use calloc instead of malloc when allocating space, or memset the allocated array after a call to malloc.
One more tip is to avoid using the letter l when accessing array elements. when tired, you will have hard time noticing errors with l vs 1.

Resources