DGEMM and DGEMV give different results - c

I want to implement the following equation in C:
C[l,q,m] = A[m,q,k] * B[k,l]
where the repeated index k is being summed over.
I implemented this in three ways:
Naive implementation with loops
Using the BLAS routine DGEMV (matrix-vector multiplication)
Using the BLAS routine DGEMM (matrix-matrix multiplication)
This is the minimal not-working code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <cblas.h>
void main()
{
const size_t n = 3;
const size_t n2 = n*n;
const size_t n3 = n*n*n;
/* Fill rank 3 tensor with random numbers */
double a[n3];
for (size_t i = 0; i < n3; i++) {
a[i] = (double) rand() / RAND_MAX;
}
/* Fill matrix with random numbers */
double b[n2];
for (size_t i = 0; i < n2; i++) {
b[i] = (double) rand() / RAND_MAX;
}
/* All loops */
double c_exact[n3];
memset(c_exact, 0, n3 * sizeof(double));
for (size_t l = 0; l < n; l++) {
for (size_t q = 0; q < n; q++) {
for (size_t m = 0; m < n; m++) {
for (size_t k = 0; k < n; k++) {
c_exact[l*n2+q*n+m] += a[m*n2+q*n+k] * b[k*n+l];
}
}
}
}
/* Matrix-vector */
double c_mv[n3];
memset(c_mv, 0, n3 * sizeof(double));
for (size_t m = 0; m < n; m++) {
for (size_t l = 0; l < n; l++) {
cblas_dgemv(
CblasRowMajor, CblasNoTrans, n, n, 1.0, &a[m*n2],
n, &b[l], n, 0.0, &c_mv[l*n2+m], n);
}
}
/* Matrix-matrix */
double c_mm[n3];
memset(c_mm, 0, n3 * sizeof(double));
for (size_t m = 0; m < n; m++) {
cblas_dgemm(
CblasRowMajor, CblasTrans, CblasTrans, n, n, n, 1.0, b, n,
&a[m*n2], n, 0.0, &c_mm[m], n2);
}
/* Compute difference */
double diff_mv = 0.0;
double diff_mm = 0.0;
for (size_t idx = 0; idx < n3; idx++) {
diff_mv += c_mv[idx] - c_exact[idx];
diff_mm += c_mm[idx] - c_exact[idx];
}
printf("Difference matrix-vector: %e\n", diff_mv);
printf("Difference matrix-matrix: %e\n", diff_mm);
}
And this the output:
Difference matrix-vector: 0.000000e+00
Difference matrix-matrix: -1.188678e+01
i.e. the DGEMV implementation is correct, the DGEMM not - I really don't understand this. I switched around the multiplication (matrix-matrix multiplication is non commutative) and transposed both to get the right order C[l,q,m] instead of C[q,l,m], but I also tried it without switching/transposing and it does not work.
Can anyone please help?
Thanks.
edit: I thought about it a bit and feel like I'm trying to do something that DGEMM doe not support? Namely I try to insert a submatrix into C[:,:,m], which means that both the leading and trailing index are not contiguous in memory. DGEMM allows me to set the parameter LDC, which in this case needs to be n^2, but it does not know that also the second index is non-contiguous with a stride of n (and there is no parameter to tell it?). So why does DGEMM not support a second parameter for the stride of the trailing dimension?

Related

I am using wrong indexing in one of the loops but can't figure out which one. I have made the changes which were suggested

#include<stdio.h>
#include<math.h>
#include<stdlib.h>
const int N = 3;
void LUBKSB(double b[], double a[N][N], int N, int *indx)
{
int i, ii, ip, j;
double sum;
ii = 0;
for(i=0;i<N;i++)
{
ip = indx[i];
sum = b[ip];
b[ip] = b[i];
if (ii)
{
for(j = ii;j<i-1;j++)
{
sum = sum - a[i][j] * b[j];
}
}
else if(sum)
{
ii = i;
}
b[i] = sum;
}
for(i=N-1;i>=0;i--)
{
sum = b[i];
for (j = i; j<N;j++)
{
sum = sum - a[i][j] * b[j];
}
b[i] = sum/a[i][i];
}
for (i=0;i<N;i++)
{
printf("b[%d]: %lf \n",i,b[i]);
}
}
void ludecmp(double a[][3], int N)
{
int i, imax, j, k;
double big, dum, sum, temp, d;
double *vv = (double *) malloc(N * sizeof(double));
int *indx = (int *) malloc(N * sizeof(double));
double TINY = 0.000000001;
double b[3] = {2*M_PI,5*M_PI,-8*M_PI};
d = 1.0;
for(i=0;i<N;i++)
{
big = 0.0;
for(j=0;j<N;j++)
{
temp = fabs(a[i][j]);
if (temp > big)
{
big = temp;
}
}
if (big == 0.0)
{
printf("Singular matrix\n");
exit(1);
}
vv[i] = 1.0/big;
}
for(j=0;j<N;j++)
{
for(i=0;i<j-1;i++)
{
sum = a[i][j];
for(int k=0;k<i-1;k++)
{
sum = sum - (a[i][k] * a[k][j]);
}
a[i][j] = sum;
}
big = 0.0;
for(i=j;i<N;i++)
{
sum = a[i][j];
for(k=0;k<j-1;k++)
{
sum = sum - a[i][k] * a[k][j];
}
a[i][j] =sum;
dum = vv[i] * fabs(a[i][j]);
if(dum >= big)
{
big = dum;
imax = i;
}
}
if(j != imax)
{
for(k=0;k<N;k++)
{
dum = a[imax][k];
a[imax][k] = a[j][k];
a[j][k] = dum;
}
d = -d;
vv[imax] = vv[j];
}
indx[j] = imax;
if (a[j][j] == 0)
{
a[j][j] = TINY;
}
if (j != N)
{
dum = 1.0/a[j][j];
for(i = j; i<N; i++)
{
a[i][j] = a[i][j] * dum;
}
}
}
LUBKSB(b,a,N,indx);
free(vv);
free(indx);
}
int main()
{
int N, i, j;
N = 3;
double a[3][3] = { 1, 2, -1, 6, -5, 4, -9, 8, -7};
ludecmp(a,N);
}
I am using these algorithms to find LU decomposition of matrix and trying to find solution A.x = b
Given a N ×N matrix A denoted as {a}N,Ni,j=1, the routine replaces it by the LU
decomposition of a rowwise permutation of itself. “a” and “N” are input. “a” is also output,
modified to apply the LU decomposition; {indxi}N
i=1 is an output vector that records the
row permutation effected by the partial pivoting; “d” is output and adopts ±1 depending on
whether the number of row interchanges was even or odd. This routine is used in combination
with algorithm 2 to solve linear equations or invert a matrix.
Solves the set of N linear equations A . x = b. Matrix {a}
N,N
i,j=1 is actually the
LU decomposition of the original matrix A, obtained from algorithm 1. Vector {indxi}
N
i=1 is
input as the permutation vector returned by algorithm 1. Vector {bi}
N
i=1 is input as the righthand side vector B but returns with the solution vector X. Inputs {a}
N,N
i,j=1, N, and {indxi}
N
i=1
are not modified in this algorithm.
There are a number of problems with your code:
In your for-loops, i <= N should be i < N and i = N should be i = N - 1.
The absolute value of a double is returned by fabs, not abs.
The statement exit should be exit(1) or exit(EXIT_FALILURE).
Two of your functions lack a return statement.
You should also free the memory you have allocated with the function free. When you compile a C program you should also enable all warnings.

Implementing LU factorization with partial pivoting in C using only one matrix

I have designed the following C function in order to compute the PA = LU factorization, using only one matrix to store and compute the data:
double plupmc(int n, double **c, int *p, double tol) {
int i, j, k, pivot_ind = 0, temp_ind;
int ii, jj;
double pivot, *temp_row;
for (j = 0; j < n-1; ++j) {
pivot = 0.;
for (i = j; i < n; ++i)
if (fabs(c[i][j]) > fabs(pivot)) {
pivot = c[i][j];
pivot_ind = i;
}
temp_row = c[j];
c[j] = c[pivot_ind];
c[pivot_ind] = temp_row;
temp_ind = p[j];
p[j] = p[pivot_ind];
p[pivot_ind] = temp_ind;
for (k = j+1; k < n; ++k) {
c[k][j] /= c[j][j];
c[k][k] -= c[k][j]*c[j][k];
}
}
return 0.;
}
where n is the order of the matrix, c is a pointer to the matrix and p is a pointer to a vector storing the permutations done when partial pivoting the system. The variable tol is not relevant for now. The program works storing in c both the lower and upper triangular parts of the factorization, where U corresponds to the upper triangular part of c and L corresponds to the strictly lower triangular part of c, adding 1's in the diagonal. For what I have been able to test, the part of the program corresponding to partial pivoting is working properly, however, the algorithm used to compute the entries of the matrix is not giving the expected results, and I cannot see why. For instance, if I try to compute the LU factorization of the matrix
1. 2. 3.
4. 5. 6.
7. 8. 9.
I get
1. 0. 0. 7. 8. 9.
l : 0.143 1. 0. u : 0. 2. 1.714*
0.571 0.214* 1. 0. 0. 5.663*
the product of which does not correspond to any permutation of the matrix c. In fact, the wrong entries seem to be the ones marked with a star.
I would appreciate any suggestion to fix this problem.
I found the problem with your code, there was a little conceptual error in the way that you were normalizing the row while computing the actual decomposition:
for (k = j+1; k < n; ++k) {
c[k][j] /= c[j][j];
c[k][k] -= c[k][j]*c[j][k];
}
became:
for (k = j+1; k < n; ++k) {
temp=c[k][j]/=c[j][j];
for(int q=j+1;q<n;q++){
c[k][q] -= temp*c[j][q];
}
}
which returns the result:
7.000000 8.000000 9.000000
0.142857 0.857143 1.714286
0.571429 0.500000 -0.000000
If you have any questions I am happy to help.
Full implementation here:
#include<stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
double plupmc(int n, double **c, int *p, double tol) {
int i, j, k, pivot_ind = 0, temp_ind;
int ii, jj;
double *vv=calloc(n,sizeof(double));
double pivot, *temp_row;
double temp;
for (j = 0; j < n; ++j) {
pivot = 0;
for (i = j; i < n; ++i)
if (fabs(c[i][j]) > fabs(pivot)) {
pivot = c[i][j];
pivot_ind = i;
}
temp_row = c[j];
c[j] = c[pivot_ind];
c[pivot_ind] = temp_row;
temp_ind = p[j];
p[j] = p[pivot_ind];
p[pivot_ind] = temp_ind;
for (k = j+1; k < n; ++k) {
temp=c[k][j]/=c[j][j];
for(int q=j+1;q<n;q++){
c[k][q] -= temp*c[j][q];
}
}
for(int q=0;q<n;q++){
for(int l=0;l<n;l++){
printf("%lf ",c[q][l]);
}
printf("\n");
}
}
return 0.;
}
int main() {
double **x;
x=calloc(3,sizeof(double));
for(int i=0;i<3;i++){
x[i]=calloc(3,sizeof(double));
}
memcpy(x[0],(double[]){1,2,3},3*sizeof(double));
memcpy(x[1],(double[]){4,5,6},3*sizeof(double));
memcpy(x[2],(double[]){7,8,9},3*sizeof(double));
int *p=calloc(3,sizeof(int));
memcpy(p,(int[]){0,1,2},3*sizeof(int));
plupmc(3,x,p,1);
for(int i=0;i<3;i++){
free(x[i]);
}
free(p);
free(x);
}

Improve performance of a construction of p-values matrix for a permutation test

I used an R code which implements a permutation test for the distributional comparison between two populations of functions. We have p univariate p-values.
The bottleneck is the construction of a matrix which contains all the possible CONTIGUOS p-values.
The last row of the matrix of p-values contain all the univariate p-values.
The penultimate row contains all the bivariate p-values in this order:
p_val_c(1,2), p_val_c(2,3), ..., p_val_c(p, 1)
...
The elements of the first row are coincident and the value associated is the p-value of the global test p_val_c(1,...,p)=p_val_c(2,...,p,1)=...=pval(p,1,...,p-1).
For computational reasons, I have decided to implement this component in c and use it in R with .C.
Here the code. The unique important part is the definition of the function Build_pval_asymm_matrix.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix);
// Function used for the sorting of vector T_temp with qsort
int cmp(const void *x, const void *y);
int main() {
int B = 1000; // number Conditional Monte Carlo (CMC) runs
int p = 100; // number univariate tests
// Generate fictitiously data univariate p-values pval and matrix L.
// The j-th column of L is the empirical survival
// function of the statistics test associated to the j-th coefficient
// of the basis expansion. The dimension of L is B * p.
// Generate pval
double pval[p];
memset(pval, 0, sizeof(pval)); // initialize all elements to 0
for (int i = 0; i < p; i++) {
pval[i] = (double)rand() / (double)RAND_MAX;
}
// Construct L
double L[B * p];
// Inizialize to 0 the elements of L
memset(L, 0, sizeof(L));
// Array used to construct the columns of L
double temp_array[B];
memset(temp_array, 0, sizeof(temp_array));
for(int i = 0; i < B; i++) {
temp_array[i] = (double) (i + 1) / (double) B;
}
for (int iter_coeff=0; iter_coeff < p; iter_coeff++) {
// Shuffle temp_array
if (B > 1) {
for (int k = 0; k < B - 1; k++)
{
int j = rand() % B;
double t = temp_array[j];
temp_array[j] = temp_array[k];
temp_array[k] = t;
}
}
for (int i=0; i<B; i++) {
L[iter_coeff + p * i] = temp_array[i];
}
}
double pval_asymm_matrix[p * p];
memset(pval_asymm_matrix, 0, sizeof(pval_asymm_matrix));
// Construct the asymmetric matrix of p-values
clock_t start, end;
double cpu_time_used;
start = clock();
Build_pval_asymm_matrix(&p, &B, pval, L, pval_asymm_matrix);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("TOTAL CPU time used: %f\n", cpu_time_used);
return 0;
}
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix) {
int nbasis = *p, iter_CMC = *B;
// Scalar output fisher combining function applied on univariate
// p-values
double T0_temp = 0;
// Vector output fisher combining function applied on a set of
//columns of L
double T_temp[iter_CMC];
memset(T_temp, 0, sizeof(T_temp));
// Counter for elements of T_temp greater than or equal to T0_temp
int count = 0;
// Indexes for columns of L
int inf = 0, sup = 0;
// The last row of matrice_pval_asymm contains the univariate p-values
for(int i = 0; i < nbasis; i++) {
pval_asymm_matrix[i + nbasis * (nbasis - 1)] = pval[i];
}
// Construct the rows from bottom to up
for (int row = nbasis - 2; row >= 0; row--) {
for (int col = 0; col <= row; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = (nbasis - row) + col - 1;
// Combining function Fisher applied on
// p-values pval[inf:sup]
for (int k = inf; k <= sup; k++) {
T0_temp += log(pval[k]);
}
T0_temp *= -2;
// Combining function Fisher applied
// on columns inf:sup of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = inf; l <= sup; l++) {
T_temp[k] += log(L[l + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
}
// auxiliary variable for columns of L inf:nbasis-1 and 1:sup
int aux_first = 0, aux_second = 0;
int num_col_needed = 0;
for (int col = row + 1; col < nbasis; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = ((nbasis - row) + col) % nbasis - 1;
// Useful indexes
num_col_needed = nbasis - inf + sup + 1;
int index_needed[num_col_needed];
memset(index_needed, -1, num_col_needed * sizeof(int));
aux_first = inf;
for (int i = 0; i < nbasis - inf; i++) {
index_needed[i] = aux_first;
aux_first++;
}
aux_second = 0;
for (int j = 0; j < sup + 1; j++) {
index_needed[j + nbasis - inf] = aux_second;
aux_second++;
}
// Combining function Fisher applied on p-values
// pval[inf:p-1] and pval[0:sup-1]1]
for (int k = 0; k < num_col_needed; k++) {
T0_temp += log(pval[index_needed[k]]);
}
T0_temp *= -2;
// Combining function Fisher applied on columns inf:p-1 and 0:sup-1
// of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = 0; l < num_col_needed; l++) {
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
} // end for over col from row + 1 to nbasis - 1
} // end for over rows of asymm p-values matrix except the last row
}
int cmp(const void *x, const void *y)
{
double xx = *(double*)x, yy = *(double*)y;
if (xx < yy) return -1;
if (xx > yy) return 1;
return 0;
}
Here the times of execution in seconds measured in R:
time_original_function
user system elapsed
79.726 1.980 112.817
time_function_double_for
user system elapsed
79.013 1.666 89.411
time_c_function
user system elapsed
47.920 0.024 56.096
The first measure was obtained using an equivalent R function with duplication of the vector pval and matrix L.
What I wanted to ask is some suggestions in order to decrease the execution time with the C function for simulation purposes. The last time I used c was five years ago and consequently there is room for improvement. For instance I sort the vector T_temp with qsort in order to compute in linear time with a while the number of elements of T_temp greater than or equal to T0_temp. Maybe this task could be done in a more efficient way. Thanks in advance!!
I reduced the input size to p to 50 to avoid waiting on it (don't have such a fast machine) -- keeping p as is and reducing B to 100 has a similar effect, but profiling it showed that ~7.5 out of the ~8 seconds used to compute this was spent in the log function.
qsort doesn't even show up as a real hotspot. This test seems to headbutt the machine more in terms of micro-efficiency than anything else.
So unless your compiler has a vastly faster implementation of log than I do, my first suggestion is to find a fast log implementation if you can afford some accuracy loss (there are ones out there that can compute log over an order of magnitude faster with precision loss in the range of ~3% or so).
If you cannot have precision loss and accuracy is critical, then I'd suggest trying to memoize the values you use for log if you can and store them into a lookup table.
Update
I tried the latter approach.
// Create a memoized table of log values.
double log_cache[B * p];
for (int j=0, num=B*p; j < num; ++j)
log_cache[j] = log(L[j]);
Using malloc might be better here, as we're pushing rather large data to the stack and could risk overflows.
Then pass her into Build_pval_asymm_matrix.
Replace these:
T_temp[k] += log(L[l + nbasis * k]);
...
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
With these:
T_temp[k] += log_cache[l + nbasis * k];
...
T_temp[k] += log_cache[index_needed[l] + nbasis * k];
This improved the times for me from ~8 seconds to ~5.3 seconds, but we've exchanged the computational overhead of log for memory overhead which isn't that much better (in fact, it rarely is but calling log for double-precision floats is apparently quite expensive, enough to make this exchange worthwhile). The next iteration, if you want more speed, and it is very possible, involves looking into cache efficiency.
For this kind of huge matrix stuff, focusing on memory layouts and access patterns can work wonders.

Parallelize this algorithm to make it faster

As a challenge, I was asked to make a parallel algorithm for inverting a matrix. I mainly looked at this paper and this SO question while doing research for it.
Before I tried to write my own code, I stumbled across someone else's implementation.
I come from an objective-c background, so I immediately thought of using GCD for this task. I also came across something called POSIX which seems more low-level, and might be apt for this task if GCD won't work--I don't know.
My naive attempt for parallelizing this was just to replace every for-loop with a dispatch_apply, which worked (the product of the original and the inverse produces the identity matrix). However, that just slowed things down significantly (about 20x as slow at a glance). I see that there are SO questions on GCD and for-loops, but I'm mainly interested in what a better approach could be to this, not links to those answers which I've already read. Is it possible that the problem is the way I'm creating the dispatch queue, or maybe the fact that I'm only using one dispatch queue?
#include <stdio.h>
#include <dispatch/dispatch.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define PARALLEL true
void invertMatrixNonParallel(double **matrix, long n);
void invertMatrixParallel(double **matrix, long n, dispatch_queue_t q);
void invertMatrixParallel(double **matrix, long n, dispatch_queue_t q)
{
__block double r;
__block long temp;
dispatch_apply(n, q, ^(size_t i) {
dispatch_apply(n, q, ^(size_t j) {
matrix[i][j + n] = (j == i) ? 1 : 0;
});
});
/* using gauss-jordan elimination */
dispatch_apply(n, q, ^(size_t j) {
temp=j;
/* finding maximum jth column element in last (n-j) rows */
dispatch_apply(n - j - 1, q, ^(size_t i) {
if (matrix[i + j + 1][j] > matrix[temp][j])
{
temp = i + j + 1;
}
});
/* swapping row which has maximum jth column element */
if(temp!=j)
{
double *row = matrix[j];
matrix[j] = matrix[temp];
matrix[temp] = row;
}
/* performing row operations to form required identity matrix out of the input matrix */
dispatch_apply(n, q, ^(size_t i) {
r = matrix[i][j];
if (i == j)
{
dispatch_apply(2 * n, q, ^(size_t k) {
matrix[i][k]/=r ;
});
}
else
{
dispatch_apply(2 * n, q, ^(size_t k) {
matrix[i][k]-=(matrix[j][k]/matrix[j][j])*r ;
});
}
});
});
}
void invertMatrixNonParallel(double **matrix, long n)
{
double temporary, r;
long i, j, k, temp;
for (i = 0; i < n; ++i)
{
for (j = n; j < n * 2; ++j)
{
matrix[i][j] = (j == i + n) ? 1 : 0;
}
}
/* using gauss-jordan elimination */
for(j=0; j<n; j++)
{
temp=j;
/* finding maximum jth column element in last (n-j) rows */
for(i=j+1; i<n; i++)
if(matrix[i][j]>matrix[temp][j])
temp=i;
/* swapping row which has maximum jth column element */
if(temp!=j)
{
for(k=0; k<2*n; k++)
{
temporary=matrix[j][k] ;
matrix[j][k]=matrix[temp][k] ;
matrix[temp][k]=temporary ;
}
}
/* performing row operations to form required identity matrix out of the input matrix */
for(i=0; i<n; i++)
{
if(i!=j)
{
r=matrix[i][j];
for(k=0; k<2*n; k++)
matrix[i][k]-=(matrix[j][k]/matrix[j][j])*r ;
}
else
{
r=matrix[i][j];
for(k=0; k<2*n; k++)
matrix[i][k]/=r ;
}
}
}
}
#pragma mark - Main
int main(int argc, const char * argv[])
{
long i, j, k;
const long n = 5;
const double range = 10.0;
__block double **matrix;
__block double **invertedMatrix = malloc(sizeof(double *) * n);
matrix = malloc(sizeof(double *) * n);
invertedMatrix = malloc(sizeof(double *) * n);
for (i = 0; i < n; ++i)
{
matrix[i] = malloc(sizeof(double) * n);
invertedMatrix[i] = malloc(sizeof(double) * n * 2);
for (j = 0; j < n; ++j)
{
matrix[i][j] = drand48() * range;
invertedMatrix[i][j] = matrix[i][j];
}
}
clock_t t;
#if PARALLEL
dispatch_queue_t q1 = dispatch_queue_create("com.example.queue1", DISPATCH_QUEUE_CONCURRENT);
t = clock();
invertMatrixParallel(invertedMatrix, n, q1);
#else
t = clock();
invertMatrixNonParallel(invertedMatrix, n);
#endif
t = clock() - t;
double time_taken = ((double)t * 1000)/CLOCKS_PER_SEC; // in seconds
printf("\n%s took %f milliseconds to execute \n\n", (PARALLEL == true) ? "Parallel" : "Non-Parallel", time_taken);
printf("Here's the product of the inverted matrix and the original matrix\n");
double product[n][n];
for (i = 0; i < n; ++i)
{
for (j = 0; j < n; ++j)
{
double sum = 0;
for (k = 0; k < n; ++k)
{
sum += matrix[i][k] * invertedMatrix[k][j + n];
}
product[i][j] = sum;
}
}
// should print the identity matrix
for (i = 0; i < n; ++i)
{
for (j = 0; j < n; ++j)
{
printf("%5.2f%s", product[i][j], (j < n - 1) ? ", " : "\n");
}
}
return 0;
}
Output for parallel:
Parallel took 0.098000 milliseconds to execute
For nonparallel:
Non-Parallel took 0.004000 milliseconds to execute
For both:
Here's the product of the inverted matrix and the original matrix
1.00, -0.00, -0.00, 0.00, -0.00
0.00, 1.00, 0.00, 0.00, 0.00
0.00, -0.00, 1.00, -0.00, 0.00
-0.00, -0.00, -0.00, 1.00, 0.00
0.00, 0.00, 0.00, 0.00, 1.00
Please, no answers that are just links, I'm only using SO as a last resort.
0) As already mentioned in comments you need bigger matrix. Creating of parallel thread takes some overhead time, so you can't make more quicker parallel version if it takes too small amount of time at all. Even if you will manage to achieve better performance for small matrix it's hard to measure exactly.
1)
dispatch_apply(n, q, ^(size_t i) {
dispatch_apply(n, q, ^(size_t j) {
matrix[i][j + n] = (j == i) ? 1 : 0;
});
});
There is not much sense in parallelization of every nested loop. There is no sense of adding every single operation one by one in dispatch queue because it still takes some overhead so it's better to add some nontrivial blocks.
dispatch_apply(n, q, ^(size_t i) {
for (j = n; j < n * 2; ++j) {
matrix[i][j + n] = (j == i) ? 1 : 0;
}
});
Is enough.
2) You need to learn about thread safety and understand your algorithm well or you may run into unpredictable and non reproducible misbehavior of your application. I'm not sure if there are many loops which can be paralleled efficiently and really safely except for mentioned above initialization and one marked with /* performing row operations to form required identity matrix out of the input matrix */
So you probably need to find some specific parallel matrix inversion algorithm.

Cholesky Factorization in C?

I am implementing the Cholesky Method in C but the program quits when it arrives at this point.
After the answers : Now it works thanks to the answers of (devnull & piotruś) but it doens't give me the right answer
/* Ax=b
*This algorithm does:
* A = U * U'
* with
* U := lower left triangle matrix
* U' := the transposed form of U.
*/
double** cholesky(double **A, int N) //this gives me the U (edited)
{
int i, j, k;
double sum, **p, **U;
U=(double**)malloc(N*sizeof(double*));
for(p=U; p<U+N; p++)
*p=(double*)malloc(N*sizeof(double));
for (j=0; j<N; j++) {
sum = A[j][j];
for (k=0; k<(j-1); k++) sum -= U[k][j]*U[k][j];
U[j][j] = sqrt(sum);
for (i=j; i<N; i++) {
sum = A[j][i];
for (k=0; k<(j-1); k++)
sum -= U[k][j]*U[k][i];
U[j][i] = sum/U[j][j];
}
}
return U;
}
am I doing something wrong here?
double** cholesky(double **A, int N)
in this function you assume array length is N. This means the last index of array is at N-1 not at N. Change the code into:
for ( j = 0; j < N; ++j)
and the rest similarly.

Resources