Parallelize this algorithm to make it faster - c

As a challenge, I was asked to make a parallel algorithm for inverting a matrix. I mainly looked at this paper and this SO question while doing research for it.
Before I tried to write my own code, I stumbled across someone else's implementation.
I come from an objective-c background, so I immediately thought of using GCD for this task. I also came across something called POSIX which seems more low-level, and might be apt for this task if GCD won't work--I don't know.
My naive attempt for parallelizing this was just to replace every for-loop with a dispatch_apply, which worked (the product of the original and the inverse produces the identity matrix). However, that just slowed things down significantly (about 20x as slow at a glance). I see that there are SO questions on GCD and for-loops, but I'm mainly interested in what a better approach could be to this, not links to those answers which I've already read. Is it possible that the problem is the way I'm creating the dispatch queue, or maybe the fact that I'm only using one dispatch queue?
#include <stdio.h>
#include <dispatch/dispatch.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define PARALLEL true
void invertMatrixNonParallel(double **matrix, long n);
void invertMatrixParallel(double **matrix, long n, dispatch_queue_t q);
void invertMatrixParallel(double **matrix, long n, dispatch_queue_t q)
{
__block double r;
__block long temp;
dispatch_apply(n, q, ^(size_t i) {
dispatch_apply(n, q, ^(size_t j) {
matrix[i][j + n] = (j == i) ? 1 : 0;
});
});
/* using gauss-jordan elimination */
dispatch_apply(n, q, ^(size_t j) {
temp=j;
/* finding maximum jth column element in last (n-j) rows */
dispatch_apply(n - j - 1, q, ^(size_t i) {
if (matrix[i + j + 1][j] > matrix[temp][j])
{
temp = i + j + 1;
}
});
/* swapping row which has maximum jth column element */
if(temp!=j)
{
double *row = matrix[j];
matrix[j] = matrix[temp];
matrix[temp] = row;
}
/* performing row operations to form required identity matrix out of the input matrix */
dispatch_apply(n, q, ^(size_t i) {
r = matrix[i][j];
if (i == j)
{
dispatch_apply(2 * n, q, ^(size_t k) {
matrix[i][k]/=r ;
});
}
else
{
dispatch_apply(2 * n, q, ^(size_t k) {
matrix[i][k]-=(matrix[j][k]/matrix[j][j])*r ;
});
}
});
});
}
void invertMatrixNonParallel(double **matrix, long n)
{
double temporary, r;
long i, j, k, temp;
for (i = 0; i < n; ++i)
{
for (j = n; j < n * 2; ++j)
{
matrix[i][j] = (j == i + n) ? 1 : 0;
}
}
/* using gauss-jordan elimination */
for(j=0; j<n; j++)
{
temp=j;
/* finding maximum jth column element in last (n-j) rows */
for(i=j+1; i<n; i++)
if(matrix[i][j]>matrix[temp][j])
temp=i;
/* swapping row which has maximum jth column element */
if(temp!=j)
{
for(k=0; k<2*n; k++)
{
temporary=matrix[j][k] ;
matrix[j][k]=matrix[temp][k] ;
matrix[temp][k]=temporary ;
}
}
/* performing row operations to form required identity matrix out of the input matrix */
for(i=0; i<n; i++)
{
if(i!=j)
{
r=matrix[i][j];
for(k=0; k<2*n; k++)
matrix[i][k]-=(matrix[j][k]/matrix[j][j])*r ;
}
else
{
r=matrix[i][j];
for(k=0; k<2*n; k++)
matrix[i][k]/=r ;
}
}
}
}
#pragma mark - Main
int main(int argc, const char * argv[])
{
long i, j, k;
const long n = 5;
const double range = 10.0;
__block double **matrix;
__block double **invertedMatrix = malloc(sizeof(double *) * n);
matrix = malloc(sizeof(double *) * n);
invertedMatrix = malloc(sizeof(double *) * n);
for (i = 0; i < n; ++i)
{
matrix[i] = malloc(sizeof(double) * n);
invertedMatrix[i] = malloc(sizeof(double) * n * 2);
for (j = 0; j < n; ++j)
{
matrix[i][j] = drand48() * range;
invertedMatrix[i][j] = matrix[i][j];
}
}
clock_t t;
#if PARALLEL
dispatch_queue_t q1 = dispatch_queue_create("com.example.queue1", DISPATCH_QUEUE_CONCURRENT);
t = clock();
invertMatrixParallel(invertedMatrix, n, q1);
#else
t = clock();
invertMatrixNonParallel(invertedMatrix, n);
#endif
t = clock() - t;
double time_taken = ((double)t * 1000)/CLOCKS_PER_SEC; // in seconds
printf("\n%s took %f milliseconds to execute \n\n", (PARALLEL == true) ? "Parallel" : "Non-Parallel", time_taken);
printf("Here's the product of the inverted matrix and the original matrix\n");
double product[n][n];
for (i = 0; i < n; ++i)
{
for (j = 0; j < n; ++j)
{
double sum = 0;
for (k = 0; k < n; ++k)
{
sum += matrix[i][k] * invertedMatrix[k][j + n];
}
product[i][j] = sum;
}
}
// should print the identity matrix
for (i = 0; i < n; ++i)
{
for (j = 0; j < n; ++j)
{
printf("%5.2f%s", product[i][j], (j < n - 1) ? ", " : "\n");
}
}
return 0;
}
Output for parallel:
Parallel took 0.098000 milliseconds to execute
For nonparallel:
Non-Parallel took 0.004000 milliseconds to execute
For both:
Here's the product of the inverted matrix and the original matrix
1.00, -0.00, -0.00, 0.00, -0.00
0.00, 1.00, 0.00, 0.00, 0.00
0.00, -0.00, 1.00, -0.00, 0.00
-0.00, -0.00, -0.00, 1.00, 0.00
0.00, 0.00, 0.00, 0.00, 1.00
Please, no answers that are just links, I'm only using SO as a last resort.

0) As already mentioned in comments you need bigger matrix. Creating of parallel thread takes some overhead time, so you can't make more quicker parallel version if it takes too small amount of time at all. Even if you will manage to achieve better performance for small matrix it's hard to measure exactly.
1)
dispatch_apply(n, q, ^(size_t i) {
dispatch_apply(n, q, ^(size_t j) {
matrix[i][j + n] = (j == i) ? 1 : 0;
});
});
There is not much sense in parallelization of every nested loop. There is no sense of adding every single operation one by one in dispatch queue because it still takes some overhead so it's better to add some nontrivial blocks.
dispatch_apply(n, q, ^(size_t i) {
for (j = n; j < n * 2; ++j) {
matrix[i][j + n] = (j == i) ? 1 : 0;
}
});
Is enough.
2) You need to learn about thread safety and understand your algorithm well or you may run into unpredictable and non reproducible misbehavior of your application. I'm not sure if there are many loops which can be paralleled efficiently and really safely except for mentioned above initialization and one marked with /* performing row operations to form required identity matrix out of the input matrix */
So you probably need to find some specific parallel matrix inversion algorithm.

Related

I am using wrong indexing in one of the loops but can't figure out which one. I have made the changes which were suggested

#include<stdio.h>
#include<math.h>
#include<stdlib.h>
const int N = 3;
void LUBKSB(double b[], double a[N][N], int N, int *indx)
{
int i, ii, ip, j;
double sum;
ii = 0;
for(i=0;i<N;i++)
{
ip = indx[i];
sum = b[ip];
b[ip] = b[i];
if (ii)
{
for(j = ii;j<i-1;j++)
{
sum = sum - a[i][j] * b[j];
}
}
else if(sum)
{
ii = i;
}
b[i] = sum;
}
for(i=N-1;i>=0;i--)
{
sum = b[i];
for (j = i; j<N;j++)
{
sum = sum - a[i][j] * b[j];
}
b[i] = sum/a[i][i];
}
for (i=0;i<N;i++)
{
printf("b[%d]: %lf \n",i,b[i]);
}
}
void ludecmp(double a[][3], int N)
{
int i, imax, j, k;
double big, dum, sum, temp, d;
double *vv = (double *) malloc(N * sizeof(double));
int *indx = (int *) malloc(N * sizeof(double));
double TINY = 0.000000001;
double b[3] = {2*M_PI,5*M_PI,-8*M_PI};
d = 1.0;
for(i=0;i<N;i++)
{
big = 0.0;
for(j=0;j<N;j++)
{
temp = fabs(a[i][j]);
if (temp > big)
{
big = temp;
}
}
if (big == 0.0)
{
printf("Singular matrix\n");
exit(1);
}
vv[i] = 1.0/big;
}
for(j=0;j<N;j++)
{
for(i=0;i<j-1;i++)
{
sum = a[i][j];
for(int k=0;k<i-1;k++)
{
sum = sum - (a[i][k] * a[k][j]);
}
a[i][j] = sum;
}
big = 0.0;
for(i=j;i<N;i++)
{
sum = a[i][j];
for(k=0;k<j-1;k++)
{
sum = sum - a[i][k] * a[k][j];
}
a[i][j] =sum;
dum = vv[i] * fabs(a[i][j]);
if(dum >= big)
{
big = dum;
imax = i;
}
}
if(j != imax)
{
for(k=0;k<N;k++)
{
dum = a[imax][k];
a[imax][k] = a[j][k];
a[j][k] = dum;
}
d = -d;
vv[imax] = vv[j];
}
indx[j] = imax;
if (a[j][j] == 0)
{
a[j][j] = TINY;
}
if (j != N)
{
dum = 1.0/a[j][j];
for(i = j; i<N; i++)
{
a[i][j] = a[i][j] * dum;
}
}
}
LUBKSB(b,a,N,indx);
free(vv);
free(indx);
}
int main()
{
int N, i, j;
N = 3;
double a[3][3] = { 1, 2, -1, 6, -5, 4, -9, 8, -7};
ludecmp(a,N);
}
I am using these algorithms to find LU decomposition of matrix and trying to find solution A.x = b
Given a N ×N matrix A denoted as {a}N,Ni,j=1, the routine replaces it by the LU
decomposition of a rowwise permutation of itself. “a” and “N” are input. “a” is also output,
modified to apply the LU decomposition; {indxi}N
i=1 is an output vector that records the
row permutation effected by the partial pivoting; “d” is output and adopts ±1 depending on
whether the number of row interchanges was even or odd. This routine is used in combination
with algorithm 2 to solve linear equations or invert a matrix.
Solves the set of N linear equations A . x = b. Matrix {a}
N,N
i,j=1 is actually the
LU decomposition of the original matrix A, obtained from algorithm 1. Vector {indxi}
N
i=1 is
input as the permutation vector returned by algorithm 1. Vector {bi}
N
i=1 is input as the righthand side vector B but returns with the solution vector X. Inputs {a}
N,N
i,j=1, N, and {indxi}
N
i=1
are not modified in this algorithm.
There are a number of problems with your code:
In your for-loops, i <= N should be i < N and i = N should be i = N - 1.
The absolute value of a double is returned by fabs, not abs.
The statement exit should be exit(1) or exit(EXIT_FALILURE).
Two of your functions lack a return statement.
You should also free the memory you have allocated with the function free. When you compile a C program you should also enable all warnings.

Implementing LU factorization with partial pivoting in C using only one matrix

I have designed the following C function in order to compute the PA = LU factorization, using only one matrix to store and compute the data:
double plupmc(int n, double **c, int *p, double tol) {
int i, j, k, pivot_ind = 0, temp_ind;
int ii, jj;
double pivot, *temp_row;
for (j = 0; j < n-1; ++j) {
pivot = 0.;
for (i = j; i < n; ++i)
if (fabs(c[i][j]) > fabs(pivot)) {
pivot = c[i][j];
pivot_ind = i;
}
temp_row = c[j];
c[j] = c[pivot_ind];
c[pivot_ind] = temp_row;
temp_ind = p[j];
p[j] = p[pivot_ind];
p[pivot_ind] = temp_ind;
for (k = j+1; k < n; ++k) {
c[k][j] /= c[j][j];
c[k][k] -= c[k][j]*c[j][k];
}
}
return 0.;
}
where n is the order of the matrix, c is a pointer to the matrix and p is a pointer to a vector storing the permutations done when partial pivoting the system. The variable tol is not relevant for now. The program works storing in c both the lower and upper triangular parts of the factorization, where U corresponds to the upper triangular part of c and L corresponds to the strictly lower triangular part of c, adding 1's in the diagonal. For what I have been able to test, the part of the program corresponding to partial pivoting is working properly, however, the algorithm used to compute the entries of the matrix is not giving the expected results, and I cannot see why. For instance, if I try to compute the LU factorization of the matrix
1. 2. 3.
4. 5. 6.
7. 8. 9.
I get
1. 0. 0. 7. 8. 9.
l : 0.143 1. 0. u : 0. 2. 1.714*
0.571 0.214* 1. 0. 0. 5.663*
the product of which does not correspond to any permutation of the matrix c. In fact, the wrong entries seem to be the ones marked with a star.
I would appreciate any suggestion to fix this problem.
I found the problem with your code, there was a little conceptual error in the way that you were normalizing the row while computing the actual decomposition:
for (k = j+1; k < n; ++k) {
c[k][j] /= c[j][j];
c[k][k] -= c[k][j]*c[j][k];
}
became:
for (k = j+1; k < n; ++k) {
temp=c[k][j]/=c[j][j];
for(int q=j+1;q<n;q++){
c[k][q] -= temp*c[j][q];
}
}
which returns the result:
7.000000 8.000000 9.000000
0.142857 0.857143 1.714286
0.571429 0.500000 -0.000000
If you have any questions I am happy to help.
Full implementation here:
#include<stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
double plupmc(int n, double **c, int *p, double tol) {
int i, j, k, pivot_ind = 0, temp_ind;
int ii, jj;
double *vv=calloc(n,sizeof(double));
double pivot, *temp_row;
double temp;
for (j = 0; j < n; ++j) {
pivot = 0;
for (i = j; i < n; ++i)
if (fabs(c[i][j]) > fabs(pivot)) {
pivot = c[i][j];
pivot_ind = i;
}
temp_row = c[j];
c[j] = c[pivot_ind];
c[pivot_ind] = temp_row;
temp_ind = p[j];
p[j] = p[pivot_ind];
p[pivot_ind] = temp_ind;
for (k = j+1; k < n; ++k) {
temp=c[k][j]/=c[j][j];
for(int q=j+1;q<n;q++){
c[k][q] -= temp*c[j][q];
}
}
for(int q=0;q<n;q++){
for(int l=0;l<n;l++){
printf("%lf ",c[q][l]);
}
printf("\n");
}
}
return 0.;
}
int main() {
double **x;
x=calloc(3,sizeof(double));
for(int i=0;i<3;i++){
x[i]=calloc(3,sizeof(double));
}
memcpy(x[0],(double[]){1,2,3},3*sizeof(double));
memcpy(x[1],(double[]){4,5,6},3*sizeof(double));
memcpy(x[2],(double[]){7,8,9},3*sizeof(double));
int *p=calloc(3,sizeof(int));
memcpy(p,(int[]){0,1,2},3*sizeof(int));
plupmc(3,x,p,1);
for(int i=0;i<3;i++){
free(x[i]);
}
free(p);
free(x);
}

DGEMM and DGEMV give different results

I want to implement the following equation in C:
C[l,q,m] = A[m,q,k] * B[k,l]
where the repeated index k is being summed over.
I implemented this in three ways:
Naive implementation with loops
Using the BLAS routine DGEMV (matrix-vector multiplication)
Using the BLAS routine DGEMM (matrix-matrix multiplication)
This is the minimal not-working code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <cblas.h>
void main()
{
const size_t n = 3;
const size_t n2 = n*n;
const size_t n3 = n*n*n;
/* Fill rank 3 tensor with random numbers */
double a[n3];
for (size_t i = 0; i < n3; i++) {
a[i] = (double) rand() / RAND_MAX;
}
/* Fill matrix with random numbers */
double b[n2];
for (size_t i = 0; i < n2; i++) {
b[i] = (double) rand() / RAND_MAX;
}
/* All loops */
double c_exact[n3];
memset(c_exact, 0, n3 * sizeof(double));
for (size_t l = 0; l < n; l++) {
for (size_t q = 0; q < n; q++) {
for (size_t m = 0; m < n; m++) {
for (size_t k = 0; k < n; k++) {
c_exact[l*n2+q*n+m] += a[m*n2+q*n+k] * b[k*n+l];
}
}
}
}
/* Matrix-vector */
double c_mv[n3];
memset(c_mv, 0, n3 * sizeof(double));
for (size_t m = 0; m < n; m++) {
for (size_t l = 0; l < n; l++) {
cblas_dgemv(
CblasRowMajor, CblasNoTrans, n, n, 1.0, &a[m*n2],
n, &b[l], n, 0.0, &c_mv[l*n2+m], n);
}
}
/* Matrix-matrix */
double c_mm[n3];
memset(c_mm, 0, n3 * sizeof(double));
for (size_t m = 0; m < n; m++) {
cblas_dgemm(
CblasRowMajor, CblasTrans, CblasTrans, n, n, n, 1.0, b, n,
&a[m*n2], n, 0.0, &c_mm[m], n2);
}
/* Compute difference */
double diff_mv = 0.0;
double diff_mm = 0.0;
for (size_t idx = 0; idx < n3; idx++) {
diff_mv += c_mv[idx] - c_exact[idx];
diff_mm += c_mm[idx] - c_exact[idx];
}
printf("Difference matrix-vector: %e\n", diff_mv);
printf("Difference matrix-matrix: %e\n", diff_mm);
}
And this the output:
Difference matrix-vector: 0.000000e+00
Difference matrix-matrix: -1.188678e+01
i.e. the DGEMV implementation is correct, the DGEMM not - I really don't understand this. I switched around the multiplication (matrix-matrix multiplication is non commutative) and transposed both to get the right order C[l,q,m] instead of C[q,l,m], but I also tried it without switching/transposing and it does not work.
Can anyone please help?
Thanks.
edit: I thought about it a bit and feel like I'm trying to do something that DGEMM doe not support? Namely I try to insert a submatrix into C[:,:,m], which means that both the leading and trailing index are not contiguous in memory. DGEMM allows me to set the parameter LDC, which in this case needs to be n^2, but it does not know that also the second index is non-contiguous with a stride of n (and there is no parameter to tell it?). So why does DGEMM not support a second parameter for the stride of the trailing dimension?

matrix determinant in c using Gauss elimination Core dumped error

I'm trying to make a simple console application in C which will calculate the determinant of a Matrix using the Gauss elimination. after a lot of tests I found out that my program is not working because of the core dumped error.After 2 days of editing and undoing, i could not find the problem.
Any help is more than welcomed.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int recherche_pivot(int k, int n, float *A)
{
int i, j;
if (A[((k - 1) * n + k) - 1] != 0)
{
return k;
}
else
{ //parcours du reste de la colonne
for (i = k + 1; i <= n; i++)
{
if (A[((k - 1) * n + i) - 1] != 0)
{
return i;
}
}
return -1;
}
}
void fois(int n, float p, int i, float * A, float *b, float * x)
{
int a;
for (a = 1; a <= n; a++)
{
x[a - 1] = A[((i - 1) * n + a) - 1] * p;
}
x[n] = b[i - 1] * p;
}
void afficher_system(int n, float * X, float *b)
{
int i, j;
for (i = 1; i <= n; i++)
{
for (j = 1; j <= n; j++)
printf("%f ", X[((i - 1) * n + j) - 1]);
printf(" | %f", b[i - 1]);
printf("nn");
}
printf("nnnn");
}
void saisirmatrice(int n, float *A)
{
int i, j;
for (i = 1; i <= n; i++)
for (j = 1; j <= n; j++)
scanf("%f", &A[((i - 1) * n + j) - 1]);
}
void affichermatrice(int n, float *A)
{
int i, j;
for (i = 1; i <= n; i++)
for (j = 1; j <= n; j++)
printf("A[%d][%d] = %fn", i, j, A[((i - 1) * n + j) - 1]);
}
void elemination(int n, int k, float *b, float *A)
{
int i, l, j;
float * L, piv;
L = (float *) malloc((n) * sizeof(float));
for (i = k + 1; i <= n; i++)
{
piv = -1 * (A[((i - 1) * n + k) - 1] / A[((k - 1) * n + k) - 1]);
fois(n, piv, k, A, b, L);
//afficher_vecteur(n,L);
for (j = 1; j <= n; j++)
{
A[((i - 1) * n + j) - 1] = A[((i - 1) * n + j) - 1] + L[j - 1];
}
b[i - 1] = b[i - 1] + L[n];
afficher_system(n, A, b);
}
}
void permutter(int n, float * A, int i, int j, float * b)
{
int a;
float t[n + 1];
for (a = 1; a <= n; a++)
{
t[a - 1] = A[((i - 1) * n + a) - 1];
A[((i - 1) * n + a) - 1] = A[((j - 1) * n + a) - 1];
A[((j - 1) * n + a) - 1] = t[a - 1];
}
t[n] = b[i - 1];
b[i - 1] = b[j - 1];
b[j - 1] = t[n];
}
void main()
{
float * A, det, *L, *R, *b, s;
int i, j, i0, n, k, stop = 0;
printf("Veuillez donner la taille de la matrice");
scanf("%d", &n);
A = (float *) malloc(sizeof(float) * (n * n));
L = (float*) malloc(n * sizeof(float));
R = (float*) malloc(n * sizeof(float));
b = (float*) malloc(n * sizeof(float));
printf("Veuillez remplir la matrice");
saisirmatrice(n, A);
det = 1;
stop = 0;
k = 1;
do
{
do
{
i0 = recherche_pivot(k, n, A);
if (i0 == k)
{
//Elémination
elemination(n, k, b, A);
k++;
}
else if (i0 == -1)
{
stop = 1;
}
else
{ //cas ou ligne pivot=i0 != k
//permutation
det = -det;
permutter(n, A, k, i0, b);
//elemination
elemination(n, k, b, A);
//afficher_matrice(n,A);
k++;
}
} while ((k <= n) && (stop == 0));
} while (stop == 1 || k == n);
for (i = 1; i < n; i++)
{
det = det * A[((i - 1) * n + i) - 1];
}
printf("Le determinant est :%f", det);
free(A);
free(L);
free(R);
free(b);
}
There are many problems in the above code. Since arrays are zero-indexed in C, you should count the rows and columns of your matrices starting from zero, instead of counting from 1 and then attempting to convert when array-indexing. There is no need to cast the result of malloc(), and it is better to use an identifier rather than an explicit type as the argument for the sizeof operator:
A = malloc(sizeof(*A) * n * n));
You allocate space for L and R in main(), and then never use these pointers until the end of the program when they are freed. Then you allocate for L within the elemination() function; but you never free this memory, so you have a memory leak. You also allocate space for b in main(), but you don't store any values in b before passing it to the elemination() function. This is bound to cause problems.
There is no need for dynamic allocation here in the first place; I suggest using a variable length array to store the elements of the matrix. These have been available since C99, and will allow you to avoid all of the allocation issues.
There is a problem in the recherche_pivot() function, where you compare:
if(A[((k - 1) * n + i) - 1] != 0) {}
This is a problem because the array element is a floating point value which is the result of arithmetic operations; this value should not be directly compared with 0. I suggest selecting an appropriate DELTA value to represent a zero range, and instead comparing:
#define DELTA 0.000001
...
if (fabs(A[((k - 1) * n + i) - 1]) < DELTA) {}
In the permutter() function you use an array, float t[n];, to hold temporary values. But an array is unnecessary here since you don't need to save these temporary values after the swap; instead just use float t;. Further, when you interchange the values in b[], you use t[n] to store the temporary value, but this is out of bounds.
The elemination() function should probably iterate over all of the rows (excepting the kth row), rather that starting from the kth row, or it should start at the k+1th row. As it is, the kth row is used to eliminate itself. Finally, the actual algorithm that you use to perform the Gaussian elimination in main() is broken. Among other things, the call permutter(n, A, k, i0, b); swaps the kth row with the i0th row, but i0 is the pivot column of the kth row. This makes no sense.
It actually looks like you want to do more than just calculate determinants with this code, since you have b, which is the constant vector of a linear system. This is not needed for the task alluded to in the title of your question. Also, it appears that your code gives a result of 1 for any 1X1 determinant. This is incorrect; it should be the value of the single number in this case.
The Gaussian elimination method for calculating the determinant requires that you keep track of how many row-interchanges are performed, and that you keep a running product of any factors by which individual rows are multiplied. Adding a multiple of one row to another row to replace that row does not change the value of the determinant, and this is the operation used in the reduce() function below. The final result is the product of the diagonal entries in the reduced matrix, multiplied by -1 once for every row-interchange operation, divided by the product of all of the factors used to scale individual rows. In this case, there are no such factors, so the result is simply the product of the diagonal elements of the reduced matrix, with the sign correction. This is the method used by the code posted in the original question.
There were so many issues here that I just wrote a fresh program that implements this algorithm. I think that it is close, at least in spirit, to what you were trying to accomplish. I did add some input validation for the size of the matrix, checking to be sure that the user inputs a positive number, and prompting for re-entry if the input is bad. The input loop that fills the matrix would benefit from similar input validation. Also note that the input size is stored in a signed int, to allow checks for negative input, and a successful input is cast and stored in a variable of type size_t, which is an unsigned integer type guaranteed to hold any array index. This is the correct type to use when indexing arrays, and you will note that size_t is used throughout the program.
#include <stdio.h>
#include <math.h>
#include <stdbool.h>
#define DELTA 0.000001
void show_matrix(size_t mx_sz, double mx[mx_sz][mx_sz]);
void interchange(size_t r1, size_t r2, size_t mx_sz, double mx[mx_sz][mx_sz]);
void reduce(double factor, size_t r1, size_t r2,
size_t mx_sz, double mx[mx_sz][mx_sz]);
size_t get_pivot(size_t row, size_t mx_sz, double mx[mx_sz][mx_sz]);
double find_det(size_t mx_sz, double mx[mx_sz][mx_sz]);
int main(void)
{
size_t n;
int read_val, c;
printf("Enter size of matrix: ");
while (scanf("%d", &read_val) != 1 || read_val < 1) {
while ((c = getchar()) != '\n' && c != EOF) {
continue; // discard extra characters
}
printf("Enter size of matrix: ");
}
n = (size_t) read_val;
double matrix[n][n];
printf("Enter matrix elements:\n");
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < n; j++) {
scanf("%lf", &matrix[i][j]);
}
}
printf("You entered:\n");
show_matrix(n, matrix);
putchar('\n');
double result = find_det(n, matrix);
show_matrix(n, matrix);
putchar('\n');
printf("Determinant: %f\n", result);
return 0;
}
void show_matrix(size_t n, double mx[n][n])
{
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < n; j++) {
printf("%7.2f", mx[i][j]);
}
putchar('\n');
}
}
/* interchange rows r1 and r2 */
void interchange(size_t r1, size_t r2, size_t mx_sz, double mx[mx_sz][mx_sz])
{
double temp;
for (size_t j = 0; j < mx_sz; j++) {
temp = mx[r1][j];
mx[r1][j] = mx[r2][j];
mx[r2][j] = temp;
}
}
/* add factor * row r1 to row r2 to replace row r2 */
void reduce(double factor, size_t r1, size_t r2,
size_t mx_sz, double mx[mx_sz][mx_sz])
{
for (size_t j = 0; j < mx_sz; j++) {
mx[r2][j] += (factor * mx[r1][j]);
}
}
/* returns pivot column, or mx_sz if there is no pivot */
size_t get_pivot(size_t row, size_t mx_sz, double mx[mx_sz][mx_sz])
{
size_t j = 0;
while (j < mx_sz && fabs(mx[row][j]) < DELTA) {
++j;
}
return j;
}
double find_det(size_t mx_sz, double mx[mx_sz][mx_sz])
{
size_t pivot1, pivot2;
size_t row;
double factor;
bool finished = false;
double result = 1.0;
while (!finished) {
finished = true;
row = 1;
while (row < mx_sz) {
// determinant is zero if there is a zero row
if ((pivot1 = get_pivot(row - 1, mx_sz, mx)) == mx_sz ||
(pivot2 = get_pivot(row, mx_sz, mx)) == mx_sz) {
return 0.0;
}
if (pivot1 == pivot2) {
factor = -mx[row][pivot1] / mx[row - 1][pivot1];
reduce(factor, row - 1, row, mx_sz, mx);
finished = false;
} else if (pivot2 < pivot1) {
interchange(row - 1, row, mx_sz, mx);
result = -result;
finished = false;
}
++row;
}
}
for (size_t j = 0; j < mx_sz; j++) {
result *= mx[j][j];
}
return result;
}
Sample session:
Enter size of matrix: oops
Enter size of matrix: 0
Enter size of matrix: -1
Enter size of matrix: 3
Enter matrix elements:
0 1 3
1 2 0
0 3 4
You entered:
0.00 1.00 3.00
1.00 2.00 0.00
0.00 3.00 4.00
1.00 2.00 0.00
-0.00 -3.00 -9.00
0.00 0.00 -5.00
Determinant: 5.000000

Improve performance of a construction of p-values matrix for a permutation test

I used an R code which implements a permutation test for the distributional comparison between two populations of functions. We have p univariate p-values.
The bottleneck is the construction of a matrix which contains all the possible CONTIGUOS p-values.
The last row of the matrix of p-values contain all the univariate p-values.
The penultimate row contains all the bivariate p-values in this order:
p_val_c(1,2), p_val_c(2,3), ..., p_val_c(p, 1)
...
The elements of the first row are coincident and the value associated is the p-value of the global test p_val_c(1,...,p)=p_val_c(2,...,p,1)=...=pval(p,1,...,p-1).
For computational reasons, I have decided to implement this component in c and use it in R with .C.
Here the code. The unique important part is the definition of the function Build_pval_asymm_matrix.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix);
// Function used for the sorting of vector T_temp with qsort
int cmp(const void *x, const void *y);
int main() {
int B = 1000; // number Conditional Monte Carlo (CMC) runs
int p = 100; // number univariate tests
// Generate fictitiously data univariate p-values pval and matrix L.
// The j-th column of L is the empirical survival
// function of the statistics test associated to the j-th coefficient
// of the basis expansion. The dimension of L is B * p.
// Generate pval
double pval[p];
memset(pval, 0, sizeof(pval)); // initialize all elements to 0
for (int i = 0; i < p; i++) {
pval[i] = (double)rand() / (double)RAND_MAX;
}
// Construct L
double L[B * p];
// Inizialize to 0 the elements of L
memset(L, 0, sizeof(L));
// Array used to construct the columns of L
double temp_array[B];
memset(temp_array, 0, sizeof(temp_array));
for(int i = 0; i < B; i++) {
temp_array[i] = (double) (i + 1) / (double) B;
}
for (int iter_coeff=0; iter_coeff < p; iter_coeff++) {
// Shuffle temp_array
if (B > 1) {
for (int k = 0; k < B - 1; k++)
{
int j = rand() % B;
double t = temp_array[j];
temp_array[j] = temp_array[k];
temp_array[k] = t;
}
}
for (int i=0; i<B; i++) {
L[iter_coeff + p * i] = temp_array[i];
}
}
double pval_asymm_matrix[p * p];
memset(pval_asymm_matrix, 0, sizeof(pval_asymm_matrix));
// Construct the asymmetric matrix of p-values
clock_t start, end;
double cpu_time_used;
start = clock();
Build_pval_asymm_matrix(&p, &B, pval, L, pval_asymm_matrix);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("TOTAL CPU time used: %f\n", cpu_time_used);
return 0;
}
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix) {
int nbasis = *p, iter_CMC = *B;
// Scalar output fisher combining function applied on univariate
// p-values
double T0_temp = 0;
// Vector output fisher combining function applied on a set of
//columns of L
double T_temp[iter_CMC];
memset(T_temp, 0, sizeof(T_temp));
// Counter for elements of T_temp greater than or equal to T0_temp
int count = 0;
// Indexes for columns of L
int inf = 0, sup = 0;
// The last row of matrice_pval_asymm contains the univariate p-values
for(int i = 0; i < nbasis; i++) {
pval_asymm_matrix[i + nbasis * (nbasis - 1)] = pval[i];
}
// Construct the rows from bottom to up
for (int row = nbasis - 2; row >= 0; row--) {
for (int col = 0; col <= row; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = (nbasis - row) + col - 1;
// Combining function Fisher applied on
// p-values pval[inf:sup]
for (int k = inf; k <= sup; k++) {
T0_temp += log(pval[k]);
}
T0_temp *= -2;
// Combining function Fisher applied
// on columns inf:sup of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = inf; l <= sup; l++) {
T_temp[k] += log(L[l + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
}
// auxiliary variable for columns of L inf:nbasis-1 and 1:sup
int aux_first = 0, aux_second = 0;
int num_col_needed = 0;
for (int col = row + 1; col < nbasis; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = ((nbasis - row) + col) % nbasis - 1;
// Useful indexes
num_col_needed = nbasis - inf + sup + 1;
int index_needed[num_col_needed];
memset(index_needed, -1, num_col_needed * sizeof(int));
aux_first = inf;
for (int i = 0; i < nbasis - inf; i++) {
index_needed[i] = aux_first;
aux_first++;
}
aux_second = 0;
for (int j = 0; j < sup + 1; j++) {
index_needed[j + nbasis - inf] = aux_second;
aux_second++;
}
// Combining function Fisher applied on p-values
// pval[inf:p-1] and pval[0:sup-1]1]
for (int k = 0; k < num_col_needed; k++) {
T0_temp += log(pval[index_needed[k]]);
}
T0_temp *= -2;
// Combining function Fisher applied on columns inf:p-1 and 0:sup-1
// of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = 0; l < num_col_needed; l++) {
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
} // end for over col from row + 1 to nbasis - 1
} // end for over rows of asymm p-values matrix except the last row
}
int cmp(const void *x, const void *y)
{
double xx = *(double*)x, yy = *(double*)y;
if (xx < yy) return -1;
if (xx > yy) return 1;
return 0;
}
Here the times of execution in seconds measured in R:
time_original_function
user system elapsed
79.726 1.980 112.817
time_function_double_for
user system elapsed
79.013 1.666 89.411
time_c_function
user system elapsed
47.920 0.024 56.096
The first measure was obtained using an equivalent R function with duplication of the vector pval and matrix L.
What I wanted to ask is some suggestions in order to decrease the execution time with the C function for simulation purposes. The last time I used c was five years ago and consequently there is room for improvement. For instance I sort the vector T_temp with qsort in order to compute in linear time with a while the number of elements of T_temp greater than or equal to T0_temp. Maybe this task could be done in a more efficient way. Thanks in advance!!
I reduced the input size to p to 50 to avoid waiting on it (don't have such a fast machine) -- keeping p as is and reducing B to 100 has a similar effect, but profiling it showed that ~7.5 out of the ~8 seconds used to compute this was spent in the log function.
qsort doesn't even show up as a real hotspot. This test seems to headbutt the machine more in terms of micro-efficiency than anything else.
So unless your compiler has a vastly faster implementation of log than I do, my first suggestion is to find a fast log implementation if you can afford some accuracy loss (there are ones out there that can compute log over an order of magnitude faster with precision loss in the range of ~3% or so).
If you cannot have precision loss and accuracy is critical, then I'd suggest trying to memoize the values you use for log if you can and store them into a lookup table.
Update
I tried the latter approach.
// Create a memoized table of log values.
double log_cache[B * p];
for (int j=0, num=B*p; j < num; ++j)
log_cache[j] = log(L[j]);
Using malloc might be better here, as we're pushing rather large data to the stack and could risk overflows.
Then pass her into Build_pval_asymm_matrix.
Replace these:
T_temp[k] += log(L[l + nbasis * k]);
...
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
With these:
T_temp[k] += log_cache[l + nbasis * k];
...
T_temp[k] += log_cache[index_needed[l] + nbasis * k];
This improved the times for me from ~8 seconds to ~5.3 seconds, but we've exchanged the computational overhead of log for memory overhead which isn't that much better (in fact, it rarely is but calling log for double-precision floats is apparently quite expensive, enough to make this exchange worthwhile). The next iteration, if you want more speed, and it is very possible, involves looking into cache efficiency.
For this kind of huge matrix stuff, focusing on memory layouts and access patterns can work wonders.

Resources