Batched QR decomposition using CUBLAS - c

I have been trying to perform a QR decomposition of many small matrices in parallel with CUDA.
I therefore used the cublasDgeqrfBatched function in Cublas. I couldn't find a working example of the above fuction and I found some ambiguity in the documentation for calling it.
In fact, I tried to test cublasDgeqrfBatched on the example in the Householder reflections section in Wikipedia as this same method is being used by cublasDgeqrfBatched. The 2 input small matrices are identical and are the following:
A= 12 -51 4
6 167 -68
-4 24 -41
According to the documentation, Aarray is an array of pointers to matrices with dimensions mxn and TauArray is an array of pointers to vectors of dimension of at least max (1, min(m, n).
cublasDgeqrfBatched performs the QR factorization of each Aarray[i] for
i =0, ...,batchSize-1
Each matrix Q[i] is stored in the lower part of each Aarray[i]
I used the following code to call this function:
#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include <cublas.h>
#include "cublas_v2.h"
#include "Utilities.cuh"
#include <helper_cuda.h>
/********/
/* MAIN */
/********/
int main(){
//mxn: size of Array[i]
const int m = 3;
const int n = 3;
double h_A[3*3*2]={12, -51, 4, 6, 167, -68, -4, 24, -41, 12, -51, 4, 6, 167, -68, -4, 24, -41};// two 3x3 identical matrices for test
const int batchSize=2;//2 small matrices
const int ltau=3; //ltau = max(1,min(m,n))
// --- CUBLAS initialization
cublasHandle_t cublas_handle;
cublasStatus_t stat;
cublasSafeCall(cublasCreate(&cublas_handle));
// --- CUDA batched QR initialization
double *d_A, *d_TAU;
checkCudaErrors(cudaMalloc((void**)&d_A, m*n*batchSize*sizeof(double)));
checkCudaErrors(cudaMalloc((void**)&d_TAU, ltau*batchSize*sizeof(double)));
checkCudaErrors(cudaMemcpy(d_A,h_A,m*n*batchSize*sizeof(double),cudaMemcpyHostToDevice));
double *d_Aarray[batchSize],*d_TauArray[batchSize];
for (int i = 0; i < batchSize; i++)
{
d_Aarray[i] = d_A+ i*m*n;
d_TauArray[i] = d_TAU + i*ltau;
}
int lda=3;
int info;
stat=cublasDgeqrfBatched(cublas_handle, m, n, d_Aarray, lda, d_TauArray, &info, batchSize);
if (stat != CUBLAS_STATUS_SUCCESS)
printf("\n cublasDgeqrfBatched failed");
double *A0,*A1;
A0=(double*)malloc(m*n*batchSize*sizeof(double));
A1=(double*)malloc(m*n*sizeof(double));
checkCudaErrors(cudaMemcpy(A0,d_Aarray[0],m*n*sizeof(double),cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(A1,d_Aarray[1],m*n*sizeof(double),cudaMemcpyDeviceToHost));
}
But, got an error "CUDA error batched_QR/kernel.cu:64 code=4(cudaErrorLaunchFailure) "cudaMemcpy(A0,d_Aarray[0],m*n*sizeof(double),cudaMemcpyDeviceToHost)"
I think there is an error in the use of pointers but I can't correct it. Where is the problem please?
Edit:
to make d_Aarray and d_TauArray device arrays as talonmies proposed, I added the following:
double *d_A, *d_TAU;
checkCudaErrors(cudaMalloc((void**)&d_A, m*n*batchSize*sizeof(*d_A)));
checkCudaErrors(cudaMalloc((void**)&d_TAU, ltau*batchSize*sizeof(*d_TAU)));
checkCudaErrors(cudaMemcpy(d_A,h_A,m*n*batchSize*sizeof(double),cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemset(d_TAU, 0, ltau*batchSize* sizeof(*d_TAU)));
But always the same error when copying the result back to host.

I think there is an error in the use of pointers
You are correct. The arrays of device pointers you are passing to cublasDgeqrfBatched are host arrays not device arrays:
double *d_Aarray[batchSize],*d_TauArray[batchSize];
for (int i = 0; i < batchSize; i++)
{
d_Aarray[i] = d_A+ i*m*n;
d_TauArray[i] = d_TAU + i*ltau;
}
You must copy d_Aarray and d_TauArray to the device and pass the address of the device copies to cublasDgeqrfBatched for this to work correctly. Something like this:
double *d_Aarray[batchSize],*d_TauArray[batchSize];
for (int i = 0; i < batchSize; i++)
{
d_Aarray[i] = d_A+ i*m*n;
d_TauArray[i] = d_TAU + i*ltau;
}
double ** d_Aarray_, ** d_TauArray_;
cudaMalloc((void **)&d_Aarray_, sizeof(d_Aarray));
cudaMalloc((void **)&d_TauArray_, sizeof(d_TauArray));
cudaMemcpy(d_Aarray_, d_Aarray, sizeof(d_Aarray), cudaMemcpyHostToDevice);
cudaMemcpy(d_TauArray_, d_TauArray, sizeof(d_TauArray), cudaMemcpyHostToDevice);
stat = cublasDgeqrfBatched(cublas_handle, m, n, d_Aarray_, lda, d_TauArray_, &info, batchSize)
[disclaimer: written in Browser]
Here d_Aarray_ and d_TauArray_ are device memory copies of d_Aarray and d_TauArray.

Related

CUDA: Finding Minimum Value in 100x100 Matrix

i just learned GPU programming and now i have a task to find a minimum value from 100x100 matrix by doing parallel at CUDA. i have try this code, but it's not showing the answer, instead of showing my initiate value hmin = 9999999.can anyone give me the right code? oh, the code is in C lang.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define size (100*100)
//Kernel Functions & Variable
__global__ void FindMin(int* mat[100][100],int* kmin){
int b=blockIdx.x+threadIdx.x*blockDim.x;
int k=blockIdx.y+threadIdx.y*blockDim.y;
if(mat[b][k] < kmin){
kmin = mat[b][k];
}
}
int main(int argc, char *argv[]) {
//Declare Variabel
int i,j,hmaks=0,hmin=9999999,hsumin,hsumax; //Host Variable
int *da[100][100],*dmin,*dmaks,*dsumin,*dsumax; // Device Variable
FILE *baca; //for opening txt file
char buf[4]; //used for fscanf
int ha[100][100],b; //matrix shall be filled by "b"
//1: Read txt File
baca=fopen("MatrixTubes1.txt","r");
if (!baca){
printf("Hey, it's not even exist"); //Checking File, is it there?
}
i=0;j=0; //Matrix index initialization
if(!feof(baca)){ //if not end of file then do
for(i = 0; i < 100; i++){
for(j = 0; j < 100; j++){
fscanf(baca,"%s",buf); //read max 4 char
b=atoi(buf); //parsing from string to integer
ha[i][j]=b; //save it to my matrix
}
}
}
fclose(baca);
//all file has been read
//time to close the file
//Sesi 2: Allocation data di GPU
cudaMalloc((void **)&da, size*sizeof(int));
cudaMalloc((void **)&dmin, sizeof(int));
cudaMalloc((void **)&dmaks, sizeof(int));
cudaMalloc((void **)&dsumin, sizeof(int));
cudaMalloc((void **)&dsumax, sizeof(int));
//Sesi 3: Copy data to Device
cudaMemcpy(da, &ha, size*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dmin, &hmin, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dmaks, &hmaks, sizeof(int), cudaMemcpyHostToDevice);
//Sesi 4: Call Kernel
FindMin<<<100,100,1>>>(da,dmin);
//5: Copy from Device to Host
cudaMemcpy(&hmin, dmin, sizeof(int), cudaMemcpyDeviceToHost);
//6: Print that value
printf("Minimum Value = %i \n",hmin);
system("pause"); return 0;
}
this is my result
Minimum Value = 9999999
Press any key to continue . . .
I saw a few issues in your code.
As mentioned in the comments from MayurK, you got the indexing wrong.
Also as MayurK said, you are comparing two pointers and not the values they point to.
You kernel invocation code asks for 100 x 100 x 1 grid, with each block containing just 1 thread. This is very bad in terms of efficiency. Also, because of this, your b and k will only range from 0 to 99, as the threadIdx.x will always be zero.
Finally, all threads will be running in parallel, resulting in a race condition in kmin = mat[b][k] (which should be *kmin by the way). When you fixed the indexing problem, all threads in the same block will write to the location in global memory at same time. You should use atomicMin() or a parallel reduction for finding the minimum value in parallel.

Sending 2D array to Cuda Kernel

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.

2d array in C with negative indices

I am writing a C-program where I need 2D-arrays (dynamically allocated) with negative indices or where the index does not start at zero. So for an array[i][j] the row-index i should take values from e.g. 1 to 3 and the column-index j should take values from e.g. -1 to 9.
For this purpose I created the following program, here the variable columns_start is set to zero, so just the row-index is shifted and this works really fine.
But when I assign other values than zero to the variable columns_start, I get the message (from valgrind) that the command "free(array[i]);" is invalid.
So my questions are:
Why it is invalid to free the memory that I allocated just before?
How do I have to modify my program to shift the column-index?
Thank you for your help.
#include <stdio.h>
#include <stdlib.h>
main()
{
int **array, **array2;
int rows_end, rows_start, columns_end, columns_start, i, j;
rows_start = 1;
rows_end = 3;
columns_start = 0;
columns_end = 9;
array = malloc((rows_end-rows_start+1) * sizeof(int *));
for(i = 0; i <= (rows_end-rows_start); i++) {
array[i] = malloc((columns_end-columns_start+1) * sizeof(int));
}
array2 = array-rows_start; //shifting row-index
for(i = rows_start; i <= rows_end; i++) {
array2[i] = array[i-rows_start]-columns_start; //shifting column-index
}
for(i = rows_start; i <= rows_end; i++) {
for(j = columns_start; j <= columns_end; j++) {
array2[i][j] = i+j; //writing stuff into array
printf("%i %i %d\n",i, j, array2[i][j]);
}
}
for(i = 0; i <= (rows_end-rows_start); i++) {
free(array[i]);
}
free(array);
}
When you shift column indexes, you assign new values to original array of columns: in
array2[i] = array[i-rows_start]-columns_start;
array2[i] and array[i=rows_start] are the same memory cell as array2 is initialized with array-rows_start.
So deallocation of memory requires reverse shift. Try the following:
free(array[i] + columns_start);
IMHO, such modification of array indexes gives no benefit, while complicating program logic and leading to errors. Try to modify indexes on the fly in single loop.
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int a[] = { -1, 41, 42, 43 };
int *b;//you will always read the data via this pointer
b = &a[1];// 1 is becoming the "zero pivot"
printf("zero: %d\n", b[0]);
printf("-1: %d\n", b[-1]);
return EXIT_SUCCESS;
}
If you don't need just a contiguous block, then you may be better off with hash tables instead.
As far as I can see, your free and malloc looks good. But your shifting doesn't make sense. Why don't you just add an offset in your array instead of using array2:
int maxNegValue = 10;
int myNegValue = -6;
array[x][myNegValue+maxNegValue] = ...;
this way, you're always in the positive range.
For malloc: you acquire (maxNegValue + maxPosValue) * sizeof(...)
Ok I understand now, that you need free(array.. + offset); even using your shifting stuff.. that's probably not what you want. If you don't need a very fast implementation I'd suggest to use a struct containing the offset and an array. Then create a function having this struct and x/y as arguments to allow access to the array.
I don't know why valgrind would complain about that free statement, but there seems to be a lot of pointer juggling going on so it doesn't surprise me that you get this problem in the first place. For instance, one thing which caught my eye is:
array2 = array-rows_start;
This will make array2[0] dereference memory which you didn't allocate. I fear it's just a matter of time until you get the offset calcuations wrong and run into this problem.
One one comment you wrote
but im my program I need a lot of these arrays with all different beginning indices, so I hope to find a more elegant solution instead of defining two offsets for every array.
I think I'd hide all this in a matrix helper struct (+ functions) so that you don't have to clutter your code with all the offsets. Consider this in some matrix.h header:
struct matrix; /* opaque type */
/* Allocates a matrix with the given dimensions, sample invocation might be:
*
* struct matrix *m;
* matrix_alloc( &m, -2, 14, -9, 33 );
*/
void matrix_alloc( struct matrix **m, int minRow, int maxRow, int minCol, int maxCol );
/* Releases resources allocated by the given matrix, e.g.:
*
* struct matrix *m;
* ...
* matrix_free( m );
*/
void matrix_free( struct matrix *m );
/* Get/Set the value of some elment in the matrix; takes logicaly (potentially negative)
* coordinates and translates them to zero-based coordinates internally, e.g.:
*
* struct matrix *m;
* ...
* int val = matrix_get( m, 9, -7 );
*/
int matrix_get( struct matrix *m, int row, int col );
void matrix_set( struct matrix *m, int row, int col, int val );
And here's how an implementation might look like (this would be matrix.c):
struct matrix {
int minRow, maxRow, minCol, maxCol;
int **elem;
};
void matrix_alloc( struct matrix **m, int minCol, int maxCol, int minRow, int maxRow ) {
int numRows = maxRow - minRow;
int numCols = maxCol - minCol;
*m = malloc( sizeof( struct matrix ) );
*elem = malloc( numRows * sizeof( *elem ) );
for ( int i = 0; i < numRows; ++i )
*elem = malloc( numCols * sizeof( int ) );
/* setting other fields of the matrix omitted for brevity */
}
void matrix_free( struct matrix *m ) {
/* omitted for brevity */
}
int matrix_get( struct matrix *m, int col, int row ) {
return m->elem[row - m->minRow][col - m->minCol];
}
void matrix_set( struct matrix *m, int col, int row, int val ) {
m->elem[row - m->minRow][col - m->minCol] = val;
}
This way you only need to get this stuff right once, in a central place. The rest of your program doesn't have to deal with raw arrays but rather the struct matrix type.

C File Input/Trapezoid Rule Program

Little bit of a 2 parter. First of all im trying to do this in all c. First of all I'll go ahead and post my program
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <string.h>
double f(double x);
void Trap(double a, double b, int n, double* integral_p);
int main(int argc, char* argv[]) {
double integral=0.0; //Integral Result
double a=6, b=10; //Left and Right Points
int n; //Number of Trapezoids (Higher=more accurate)
int degree;
if (argc != 3) {
printf("Error: Invalid Command Line arguements, format:./trapezoid N filename");
exit(0);
}
n = atoi(argv[2]);
FILE *fp = fopen( argv[1], "r" );
# pragma omp parallel
Trap(a, b, n, &integral);
printf("With n = %d trapezoids....\n", n);
printf("of the integral from %f to %f = %.15e\n",a, b, integral);
return 0;
}
double f(double x) {
double return_val;
return_val = pow(3.0*x,5)+pow(2.5*x,4)+pow(-1.5*x,3)+pow(0*x,2)+pow(1.7*x,1)+4;
return return_val;
}
void Trap(double a, double b, int n, double* integral_p) {
double h, x, my_integral;
double local_a, local_b;
int i, local_n;
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
h = (b-a)/n;
local_n = n/thread_count;
local_a = a + my_rank*local_n*h;
local_b = local_a + local_n*h;
my_integral = (f(local_a) + f(local_b))/2.0;
for (i = 1; i <= local_n-1; i++) {
x = local_a + i*h;
my_integral += f(x);
}
my_integral = my_integral*h;
# pragma omp critical
*integral_p += my_integral;
}
As you can see, it calculates trapezoidal rule given an interval.
First of all it DOES work, if you hardcode the values and the function. But I need to read from a file in the format of
5
3.0 2.5 -1.5 0.0 1.7 4.0
6 10
Which means:
It is of degree 5 (no more than 50 ever)
3.0x^5 +2.5x^4 −1.5x^3 +1.7x+4 is the polynomial (we skip ^2 since it's 0)
and the Interval is from 6 to 10
My main concern is the f(x) function which I have hardcoded. I have NO IDEA how to make it take up to 50 besides literally typing out 50 POWS and reading in the values to see what they could be.......Anyone else have any ideas perhaps?
Also what would be the best way to read in the file? fgetc? Im not really sure when it comes to reading in C input (especially since everything i read in is an INT, is there some way to convert them?)
For a large degree polynomial, would something like this work?
double f(double x, double coeff[], int nCoeff)
{
double return_val = 0.0;
int exponent = nCoeff-1;
int i;
for(i=0; i<nCoeff-1; ++i, --exponent)
{
return_val = pow(coeff[i]*x, exponent) + return_val;
}
/* add on the final constant, 4, in our example */
return return_val + coeff[nCoeff-1];
}
In your example, you would call it like:
sampleCall()
{
double coefficients[] = {3.0, 2.5, -1.5, 0, 1.7, 4};
/* This expresses 3x^5 + 2.5x^4 + (-1.5x)^3 + 0x^2 + 1.7x + 4 */
my_integral = f(x, coefficients, 6);
}
By passing an array of coefficients (the exponents are assumed), you don't have to deal with variadic arguments. The hardest part is constructing the array, and that is pretty simple.
It should go without saying, if you put the coefficients array and number-of-coefficients into global variables, then the signature of f(x) doesn't need to change:
double f(double x)
{
// access glbl_coeff and glbl_NumOfCoeffs, instead of parameters
}
For you f() function consider making it variadic (varargs is another name)
http://www.gnu.org/s/libc/manual/html_node/Variadic-Functions.html
This way you could pass the function 1 arg telling it how many "pows" you want, with each susequent argument being a double value. Is this what you are asking for with the f() function part of your question?

Computing the inverse of a matrix using lapack in C

I would like to be able to compute the inverse of a general NxN matrix in C/C++ using lapack.
My understanding is that the way to do an inversion in lapack is by using the dgetri function, however, I can't figure out what all of its arguments are supposed to be.
Here is the code I have:
void dgetri_(int* N, double* A, int* lda, int* IPIV, double* WORK, int* lwork, int* INFO);
int main(){
double M [9] = {
1,2,3,
4,5,6,
7,8,9
};
return 0;
}
How would you complete it to obtain the inverse of the 3x3 matrix M using dgetri_?
Here is the working code for computing the inverse of a matrix using lapack in C/C++:
#include <cstdio>
extern "C" {
// LU decomoposition of a general matrix
void dgetrf_(int* M, int *N, double* A, int* lda, int* IPIV, int* INFO);
// generate inverse of a matrix given its LU decomposition
void dgetri_(int* N, double* A, int* lda, int* IPIV, double* WORK, int* lwork, int* INFO);
}
void inverse(double* A, int N)
{
int *IPIV = new int[N];
int LWORK = N*N;
double *WORK = new double[LWORK];
int INFO;
dgetrf_(&N,&N,A,&N,IPIV,&INFO);
dgetri_(&N,A,&N,IPIV,WORK,&LWORK,&INFO);
delete[] IPIV;
delete[] WORK;
}
int main(){
double A [2*2] = {
1,2,
3,4
};
inverse(A, 2);
printf("%f %f\n", A[0], A[1]);
printf("%f %f\n", A[2], A[3]);
return 0;
}
First, M has to be a two-dimensional array, like double M[3][3]. Your array is, mathematically speaking, a 1x9 vector, which is not invertible.
N is a pointer to an int for the
order of the matrix - in this case,
N=3.
A is a pointer to the LU
factorization of the matrix, which
you can get by running the LAPACK
routine dgetrf.
LDA is an integer for the "leading
element" of the matrix, which lets
you pick out a subset of a bigger
matrix if you want to just invert a
little piece. If you want to invert
the whole matrix, LDA should just be
equal to N.
IPIV is the pivot indices of the
matrix, in other words, it's a list
of instructions of what rows to swap
in order to invert the matrix. IPIV
should be generated by the LAPACK
routine dgetrf.
LWORK and WORK are the "workspaces"
used by LAPACK. If you are inverting
the whole matrix, LWORK should be an
int equal to N^2, and WORK should be
a double array with LWORK elements.
INFO is just a status variable to
tell you whether the operation
completed successfully. Since not all
matrices are invertible, I would
recommend that you send this to some
sort of error-checking system. INFO=0 for successful operation, INFO=-i if the i'th argument had an incorrect input value, and INFO > 0 if the matrix is not invertible.
So, for your code, I would do something like this:
int main(){
double M[3][3] = { {1 , 2 , 3},
{4 , 5 , 6},
{7 , 8 , 9}}
double pivotArray[3]; //since our matrix has three rows
int errorHandler;
double lapackWorkspace[9];
// dgetrf(M,N,A,LDA,IPIV,INFO) means invert LDA columns of an M by N matrix
// called A, sending the pivot indices to IPIV, and spitting error
// information to INFO.
// also don't forget (like I did) that when you pass a two-dimensional array
// to a function you need to specify the number of "rows"
dgetrf_(3,3,M[3][],3,pivotArray[3],&errorHandler);
//some sort of error check
dgetri_(3,M[3][],3,pivotArray[3],9,lapackWorkspace,&errorHandler);
//another error check
}
Here is a working version of the above using OpenBlas interface to LAPACKE.
Link with openblas library (LAPACKE is already contained)
#include <stdio.h>
#include "cblas.h"
#include "lapacke.h"
// inplace inverse n x n matrix A.
// matrix A is Column Major (i.e. firts line, second line ... *not* C[][] order)
// returns:
// ret = 0 on success
// ret < 0 illegal argument value
// ret > 0 singular matrix
lapack_int matInv(double *A, unsigned n)
{
int ipiv[n+1];
lapack_int ret;
ret = LAPACKE_dgetrf(LAPACK_COL_MAJOR,
n,
n,
A,
n,
ipiv);
if (ret !=0)
return ret;
ret = LAPACKE_dgetri(LAPACK_COL_MAJOR,
n,
A,
n,
ipiv);
return ret;
}
int main()
{
double A[] = {
0.378589, 0.971711, 0.016087, 0.037668, 0.312398,
0.756377, 0.345708, 0.922947, 0.846671, 0.856103,
0.732510, 0.108942, 0.476969, 0.398254, 0.507045,
0.162608, 0.227770, 0.533074, 0.807075, 0.180335,
0.517006, 0.315992, 0.914848, 0.460825, 0.731980
};
for (int i=0; i<25; i++) {
if ((i%5) == 0) putchar('\n');
printf("%+12.8f ",A[i]);
}
putchar('\n');
matInv(A,5);
for (int i=0; i<25; i++) {
if ((i%5) == 0) putchar('\n');
printf("%+12.8f ",A[i]);
}
putchar('\n');
}
Example:
% g++ -I [OpenBlas path]/include/ example.cpp [OpenBlas path]/lib/libopenblas.a
% a.out
+0.37858900 +0.97171100 +0.01608700 +0.03766800 +0.31239800
+0.75637700 +0.34570800 +0.92294700 +0.84667100 +0.85610300
+0.73251000 +0.10894200 +0.47696900 +0.39825400 +0.50704500
+0.16260800 +0.22777000 +0.53307400 +0.80707500 +0.18033500
+0.51700600 +0.31599200 +0.91484800 +0.46082500 +0.73198000
+0.24335255 -2.67946180 +3.57538817 +0.83711880 +0.34704217
+1.02790497 -1.05086895 -0.07468137 +0.71041070 +0.66708313
-0.21087237 -4.47765165 +1.73958308 +1.73999641 +3.69324020
-0.14100897 +2.34977565 -0.93725915 +0.47383541 -2.15554470
-0.26329660 +6.46315378 -4.07721533 -3.37094863 -2.42580445
Here is a working version of Spencer Nelson's example above. One mystery about it is that the input matrix is in row-major order, even though it appears to call the underlying fortran routine dgetri. I am led to believe that all the underlying fortran routines require column-major order, but I am no expert on LAPACK, in fact, I'm using this example to help me learn it. But, that one mystery aside:
The input matrix in the example is singular. LAPACK tries to tell you that by returning a 3 in the errorHandler. I changed the 9 in that matrix to a 19, getting an errorHandler of 0 signalling success, and compared the result to that from Mathematica. The comparison was also successful and confirmed that the matrix in the example should be in row-major order, as presented.
Here is the working code:
#include <stdio.h>
#include <stddef.h>
#include <lapacke.h>
int main() {
int N = 3;
int NN = 9;
double M[3][3] = { {1 , 2 , 3},
{4 , 5 , 6},
{7 , 8 , 9} };
int pivotArray[3]; //since our matrix has three rows
int errorHandler;
double lapackWorkspace[9];
// dgetrf(M,N,A,LDA,IPIV,INFO) means invert LDA columns of an M by N matrix
// called A, sending the pivot indices to IPIV, and spitting error information
// to INFO. also don't forget (like I did) that when you pass a two-dimensional
// array to a function you need to specify the number of "rows"
dgetrf_(&N, &N, M[0], &N, pivotArray, &errorHandler);
printf ("dgetrf eh, %d, should be zero\n", errorHandler);
dgetri_(&N, M[0], &N, pivotArray, lapackWorkspace, &NN, &errorHandler);
printf ("dgetri eh, %d, should be zero\n", errorHandler);
for (size_t row = 0; row < N; ++row)
{ for (size_t col = 0; col < N; ++col)
{ printf ("%g", M[row][col]);
if (N-1 != col)
{ printf (", "); } }
if (N-1 != row)
{ printf ("\n"); } }
return 0; }
I built and ran it as follows on a Mac:
gcc main.c -llapacke -llapack
./a.out
I did an nm on the LAPACKE library and found the following:
liblapacke.a(lapacke_dgetri.o):
U _LAPACKE_dge_nancheck
0000000000000000 T _LAPACKE_dgetri
U _LAPACKE_dgetri_work
U _LAPACKE_xerbla
U _free
U _malloc
liblapacke.a(lapacke_dgetri_work.o):
U _LAPACKE_dge_trans
0000000000000000 T _LAPACKE_dgetri_work
U _LAPACKE_xerbla
U _dgetri_
U _free
U _malloc
and it looks like there is a LAPACKE [sic] wrapper that would presumably relieve us of having to take addresses everywhere for fortran's convenience, but I am probably not going to get around to trying it because I have a way forward.
EDIT
Here is a working version that bypasses LAPACKE [sic], using LAPACK fortran routines directly. I do not understand why a row-major input produces correct results, but I confirmed it again in Mathematica.
#include <stdio.h>
#include <stddef.h>
int main() {
int N = 3;
int NN = 9;
double M[3][3] = { {1 , 2 , 3},
{4 , 5 , 6},
{7 , 8 , 19} };
int pivotArray[3]; //since our matrix has three rows
int errorHandler;
double lapackWorkspace[9];
/* from http://www.netlib.no/netlib/lapack/double/dgetrf.f
SUBROUTINE DGETRF( M, N, A, LDA, IPIV, INFO )
*
* -- LAPACK routine (version 3.1) --
* Univ. of Tennessee, Univ. of California Berkeley and NAG Ltd..
* November 2006
*
* .. Scalar Arguments ..
INTEGER INFO, LDA, M, N
* ..
* .. Array Arguments ..
INTEGER IPIV( * )
DOUBLE PRECISION A( LDA, * )
*/
extern void dgetrf_ (int * m, int * n, double * A, int * LDA, int * IPIV,
int * INFO);
/* from http://www.netlib.no/netlib/lapack/double/dgetri.f
SUBROUTINE DGETRI( N, A, LDA, IPIV, WORK, LWORK, INFO )
*
* -- LAPACK routine (version 3.1) --
* Univ. of Tennessee, Univ. of California Berkeley and NAG Ltd..
* November 2006
*
* .. Scalar Arguments ..
INTEGER INFO, LDA, LWORK, N
* ..
* .. Array Arguments ..
INTEGER IPIV( * )
DOUBLE PRECISION A( LDA, * ), WORK( * )
*/
extern void dgetri_ (int * n, double * A, int * LDA, int * IPIV,
double * WORK, int * LWORK, int * INFO);
// dgetrf(M,N,A,LDA,IPIV,INFO) means invert LDA columns of an M by N matrix
// called A, sending the pivot indices to IPIV, and spitting error information
// to INFO. also don't forget (like I did) that when you pass a two-dimensional
// array to a function you need to specify the number of "rows"
dgetrf_(&N, &N, M[0], &N, pivotArray, &errorHandler);
printf ("dgetrf eh, %d, should be zero\n", errorHandler);
dgetri_(&N, M[0], &N, pivotArray, lapackWorkspace, &NN, &errorHandler);
printf ("dgetri eh, %d, should be zero\n", errorHandler);
for (size_t row = 0; row < N; ++row)
{ for (size_t col = 0; col < N; ++col)
{ printf ("%g", M[row][col]);
if (N-1 != col)
{ printf (", "); } }
if (N-1 != row)
{ printf ("\n"); } }
return 0; }
built and run like this:
$ gcc foo.c -llapack
$ ./a.out
dgetrf eh, 0, should be zero
dgetri eh, 0, should be zero
-1.56667, 0.466667, 0.1
1.13333, 0.0666667, -0.2
0.1, -0.2, 0.1
EDIT
The mystery no longer appears to be a mystery. I think the computations are being done in column-major order, as they must, but I am both inputting and printing the matrices as if they were in row-major order. I have two bugs that cancel each other out so things look row-ish even though they're column-ish.

Resources