I am trying for the first time to use LAPACK from C to diagonalize a matrix and I am stuck.
I have been trying to modify this example http://rcabreral.blogspot.co.uk/2010/05/eigenvalues-clapack.html from zgeev to dgeev. I have looked at the DGEEV input parameters, http://www.netlib.org/lapack/explore-html/d9/d28/dgeev_8f.html but it seems I don't understand the well enough.
Hence, the code below produces:
**** On entry to DGEEV parameter number 9 had an illegal value**
EDIT: The error occurs in the call of dgeev spanning lines 48 to (including) 53.
EDIT: Note that the arguments differ from the specifications here
http://www.netlib.org/lapack/explore-html/d9/d28/dgeev_8f.html
in that they have been translated to pointers. That is necessary when using these Fortran routines in C, as explained here:
http://www.physics.orst.edu/~rubin/nacphy/lapack/cprogp.html
#include <stdio.h>
#include <math.h>
#include <complex.h>
#include <stdlib.h>
//.........................................................................
void dgeTranspose( double *Transposed, double *M ,int n) {
int i,j;
for(i=0;i<n;i++)
for(j=0;j<n;j++)
Transposed[i+n*j] = M[i*n+j];
}
//.........................................................................
// MatrixComplexEigensystem: computes the eigenvectors and eigenValues of input matrix A
// The eigenvectors are stored in columns
//.........................................................................
void MatrixComplexEigensystem( double *eigenvectorsVR, double *eigenvaluesW, double *A, int N){
int i;
double *AT = (double *) malloc( N*N*sizeof(double ) );
dgeTranspose( AT, A , N);
char JOBVL ='N'; // Compute Right eigenvectors
char JOBVR ='V'; // Do not compute Left eigenvectors
double VL[1];
int LDVL = 1;
int LDVR = N;
int LWORK = 4*N;
double *WORK = (double *)malloc( LWORK*sizeof(double));
double *RWORK = (double *)malloc( 2*N*sizeof(double));
int INFO;
dgeev_( &JOBVL, &JOBVR, &N, AT , &N , eigenvaluesW ,
VL, &LDVL,
eigenvectorsVR, &LDVR,
WORK,
&LWORK, RWORK, &INFO );
dgeTranspose( AT, eigenvectorsVR , N);
for(i=0;i<N*N;i++) eigenvectorsVR[i]=AT[i];
free(WORK);
free(RWORK);
free(AT);
}
int main(){
int i,j;
const int N = 3;
double A[] = { 1.+I , 2. , 3 , 4. , 5.+I , 6. , 7., 8., 9. + I};
double eigenVectors[N*N];
double eigenValues[N];
MatrixComplexEigensystem( eigenVectors, eigenValues, A, N);
printf("\nEigenvectors\n");
for(i=0;i<N;i++){
for(j=0;j<N;j++) printf("%e", eigenVectors[i*N + j]);
printf("\n");
}
printf("\nEigenvalues \n");
for(i=0;i<N;i++) printf("%e", eigenValues[i] );
printf("\n------------------------------------------------------------\n");
return 0;
}
You can not port directly from zgeev to dgeev. The zgeev gets a complex matrix and computes complex eigenvalues. While dgeev gets a real matrix and computes complex eigenvalues. In order to be consistent LAPACK uses WR and WI which is used for the real and imaginary part of each eigenvalue.
So note that dgeev definition is
void dgeev_(char* JOBVL, char* JOBVR, int* N, double* A, int* LDA, double* WR, double* WI, double* VL, int* LDVL, double* VR, int* LDVR, double* WORK, int* LWORK, int* INFO);
My suggestion for your example is to remove:
#include <complex.h>
remove I's from matrix of doubles:
double A[] = { 1. , 2. , 3 , 4. , 5. , 6. , 7., 8., 9.};
then double the size of eigenvalues vector:
double eigenValues[2*N];
and call dgeev using WR and WI:
double *eigenvaluesWR = eigenvaluesW;
double *eigenvaluesWI = eigenvaluesW+N;
dgeev_(&JOBVL, &JOBVR, &N, AT, &N,
eigenvaluesWR, eigenvaluesWI,
VL, &LDVL,
eigenvectorsVR, &LDVR,
WORK, &LWORK, &INFO);
I am writing a simpled code about the addition of the elements of 2 matrices A and B; the code is quite simple and it is inspired on the example given in chapter 2 of the CUDA C Programming Guide.
#include <stdio.h>
#include <stdlib.h>
#define N 2
__global__ void MatAdd(int A[][N], int B[][N], int C[][N]){
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main(){
int A[N][N] = {{1,2},{3,4}};
int B[N][N] = {{5,6},{7,8}};
int C[N][N] = {{0,0},{0,0}};
int (*pA)[N], (*pB)[N], (*pC)[N];
cudaMalloc((void**)&pA, (N*N)*sizeof(int));
cudaMalloc((void**)&pB, (N*N)*sizeof(int));
cudaMalloc((void**)&pC, (N*N)*sizeof(int));
cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pB, B, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pC, C, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(N,N);
MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C);
cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);
int i, j; printf("C = \n");
for(i=0;i<N;i++){
for(j=0;j<N;j++){
printf("%d ", C[i][j]);
}
printf("\n");
}
cudaFree(pA);
cudaFree(pB);
cudaFree(pC);
printf("\n");
return 0;
}
when i run it i keep getting the initial matrix C = [0 0 ; 0 0] instead of the addition of the elements(i,j) of the 2 matrices A and B; i have previously done another example about the addition of the elements of two arrays and it seems to work fine; however this time i don't know why it does not work.
I believe there's something wrong with the cudaMalloc command by i don't really know what else could it be.
Any ideas?
MatAdd<<<numBlocks,threadsPerBlock>>>(pA,pB,pC); instead of MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C); solves the problem.
The reason is that A,B and C are allocated on the CPU, while pA,pB and pC are allocated of the GPU, using CudaMalloc(). Once pA,pB and pC are allocated, the values are sent from the CPU to GPU by cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
Then, the addition is performed on the GPU, that is with pA,pB and pC. To use printf, the result pC is sent from the GPU to the CPU via cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);
Think as if the CPU cannot see pA and the GPU cannot see A.
I am very new to CUDA programming and was reading the 'CUDA C Programming Guide' provided by nvidia.
(http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf)
In the page 25, it has the following C code that does the matrix multiplication. Can you please tell me how can I make that code run on two devices? (if I have two nvida CUDA capable cards installed in my computer). Can you please show me with an example.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col];
return Asub;
}
// Thread block size
#define BLOCK_SIZE 16
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);
// Allocate C in device memory
Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}
There is no "automatic" way to run a CUDA kernel on multiple GPUs.
You will need to devise a way to decompose the matrix multiplication problem into independent operations that can be run in parallel (so one on each GPU in parallel). As a simple example:
C = A.B is equivalent to C = [A].[B1|B2] = [A.B1|A.B2] where B1 and B2 are suitably sized matrices containing the columns of the matrix B and | denotes columnwise concantenation. You can calculate A.B1 and A.B2 as separate matrix multiplication operations, and then perform the concatenation when copying the resulting submatrices back to host memory.
Once you have a suitable decomposition scheme, you then implement it using the standard multi-gpu facilities in the CUDA 4.x API. For a great overview of multi-GPU programming using the CUDA APIs, I recommend watching Paulius Micikevicius' excellent talk from GTC 2012, which available as a streaming video and PDF here.
The basics are described in the CUDA C Programming Guide under section 3.2.6.
Basically, you can set on which GPU your current host thread operates on by calling cudaSetDevice(). Still you have to write your own code, to decompose your routines to be split across multiple GPUs.
I would like to be able to compute the inverse of a general NxN matrix in C/C++ using lapack.
My understanding is that the way to do an inversion in lapack is by using the dgetri function, however, I can't figure out what all of its arguments are supposed to be.
Here is the code I have:
void dgetri_(int* N, double* A, int* lda, int* IPIV, double* WORK, int* lwork, int* INFO);
int main(){
double M [9] = {
1,2,3,
4,5,6,
7,8,9
};
return 0;
}
How would you complete it to obtain the inverse of the 3x3 matrix M using dgetri_?
Here is the working code for computing the inverse of a matrix using lapack in C/C++:
#include <cstdio>
extern "C" {
// LU decomoposition of a general matrix
void dgetrf_(int* M, int *N, double* A, int* lda, int* IPIV, int* INFO);
// generate inverse of a matrix given its LU decomposition
void dgetri_(int* N, double* A, int* lda, int* IPIV, double* WORK, int* lwork, int* INFO);
}
void inverse(double* A, int N)
{
int *IPIV = new int[N];
int LWORK = N*N;
double *WORK = new double[LWORK];
int INFO;
dgetrf_(&N,&N,A,&N,IPIV,&INFO);
dgetri_(&N,A,&N,IPIV,WORK,&LWORK,&INFO);
delete[] IPIV;
delete[] WORK;
}
int main(){
double A [2*2] = {
1,2,
3,4
};
inverse(A, 2);
printf("%f %f\n", A[0], A[1]);
printf("%f %f\n", A[2], A[3]);
return 0;
}
First, M has to be a two-dimensional array, like double M[3][3]. Your array is, mathematically speaking, a 1x9 vector, which is not invertible.
N is a pointer to an int for the
order of the matrix - in this case,
N=3.
A is a pointer to the LU
factorization of the matrix, which
you can get by running the LAPACK
routine dgetrf.
LDA is an integer for the "leading
element" of the matrix, which lets
you pick out a subset of a bigger
matrix if you want to just invert a
little piece. If you want to invert
the whole matrix, LDA should just be
equal to N.
IPIV is the pivot indices of the
matrix, in other words, it's a list
of instructions of what rows to swap
in order to invert the matrix. IPIV
should be generated by the LAPACK
routine dgetrf.
LWORK and WORK are the "workspaces"
used by LAPACK. If you are inverting
the whole matrix, LWORK should be an
int equal to N^2, and WORK should be
a double array with LWORK elements.
INFO is just a status variable to
tell you whether the operation
completed successfully. Since not all
matrices are invertible, I would
recommend that you send this to some
sort of error-checking system. INFO=0 for successful operation, INFO=-i if the i'th argument had an incorrect input value, and INFO > 0 if the matrix is not invertible.
So, for your code, I would do something like this:
int main(){
double M[3][3] = { {1 , 2 , 3},
{4 , 5 , 6},
{7 , 8 , 9}}
double pivotArray[3]; //since our matrix has three rows
int errorHandler;
double lapackWorkspace[9];
// dgetrf(M,N,A,LDA,IPIV,INFO) means invert LDA columns of an M by N matrix
// called A, sending the pivot indices to IPIV, and spitting error
// information to INFO.
// also don't forget (like I did) that when you pass a two-dimensional array
// to a function you need to specify the number of "rows"
dgetrf_(3,3,M[3][],3,pivotArray[3],&errorHandler);
//some sort of error check
dgetri_(3,M[3][],3,pivotArray[3],9,lapackWorkspace,&errorHandler);
//another error check
}
Here is a working version of the above using OpenBlas interface to LAPACKE.
Link with openblas library (LAPACKE is already contained)
#include <stdio.h>
#include "cblas.h"
#include "lapacke.h"
// inplace inverse n x n matrix A.
// matrix A is Column Major (i.e. firts line, second line ... *not* C[][] order)
// returns:
// ret = 0 on success
// ret < 0 illegal argument value
// ret > 0 singular matrix
lapack_int matInv(double *A, unsigned n)
{
int ipiv[n+1];
lapack_int ret;
ret = LAPACKE_dgetrf(LAPACK_COL_MAJOR,
n,
n,
A,
n,
ipiv);
if (ret !=0)
return ret;
ret = LAPACKE_dgetri(LAPACK_COL_MAJOR,
n,
A,
n,
ipiv);
return ret;
}
int main()
{
double A[] = {
0.378589, 0.971711, 0.016087, 0.037668, 0.312398,
0.756377, 0.345708, 0.922947, 0.846671, 0.856103,
0.732510, 0.108942, 0.476969, 0.398254, 0.507045,
0.162608, 0.227770, 0.533074, 0.807075, 0.180335,
0.517006, 0.315992, 0.914848, 0.460825, 0.731980
};
for (int i=0; i<25; i++) {
if ((i%5) == 0) putchar('\n');
printf("%+12.8f ",A[i]);
}
putchar('\n');
matInv(A,5);
for (int i=0; i<25; i++) {
if ((i%5) == 0) putchar('\n');
printf("%+12.8f ",A[i]);
}
putchar('\n');
}
Example:
% g++ -I [OpenBlas path]/include/ example.cpp [OpenBlas path]/lib/libopenblas.a
% a.out
+0.37858900 +0.97171100 +0.01608700 +0.03766800 +0.31239800
+0.75637700 +0.34570800 +0.92294700 +0.84667100 +0.85610300
+0.73251000 +0.10894200 +0.47696900 +0.39825400 +0.50704500
+0.16260800 +0.22777000 +0.53307400 +0.80707500 +0.18033500
+0.51700600 +0.31599200 +0.91484800 +0.46082500 +0.73198000
+0.24335255 -2.67946180 +3.57538817 +0.83711880 +0.34704217
+1.02790497 -1.05086895 -0.07468137 +0.71041070 +0.66708313
-0.21087237 -4.47765165 +1.73958308 +1.73999641 +3.69324020
-0.14100897 +2.34977565 -0.93725915 +0.47383541 -2.15554470
-0.26329660 +6.46315378 -4.07721533 -3.37094863 -2.42580445
Here is a working version of Spencer Nelson's example above. One mystery about it is that the input matrix is in row-major order, even though it appears to call the underlying fortran routine dgetri. I am led to believe that all the underlying fortran routines require column-major order, but I am no expert on LAPACK, in fact, I'm using this example to help me learn it. But, that one mystery aside:
The input matrix in the example is singular. LAPACK tries to tell you that by returning a 3 in the errorHandler. I changed the 9 in that matrix to a 19, getting an errorHandler of 0 signalling success, and compared the result to that from Mathematica. The comparison was also successful and confirmed that the matrix in the example should be in row-major order, as presented.
Here is the working code:
#include <stdio.h>
#include <stddef.h>
#include <lapacke.h>
int main() {
int N = 3;
int NN = 9;
double M[3][3] = { {1 , 2 , 3},
{4 , 5 , 6},
{7 , 8 , 9} };
int pivotArray[3]; //since our matrix has three rows
int errorHandler;
double lapackWorkspace[9];
// dgetrf(M,N,A,LDA,IPIV,INFO) means invert LDA columns of an M by N matrix
// called A, sending the pivot indices to IPIV, and spitting error information
// to INFO. also don't forget (like I did) that when you pass a two-dimensional
// array to a function you need to specify the number of "rows"
dgetrf_(&N, &N, M[0], &N, pivotArray, &errorHandler);
printf ("dgetrf eh, %d, should be zero\n", errorHandler);
dgetri_(&N, M[0], &N, pivotArray, lapackWorkspace, &NN, &errorHandler);
printf ("dgetri eh, %d, should be zero\n", errorHandler);
for (size_t row = 0; row < N; ++row)
{ for (size_t col = 0; col < N; ++col)
{ printf ("%g", M[row][col]);
if (N-1 != col)
{ printf (", "); } }
if (N-1 != row)
{ printf ("\n"); } }
return 0; }
I built and ran it as follows on a Mac:
gcc main.c -llapacke -llapack
./a.out
I did an nm on the LAPACKE library and found the following:
liblapacke.a(lapacke_dgetri.o):
U _LAPACKE_dge_nancheck
0000000000000000 T _LAPACKE_dgetri
U _LAPACKE_dgetri_work
U _LAPACKE_xerbla
U _free
U _malloc
liblapacke.a(lapacke_dgetri_work.o):
U _LAPACKE_dge_trans
0000000000000000 T _LAPACKE_dgetri_work
U _LAPACKE_xerbla
U _dgetri_
U _free
U _malloc
and it looks like there is a LAPACKE [sic] wrapper that would presumably relieve us of having to take addresses everywhere for fortran's convenience, but I am probably not going to get around to trying it because I have a way forward.
EDIT
Here is a working version that bypasses LAPACKE [sic], using LAPACK fortran routines directly. I do not understand why a row-major input produces correct results, but I confirmed it again in Mathematica.
#include <stdio.h>
#include <stddef.h>
int main() {
int N = 3;
int NN = 9;
double M[3][3] = { {1 , 2 , 3},
{4 , 5 , 6},
{7 , 8 , 19} };
int pivotArray[3]; //since our matrix has three rows
int errorHandler;
double lapackWorkspace[9];
/* from http://www.netlib.no/netlib/lapack/double/dgetrf.f
SUBROUTINE DGETRF( M, N, A, LDA, IPIV, INFO )
*
* -- LAPACK routine (version 3.1) --
* Univ. of Tennessee, Univ. of California Berkeley and NAG Ltd..
* November 2006
*
* .. Scalar Arguments ..
INTEGER INFO, LDA, M, N
* ..
* .. Array Arguments ..
INTEGER IPIV( * )
DOUBLE PRECISION A( LDA, * )
*/
extern void dgetrf_ (int * m, int * n, double * A, int * LDA, int * IPIV,
int * INFO);
/* from http://www.netlib.no/netlib/lapack/double/dgetri.f
SUBROUTINE DGETRI( N, A, LDA, IPIV, WORK, LWORK, INFO )
*
* -- LAPACK routine (version 3.1) --
* Univ. of Tennessee, Univ. of California Berkeley and NAG Ltd..
* November 2006
*
* .. Scalar Arguments ..
INTEGER INFO, LDA, LWORK, N
* ..
* .. Array Arguments ..
INTEGER IPIV( * )
DOUBLE PRECISION A( LDA, * ), WORK( * )
*/
extern void dgetri_ (int * n, double * A, int * LDA, int * IPIV,
double * WORK, int * LWORK, int * INFO);
// dgetrf(M,N,A,LDA,IPIV,INFO) means invert LDA columns of an M by N matrix
// called A, sending the pivot indices to IPIV, and spitting error information
// to INFO. also don't forget (like I did) that when you pass a two-dimensional
// array to a function you need to specify the number of "rows"
dgetrf_(&N, &N, M[0], &N, pivotArray, &errorHandler);
printf ("dgetrf eh, %d, should be zero\n", errorHandler);
dgetri_(&N, M[0], &N, pivotArray, lapackWorkspace, &NN, &errorHandler);
printf ("dgetri eh, %d, should be zero\n", errorHandler);
for (size_t row = 0; row < N; ++row)
{ for (size_t col = 0; col < N; ++col)
{ printf ("%g", M[row][col]);
if (N-1 != col)
{ printf (", "); } }
if (N-1 != row)
{ printf ("\n"); } }
return 0; }
built and run like this:
$ gcc foo.c -llapack
$ ./a.out
dgetrf eh, 0, should be zero
dgetri eh, 0, should be zero
-1.56667, 0.466667, 0.1
1.13333, 0.0666667, -0.2
0.1, -0.2, 0.1
EDIT
The mystery no longer appears to be a mystery. I think the computations are being done in column-major order, as they must, but I am both inputting and printing the matrices as if they were in row-major order. I have two bugs that cancel each other out so things look row-ish even though they're column-ish.