I was trying to create a sparse distributed Matrix with SUPERLU, but I'm getting some troubles:
basing on SuperLU documentation, I'm using following function,
void dCreate_CompRowLoc_Matrix_dist(SuperMatrix *A, int m, int n,
int nnz_loc, int m_loc, int fst_row,
double *nzval, int *colind, int *rowptr,
Stype_t stype, Dtype_t dtype, Mtype_t mtype);
but it seems that, whatever I pass to, segmentation fault happens.
I've tried passing a very simple 2x2 matrix with 2 nonzeros only, running with 1 process (that means something like)
m = 2;
n = 2;
nnz_loc = 2;
m_loc = 2;
fst_row = 0;
nzval[0] = 1.0;
nzval[1] = 2.0;
colind[0] = 0;
colind[1] = 1;
rowptr[0] = 0;
rowptr[1] = 1;
rowptr[2] = 2;
...SLU_NC, SLU_D, SLU_GE
and I continue having a segmentation fault error
I assume I'm not fully understanding the function usage,
can anyone help me on this (or, if more information are needed, please let me know)
many thanks
Simone
Update/1
just as further information, I've noticed that Row Local matrices are correctly created (with correct position for each element/row) but they seems not to be "collapsed" into a global Supermatrix A-
Related
For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem
As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.
I am implementing the guo hall algorithm for a micro controller. The problem is due to it's architecture I cannot use opencv. I have the algorithm working fine except for one problem. in the following code a struct is passed through the thinning iterator the struct contains both the 2d array and a boolean determining whether or not change was made to the array.
int* thinning(int* it, int x, int y)
{
for(int i= 0; i < x*y; ++i)
it[i] /= 255;
struct IterRet base;
base.i = it;
base.b = false;
do
{
base = thinningIteration(base, x, y, 0);
base = thinningIteration(base, x, y, 1);
}
while (base.b);
for(int i= 0; i < x*y; ++i)
base.i[i] *= 255;
return base.i;
}
when I change the while condition to while(0) A single iteration passes and the matrix is properly returned.
When I leave the while loop as is, it goes on indefinitely.
I have narrowed the problem down to the fact that base is reset after each run of the do-while loop.
What would cause this? I can give more code if this is too narrow of a view for it.
I ran your code as it is, it did not go on indefinitely, but ran through once, and stopped. However, there are two places where I made a suggested change. Really just a readability/style thing, not something that will change the behavior of your code in this case.
See commented and replacement lines below.
In thinningIteration()
struct IterRet thinningIteration(struct IterRet it, int x, int y, int iter)
{
//int* marker = malloc(x*y* sizeof *marker);
int* marker = malloc(x*y* sizeof(int));
In main()
//int* src = malloc( sizeof *src * x * y);
int* src = malloc( sizeof (int) * x * y);
Unfortunately, these edits did not address the main issue you asked about, but again, running the code did not exhibit the behavior you described.
If you can add more about the nature of your observed issues, please leave a comment, and if I can, will attempt to help.
I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
}
fprintf(f, "%i", ptr[n-1]);
}
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
}
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
}
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
kernel<<<blocks,DIM>>>(dev_a);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
cudaFree(dev_a);
int1DExportCSV(b, DIM*DIM);
}
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.
No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.
I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}
There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.
If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.
I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.
data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;
I am reprogramming a piece of MATLAB code in mex (using C). So far my C version of the MATLAB code is about as double as fast as the MATLAB code. Now I have three questions, all related to the code below:
How can I speed up this code more?
Do you see any problems with this code? I ask this because I don't know mex very well and I am also not a C guru ;-) ... I am aware that there should be some checks in the code (for example if there is still heap space while using realloc, but I left this away for the sake of simplicity for the moment)
Is it possible, that MATLAB is optimizing so well, that I really can't get much more than twice as fast code in C...?
The code should be more or less platform independent (Win, Linux, Unix, Mac, different Hardware), so I don't want to use assembler or specific linear Algebra Libraries. So that's why I programmed the staff by myself...
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
int numParts = ((int)(mxGetScalar(prhs[3])));
double *partMat = mxGetPr(prhs[4]);
const mxArray* verletListCells = prhs[5];
mxArray *verletList;
double *pseSum = (double *) malloc(numParts * sizeof(double));
for(int i = 0; i < numParts; i++) pseSum[i] = 0.0;
float *tempVar = NULL;
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
int numberVerlet = mxGetM(verletList);
tempVar = (float *) realloc(tempVar, numberVerlet * sizeof(float) * 2);
for(int a = 0; a < numberVerlet; a++)
{
tempVar[a*2] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1] - partMat[i];
tempVar[a*2 + 1] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + lenPartMat] - partMat[i + lenPartMat];
tempVar[a*2] = pow(tempVar[a*2],2);
tempVar[a*2 + 1] = pow(tempVar[a*2 + 1],2);
tempVar[a*2] = tempVar[a*2] + tempVar[a*2 + 1];
tempVar[a*2] = sqrt(tempVar[a*2]);
tempVar[a*2] = 4.0/(pow(epsilon,2) * M_PI) * exp(-(pow((tempVar[a*2]/epsilon),2)));
pseSum[i] = pseSum[i] + ((partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + 2*lenPartMat] - partMat[i + (2 * lenPartMat)]) * tempVar[a*2]);
}
}
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
for(int a = 0; a < numParts; a++)
{
*(mxGetPr(plhs[0]) + a) = pseSum[a];
}
free(tempVar);
free(pseSum);
}
So this is the improved version, which is about 12 times faster than MATLAB version. The conversion thing is still eating up much time, but I let this away for now, becaues I have to change something in MATLAB for this. So first focus on the remaining C code. Do you see any more potential in the following code?
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
double *partMat = mxGetPr(prhs[3]);
const mxArray* verletListCells = prhs[4];
int numParts = mxGetM(verletListCells);
mxArray *verletList;
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
double *pseSum = mxGetPr(plhs[0]);
double epsilonSquared = epsilon*epsilon;
double preConst = 4.0/((epsilonSquared) * M_PI);
int numberVerlet = 0;
double tempVar[2];
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
double *verletListPtr = mxGetPr(verletList);
numberVerlet = mxGetM(verletList);
for(int a = 0; a < numberVerlet; a++)
{
int adress = ((int) (*(verletListPtr + a))) - 1;
tempVar[0] = partMat[adress] - partMat[i];
tempVar[1] = partMat[adress + lenPartMat] - partMat[i + lenPartMat];
tempVar[0] = tempVar[0]*tempVar[0] + tempVar[1]*tempVar[1];
tempVar[0] = preConst * exp(-(tempVar[0]/epsilonSquared));
pseSum[i] += ((partMat[adress + 2*lenPartMat] - partMat[i + (2*lenPartMat)]* tempVar[0]);
}
}
}
You do not need to allocate the pseSum for local use and then later copy the data to the output. You can simply allocate a MATLAB object and get the pointer to the memory :
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
pseSum = mxGetPr(plhs[0]);
Thus you will not have to initialize pseSum to 0, because MATLAB already does it in mxCreateDoubleMatrix.
Remove all the mxGetPr from the inner loop and assign them to variables before.
Instead of casting doubles to ints consider using int32 or uint32 arrays in MATLAB. Casting double to int is expensive. The internal loop computations would look like
tempVar[a*2] = partMat[somevar[a] - 1] - partMat[i];
You use such constructs in your code
((int) (*(mxGetPr(verletList) + a)))
You do it because the varletList is a 'double' array (that is the case by default in MATLAB), which holds integer values. Instead, you should use integer array. Before you call your mex file type in MATLAB:
varletList = int32(varletList);
Then you will not need the type cast to int above. You will simply write
((int*)mxGetData(verletList))[a]
or better yet, assign earlier
somevar = (int*)mxGetData(verletList);
and later write
somevar[a]
precompute 4.0/(pow(epsilon,2) * M_PI) before all loops! That is one expensive constant.
pow((tempVar[a*2]/epsilon),2)) is simply tempVar[a*2]^2/epsilon^2. You calculate sqrt(tempVar[a*2]) just before. Why do you square it now?
Generally do not use pow(x, 2). Just write x*x
I would add some sanity checks on the parameters, especially if you demand integers. Either use MATLABs int32/uint32 type, or check that what you get actually is an integer.
Edit in the new code
compute -1/epsilonSquared before the loops and compute exp(minvepssq*tempVar[0]).note that the result might differ slightly. Depends what you need, but if you don't care about exact order of operations, do it.
define a register variable preSum_r and use it to sum the results in the inner loop. After the loop assign it to preSum[i]. If you want more fun, you can write the result to the memory using SSE streaming store (_mm_stream_pd compiler intrinsic).
do remove double to int cast
most likely irrelevant, but try to change tempVar[0/1] to normal variables. Irrelevant, because the compiler should do that for you. But again, an array is not needed here.
parallelise the external loop with OpenMP. Trivial (at least the simplest version without thinking about data layout for NUMA architectures) since there is no dependence between the iterations.
Can you estimate ahead of time what will be the maximum size of tempVar and allocate memory for it before the loop instead of using realloc? Reallocating memory is a time consuming operation and if your numParts is large, this could have a huge impact. Take a look at this question.