Can eigen supports inconsistent input and output types? - eigen3

For example, I want to use Eigen to do matrix multiply. But the type of input matrix is int16_t, and the type of output is int32_t. So it causes a compiler error.
Showing Recent Issues
/Eigen/src/Core/AssignEvaluator.h:834:3: Static_assert failed due to requirement 'Eigen::internal::has_ReturnType::Scalar, assign_op > >::value' "YOU_MIXED_DIFFERENT_NUMERIC_TYPES__YOU_NEED_TO_USE_THE_CAST_METHOD_OF_MATRIXBASE_TO_CAST_NUMERIC_TYPES_EXPLICITLY"
Below is the test code:
#include <iostream>
typedef Eigen::Matrix<int16_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatX16;
typedef Eigen::Matrix<int32_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatX32;
#define MAP_MATRIX(name, ptr, M, N) Eigen::Map<MatX32> name(ptr, M, N)
#define MAP_CONST_MATRIX(name, ptr, M, N) Eigen::Map<const MatX16> name(ptr, M, N)
int main(int argc, const char * argv[]) {
int M, N, K;
M = 10; N = 10; K = 10;
// eigen int16xint16 = int32
int16_t lhs[100] = {1};
int16_t rhs[100] = {2};
int32_t res[100] = {0};
MAP_CONST_MATRIX(eA, lhs, M, K);
MAP_CONST_MATRIX(eB, rhs, K, N);
MAP_MATRIX(eC, res, M, N);
eC = eA * eB;
return 0;
}

The product of two int16 matrices will be an int16 again. You can cast the result to int32:
eC = (eA * eB).cast<int32_t>();
However, what you probably actually want, is to cast the original factors to int32. Additionally, you can tell Eigen that eC will not alias with either eA or eB:
eC.noalias() = eA.cast<int32_t>() * eB.cast<int32_t>();
Note that casting is not vectorized (yet), so you probably get sub-optimal code with this. Your compiler might be smart enough to partially auto-vectorize the product, though.

Related

Is _mm256_store_ps() function is atomic ? while using alongside openmp

I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps().
I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working.
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<immintrin.h>
#include<omp.h>
#define N 64
__m256 multiply_and_add_intel(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b),c);
}
void multiply_and_add_intel_total_omp(const float* a, const float* b, const float* c, float* d)
{
__m256 a_intel, b_intel, c_intel, d_intel;
#pragma omp parallel for private(a_intel,b_intel,c_intel,d_intel)
for(long i=0; i<N; i=i+8) {
a_intel = _mm256_loadu_ps(&a[i]);
b_intel = _mm256_loadu_ps(&b[i]);
c_intel = _mm256_loadu_ps(&c[i]);
d_intel = multiply_and_add_intel(a_intel, b_intel, c_intel);
_mm256_store_ps(&d[i],d_intel);
}
}
int main()
{
srand(time(NULL));
float * a = (float *) malloc(sizeof(float) * N);
float * b = (float *) malloc(sizeof(float) * N);
float * c = (float *) malloc(sizeof(float) * N);
float * d_intel_avx_omp = (float *)malloc(sizeof(float) * N);
int i;
for(i=0;i<N;i++)
{
a[i] = (float)(rand()%10);
b[i] = (float)(rand()%10);
c[i] = (float)(rand()%10);
}
double time_t = omp_get_wtime();
multiply_and_add_intel_total_omp(a,b,c,d_intel_avx_omp);
time_t = omp_get_wtime() - time_t;
printf("\nTime taken to calculate with AVX2 and OMP : %0.5lf\n",time_t);
}
free(a);
free(b);
free(c);
free(d_intel_avx_omp);
return 0;
}
I expect that I will get d = a * b + c but it is showing segmentation fault. I have tried to perform the same task without OpenMP and it working errorless. Please let me know if there is any compatibility issue or I am missing any part.
gcc version 7.3.0
Intel® Core™ i3-3110M Processor
OS Ubuntu 18.04
Open MP 4.5, I have executed the command $ echo |cpp -fopenmp -dM |grep -i open and it showed #define _OPENMP 201511
Command to compile, gcc first_int.c -mavx -fopenmp
** UPDATE **
As per the discussions and suggestions, the new code is,
float * a = (float *) aligned_alloc(N, sizeof(float) * N);
float * b = (float *) aligned_alloc(N, sizeof(float) * N);
float * c = (float *) aligned_alloc(N, sizeof(float) * N);
float * d_intel_avx_omp = (float *)aligned_alloc(N, sizeof(float) * N);
This working without perfectly.
Just a note, I was trying to compare general calculations, avx calculation and avx+openmp calculation. This is the result I got,
Time taken to calculate without AVX : 0.00037
Time taken to calculate with AVX : 0.00024
Time taken to calculate with AVX and OMP : 0.00019
N = 50000
The documentation for _mm256_store_ps says:
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
You can use _mm256_storeu_si256 instead for unaligned stores.
A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.
Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:
float* allocate_aligned(size_t n) {
constexpr size_t alignment = alignof(__m256);
return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);
In C++-17 new can allocate with alignment:
float* allocate_aligned(size_t n) {
constexpr auto alignment = std::align_val_t{alignof(__m256)};
return new(alignment) float[n];
}
Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:
#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>
Vc::float_v random_float_v() {
alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
for(unsigned i = 0; i < Vc::float_v::Size; ++i)
t[i] = std::rand() % 10;
return Vc::float_v(t, Vc::Aligned);
}
unsigned reverse_crc32(void const* vbegin, void const* vend) {
unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
unsigned const* end = reinterpret_cast<unsigned const*>(vend);
unsigned r = 0;
while(begin != end)
r = __builtin_ia32_crc32si(r, *--end);
return r;
}
int main() {
constexpr size_t N = 65536;
constexpr size_t M = N / Vc::float_v::Size;
std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);
for(unsigned i = 0; i < M; ++i) {
a[i] = random_float_v();
b[i] = random_float_v();
c[i] = random_float_v();
}
auto t0 = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < M; ++i)
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
auto t1 = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}
Parallel version:
#include <tbb/parallel_for.h>
// ...
auto t0 = std::chrono::high_resolution_clock::now();
tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
});
auto t1 = std::chrono::high_resolution_clock::now();
You must use aligned memory for these intrinsics. Change your malloc(...) to aligned_alloc(sizeof(float) * 8, ...) (C11).
This is completely unrelated to atomics. You are working on entirely separate pieces of data (even on different cache lines), so there is no need for any protection.

C flat array using matlab syntax

I'd like to use my own type matrix in C using a syntax matlab-like to access it. Is there a solution using the preprocessor? Thanks. (The following code doesn't work).
include <stdio.h>
#define array(i,j) array.p[i*array.nrows+j] //???????????
typedef struct
{
unsigned int nrows;
unsigned int ncols;
float* p;
} matrix;
int main()
{
unsigned int i=4,j=5;
float v=154;
matrix a;
a.p=(float*) malloc(10*sizeof(float));
array(i,j)=v;
return 0;
}
You would have to pass in the name of your array to the macro, but yes, you could do something like that.
Just as an FYI, MATLAB order is more generally known as "column major". C order is more generally known as "row major order".
I have taken the liberty of correcting 1) your memory allocation and 2) your initialization of the dimensions since they are necessary for the macro to work properly:
include <stdio.h>
#define INDEX(mat, i, j) (mat).p[(i) * (mat).nrows + (j)]
typedef struct
{
unsigned int nrows;
unsigned int ncols;
float *p;
} matrix;
int main()
{
unsigned int i = 4, j = 5;
float v = 154;
matrix a = {i, j, NULL};
a.p = malloc(i * j * sizeof(float));
INDEX(a, i - 1, j - 1) = v;
return 0;
}
Here the order is column major, but the index is still zero-based. I have highlighted this by accessing index [i - 1, j - 1] instead of [i, j]. If you want one-based indexing to really conform to MATLAB's way of doing things, you can change your macro to this:
#define INDEX(mat, i, j) (mat).p[((i) - 1) * (mat).nrows + (j) - 1]
Then in main, you could do:
INDEX(a, i, j) = v;
Maybe this approach could help you
#include <stdio.h>
#define array(arr,i,j) (arr.p[ i * arr.nrows + j ]) //???????????
typedef struct
{
unsigned int nrows;
unsigned int ncols;
float* p;
} matrix;
int main()
{
unsigned int i=4,j=5;
float v=154;
matrix a;
a.p=(float*) malloc(10*sizeof(float));
array(a,i,j)=v;
return 0;
}
If you will try to do: a(i,j) it will try to make a call to "a".
thank you for your help.
I fixed some trivial errors in the initial code.
I see your answers and I suppose that a matlab syntax cannot be implemented using a generic #define rule in the preprocessor for several structs.
Instead, a #define rule must be written for each specific struct as the following code.
Thank you so much again.
#include <stdio.h>
#define a(i,j) a.p[i*a.nrows+j]
typedef struct
{
unsigned int nrows;
unsigned int ncols;
float* p;
} matrix;
int main()
{
unsigned int i=4,j=5;
float v=154;
matrix a;
a.nrows=10;
a.ncols=10;
a.p=(float *) malloc(100*sizeof(float));
a(i,j)=v;
return 0;
}

C malloc of pointer to pointer inside function not giving correct size

So I am now rewriting my fortran code in C (to use CUDA), and apparently I do not understand how to properly use malloc and pointers. I am trying to make the main function just calls to other functions, which need to malloc arrays that will then be used inside other functions. So, I am passing pointers of pointers to them as per this post: C Programming: malloc() inside another function
But the right amount of memory is not being allocated so I get segmentation faults. Here is the code:
#include <stdio.h>
#include <stdlib.h>
//#include <cuda.h>
#include <math.h>
//#include "cublas.h"
//datatype to match FORTRAN complex type
typedef float real;
typedef struct{
int nx;
int ny;
int nz;
int sz;
int tz;
} states;
void set_SPB(real **,int,states **,states **,int **);
//void set_SPB();
int find_minimum(int a[], int n,int start);
const real hc =197.32697,pi=3.1415927;
int main(){
int nmax = 2, A = 28;
real *etemp, *fock;
int *Ndex,*lookup,*lookup_a;
states *channel,*SPB;
//!generates the single particle basis to be used
set_SPB(&etemp,nmax,&SPB,&channel,&Ndex);
free(etemp);
free(Ndex);
free(SPB);
return 0;
}
void set_SPB(real **etemp,int nmax,states **SPB,states **channel,int **Ndex){
int tot_orbs = (2*nmax+1)*(2*nmax+1)*(2*nmax+1)*4;
int D = tot_orbs/4;
int Nalpha = (2*nmax+1)*(2*nmax+1)*(2*nmax+1)*9;
real E;
*etemp = (real*)malloc(D);
*Ndex = (int*)malloc(D*3);
*SPB = (states*)malloc(tot_orbs);
printf("orbits without spin degeneracy %d \n",D);
printf("size of etemp %ld \n",sizeof(*etemp)/sizeof(*etemp[0]));
return;
int i = 0;
for(int nx =-nmax;nx<=nmax;nx++){
for(int ny =-nmax;ny<=nmax;ny++){
for(int nz =-nmax;nz<=nmax;nz++){
E = 0.5*4.0*pi*pi*(nx*nx+ny*ny+nz*nz);
//printf("%d\n",i);
*etemp[i] = E;
*Ndex[0*D+i] =nx;
*Ndex[1*D+i] = ny;
*Ndex[2*D+i] = nz;
i+=1;
}
}
}
return;
}
Also I am not sure exactly if my assignments of the arrays are correct.
Specifically the print to find the number of elements of that have been allocated always gives 2, when it should be D = 125.
I cannot believe that float and int take only 1 byte in your environment.
Multiply the size to be allocated by size of their elements.
*etemp = malloc(sizeof(**etemp) * D);
*Ndex = malloc(sizeof(**Ndex) * D*3);
*SPB = malloc(sizeof(**SPB) * tot_orbs); /* not sure because this is not used */
Note that they say you shouldn't cast the result of malloc() in C.
Also note that [] operator has higher precedence than * operator, so you have to use parentheses to use the arrays.
(*etemp)[i] = E;
(*Ndex)[0*D+i] =nx;
(*Ndex)[1*D+i] = ny;
(*Ndex)[2*D+i] = nz;

lapack dgels_ segmentation fault 11

I am trying to use LAPACK's dgels_ in C to solve a linear least squares problem. I have to read the matrix A (assumed to have full rank and m>=n) and a vector b from 2 text files. I can easily compile my code, but when i try to run it I get a "segmentation fault 11", but I can't really see why. It is my first time using LAPACK so I don't know if maybe I am using the dgels_ function wrong?? The way I get it the solution x will be overwritten in the vector b? :
lssolve.c:
#include <stdlib.h>
#include <stdio.h>
#include "linalg.h"
/* C prototype for LAPACK routine DGELS */
void dgels_(const char * trans, const int * m, const int * n, const int *
nrhs, double * A, const int * lda, double * B, const int * ldb, double * work,
int * lwork,int * info);
int main(int argc, char * argv[]) {
vector_t * b_t = NULL;
matrix_t * A_t = NULL;
char trans = 'N';
int m, n, nrhs, mb, lda, ldb, info, lwork;
double optwork;
double * work;
// we read the matrix A and the vector b:
b_t = read_vector("b.txt");
A_t = read_matrix("A.txt");
m = A_t-> m; //number of rows in A
n = A_t-> n; //number of columns in A
nrhs = 1; //number of columns in B (will always be 1, since we read b_t with read_vector)
mb = b_t -> n; //number of rows in B
if (mb != m ) { //end program if A and B doesn't have the same number of rows
free(A_t);
free(b_t);
fprintf(stderr, "Sorry, but the matrix A and the vector b have incompatible dimensions. Good Bye!\n");
exit(EXIT_FAILURE);
}
//We make A and B into the wanted input form for the dgels_-function:
double * B = b_t -> v;
double ** A = A_t ->A;
lda = m;
ldb = mb;
//we calculate the optimal size of the work array:
lwork = -1;
dgels_(&trans, &n, &m, &nrhs, *A, &lda, B, &ldb, &optwork, &lwork, &info);
lwork = (int)optwork;
//we allocate space for the work array:
work = (double*)malloc( lwork*sizeof(double));
//solving the least squares problem:
dgels_(&trans, &n, &m, &nrhs, *A, &lda, B, &ldb, work, &lwork, &info);
//Check whether there was an successful exit:
if (info > 0){
fprintf(stderr, "Sorry, but illegal arguments were used, and therefore a least square solution cannot be computes. Good Bye!\n");
exit(EXIT_FAILURE);
} else if(info < 0){
fprintf(stderr, "Sorry, but A doesn't have full rank, and therefore a least square solution cannot be computed. Good Bye!\n");
exit(EXIT_FAILURE);
}
//Saving the least square problem as a vector_t:
vector_t * x = NULL;
x->n = mb;
x->v = B;
print_vector(x);
//Free memory
free_vector(b_t);
free_matrix(A_t);
free_vector(x);
return(EXIT_SUCCESS);
}
I am using the functions read_matrix, read_vector, print_vector, print_matrix and free_vector, which is why I use the struct vector_t and matrix_t:
typedef struct vector {
unsigned long n; /* length of vector */
double * v; /* pointer to array of length n */
} vector_t;
typedef struct matrix {
unsigned long m; /* number of rows */
unsigned long n; /* number of columns */
double ** A; /* pointer to two-dimensional array */
} matrix_t;
I don't think that anything is wrong with read_vector and read_matrix because I can easily do this and use print_vector or print_matrix before I do all of the other operations.
You dereference a NULL pointer here, causing the segfault:
//Saving the least square problem as a vector_t:
vector_t * x = NULL;
x->n = mb;
x->v = B;
Maybe you should use/create a new vector_t instead of just a pointer to a vector_t?

How do I parallelize this triple loop in an efficient way?

I'm trying to parallelize a function which takes as input three arrays (x, y, and prb) and one scalar, and outputs three arrays (P1, Pt1, and Px).
The original c code is here (the outlier and E are inconsequential):
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define max(A, B) ((A) > (B) ? (A) : (B))
#define min(A, B) ((A) < (B) ? (A) : (B))
void cpd_comp(
double* x,
double* y,
double* prb,
double* sigma2,
double* outlier,
double* P1,
double* Pt1,
double* Px,
double* E,
int N,
int M,
int D
)
{
int n, m, d;
double ksig, diff, razn, outlier_tmp, sp;
double *P, *temp_x;
P = (double*) calloc(M, sizeof(double));
temp_x = (double*) calloc(D, sizeof(double));
ksig = -2.0 * *sigma2;
for (n=0; n < N; n++) {
sp=0;
for (m=0; m < M; m++) {
razn=0;
for (d=0; d < D; d++) {
diff=*(x+n+d*N)-*(y+m+d*M); diff=diff*diff;
razn+=diff;
}
*(P+m)=exp(razn/ksig) ;
sp+=*(P+m);
}
*(Pt1+n)=*(prb+n);
for (d=0; d < D; d++) {
*(temp_x+d)=*(x+n+d*N)/ sp;
}
for (m=0; m < M; m++) {
*(P1+m)+=((*(P+m)/ sp) **(prb+n));
for (d=0; d < D; d++) {
*(Px+m+d*M)+= (*(temp_x+d)**(P+m)**(prb+n));
}
}
*E += -log(sp);
}
*E +=D*N*log(*sigma2)/2;
free((void*)P);
free((void*)temp_x);
return;
}
Here is my attempt at parallelizing it:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
/*headers*/
void cpd_comp(
float * x, //Points to register [N*D]
float * y, //Points to be registered [M*D]
float * prb, //Vector of probabilities [N]
float * sigma2, //Square of sigma
float ** P1, //P1, output, [M]
float ** Pt1, //Pt1, output, [N]
float ** Px, //Px, output, [M*3]
int N, //Number of points, i.e. rows, in x
int M //Number of points, i.e. rows, in
);
__global__ void d_computeP(
float * P,
float * P1,
float * Px,
float * ProbabilityMatrix,
float * x,
float * y,
float * prb,
float ksig,
const int N,
const int M);
__global__ void d_sumP(
float * sp,
float * P1timessp,
float * Pxtimessp,
float * P1,
float * Px,
const int N,
const int M);
/*implementations*/
void cpd_comp(
float * x, //Points to register [N*D]
float * y, //Points to be registered [M*D]
float * prb, //Vector of probabilities [N]
float * sigma2, //Scalar
float ** P1, //P1, output, [M]
float ** Pt1, //Pt1, output, [N]
float ** Px, //Px, output, [M*3]
int N, //Number of points, i.e. rows, in x
int M //Number of points, i.e. rows, in y
){
//X is generatedPointPos
//Y is points
float
*P,
*P1timessp,
*Pxtimessp,
ksig = -2.0 * (*sigma2),
*h_sumofP = new float[N], //sum of P, on host
*d_sumofP; //sum of P, on device
cudaMalloc((void**)&P, sizeof(float)*M*N);
cudaMalloc((void**)&P1timessp,sizeof(float)*M*N);
cudaMalloc((void**)&Pxtimessp,sizeof(float)*M*N*3);
cudaMalloc((void**)&d_sumofP, sizeof(float)*N);
cudaMalloc((void**)P1, sizeof(float)*M);
cudaMalloc((void**)Px, sizeof(float)*M*3);
cudaMalloc((void**)Pt1, sizeof(float)*N);
d_computeP<<<dim3(N,M/1024+1),M>1024?1024:M>>>(P,P1timessp,Pxtimessp,NULL,x,y,prb,ksig,N,M);
for(int n=0; n<N; n++){
thrust::device_ptr<float>dev_ptr(P);
h_sumofP[n] = thrust::reduce(dev_ptr+M*n,dev_ptr+M*(n+1),0.0f,thrust::plus<float>());
}
cudaMemcpy(d_sumofP,h_sumofP,sizeof(float)*N,cudaMemcpyHostToDevice);
d_sumP<<<M/1024+1,M>1024?1024:M>>>(d_sumofP,P1timessp,Pxtimessp,*P1,*Px,N,M);
cudaMemcpy(*Pt1,prb,sizeof(float)*N,cudaMemcpyDeviceToDevice);
cudaFree(P);
cudaFree(P1timessp);
cudaFree(Pxtimessp);
cudaFree(d_sumofP);
delete[]h_sumofP;
}
/*kernels*/
__global__ void d_computeP(
float * P,
float * P1,
float * Px,
float * ProbabilityMatrix,
float * x,
float * y,
float * prb,
float ksig,
const int N,
const int M){
//thread configuration: <<<dim3(N,M/1024+1),1024>>>
int m = threadIdx.x+blockIdx.y*blockDim.x;
int n = blockIdx.x;
if(m>=M || n>=N) return;
float
x1 = x[3*n],
x2 = x[3*n+1],
x3 = x[3*n+2],
diff1 = x1 - y[3*m],
diff2 = x2 - y[3*m+1],
diff3 = x3 - y[3*m+2],
razn = diff1*diff1+diff2*diff2+diff3*diff3,
Pm = __expf(razn/ksig), //fast exponentiation
prbn = prb[n];
P[M*n+m] = Pm;
__syncthreads();
P1[N*m+n] = Pm*prbn;
Px[3*(N*m+n)+0] = x1*Pm*prbn;
Px[3*(N*m+n)+1] = x2*Pm*prbn;
Px[3*(N*m+n)+2] = x3*Pm*prbn;
}
__global__ void d_sumP(
float * sp,
float * P1timessp,
float * Pxtimessp,
float * P1,
float * Px,
const int N,
const int M){
//computes P1 and Px
//thread configuration: <<<M/1024+1,1024>>>
int m = threadIdx.x+blockIdx.x*blockDim.x;
if(m>=M) return;
float
P1m = 0,
Pxm1 = 0,
Pxm2 = 0,
Pxm3 = 0;
for(int n=0; n<N; n++){
float spn = 1/sp[n];
P1m += P1timessp[N*m+n]*spn;
Pxm1 += Pxtimessp[3*(N*m+n)+0]*spn;
Pxm2 += Pxtimessp[3*(N*m+n)+1]*spn;
Pxm3 += Pxtimessp[3*(N*m+n)+2]*spn;
}
P1[m] = P1m;
Px[3*m+0] = Pxm1;
Px[3*m+1] = Pxm2;
Px[3*m+2] = Pxm3;
}
However, to my horror, it runs much, much slower than the original version. How do I make it run faster? Please explain things thoroughly since I am very new to CUDA and parallel programming and have no experience in algorithms.
Do note that the c version has column-major ordering and the CUDA version has row-major. I have done several tests to make sure that the result is correct. It's just extremely slow and takes up a LOT of memory.
Any help is greatly appreciated!
EDIT: More information: N and M are on the order of a few thousand (say, 300-3000) and D is always 3. The CUDA version expects arrays to be device memory, except for variables prefixed with h_.
Before trying any CUDA-specific optimizations, profile your code to see where time is being spent.
Try and arrange your array reads/writes so that each CUDA thread uses a strided access pattern. For example, currently you have
int m = threadIdx.x+blockIdx.y*blockDim.x;
int n = blockIdx.x;
if(m>=M || n>=N) return;
diff1 = x1 - y[3*m],
diff2 = x2 - y[3*m+1],
diff3 = x3 - y[3*m+2],
So thread 1 will read from y[0],y[1],y[2] etc. Instead, rearrange your data so that thread 1 reads from y[0],y[M],y[2*M] and thread 2 reads from y[1],y[M+1],y[2*M+1] etc. You should follow this access pattern for other arrays.
Also, you may want to consider whether you can avoid the use of __syncthreads(). I don't quite follow why it's necessary in this algorithm, it might be worth removing it to see if it improves performance ( even if it produces incorrect results ).
The key to good CUDA performance is almost always to make as near to optimal memory access as possible. Your memory access pattern looks very similar to matrix multiplication. I would start with a good CUDA matrix multiplication implementation, being sure to understand why it's implemented the way it is, and then modify that to suit your needs.

Resources