Matrix Multiplication CUDA [duplicate]

Matrix Multiplication CUDA [duplicate] - c

This question already has an answer here:
CUDA Matrix Multiplication write to wrong memory location
(1 answer)
Closed 2 years ago.
I have been reading through several websites and even used NVIDA's code as a guide but I am still getting the wrong answer. The main will ask the user for size, and will display A and B then display the resulting matrix C. However say I run a 2x2 matrix for both A and B this is my sample output:
Matrix A
0.000000 8.000000
2.000000 2.000000
Matrix B
3.000000 1.000000
5.000000 7.000000
Matrix C (Results)
0.000000 9.000000
7.000000 4.000000
But that's incorrect. It should be:
40.000 56.000
16.000 16.000
I changed it from decimals to whole numbers so that it would be easier to check, and I found that it's incorrect. I do not understand why it would be incorrect, especially even though I took it right from their code sample.
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE 16
// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed
// by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed
// by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the
// sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed
// by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the
// sub-matrices of B
int bStep = BLOCK_SIZE * wB;
float Csub=0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
// Declaration of the shared memory array As
// used to store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs
// used to store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from global memory
// to shared memory; each thread loads
// one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices
// are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
host code:
//perform the calculation
//setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);
// execute the kernel
matrixMul<<< grid, threads >>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize);
Thanks for your help,
Dan

The code you are using implicitly requires that the size of the matrices are round multiples of the block size (16x16 in this case). The inner product calculation processes a tile width at a time without checking for out of bounds memory access. For this reason, 2x2 matrices will not work.
If you try running kernel with a 16x16 input (for example zero padding your 2x2 case to 16x16), you should be able to confirm the result.

Related

"no operator found" when asigning Sparse matrices results to sparse matrices

I do have a function that implements a minimization algorithm. I didn't include all the vars, just the matrices to illustrate the types:
typedef Eigen::SparseMatrix<double> SpMat;
typedef Eigen::VectorXd Vec;
int lm_solver(void (*f_dz)(Vec* x_, int m, Vec* dz_, SpMat* W_),
void (*f_H)(Vec* x_, SpMat* jac_,int n_, int m_),
Vec* x, int nx, int mm, int nnz,
double tol=1e-9, int max_iter = 100){
SpMat A(mm, nx);
SpMat H1(mm, nx);
SpMat H2(mm, nx);
SpMat H(mm, nx);
SpMat W(mm, mm);
Vec rhs(nx);
Vec dz(nx);
Vec dx(nx);
Vec a(1);
Vec b(1);
double f, f_prev, lbmda, rho, nu, tau;
bool updateH, converged;
int iter_;
// reserve matrices memory
H.reserve(nnz);
W.reserve(mm);
while (!converged && iter_ < max_iter){
// get the system matrices
if (updateH){ // if the Jacobian computation is not locked...
f_dz(x, mm, &dz, &W); // Residual increment (z-h(x)) vector creation or update: fill dz and W
f_H(x, &H, nx, mm); // Jacobian matrix creation or update: fill H
// Start forming the auxiliary matrices of A
H1 = H.transpose() * W;
H2 = H1 * H;
}
// set the first value of lmbda
if (iter_ == 1)
lbmda = tau * H2.diagonal().maxCoeff();
// form the system matrix A = H^t·W·H + lambda·I
A = H2 + lbmda * Idn;
// form the right hand side: H^t·W·dz
rhs = H1 * dz;
// Solve the increment: dx = solve(A, rhs);
solver.compute(A);
dx = solver.solve(rhs);
// calculate the objective function: Least squares function
a = 0.5 * dz * W * dz; //vector x matrix x vector -> vector of 1 element
f = a.coeffRef(0);
// calculate the gain ratio
b = 0.5 * dx * (lbmda * dx - rhs); //vector x matrix x vector -> vector of 1 element
rho = (f_prev - f) / b.coeffRef(0);
}
return 0;
}
The process does the following:
Declare sparse matrices matrices (SpMat)
reserve matrices memory
Call external functions to fill H, dz and W
Do matrices multiplications and store the results into intermediate matrices
that are sparse too.
This function is the only function in a .h file that is compiled into a static library .lib
When I compile the static library alone, it compiles flawlessly.
However when I use the library project from another project, I get the following error:
error: C2679: binary '=' : no operator found which takes a right-hand operand of type 'const Eigen::CwiseBinaryOp' (or there is no acceptable conversion)
\eigen\src/Core/Matrix.h(206): could be 'Eigen::Matrix<_Scalar,_Rows,_Cols> &Eigen::Matrix<_Scalar,_Rows,_Cols>::operator =(const Eigen::Matrix<_Scalar,_Rows,_Cols> &)'
with
[
_Scalar=double,
_Rows=-1,
_Cols=1
]
d:\proyectos\proyectos_i+d\ingrid\eigen\eigen_3_3_3\eigen\src/Core/Matrix.h(281): or 'Eigen::Matrix<_Scalar,_Rows,_Cols> &Eigen::Matrix<_Scalar,_Rows,_Cols>::operator =(Eigen::Matrix<_Scalar,_Rows,_Cols> &&)'
with
[
_Scalar=double,
_Rows=-1,
_Cols=1
]
while trying to match the argument list '(Vec, const Eigen::CwiseBinaryOp)'
This error flags the lines:
H1 = H.transpose() * W;
H2 = H1 * H;
rhs = H1 * dz;
b = 0.5 * dx * (lbmda * dx - rhs);
a = 0.5 * dz * W * dz;
I understand from this that I cannot store the result of sparse matrices multiplications in a new sparse matrix. I don't know the solution to this.
(I'm using Eigen 3.3.3)

I don't see what lines exactly cause your error, but it looks rather like it is caused by calculating a and b. You can't multiply a col-vector by another col-vector without transposing it, e.g.
b = 0.5 * dx.transpose() * (lbmda * dx - rhs);
However, this is actually a dot product, so you should just write
double b = 0.5 * dx.dot(lbmda * dx - rhs);

The problem was that I wrote all the functions in the .h.
By putting the body of the function on the .cpp all went fine.
This dicotomy of .h and .cpp is what anoys me the most about c++.
Anyway, for future reference.

How to perform reduction on a huge 2D matrix along the row direction using cuda? (max value and max value's index for each row)

I'm trying to implement a reduction along the row direction of a 2D matrix. I'm starting from a code I found on stackoverflow (thanks a lot Robert!)
thrust::max_element slow in comparison cublasIsamax - More efficient implementation?
The above link shows a custom kernel that performs reduction on a single row. It divides the input row into many rows and each row has 1024 threads. Works very well.
For the 2D case, everything's the same except that now there's a y grid dimension. So each block's y dimension is still 1. The problem is that when I try to write data onto the shared memory within each block (within the "max_idx_kernel_reduction_within_block" kernel in the code), It takes very long (More than (# of Rows) * (Time it takes to perform reduction on 1 Row. I would rather run a for loop). I know I have a lot of elements but I was expecting something faster than that.
I don't think the memory access pattern is an issue, but I heard that the TOTAL amount of shared memory might be the limitation?? : CUDA: Is coalesced global memory access faster than shared memory? Also, does allocating a large shared memory array slow down the program?
Any suggestions to make my code faster (the first kernel is the bottleneck)? Thank you very much, very much appreciated!!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <cuda_runtime.h>
#define NCOLS 163317 // number of columns
#define NROWS 8 // number of rows
#define nTPB 1024 // Threads per Block. nTPB should be a power-of-2
#define MAX_BLOCKS_X ((NCOLS/nTPB)+1) // # of blocks I will launch
#define MIN(a,b) ((a>b)?b:a)
#define FLOAT_MIN -1.0f // lowest anticipated number of the data. Values in array will be compared with this and updated with the larger one
#define IDX2F(i,j,ld) ((j-1) * (ld) + ((i) - 1)) // 1 based indexing
#define IDX2C(i,j,ld) ((j) * (ld) + i) // 0 based indexing
__device__ volatile float blk_vals[NROWS][MAX_BLOCKS_X];
__device__ volatile int blk_idxs[NROWS][MAX_BLOCKS_X];
// blk_vals and blk_idxs are the results obtained from reduction within each block.
// after 1st reduction, each row will have blk_vals[MAX_BLOCKS_X] array and blk_idxs[MAX_BLOCKS_X]
// these will be passed to the 2nd kernel
__global__ void max_idx_kernel_reduction_within_block(const float *data, const int xSize, const int ySize){ // first kernel. Reduction within blocks
__shared__ volatile float vals[nTPB]; // Total amount of shared memory per block: 49152 bytes (50 KB). 1024 gives ~ 4KB for single.
__shared__ volatile int idxs[nTPB]; // ~ 4 KB for single, when nTPB is 1024. each block will have both indices and values
int idx = threadIdx.x+blockDim.x * blockIdx.x; // idx in the x direction
float my_val = FLOAT_MIN; // lowest possible number
int my_idx = -1; // to check whether you DID perform the kernel. Again, it's the idx in the x dir.
// sweep from global memory
while (idx < xSize){ // this ensures you don't go out the size of the array's x direction
if (data[IDX2C(blockIdx.y,idx,ySize)] > my_val) {my_val = data[IDX2C(blockIdx.y,idx,ySize)]; my_idx = idx;}
// compare with my_val, and put the bigger value into my_val for next comparison. my_idx is 0 index based
idx += blockDim.x*gridDim.x;}
// until here takes about 6 ms !! very fast!!
// populate shared memory: takes ~ 270 ms
vals[threadIdx.x] = my_val; // put the computed max value for each thread into the shared memory. -> this is the bottleneck!!
idxs[threadIdx.x] = my_idx; // do this for index as well -> this is also slow!!
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i) // the first half threads of the block
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
// the above is comparing shared memory of threadIdx.x with shared memory of threadIdx.x + i.
// then puts the larger value into shared memory of threadIdx.x
__syncthreads();} // so now in each block, shared memory's first element (index 0) is the max value and max value index
// perform block-level reduction
if (!threadIdx.x){ // at the shared memory, only the first element (index 0) (actually 2 elements in the first index. max value, and max value index) is what we need.
blk_vals[blockIdx.y][blockIdx.x] = vals[0]; // For each window (single x row), the first elements of the blocks are stored into the blk_vals[windowNumber][:]
// remember, this is a global variable.
blk_idxs[blockIdx.y][blockIdx.x] = idxs[0]; // and the max value index
__syncthreads();
}
}
// originally the following kernel was in the 1st kernel, performed by the last block. So just use one block for this.
__global__ void max_idx_kernel_final(int *result_maxInd, float *result_maxVal){
__shared__ volatile float vals[nTPB]; // Total amount of shared memory per block: 49152 bytes (50 KB). 1024 gives ~ 4KB for single.
__shared__ volatile int idxs[nTPB]; // ~ 4 KB for single, when nTPB is 1024. each block will have these variables!! (vals and idxs)
int idx = threadIdx.x;
float my_val = FLOAT_MIN;
int my_idx = -1; // remember, these are local variables, so each thread has this variable. This local variable is independent from other thread's local variable
while (idx < MAX_BLOCKS_X ){ // ?? confused whether it should be gridDim.x (actual # of blocks launched) or MAX_BLOCKS_X (# of elements in x dir of the global array blk_vals)
if (blk_vals[blockIdx.y][idx] > my_val)
{my_val = blk_vals[blockIdx.y][idx]; my_idx = blk_idxs[blockIdx.y][idx]; }
idx += blockDim.x;} // all threads in this single block (single in the x dir) are working, so you should loop over blockDim.x.
// Imagine where gridDim.x (# of blocks) is huge so that you need to loop over to get the max value and index
// After this, each thread in the block has a local variable (max value and max value index).
// So far it was sort of a reduction, but instead of pairing values we just looped over the blk_vals and blk_idxs
// populate shared memory
vals[threadIdx.x] = my_val; // This is now shared memory. This is because reduction requires comparison between different elements
idxs[threadIdx.x] = my_idx; // my_idx value is 0 based. This is done for all blocks (in the y direction)
__syncthreads();
// Now the final task is to do reduction for all threads in our single block (single block in the x dir, NROWS blocks in the y dir)!
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1) {
if (threadIdx.x < i) // the first half threads of the block
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();} // now all the results are in threadIdx.x == 0 for each block (there are NROWS blocks in the y dir)
// 0th thread. the results are in shared memory, not the local memory, so any thread could do the following. We just selected the 0th thread for no reason. If several threads try to do this, that would be a problem, since we'll have to wait for them
if(!threadIdx.x){
result_maxInd[blockIdx.y] = idxs[0]; // the final result for each row goes into the corresponding position (blockIdx.y)
result_maxVal[blockIdx.y] = vals[0];
}
}
int main(){
dim3 grids(MAX_BLOCKS_X, NROWS);
dim3 threads(nTPB,1);
dim3 grids2(1,NROWS);
dim3 threads2(nTPB);
float *d_vector, *h_vector;
h_vector = (float*)malloc(NROWS * NCOLS * sizeof(float));
for (int j = 1; j <= NCOLS; j++) {
for (int i = 1; i <= NROWS; i++) {
h_vector[IDX2F(i,j,NROWS)] = (float) (rand()/(float)RAND_MAX);
}
}
h_vector[IDX2F(2,5,NROWS)] = 10; // create definite max element
cudaMalloc(&d_vector, NROWS * NCOLS * sizeof(float));
cudaMemcpy(d_vector, h_vector, NROWS * NCOLS * sizeof(float), cudaMemcpyHostToDevice);
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
int *max_index;
float *max_val;
int *d_max_index;
float *d_max_val;
max_index = (int*)malloc(NROWS * sizeof(int));
max_val = (float*)malloc(NROWS * sizeof(float));
cudaMalloc((void**)&d_max_index, NROWS * sizeof(int));
cudaMalloc((void**)&d_max_val, NROWS * sizeof(float));
max_idx_kernel_reduction_within_block<<<grids, threads>>>(d_vector, NCOLS, NROWS);
max_idx_kernel_final<<<grids2,threads2>>>(d_max_index, d_max_val);
cudaMemcpy(max_index, d_max_index, NROWS * sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(max_val, d_max_val, NROWS * sizeof(float), cudaMemcpyDeviceToHost);
for(int z=0;z<20;z++)
printf("%d ",max_index[z]);
printf("\n\n\n");
for(int z=0;z<20;z++)
printf("%f ",max_val[z]);
return 0;
}

Your code has various issues:
You should use proper cuda error checking. That's just a standard boiler-plate statement I make. I don't think your code was producing any runtime errors.
You should validate your results. I seriously doubt the code was producing sensible results. The reasons why will become evident. If you want to prove this to yourself, modify your data initialization to something that is obviously and easily verifiable, such as I have shown, without making any other changes, and you will see that your program produces errors.
In your kernel, you're not indexing through the arrays correctly. Perhaps you don't understand the IDX2C and IDX2F macros. They are hurting you in two ways: you don't understand the pattern in which they are indexing through your array, and they are killing your coalescing out of global memory (due to the way you have used them). Whenever we have a construct that is an indexed function of threadIdx.x and threadIdx.y (or blockIdx.y in this case), to maintain proper coalescing amongst adjacent threads, its desirable that the component based on the threadIdx.x not get multiplied by any scaling factors. But the way you are passing parameters to IDX2C in your kernel, you are breaking this rule (and also disrupting your desired row-wise access pattern.) So for now, let's get rid of those macros, as they are confusing the issue.
This is an illegal use of __syncthreads():
if (!threadIdx.x){ // at the shared memory, only the first element (index 0) (actually 2 elements in the first index. max value, and max value index) is what we need.
blk_vals[blockIdx.y][blockIdx.x] = vals[0]; // For each window (single x row), the first elements of the blocks are stored into the blk_vals[windowNumber][:]
// remember, this is a global variable.
blk_idxs[blockIdx.y][blockIdx.x] = idxs[0]; // and the max value index
__syncthreads();
}
It's illegal to use it in conditional code unless the condition evaluates teh same for every thread in the block. It's entirely unneeded there, so we'll just delete it.
Your printouts were indexing up through 20 instead of NROWS.
With the above changes, your code goes from being broken (producing incorrect results) to being fixed, and execution time for the kernels on my system goes from 0.929ms to 0.656ms. I attribute all of this to the coalescing fix in item 3 above.
When I profile the before and after cases with nvprof --metrics gld_efficiency ..., it shows 12.5% efficiency with your original code and 53% efficiency with the changes. Here's the modified code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#define NCOLS 163317 // number of columns
#define NROWS 8 // number of rows
#define nTPB 1024 // Threads per Block. nTPB should be a power-of-2
#define MAX_BLOCKS_X ((NCOLS/nTPB)+1) // # of blocks I will launch
#define FLOAT_MIN -1.0f // lowest anticipated number of the data. Values in array will be compared with this and updated with the larger one
__device__ volatile float blk_vals[NROWS][MAX_BLOCKS_X];
__device__ volatile int blk_idxs[NROWS][MAX_BLOCKS_X];
// blk_vals and blk_idxs are the results obtained from reduction within each block.
// after 1st reduction, each row will have blk_vals[MAX_BLOCKS_X] array and blk_idxs[MAX_BLOCKS_X]
// these will be passed to the 2nd kernel
__global__ void max_idx_kernel_reduction_within_block(const float *data, const int xSize, const int ySize){ // first kernel. Reduction within blocks
__shared__ volatile float vals[nTPB]; // Total amount of shared memory per block: 49152 bytes (50 KB). 1024 gives ~ 4KB for single.
__shared__ volatile int idxs[nTPB]; // ~ 4 KB for single, when nTPB is 1024. each block will have both indices and values
int idx = threadIdx.x+blockDim.x * blockIdx.x; // idx in the x direction
int idy = blockIdx.y;
float my_val = FLOAT_MIN; // lowest possible number
int my_idx = -1; // to check whether you DID perform the kernel. Again, it's the idx in the x dir.
// sweep from global memory
while (idx < xSize){ // this ensures you don't go out the size of the array's x direction
float temp = data[idy*xSize+idx];
if (temp > my_val) {my_val = temp; my_idx = idx;}
// compare with my_val, and put the bigger value into my_val for next comparison. my_idx is 0 index based
idx += blockDim.x*gridDim.x;}
// until here takes about 6 ms !! very fast!!
// populate shared memory: takes ~ 270 ms
vals[threadIdx.x] = my_val; // put the computed max value for each thread into the shared memory. -> this is the bottleneck!!
idxs[threadIdx.x] = my_idx; // do this for index as well -> this is also slow!!
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i) // the first half threads of the block
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
// the above is comparing shared memory of threadIdx.x with shared memory of threadIdx.x + i.
// then puts the larger value into shared memory of threadIdx.x
__syncthreads();} // so now in each block, shared memory's first element (index 0) is the max value and max value index
// perform block-level reduction
if (!threadIdx.x){ // at the shared memory, only the first element (index 0) (actually 2 elements in the first index. max value, and max value index) is what we need.
blk_vals[blockIdx.y][blockIdx.x] = vals[0]; // For each window (single x row), the first elements of the blocks are stored into the blk_vals[windowNumber][:]
// remember, this is a global variable.
blk_idxs[blockIdx.y][blockIdx.x] = idxs[0]; // and the max value index
__syncthreads();
}
}
// originally the following kernel was in the 1st kernel, performed by the last block. So just use one block for this.
__global__ void max_idx_kernel_final(int *result_maxInd, float *result_maxVal){
__shared__ volatile float vals[nTPB]; // Total amount of shared memory per block: 49152 bytes (50 KB). 1024 gives ~ 4KB for single.
__shared__ volatile int idxs[nTPB]; // ~ 4 KB for single, when nTPB is 1024. each block will have these variables!! (vals and idxs)
int idx = threadIdx.x;
int idy = blockIdx.y;
float my_val = FLOAT_MIN;
int my_idx = -1; // remember, these are local variables, so each thread has this variable. This local variable is independent from other thread's local variable
while (idx < MAX_BLOCKS_X ){ // ?? confused whether it should be gridDim.x (actual # of blocks launched) or MAX_BLOCKS_X (# of elements in x dir of the global array blk_vals)
float temp = blk_vals[idy][idx];
if (temp > my_val)
{my_val = temp; my_idx = blk_idxs[idy][idx]; }
idx += blockDim.x;} // all threads in this single block (single in the x dir) are working, so you should loop over blockDim.x.
// Imagine where gridDim.x (# of blocks) is huge so that you need to loop over to get the max value and index
// After this, each thread in the block has a local variable (max value and max value index).
// So far it was sort of a reduction, but instead of pairing values we just looped over the blk_vals and blk_idxs
// populate shared memory
idx = threadIdx.x;
vals[idx] = my_val; // This is now shared memory. This is because reduction requires comparison between different elements
idxs[idx] = my_idx; // my_idx value is 0 based. This is done for all blocks (in the y direction)
__syncthreads();
// Now the final task is to do reduction for all threads in our single block (single block in the x dir, NROWS blocks in the y dir)!
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1) {
if (idx < i) // the first half threads of the block
if (vals[idx] < vals[idx + i]) {vals[idx] = vals[idx+i]; idxs[idx] = idxs[idx+i]; }
__syncthreads();} // now all the results are in threadIdx.x == 0 for each block (there are NROWS blocks in the y dir)
// 0th thread. the results are in shared memory, not the local memory, so any thread could do the following. We just selected the 0th thread for no reason. If several threads try to do this, that would be a problem, since we'll have to wait for them
if(!threadIdx.x){
result_maxInd[idy] = idxs[0]; // the final result for each row goes into the corresponding position (blockIdx.y)
result_maxVal[idy] = vals[0];
}
}
int main(){
dim3 grids(MAX_BLOCKS_X, NROWS);
dim3 threads(nTPB,1);
dim3 grids2(1,NROWS);
dim3 threads2(nTPB);
float *d_vector, *h_vector;
h_vector = (float*)malloc(NROWS * NCOLS * sizeof(float));
memset(h_vector, 0, NROWS*NCOLS*sizeof(float));
for (int i = 0; i < NROWS; i++)
h_vector[i*NCOLS + i] = 10.0f; // create definite max element per row
cudaMalloc(&d_vector, NROWS * NCOLS * sizeof(float));
cudaMemcpy(d_vector, h_vector, NROWS * NCOLS * sizeof(float), cudaMemcpyHostToDevice);
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
int *max_index;
float *max_val;
int *d_max_index;
float *d_max_val;
max_index = (int*)malloc(NROWS * sizeof(int));
max_val = (float*)malloc(NROWS * sizeof(float));
cudaMalloc((void**)&d_max_index, NROWS * sizeof(int));
cudaMalloc((void**)&d_max_val, NROWS * sizeof(float));
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
max_idx_kernel_reduction_within_block<<<grids, threads>>>(d_vector, NCOLS, NROWS);
max_idx_kernel_final<<<grids2,threads2>>>(d_max_index, d_max_val);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float et;
cudaEventElapsedTime(&et, start, stop);
printf("elapsed time: %fms\n", et);
cudaMemcpy(max_index, d_max_index, NROWS * sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(max_val, d_max_val, NROWS * sizeof(float), cudaMemcpyDeviceToHost);
for(int z=0;z<NROWS;z++)
printf("%d ",max_index[z]);
printf("\n\n\n");
for(int z=0;z<NROWS;z++)
printf("%f ",max_val[z]);
printf("\n");
return 0;
}
$

Very low Gflops on array elementwise product-sum opeartion with CUDA

I have two 3D arrays, being signalsS(Q,C,M) and filters F(Q,C,K). Q contains transforms (FFT/DHT), C is the channel number. Each Q*C is a filter. And M K are the number of signals and filters.
Now I need to perform the following operation: apply each filter for each signal, with element multiplication of 2D array Q*Cs. There are MK number of QCs, and each pair from S and F are to be multiplied. In Matlab form it would be Z(:,:,i,j) = S(:,:,i) .* F(:,:,j).
Z has dimension Q*C*K*M. It looks like a outer product on the last dimension. After that, I need to sum over all channels, resulting in a Q*K*M array. No need to save the intermediate result Z.
I have written the following CUDA kernel, but it is only showing <20 GFlops/s. Lunching parameter: Q=1024, threadPerBlock = Q, blockPerGrid = (K, M).
#define C 50
#define M 100
#define K 500
__global__ void corr5Ker(float *X, float *W, float *Z, int nChan) {
// Block index
int bk = blockIdx.x;
int bm = blockIdx.y;
// Thread index
int tx = threadIdx.x;
// Calc offsets
int xBegin = 1024 * nChan * bm;
int xStep = 1024;
int xEnd = 1024 * nChan * (bm + 1);
int wBegin = 1024 * nChan * bk;
int wStep = 1024;
float rC = 0;
// Conv
for (int ix = xBegin, iw = wBegin; ix < xEnd; ix += xStep, iw += wStep) {
rC += X[ix + tx] * W[iw + tx];
}
__syncthreads();
int threadId = (bk + bm * gridDim.x) * 1024 + tx;
Z[threadId] = rC;
}
I use Q*C*M*K to calculate the Flops, and the timing only contains kernel time. I also tested matrix element-wise addition and multiplication with linear kernels, if the data dimension is large enough, it can reach about 600 Gflops/s. The above operation is only slightly more complicated, but not supposed to be as low as 20 Gflops/s. At what point am I wrong?
Edit 1
I have corrected my code in calculating the matrix addition, and the code is only 6 Gflops/s. I tried to use saxpy, which also offered the same result. Now it is clear that what matters is the memory bandwidth.
I also corrected the above kernel with more registers, which gives around 50 Gflops/s. Now it is reasonable.

First of all. The performance of CUDA kernels is closely related to the ccc of the device and it's hardware, being said that:
Using 1024 threads per block probably you only could be able to fit a small number of blocks on an SM. This is due to the number of registers each thread needs. Shared memory its not a restriction in that case, so you are not using any.
Using shared memory you'll see a performance improvement.
The point 1 explain what you said:
if the data dimension is large enough, it can reach about 600
Gflops/s.
Increasing the size of the data you allow the kernel to run more blocks and that means that the device is able to hide long latency operations by context switches.
My advice is (ordered by increasing complexity:
Profile your application an see what is happening.
Reduce the number of threads per block. That should increase the number of blocks per SM and should increase the performance you get.
Use shared memory.
If you post a minimal functional code, I could tell you more!

How to copy a flattened 2D array from Global Memory to Shared Memory in CUDA

I have a kernel receiving a flattened 2D array, and I would like to copy one Line of the array each time the shared memory, my kernel looks like the following :
__global__ void searchKMP(char *test,size_t pitch_test,int ittNbr){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int strideId = tid * 50;
int m = 50;
__shared__ char s_test[m];
int j;
//this loops over the number of lines in my 2D array
for(int k=0; k<ittNbr; k++){
//this loops to store my flattened (basically threats 1 line at a time) array into shared memory
if(threadIdx.x==0){
for(int n =0; n<50; ++n){
s_test[n] = *(((char*)test + k * pitch_test) + n);
}
}
__syncthreads();
j=0;
//this is loop to process my shared memory array against another 1D array
for(int i=strideID; i<(strideID+50); i++{
...dosomething...
(increment x if a condition is met)
...dosomething...
}
__syncthreads();
if(x!=0)
cache[0]+=x;
...dosomething...
}
although when I verify the values of x, the value of x varies, all the times, or varies with the number of threads. Example, 10 blocks of 500 threads returns 9 when 20 blocks of 250 threads is returning the value 7 or 6 depending of the executions. I wonder if the problem is coming from the 2D flattened array copied in shared memory or if something is done wrong in this bit of code.

It looks like your array in shared memory has 20 elements:
int m = 20;
__shared__ char s_test[m];
But in your inner loop you are trying to write 50 elements:
for(int n =0; n<50; ++n){
s_test[n] = *(((char*)test + k * pitch_test) + n);
I don't know if this is specifically the problem you were looking for, but that looks like it won't work.

shared memory is shared across all threads in the same block
it is not very clear, why you need shared memory and what you are doing:
in your code all threads in the block write the same values to your shared memory many times, but it is redundantly
common way to work with shared memeory is something like this:
if(threadIdx.x < m)
s_test[threadIdx.x] = *(global_mem_pointer + threadIdx.x);
__syncthreads();
all threads in the block write their own value "at the same moment" and after __syncthreads(); your memory is filled with what you need and visible for all threads in the block

Implementing matrix multiplication with openCL / C

I understand the theory of matrix multiplication, I just have two questions about this particular kernel implementation:
For reference, num_rows = 32. The matrix B (b_mat) has been transposed before by another kernel, so as I understand it we're dot-ting row vectors together.
1) why do we need to use the param "vectors_per_row" and thus the inner loop? I thought we could just do sum += dot(row of A, row of B), and it seems like this param is splitting up the row into smaller portions (why?).
2) I don't understand the address offset for a_mat and b_mat, i.e. a_mat += start; b_mat += start*4;
__kernel void matrix_mult(__global float4 *a_mat,
__global float4 *b_mat, __global float *c_mat) {
float sum;
int num_rows = get_global_size(0);
int vectors_per_row = num_rows/4;
int start = get_global_id(0) * vectors_per_row;
a_mat += start;
c_mat += start*4;
for(int i=0; i<num_rows; i++) {
sum = 0.0f;
for(int j=0; j<vectors_per_row; j++) {
sum += dot(a_mat[j],
b_mat[i*vectors_per_row + j]);
}
c_mat[i] = sum;
}
}

Your matrix is composed of an array of float4's. Flaoa4's are vectors of 4 floats. This is where the 4 comes from. Dot only works with the builtin types, so you have to do it on the float4.
c_mat is of type float, which is why it has start*4 and a_mat has start. The offset is because the code is split up across several (potentially hundreds) of threads. Each thread is only calculating a small part of the multiply operation. start is simply where the thread starts computing. This is what the get_global_id(0) is for. It essentially gets your thread id. Technically it's the thread index of the first dimension, but it appears you only have one thread dimension, so here you can just think of it as thread id.