program fails for array 30 x 30

program fails for array 30 x 30 - c

This is program for matrix multiplication on CUDA architecture.
This code is working fine when size of array is 30 x 30 but giving output as a series of 0's when size is greater.
I am using standard ec2 instance for CUDA hosted on linux machine. Can anybody figure out the reason ?
#include <stdio.h>
#define SIZE 30
__global__ void matrix_multiply(float *input1,float *input2,float *output,int dimension){
int input1_index = threadIdx.x / dimension * dimension;
int input2_index = threadIdx.x % dimension;
int i=0;
for( i =0; i <dimension; i++){
output[threadIdx.x] += input1[input1_index + i] * input2[input2_index + i * dimension];
}
}
int main(){
int i,j,natural_number=1;
float input1[SIZE][SIZE],input2[SIZE][SIZE],result[SIZE][SIZE]={0};
float *c_input1,*c_input2,*c_result;
for(i=0;i<SIZE;i++){
for(j=0;j<SIZE;j++){
input1[i][j]=input2[i][j]=natural_number++;
}
}
cudaMalloc((void**)&c_input1,sizeof(input1));
cudaMalloc((void**)&c_input2,sizeof(input2));
cudaMalloc((void**)&c_result,sizeof(result));
cudaMemcpy(c_input1,input1,sizeof(input1),cudaMemcpyHostToDevice);
cudaMemcpy(c_input2,input2,sizeof(input2),cudaMemcpyHostToDevice);
cudaMemcpy(c_result,result,sizeof(result),cudaMemcpyHostToDevice);
matrix_multiply<<<1,SIZE * SIZE>>>(c_input1,c_input2,c_result,SIZE);
if(cudaGetLastError()!=cudaSuccess){
printf("%s\n",cudaGetErrorString(cudaGetLastError()));
}
cudaMemcpy(result,c_result,sizeof(result),cudaMemcpyDeviceToHost);
for(i=0;i<SIZE;i++){
for(j=0;j<SIZE;j++){
printf("%.2f ",result[i][j]);
}
printf("\n");
}
cudaFree(c_input1);
cudaFree(c_input2);
cudaFree(c_result);
return 0;
}

You probably have a max of 1024 threads per block on your GPU. 30 x 30 = 900, so that should be OK, but e.g. 40 x 40 would results in a kernel launch failure (take-home message: always check for errors !).
You probably want to consider organizing your data differently, e.g. SIZE blocks of SIZE threads and then call the kernel as:
matrix_multiply<<<SIZE, SIZE>>>(c_input1,c_input2,c_result,SIZE);
(Obviously you'll need to modify your array indexing within the kernel code, e.g. use the block index as the row and the thread index as the column.)

You are invoking the kernel with a configuration of 1 grid with size 30x30:
matrix_multiply<<<1, SIZE * SIZE>>>(c_input1,c_input2,c_result,SIZE);
There are not enough threads to process more.

Related

Non-deterministic CUDA C kernel

I'm still a beginner with CUDA and I have been trying to write a simple kernel to perform a parallel prime sieve on the GPU. Originally I had written my code in C but I wanted to investigate the speed up on a GPU so I rewrote it:
41.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define B 1024
#define T 256
#define N (B*T)
#define checkCudaErrors(error) {\
if (error != cudaSuccess) {\
printf("CUDA Error - %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(error));\
exit(1);\
}\
}\
__global__ void prime_sieve(int *primes) {
unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
primes[i] = i;
primes[0] = primes[1] = 0;
if (i > 1 && i<N) {
for (int j=2; j<N/2; j++) {
if (i*j < N) {
primes[i*j] = 0;
}
}
}
}
int main() {
int *h_primes=(int*)malloc(N * sizeof(int));
int *d_primes;
checkCudaErrors(cudaMalloc( (void**)&d_primes, N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_primes,h_primes,N*sizeof(int),cudaMemcpyHostToDevice));
prime_sieve<<<B,T>>>(d_primes);
checkCudaErrors(cudaMemcpy(h_primes,d_primes,N*sizeof(int),cudaMemcpyDeviceToHost));
checkCudaErrors(cudaFree(d_primes));
int size = 0;
int total = 0;
for (int i=2; i<N; i++) {
if (h_primes[i]) {
size++;
}
total++;
}
printf("\n");
printf("Length = %d\tPrimes = %d\n",total,size);
free(h_primes);
return 0;
}
I run the program on Ubuntu 16.04 (4.4.0-83-generic) and I compile using nvcc 41.cu -o 41.o -arch=sm_30 under version 8.0.61. The program is run on a GeForce GTX 780 Ti but everytime it runs, it always produces non-deterministic results:
Length = 262142 Primes = 49477
Length = 262142 Primes = 49486
Length = 262142 Primes = 49596
Length = 262142 Primes = 49589
There were no errors reported back. At first I thought it was a race condition but cuda-memcheck didn't report back any hazards for racecheck,initcheck or synccheck and I couldn't think of any problems with my assumptions. I was thinking this could be a synchronisation problem?
This non-deterministic behaviour only occurs when I increase the block size and thread size as seen in the code. When I tried a block size and thread size of say 16, then there were no problems (as far as I could tell). It seems that not all threads get the chance to execute? I was planning to run this on very large array sizes (< 1 billion integers) but I am stuck at this point.
What am I doing wrong here?

There is a giant race-condition
So prime[i] > 0 means prime, while prime[i]=0 means composite.
primes[i] = i; is executed as first update on primes by each thread. Keep this in mind.
Now let's see what happen when thread 16 executes. It marks primes[16]=16 and and all multiples of 16 too. Something like the following
primes[16] = primes[32] = primes[48]=....=primes[k*16]=0
Imagine that thread 48 gets scheduled just after thread 16 completed its job (or when j>3 in thread 16 loop`).
Thread 48 sets primes[48] = 48. You have lost the update made by thread 16.
That is a race condition.
When coding in CUDA you should make sure that the correctness of your code does not depend on a particular scheduling of warps.
You should think as the order of execution as something non-deterministic.

A lot of 0's received when using cudaMemcpy()

I've just started to learn CUDA and i wanted to fill an array (a 2D array represented as a 1D array) with random numbers. I followed another posts in order to generate random numbers, but i don't know if there is a problem with the generation of numbers or with the memory recovering from the device or anything else. The problem is that, though i have tried to fill any cell of the array with the id of the thread that is atending it in order to see the results after copying into the host memory, i receive an array that is filled with 0 in any position after recovering the data with cudaMemcpy().
I'm programming on Visual Studio 2013, with cuda 7.5, on a i5 2500k as my processor and a 960 GTX graphic card.
Here is the main and the method where i try to fill it. I'll update the cuRand Initialization too. If you need to see something else, just tell me.
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void poblar(int * adn, curandState * state){
curandState localState = state[threadIdx.x];
int random = curand(&localState);
adn[threadIdx.x] = random;
// It doesn't mind if i use the following instruction, the result is a lot of 0's
//adn[threadIdx.x] = threadIdx.x;
}
int main()
{
const int adnLength = NUMCROMOSOMAS * SIZECROMOSOMAS; // 256 * 128 (32.768)
const size_t adnSize = adnLength * sizeof(int);
int adnCPU[adnLength];
int * adnDevice;
cudaError_t error = cudaSetDevice(0);
if (error != cudaSuccess)
exit(-EXIT_FAILURE);
curandState * randState;
error = cudaMalloc(&randState, adnLength * sizeof(curandState));
if (error != cudaSuccess){
cudaFree(randState);
exit(-EXIT_FAILURE);
}
//Here is initialized cuRand
setup_cuRand <<<1, adnLength >> > (randState, unsigned(time(NULL)));
error = cudaMalloc((void **)&adnDevice, adnSize);
if (error == cudaErrorMemoryAllocation){// cudaSuccess){
cudaFree(adnDevice);
cudaFree(randState);
printf("\n error");
exit(-EXIT_FAILURE);
}
poblar <<<1, adnLength >>> (adnDevice, randState);
error = cudaMemcpy(adnCPU, adnDevice, adnSize, cudaMemcpyDeviceToHost);
//After here, for any i, adnCPU[i] is 0 and i cannot figure what is wrong
if (error == cudaSuccess){
for (int i = 0; i < NUMCROMOSOMAS; i++){
for (int j = 0; j < SIZECROMOSOMAS; j++){
printf("%i,", adnCPU[(i*SIZECROMOSOMAS) + j]);
}
printf("\n");
}
}
return 0;
}
EDIT after answer solved: There was a particularity over the answer given, and is that you need a lower number of threads (half of that quantity worked for me) in order to seed correctly the random numbers with cuRand. For some reason, i could create the threads perfectly but i couldn't seed the pseudo-random algorithm generator.

The maximum number of threads per block is 1024 on your hardware, hence, you may not schedule a call with adnLength if it is larger than 1024.
The error you are having is most probably a call configuration error, and it is returned by cudaPeekAtLastError, as it occurs before any GPU work, right after the triple angled-bracket call. Indeed cudaMemcpy may not return it, even though it returns error from previous asynchronous calls.
The error that may occur is cudaErrorLaunchOutOfResources.

How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.
I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.
This is my example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mkl.h"
#include "omp.h"
#include "papi.h"
int main(int argc, char *argv[] )
{
int i, j, l, k, n, m, idx, iter;
int mat, mat_min, mat_max;
int threads;
double *A, *B, *C;
double alpha =1.0, beta=0.0;
float rtime1, rtime2, ptime1, ptime2, mflops;
long long flpops;
#pragma omp parallel
{
#pragma omp master
threads = omp_get_num_threads();
}
if(argc < 4){
printf("pass me 3 arguments!\n");
return( -1 );
}
else
{
mat_min = atoi(argv[1]);
mat_max = atoi(argv[2]);
iter = atoi(argv[3]);
}
m = mat_max; n = mat_max; k = mat_max;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
A = (double *) malloc( m*k * sizeof(double) );
B = (double *) malloc( k*n * sizeof(double) );
C = (double *) malloc( m*n * sizeof(double) );
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*k); i++)
A[i] = (double)(i+1);
for (i = 0; i < (k*n); i++)
B[i] = (double)(-i-1);
memset(C,0,m*n*sizeof(double));
// actual meassurment
for(mat=mat_min;mat<=mat_max;mat+=5)
{
m = mat; n = mat; k = mat;
for( idx=-1; idx<iter; idx++ ){
PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
}
printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
}
printf("Done\n");fflush(stdout);
free(A);
free(B);
free(C);
return 0;
}
This is one output (for matrix size 200):
1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS
We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.
My question is: How can I measure the overall performance? I am grateful for any input.

First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?
PAPI is thread based not much help on its own here.
What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.
Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

Reduce matrix rows with CUDA

Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2 loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1 loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.
CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.
CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.
Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.
Kernel Approach
Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.
Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.
For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by #Novak. This may improve the performance more, especially for matrices with bad shape.
Thrust Approach
General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.
However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.
cuBLAS Approach
Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.
y = A * ones(size(A,2),1);
where y is the sum of rows of A.
cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.
Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.
APPROACH #1 - reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 - transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 - inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 - cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.
THE FULL CODE
Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
// --- Required for approach #2
__device__ float *vals;
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {
const int Ncols; // --- Number of columns
row_reduction(int _Ncols) : Ncols(_Ncols) {}
__device__ float operator()(float& x, int& y ) {
float temp = 0.f;
for (int i = 0; i<Ncols; i++)
temp += vals[i + (y*Ncols)];
return temp;
}
};
/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
T C;
__host__ __device__ MulC(T c) : C(c) { }
__host__ __device__ T operator()(T x) { return x * C; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
TimingGPU timerGPU;
/***************/
/* APPROACH #1 */
/***************/
timerGPU.StartCounter();
// --- Allocate space for row sums and indices
thrust::device_vector<float> d_row_sums(Nrows);
thrust::device_vector<int> d_row_indices(Nrows);
// --- Compute row sums by summing values with equal row indices
//thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
// thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
// d_matrix.begin(),
// d_row_indices.begin(),
// d_row_sums.begin(),
// thrust::equal_to<int>(),
// thrust::plus<float>());
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
thrust::make_discard_iterator(),
d_row_sums.begin());
printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());
// --- Print result
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums[i] << "\n";
}
/***************/
/* APPROACH #2 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_2(Nrows, 0);
float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0), d_row_sums_2.begin(), row_reduction(Ncols));
printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_2[i] << "\n";
}
/***************/
/* APPROACH #3 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_3(Nrows, 0);
thrust::device_vector<float> d_temp(Nrows * Ncols);
thrust::inclusive_scan_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
d_temp.begin());
thrust::copy(
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
d_row_sums_3.begin());
printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_3[i] << "\n";
}
/***************/
/* APPROACH #4 */
/***************/
cublasHandle_t handle;
timerGPU.StartCounter();
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_row_sums_4(Nrows);
thrust::device_vector<float> d_ones(Ncols, 1.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));
printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_4[i] << "\n";
}
return 0;
}
TIMING RESULTS (tested on a Kepler K20c)
Matrix size #1 #1-v2 #2 #3 #4 #4 (no plan)
100 x 100 0.63 1.00 0.10 0.18 139.4 0.098
1000 x 1000 1.25 1.12 3.25 1.04 101.3 0.12
5000 x 5000 8.38 15.3 16.05 13.8 111.3 1.14
100 x 5000 1.25 1.52 2.92 1.75 101.2 0.40
5000 x 100 1.35 1.99 0.37 1.74 139.2 0.14
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.

If this is the extent (summing the rows) of the operations you need to do with this data, I wouldn't expect a sizable benefit from the GPU. You have exactly one arithmetic operation per data element, and for that you are paying the cost of transferring that data element to the GPU. And beyond a certain problem size (whatever it takes to keep the machine busy) you get no added benefit from larger problem sizes, because the arithmetic intensity is O(n).
So this isn't a particularly exciting problem to solve on the GPU.
But as talonmies has indicated, you have a coalescing problem in the way you have crafted it, which will further slow things down. Let's take a look at a small example:
C1 C2 C3 C4
R1 11 12 13 14
R2 21 22 23 24
R3 31 32 33 34
R4 41 42 43 44
Above is a simple pictorial example of a small portion of your matrix. The machine data storage is such that elements (11), (12), (13), and (14) are stored in adjacent memory locations.
For coalesced access, we want an access pattern such that adjacent memory locations are requested from the same instruction, executed across the warp.
We need to think about execution of your code from the standpoint of a warp, that is 32 threads executing in lock-step. What is your code doing? Which elements is it retrieving (asking for) at each step/instruction? Let's take a look at this line of code:
sum+=m[rowIdx*ncol+k];
Adjacent threads in the warp have adjacent (i.e. consecutive) values for rowIdx as you have created that variable. So when k = 0, which data element is being asked for by each thread when we try to retrieve the value m[rowIdx*ncol+k] ?
In block 0, thread 0 has a rowIdx of 0. Thread 1 has a rowIdx of 1, etc. So the values being asked for by each thread at this instruction are:
Thread: Memory Location: Matrix Element:
0 m[0] (11)
1 m[ncol] (21)
2 m[2*ncol] (31)
3 m[3*ncol] (41)
But this is not coalesced access! Elements (11), (21), etc. are not adjacent in memory. For coalesced access, we would like that Matrix Element row to read like this:
Thread: Memory Location: Matrix Element:
0 m[?] (11)
1 m[?] (12)
2 m[?] (13)
3 m[?] (14)
If you then work backwards to determine what the value of ? should be, you will come up with an instruction something like this:
sum+=m[k*ncol+rowIdx];
This will give coalesced access, but it will not give you the correct answer, because we are now summing matrix columns instead of matrix rows. We can fix this by re-organizing your data storage to be in column-major order rather than row-major order. (You should be able to google that for ideas, right?) Conceptually, this is equivalent to transposing your matrix m. Whether this is convenient for you to do or not is outside the scope of your question, as I see it, and not really a CUDA issue. It may be a simple thing for you to do as you are creating the matrix on the host or transferring the matrix from host to device. But in summary, I don't know of a way to sum the matrix rows with 100% coalesced access, if the matrix is stored in row-major order. (You could resort to a sequence of row-reductions but that looks painful to me.)
It's not uncommon, when we are thinking about ways to accelerate code on the GPU, to consider re-organizing our data storage to facilitate the GPU. This is one example.
And, yes, what I'm outlining here still retains a loop in the kernel.
As an additional comment, I would suggest timing the data copy portions, and kernel (compute) portions separately. I can't tell from your question whether you are timing just the kernel or the entire (GPU) operation, including the data copies. If you time the data copies separately, you may discover that just the data copy time exceeds your CPU time. Any effort put into optimizing your CUDA code will not affect the data copy time. This might be a useful data point before you spend much time on this.

How to dynamically allocate arrays inside a kernel?

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?
My code is something like that:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float x[n],y[nn];
//Do some really cool and heavy computations here that takes hours.
}
But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.
So, what should I do?
If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.

Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float *x = new float[n], *y = new float[nn];
}
This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.

#talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc() and an alternative you might want to consider.
Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.
The test implements two approaches. The first approach uses __device__ malloc() and the second approach uses memory that is allocated before the kernel runs.
On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc() and 27ms when using pre-allocated memory. In other words, the test takes 56x longer to run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc() / cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc() / cudaFree() is amortized over all the kernel launches. That brings the difference even higher, to around 60x.
Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc() in order to provide separate chunks of memory to each caller.
The version that does not use __device__ malloc() allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().
The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc() is doing in the background, and would see a similar performance hit.
Test the performance of __device__ malloc():
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int N_ITEMS(16);
#define USE_DYNAMIC_MALLOC
__global__ void test_malloc(int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(new int[N_ITEMS]);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
delete[] s;
}
__global__ void test_malloc_2(int* items, int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(items + tx * N_ITEMS);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
}
int main()
{
cudaError_t cuda_status;
cudaSetDevice(0);
int blocks_per_launch(1024 * 10);
int threads_per_block(256);
int threads_per_launch(blocks_per_launch * threads_per_block);
int* totals_d;
cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaDeviceSynchronize();
cudaEventRecord(start, 0);
#ifdef USE_DYNAMIC_MALLOC
cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));
test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
int* items_d;
cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);
test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);
cudaFree(items_d);
#endif
cuda_status = cudaDeviceSynchronize();
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Elapsed: %f\n", elapsedTime);
int* totals_h(new int[threads_per_launch]);
cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
for (int i(0); i < 10; ++i) {
printf("%d ", totals_h[i]);
}
printf("\n");
cudaFree(totals_d);
delete[] totals_h;
return cuda_status;
}
Output:
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080

If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?

Ran an experiment based on the concepts in #rogerdahl's post. Assumptions:
4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100
The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.029794s
I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:
#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 64;
void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void mallocai() {
for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
int * foo;
foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
free(foo);
}
}
int main() {
Timer cuda_malloc_timer("cuda malloc timer");
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
cuda_malloc_timer.stop_and_report();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Timer device_malloc_timer("device malloc timer");
device_malloc_timer.start();
mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
device_malloc_timer.stop_and_report();
}
If you find mistakes, please lmk in the comments, and I'll try to fix them.
And I ran them again with larger everything:
const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 1024;
And cudaMalloc was still slower by a lot:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.167331s

Maybe you should test
cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);
instead
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight