Copy Array of pointers inside a struct using CUDA

Copy Array of pointers inside a struct using CUDA - c

I wish to copy an array of pointers from one struct to another. The Struct looks like this:
typedef struct COORD3D
{
int x,y,z;
}
COORD3D;
typedef struct structName
{
double *volume;
COORD3D size;
// .. some other vars
}
structName;
I wish to do this inside a function where I pass in the address of an empty instance of the struct and the address of the struct with the data I wish to copy. Currently I do this serially via:
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
int i;
for(i=0;i<size;i++)
dest->volume[i] = source->volume[i];
}
I want to do this in CUDA to speed up the process (as the array is very large [~12 million elements].
I have tried the following however, although the code compiles and runs, I get incorrect results stored in the array (seems to be very large random numbers)
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
// Device Pointers
double *DEVICE_SOURCE, *DEVICE_DEST;
// Declare memory on GPU
cudaMalloc(&DEVICE_DEST,size);
cudaMalloc(&DEVICE_SOURCE,size);
// Copy Source to GPU
cudaMemcpy(DEVICE_SOURCE,source->volume,size,
cudaMemcpyHostToDevice);
// Setup Blocks/Grids
dim3 dimGrid(ceil(source->size.x/10.0),
ceil(source->size.y/10.0),
ceil(source->size.z/10.0));
dim3 dimBlock(10,10,10);
// Run CUDA Kernel
copyVol<<<dimGrid,dimBlock>>> (DEVICE_SOURCE,
DEVICE_DEST,
source->size.x,
source->size.y,
source->size.z);
// Copy Constructed Array back to Host
cudaMemcpy(dest->volume,DEVICE_DEST,size,
cudaMemcpyDeviceToHost);
}
The Kernel looks like this:
__global__ void copyVol(double *source, double *dest,
int x, int y, int z)
{
int posX = blockIdx.x * blockDim.x + threadIdx.x;
int posY = blockIdx.y * blockDim.y + threadIdx.y;
int posZ = blockIdx.z * blockDim.z + threadIdx.z;
if (posX < x && posY < y && posZ < z)
{
dest[posX+(posY*x)+(posZ*y*x)] =
source[posX+(posY*x)+(posZ*y*x)];
}
}
Can anyone tell me where I am going wrong?

I am risking a wrong answer, but have you left out the size of the data type?
cudaMalloc(&DEVICE_DEST,size);
should be
cudaMalloc(&DEVICE_DEST,size*sizeof(double));
Also
cudaMemcpy(DEVICE_SOURCE,source->volume,size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size*sizeof(double), cudaMemcpyHostToDevice);
and so on.

Related

Multiple Flexible Array Member or VLA with Shared Memory and Semaphores

I need to define a struct with two semaphores and three(at the least) or maybe more arrays as members of the struct whose size are variables. Indicative example ( not the right syntax but to give a contextual meaning ; lr is typedef for double) :
int nx = 400,ny = 400,nz = 400;
struct ShMem {
sem_t prod;
sem_t cons;
lr u_x[nx+1][ny+2][nz+2];
lr u_y[nx+2][ny+1][nz+2];
lr u_z[nx+2][ny+2][nz+2];
};
What I need to do is to make the struct ShMem as a shared memory block between two codes aka producer and consumer which compute and read this memory block with the help of the semaphores present in the struct.
Since the array size are variables and will be defined in runtime how do i get a 3 Dimensional variable length array ?
Comment :
If lets say I have nx, ny and nz #defined to 400 I follow the following step ( already tested )
#define nx (400)
#define ny (400)
#define nz (400)
struct ShMem {
sem_t prod;
sem_t cons;
lr u_x[nx+1][ny+2][nz+2];
lr u_y[nx+2][ny+1][nz+2];
lr u_z[nx+2][ny+2][nz+2];
};
...
// shared memory allocation
ShmID = shmget(ShmKEY, sizeof(struct Shmem), IPC_CREAT|0666);
...
Additional requirement is that for the application I do need those arrays as 3D arrays such that I can index them as u_x[i][j][k], whre i, j, k are indices in the x, y, and z-direction respectively.
Edit after Lundin and Felix solution.
CONSTRAINT - u_x, u_y and u_z needs to be a 3D array/*** pointer which is accessed by u_x[i][j][k] - This can't be changed since this is a legacy code. The arrays need to be set such that the sanctity of the access is maintained. Everywhere in the code it is accessed like that.

As already discussed in the comments, C doesn't support something like that. So, you will have to build it yourself. A simple example using a macro to make the "3D access" inside the structure readable could look like this:
#include <stdlib.h>
typedef int lr;
struct foo {
size_t ny;
size_t nz;
lr *u_y;
lr *u_z;
lr u_x[];
};
#define val(o, a, x, y, z) ((o).a[(o).ny * (o).nz * x + (o).nz * y + z])
struct foo *foo_create(size_t nx, size_t ny, size_t nz)
{
size_t arrsize = nx * ny * nz;
struct foo *obj = malloc(sizeof *obj + 3 * arrsize * sizeof *(obj->u_x));
if (!obj) return 0;
obj->ny = ny;
obj->nz = nz;
obj->u_y = obj->u_x + arrsize;
obj->u_z = obj->u_y + arrsize;
return obj;
}
int main(void)
{
struct foo *myFoo = foo_create(10, 10, 10);
// set u_y[9][5][2] in *myFoo to 42:
val(*myFoo, u_y, 9, 5, 2) = 42;
free(myFoo);
}
This uses the single FAM at the end of the struct supported by C, so you can allocate such a struct in a single block. To place it in shared memory, just replace the malloc() and use the same calculations for the size.

You have to build something like this
struct ShMem {
int some_stuff_here;
size_t x[3];
size_t y[3];
size_t z[3];
int array[];
};
And then ignore that the flexible array member type is a plain int array. Instead do something like
size_t size = sizeof( int[x1][y1][z1] ) +
sizeof( int[x2][y2][z2] ) +
sizeof( int[x3][y3][z3] );
ShMem* shmem = malloc(sizeof *shmem + size);
And then when accessing, you use an array pointer type instead of the int[]. The code turns a bit nasty to read:
for(size_t i=0; i<3; i++)
{
typedef int(*arr_t)[shmem->y[i]][shmem->z[i]]; // desired array pointer type
int some_offset = ... // calculate based on previously used x y z
arr_t arr = (arr_t)shmem->(array + some_offset); // cast to the array pointer type
for(size_t x=0; x<shmem->x[i]; x++)
{
for(size_t y=0; y<shmem->y[i]; y++)
{
for(size_t z=0; z<shmem->z[i]; z++)
{
arr[x][y][z] = something;
}
}
}
}
This is actually well-defined behavior since the data allocated with malloc doesn't have an effective type until you access it.
"some_offset" in the above example could be a counter variable or something stored inside the struct itself.

Aligning 12-byte struct within a warp - cuda

Lets say we have a struct of 3 integers which is not aligned:
struct data {
int x;
int y;
int z;
};
I pass an array of this struct to kernel. I'm aware that I should pass struct of array instead of array of struct but this is not important for this question.
32 threads inside a warp, access memory in coalesced manner (i to i + 31) which equals total memory of 384 bytes. 384 bytes is multiple of L1 cache line (128 bytes) which means three memory transaction of 128-byte each.
Now if we have an aligned struct:
struct __align__(16) aligned_data {
int x;
int y;
int z;
};
if access patterns remains the same as previous example, then it would fetch 512 bytes of memory which is 4 memory transaction each requesting 128-byte.
So this means is first example more efficient or second one is still more efficient although it fetches more memory.

The only real way to answer a question is by benchmarking. And if you do, you may not get the same answer depending on your hardware. When I run this:
#define NITER (128)
struct data {
int x;
int y;
int z;
};
struct __align__(16) aligned_data {
int x;
int y;
int z;
};
template<typename T, int niter>
__global__
void kernel(T *in, int *out, int dowrite=0)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int nthreads = blockDim.x * gridDim.x;
int oval = 0;
#pragma unroll
for(int i=0; i<niter; ++i,tid+=nthreads) {
T val = in[tid];
oval += val.x + val.y + val.z;
}
if (dowrite) {
out[tid] = oval;
}
}
template __global__ void kernel<data,NITER>(data *, int*, int);
template __global__ void kernel<aligned_data,NITER>(aligned_data *, int*, int);
int main()
{
const int bs = 512;
const int nb = 32;
const int nvals = bs * nb * NITER;
data *d_; cudaMalloc((void **)&d_, sizeof(data) * size_t(nvals));
aligned_data *ad_; cudaMalloc((void **)&ad_, sizeof(aligned_data) * size_t(nvals));
for(int i=0; i<10; ++i) {
kernel<data,NITER><<<nb, bs>>>(d_, (int *)0, 0);
kernel<aligned_data,NITER><<<nb, bs>>>(ad_, (int *)0, 0);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return 0;
}
I see that the aligned structure version gives overall higher performance on a compute 5.2 capability device:
Time(%) Time Calls Avg Min Max Name
52.71% 2.3995ms 10 239.95us 238.10us 241.79us void kernel<data, int=128>(data*, int*, int)
47.29% 2.1529ms 10 215.29us 214.91us 215.51us void kernel<aligned_data, int=128>(aligned_data*, int*, int)
In this case I would assumed that the roughly 10% improvement is down to the lower number of load instructions which are issued. In the unaligned case the compiler issues three 32 bit loads to fetch the structure, whereas in the aligned case the compiler issues a single 128 bit load to fetch the structure. The reduction in instructions seems to offset the 25% wasted net memory bandwidth. On other hardware with different memory instruction throughput to bandwdith ratios, the result might well be different.

Initializating struct error: 'theGrid' being used without being initialized

I am trying to create a 3d grid for my OpenCl/GL fluid. The problem Im having is that for some reason the my grid initialization function does not work properly. Here is my *.h, *.c setup and (at the end) call in main:
(grid.h):
#if RunGPU
#define make_float3(x,y,z) (float3)(x,y,z)
#define make_int3(i,j,k) (int3)(i,j,k)
#else
typedef struct i3{
int i,j,k;
} int3;
typedef struct f3{
float x,y,z;
} float3;
#define __global
#define make_float3(x,y,z) {x , y , z}
#define make_int3(x,y,z) {x , y ,z}
#endif
typedef struct grid3 * grid3_t; // u,v,w
typedef struct grid * grid_t; // p
struct grid3 {
__global float3* values_;
__global float * H_;
__global float * h_;
int dimx_;
int dimy_;
int dimz_;
} ;
struct grid {
__global float * values_;
int dimx_;
int dimy_;
int dimz_;
};
void grid3_init(grid3_t grid,__global float3* vel,__global float* H,__global float *h, int X, int Y, int Z);
(grid.c):
void grid3_init(grid3_t grid,__global float3* val,__global float* H,__global float *h, int X, int Y, int Z){
grid->values_ = val;
grid->H_ = H;
grid->h_ = h;
grid->dimx_ = X;
grid->dimy_ = Y;
grid->dimz_ = Z;
}
In main im initializing my grid like so:
int main(int argc, char** argv)
{
const int size3d = Bx*(By+2)*Bz;
const int size2d = Bx*Bz;
float3 * velocities = (float3*)malloc(size3d*sizeof(float3));
float * H = (float*)malloc(size2d*sizeof(float));
float * h = (float*)malloc(size2d*sizeof(float));
for(int i = 0; i < size3d; i++){
float3 tmp = make_float3(0.f,0.f,0.f);
velocities[i] = tmp;
if(i < size2d){
H[i] = 1;
h[i] = 2;
}
}
grid3_t theGrid;
grid3_init(theGrid, velocities, H, h, Bx, By, Bz); // <- ERROR OCCURS HERE
}
The error im getting is during runtime - "Run-Time Check Failure #3 - The variable 'theGrid' is being used without being initialized". But thats precisely the job of grid3_init?
As im trying to write code to work for both Host and GPU I have to sacrifice the use of classes and work strictly with structs - which I have less experience with.
At this point I dont really know what to google either, I appriciate any help i can get.

struct grid3 theGrid;
grid3_init(&theGrid, velocities, H, h, Bx, By, Bz);
You need to create grid3 instance and pass its pointer to grid3_init. Your existing code just uses uninitialized pointer.

Does "more threads" mean more speed?

I've got a Jacobi implementation on CUDA, but the problem is:
I assign threads at this way:
#define imin(a,b) (a < b ? a : b)
int dimBlocks, dimThreads;
dimThreads = 256;
dimBlocks = imin(32, (dimThreads + dim - 1)/dimThreads);
But if I use 32 threads it's fastest than using 256 threads or moreover...
I've got these results:
Sequential times:
9900 5.882000
9900 6.071000
Parallel times:
9900 1.341000 //using 32
9900 1.626000 //using 256
Where 9900 is matrix WIDTH... And we can see the following:
5.882 / 1.34 = 4.39
6.07 / 1.62 = 3.74
So 32 threads is more efficient than 256?
Sorry, I don't know if I should upload the code(since they are a bit long), if you request it I will do it.
EDIT:
//Based on doubletony algorithm
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "Jacobi.cuh"
#include "thrust\host_vector.h"
#include "thrust\device_vector.h"
#include "thrust\extrema.h"
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <ctime>
#define imin(a,b) (a < b ? a : b)
// name OF FUNCTION: __copy_vector
// PURPOSE:
// The function will copy a vector.
//
// PARAMETERS:
// name type value/reference description
// ---------------------------------------------------------------------
// source double* value vector to be copied
// dest double* reference vector copied
// dim int value vector dimension
// RETURN VALUE:
// name type description
// ---------------------------------------------------------------------
//
__global__ void __copy_vector(double *source, double *dest, const int dim)
{
int tIdx = blockDim.x * blockIdx.x + threadIdx.x;
while(tIdx < dim){
dest[tIdx] = source[tIdx];
tIdx += gridDim.x * blockDim.x;
}
}
// name OF FUNCTION: __Jacobi_sum
// PURPOSE:
// The function will execute matrix vector multiplication
//
// PARAMETERS:
// name type value/reference description
// ---------------------------------------------------------------------
// A double* value A
// B double* value B
// C double* reference A*B
// dim int value vector dimension
// RETURN VALUE:
// name type description
// ---------------------------------------------------------------------
//
__global__ void __Jacobi_sum(const double *A,
const double *B,
double *resul,
const int dim)
{
int tIdx = blockIdx.x * blockDim.x + threadIdx.x;
while(tIdx < dim){
resul[tIdx] = 0;
for(int i = 0; i < dim; i++)
if(tIdx != i)
resul[tIdx] += A[tIdx * dim + i] * B[i];
tIdx += gridDim.x * blockDim.x;
}
__syncthreads;
}
// name OF FUNCTION: __substract
// PURPOSE:
// The function will execute A-B=resul
//
// PARAMETERS:
// name type value/reference description
// ---------------------------------------------------------------------
// A double* value A
// B double* value B
// C double* reference A-B
// dim int value vector dimension
// RETURN VALUE:
// name type description
// ---------------------------------------------------------------------
//
__global__ void __substract(const double *A,
const double *B,
double *C,
const int dim)
{
int tIdx = blockIdx.x * blockDim.x + threadIdx.x;
while(tIdx < dim){
C[tIdx] = A[tIdx] - B[tIdx];
tIdx += gridDim.x * blockDim.x;
}
}
// name OF FUNCTION: __divide
// PURPOSE:
// The function will execute the jacobi division, that is,
// (B-sum)/A[i,i]
//
// PARAMETERS:
// name type value/reference description
// ---------------------------------------------------------------------
// A double* value A
// B double* reference (B-sum)/A[i,i]
// dim int value vector dimension
// RETURN VALUE:
// name type description
// ---------------------------------------------------------------------
//
__global__ void __divide(const double *A, double *B, const int dim)
{
int tIdx = blockIdx.x * blockDim.x + threadIdx.x;
while(tIdx < dim){
//if(A[tIdx * dim + tIdx] != 0)
B[tIdx] /= A[tIdx * dim + tIdx];
tIdx += blockDim.x * gridDim.x;
}
}
// name OF FUNCTION: __absolute
// PURPOSE:
// The function will calculate the absolute value for each
// number in an array
//
// PARAMETERS:
// name type value/reference description
// ---------------------------------------------------------------------
// A double* reference |A[i]|
// dim int value vector dimension
// RETURN VALUE:
// name type description
// ---------------------------------------------------------------------
//
__global__ void __absolute(double *A, const int dim)
{
int tIdx = blockIdx.x * blockDim.x + threadIdx.x;
while(tIdx < dim){
if(A[tIdx] < 0)
A[tIdx] = -A[tIdx];
tIdx += blockDim.x * gridDim.x;
}
}
// name OF FUNCTION: Jacobi_Cuda
// PURPOSE:
// The function will calculate a X solution for a linear system
// using Jacobi's Method.
//
// PARAMETERS:
// name type value/reference description
// ---------------------------------------------------------------------
// Matrix_A double* value Matrix A(coefficients)
// Vector_B double* value Vector B
// Vector_X double* reference Solution
// dim int value Matrix Dimension
// e double value Error allowed
// maxIter int value Maximum iterations allowed
// RETURN VALUE:
// name type description
// ---------------------------------------------------------------------
//
void Jacobi_Cuda(const double *Matrix_A,
const double *Vector_B,
double *Vector_X,
const int dim,
const double e,
const int maxIter,
double *t)
{
/** Host variables **/
int iter = 0; // iter counter
double err = 1; // error between X^k and X^k-1
double *tmp; // temporary for thrust norm
double *norm; // Vector norm
tmp = (double *) malloc(sizeof(double) * dim);
norm = (double *) malloc(sizeof(double));
int dimBlocks, dimThreads;
dimThreads = 64;
dimBlocks = imin(32, (dim + dimThreads - 1)/dimThreads);
/** ************** **/
/** Device variables **/
double *d_Matrix_A, *d_Vector_B, *d_Vector_X, *d_Vector_Y, *d_Vector_Resul;
cudaMalloc((void**)&d_Matrix_A, sizeof(double) * dim * dim);
cudaMalloc((void**)&d_Vector_B, sizeof(double) * dim);
cudaMalloc((void**)&d_Vector_X, sizeof(double) * dim);
cudaMalloc((void**)&d_Vector_Y, sizeof(double) * dim);
cudaMalloc((void**)&d_Vector_Resul, sizeof(double) * dim);
/** **************** **/
/** Initialize **/
cudaMemcpy(d_Matrix_A, Matrix_A, sizeof(double) * dim * dim,
cudaMemcpyHostToDevice);
cudaMemcpy(d_Vector_B, Vector_B, sizeof(double) * dim, cudaMemcpyHostToDevice);
cudaMemcpy(d_Vector_X, Vector_X, sizeof(double) * dim, cudaMemcpyHostToDevice);
/** ********** **/
clock_t start,finish;
double totaltime;
start = clock();
/** Jacobi **/
while(err > e && iter < maxIter){
__copy_vector<<<dimBlocks, dimThreads>>>(d_Vector_X, d_Vector_Y, dim);
__Jacobi_sum<<<dimBlocks, dimThreads>>>(d_Matrix_A, d_Vector_Y,
d_Vector_Resul, dim);
__substract<<<dimBlocks, dimThreads>>>(d_Vector_B, d_Vector_Resul,
d_Vector_X, dim);
__divide<<<dimBlocks, dimThreads>>>(d_Matrix_A, d_Vector_X, dim);
__substract<<<dimBlocks, dimThreads>>>(d_Vector_Y, d_Vector_X,
d_Vector_Resul, dim);
__absolute<<<dimBlocks, dimThreads>>>(d_Vector_Resul, dim);
cudaMemcpy(tmp, d_Vector_Resul, sizeof(double) * dim, cudaMemcpyDeviceToHost);
double *t = thrust::max_element(tmp, tmp + dim); //vector norm
err = *t;
iter++;
}
finish = clock();
totaltime=(double)(finish-start)/CLOCKS_PER_SEC;
*t = totaltime;
cudaMemcpy(Vector_X, d_Vector_X, sizeof(double) * dim,
cudaMemcpyDeviceToHost);
if(iter == maxIter)
puts("Jacobi has reached maxIter!");
/** ****** **/
/** Free memory **/
cudaFree(d_Matrix_A);
cudaFree(d_Vector_B);
cudaFree(d_Vector_X);
cudaFree(d_Vector_Y);
cudaFree(d_Vector_Resul);
free(tmp);
free(norm);
/** *********** **/
}

It depends on your algorithm. Some algorithms are by definition non-parallelizable (calculating the Fibonacci series, for example). But here's a parallelizable Jacobi algorithm courtesy of Brown. Note that solving systems of equations CAN be solved either in serial or in parallel, it's just a matter of writing the code.
In short, it's impossible to know whether or not more threads = more speed unless you show us (or at least explain) the algorithm. As far as thread synchronization goes, CUDA is very (very) good at normalizing synchronization costs so (if your algorithm is proper), more threads should almost always yield more speed.

Fewer threads might be more efficient if the workload is small enough that the overheads of managing many threads cause the performance degradation.
...but without seeing your code it's hard to say. Personally I'm more inclined to believe it's just a bug in your code.

Sending 2D array to Cuda Kernel

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );

You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.