CUDA, dynamic array + array. malloc and copy - arrays

So I have been stuck on this problem for a while. My struct looks like this:
typedef struct
{
int size;
int dim[DIMENSIONS];
float *data;
}matrix;
Now the problem for me is how to malloc and memcpy. This is how I'm doing it:
matrix * d_in;
matrix * d_out;
const int THREADS_BYTES = sizeof(int) + sizeof(int)*DIMENSIONS + sizeof(float)*h_A->_size;
cudaMalloc((void **) &d_in, THREADS_BYTES);
cudaMemcpy(d_in, h_A, THREADS_BYTES, cudaMemcpyHostToDevice);
EDIT: this is how I allocated h_a:
matrix A; // = (matrix*)malloc(sizeof(matrix));
A._dim[0] = 40;
A._dim[1] = 60;
A._size = A._dim[0]*A._dim[1];
A._data = (float*)malloc(A._size*sizeof(float));
matrix *h_A = &A;
Where h_A is a matrix I allocated. I call my kernel like this:
DeviceComp<<<gridSize, blockSize>>>(d_out, d_in);
However, in my kernel I cannot reach any data from the struct, only the array and the variable.

This is a common problem. When you did the malloc operation on the host (for h_a->data), you allocated host data, which is not accessible from the device.
This answer describes in some detail what is going on and how to fix it.
In your case, something like this should work:
matrix A; // = (matrix*)malloc(sizeof(matrix));
A._dim[0] = 40;
A._dim[1] = 60;
A._size = A._dim[0]*A._dim[1];
A._data = (float*)malloc(A._size*sizeof(float));
matrix *h_A = &A;
float *d_data;
cudaMalloc((void **) &d_data, A._size*sizeof(float));
matrix * d_in;
matrix * d_out;
const int THREADS_BYTES = sizeof(int) + sizeof(int)*DIMENSIONS + sizeof(float)*h_A->_size;
cudaMalloc((void **) &d_in, THREADS_BYTES);
cudaMemcpy(d_in, h_A, THREADS_BYTES, cudaMemcpyHostToDevice);
cudaMemcpy(&(d_in->data), &d_data, sizeof(float *), cudaMemcpyHostToDevice);
Note that this doesn't actually copy the data area from the host copy of A to the device copy. It simply makes a device-accessible data area, equal in size to the host data area. If you also want to copy the data area, that will require another cudaMemcpy operation, using h_a->data and d_data.

Related

CUDA copying an array of arrays filled with data, from host to device

i've been looking for a way to transfer a filled array of arrays from host to device in CUDA.
What i have:
A global array of arrays that is filled with data, that i need to copy to the device for kernel execution.
The arrays in the array have different lengths.
I have a function to initiate the array and it's values:
double** weights; // globally defined in host
int init_weigths(){
weights = (double**) malloc(sizeof(double*) * SIZE);
for (int i = 0; i < SIZE; i++) {
weights[i] = (double*) malloc(sizeof(double) * getSize(i));
for (int j = 0; j < getSize(i); j++){
weights[i][j] = get_value(i,j);
}
}
}
My (not working) solution:
I've designed a solution gathering information of other answers found in the Internet, but no one worked. I think it's because of the difference that my array of arrays is already filled up with information, and of the variable lengths of the contained arrays.
The solution i have, that is throwing "invalid argument" error in all cudaMemcpy calls, and in the second and further cudaMalloc calls; checked by cudaGetLastError().
The solution is this one:
double** d_weights;
int init_cuda_weight(){
cudaMalloc((void **) &d_weights, sizeof(double*) * SIZE);
double** temp_d_ptrs = (double**) malloc(sizeof(double*) * SIZE);
// temp array of device pointers
for (int i = 0; i < SIZE; i++){
cudaMalloc((void**) &temp_d_ptrs[getSize(i)],
sizeof(double) * getSize(i));
// ERROR CHECK WITH cudaGetLastError(); doesn't throw any errors ar first.
cudaMemcpy(temp_d_ptrs[getSize(i)], weights[getSize(i)], sizeof(double) * getSize(i), cudaMemcpyHostToDevice);
// ERROR CHECK WITH cudaGetLastError(); throw "invalid argument" error for now and beyond.
}
cudaMemcpy(d_weights, temp_d_ptrs, sizeof(double*) * SIZE,
cudaMemcpyHostToDevice);
}
As aditional information, i've simplified the code a bit. The arrays contained in the array of arrays have different lengths (i.e. SIZE2 isn't constant), thats why i'm not flattening to an 1D array.
What is wrong with this implementation? Any ideas to achieve the copy?
Edit2:
The original code i wrote was OK. I edited the code to include the error i had and included the correct answer (code) below.
The mistake is that i used the array total size getSize(i) as the index of the allocations and copies. It was a naive error hidden by the complexity and verbosity of the real code.
The correct solution is:
double** d_weights;
int init_cuda_weight(){
cudaMalloc((void **) &d_weights, sizeof(double*) * SIZE);
double** temp_d_ptrs = (double**) malloc(sizeof(double*) * SIZE);
// temp array of device pointers
for (int i = 0; i < SIZE; i++){
cudaMalloc((void**) &temp_d_ptrs[i],
sizeof(double) * getSize(i));
// ERROR CHECK WITH cudaGetLastError()
cudaMemcpy(temp_d_ptrs[i], weights[i], sizeof(double) * getSize(i), cudaMemcpyHostToDevice);
// ERROR CHECK WITH cudaGetLastError()
}
cudaMemcpy(d_weights, temp_d_ptrs, sizeof(double*) * SIZE,
cudaMemcpyHostToDevice);
}

cuda : Is shared memory always helpful?

When I read the programming guide, I got the feeling that shared memory will always improve the performance, but it seems not.
I have two functions:
const int Ntimes=1;
__global__ void testgl(float *A, float *C, int numElements){
int ti = threadIdx.x;
int b0 = blockDim.x*blockIdx.x;
if (b0+ti < numElements){
for(int i=0;i<Ntimes;i++){
A[b0+ti]=A[b0+ti]*A[b0+ti]*10-2*A[b0+ti]+1;
}
C[b0+ti] = A[b0+ti]*A[b0+ti];
}
}
__global__ void testsh(float *A, float *C, int numElements){
int ti = threadIdx.x;
int b0 = blockDim.x*blockIdx.x;
__shared__ float a[1024];
if (b0+ti < numElements){
a[ti]=A[b0+ti];
}
__syncthreads();
if (b0+ti < numElements){
for(int i=0;i<Ntimes;i++){
a[ti]=a[ti]*a[ti]*10-2*a[ti]+1;
}
C[b0+ti] = a[ti]*a[ti];
}
}
int main(void){
int numElements = 500000;
size_t size = numElements * sizeof(float);
// Allocate the host input
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
// Allocate the host output
float *h_C = (float *)malloc(size);
float *h_D = (float *)malloc(size);
// Initialize the host input
for (int i = 0; i < numElements; i++){
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = h_A[i];
}
// Allocate the device input
float *d_A = NULL; cudaMalloc((void **)&d_A, size);
float *d_B = NULL; cudaMalloc((void **)&d_B, size);
float *d_C = NULL; cudaMalloc((void **)&d_C, size);
float *d_D = NULL; cudaMalloc((void **)&d_D, size);
//Copy to Device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 1024;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
testgl<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);
testsh<<<blocksPerGrid, threadsPerBlock>>>(d_B, d_D, numElements);
// Copy the device resultto the host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaMemcpy(h_D, d_D, size, cudaMemcpyDeviceToHost);
// Free device global memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFree(d_D);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
free(h_D);
// Reset the device and exit
cudaDeviceReset();
return 0;
}
If Ntimes is set to be 1, testgl costs 49us, and testsh costs 97us.
If Ntimes is set to be 100, testgl costs 9.7ms, and testsh costs 8.9ms.
I do not know why it's more than 100 times longer.
So it seems the shared memory helps only when we want to do a lot of things in device, is that right?
The card used here is GTX680.
Thanks in advance.
shared memory will always improve the performance
Thats not true. It depends on the algorithm. If you have a perfectly coalesced memory access in the kernel and you are accessing the global memory just once it may not help. But if you are implementing suppose a matrix multiplication where you need the partial sums to be held then it will be useful.
It will be also helpful if you are accessing the same memory location more than once in the kernel it will help in this case since the shared memory latency is 100 times less than the global memory because its on-chip memory.
When you analyse that the kernel is bandwidth limited then its a good place to think if there is a scope of using the shared memory and increase the performance. Its also better strategy to check the occupancy calculator to check if the usage of shared memory is going to affect the occupancy.
shared memory helps only when we want to do a lot of things in device ?
Partial Yes. Shared memory helps when we want to do a lot of things in device.
In your case in the above kernel, as you are accessing the global memory more than once in the kernel it should help. It will be helpful if you can provide the complete reproducer to analyze the code. Also it will be helpful to know the card details you are running on.

sending struct array to cuda kernel

I'm working on a project and I have to sent a struct array to cuda kernel. The struct also contains an array. To test it I have written a simple program.
struct Point {
short x;
short *y;
};
my kernel code:
__global__ void addKernel(Point *a, Point *b, Point *c)
{
int i = threadIdx.x;
c[i].x = a[i].x + b[i].x;
for (int j = 0; j<4; j++){
c[i].y[j] = a[i].y[j] + a[i].y[j];
}
}
my main code:
int main()
{
const int arraySize = 4;
const int arraySize2 = 4;
short *ya, *yb, *yc;
short *dev_ya, *dev_yb, *dev_yc;
Point *a;
Point *b;
Point *c;
Point *dev_a;
Point *dev_b;
Point *dev_c;
size_t sizeInside = sizeof(short) * arraySize2;
ya = (short *)malloc(sizeof(short) * arraySize2);
yb = (short *)malloc(sizeof(short) * arraySize2);
yc = (short *)malloc(sizeof(short) * arraySize2);
ya[0] = 1; ya[1] =2; ya[2]=3; ya[3]=4;
yb[0] = 2; yb[1] =3; yb[2]=4; yb[3]=5;
size_t sizeGeneral = (sizeInside+sizeof(short)) * arraySize;
a = (Point *)malloc( sizeGeneral );
b = (Point *)malloc( sizeGeneral );
c = (Point *)malloc( sizeGeneral );
a[0].x = 2; a[0].y = ya;
a[1].x = 2; a[1].y = ya;
a[2].x = 2; a[2].y = ya;
a[3].x = 2; a[3].y = ya;
b[0].x = 4; b[0].y = yb;
b[1].x = 4; b[1].y = yb;
b[2].x = 4; b[2].y = yb;
b[3].x = 4; b[3].y = yb;
cudaMalloc((void**)&dev_a, sizeGeneral);
cudaMalloc((void**)&dev_b, sizeGeneral);
cudaMalloc((void**)&dev_c, sizeGeneral);
cudaMemcpy(dev_a, a, sizeGeneral, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, sizeGeneral, cudaMemcpyHostToDevice);
addKernel<<<1, 4>>>(dev_a, dev_b, dev_c);
cudaError_t err = cudaMemcpy(c, dev_c, sizeGeneral, cudaMemcpyDeviceToHost);
printf("{%d-->%d,%d,%d,%d} \n err= %d",c[0].x,c[0].y[0],c[1].y[1],c[1].y[2],c[2].y[3], err);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
It seems cuda kernel is not working. Actually I can access structs 'x' variable but I cannot access 'y' array. What can I do to access the 'y' array? Thanks in advance.
When you are sending this struct to kernel you send short and pointer to short in host memory not device. This is crucial. For simple type - as short this works, because kernel has its local copy in memory designated to accept parameters. So when you call this kernel you have moved x and y to device, but not the area pointed by y. This you have to do manually by allocating space for it and updating pointer y to point to device memory.
You are not passin the array to the device. You can either make the array a part of the struct, by defining it like this:
struct {
short normalVal;
short inStructArr[4];
}
Or pass the array into the device memory and update the pointer in the struct.

Sending 2D array to Cuda Kernel

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.

malloc causing a SIGSEGV:Segmentation fault

typedef struct Matrix
{
double * matrix;
int sizex;
int sizey;
}Matrix;
int nn = 257;
Matrix * g = (Matrix *)malloc(sizeof(Matrix *));
g->matrix = malloc(sizeof(double) * nn * nn);
g->sizex = nn;
g->sizey = nn;
This code give an error when it gets to g->matrix = malloc(sizeof(double) * nn * nn);
anyone see a problem with it ?
edit: found problem to be accessing unallocated memory in a place before the allocation shown, it was causing a SIGSEGV:Segmentation fault.
You need to pass malloc the sizeof the Matrix not sizeof pointer to the Matrix.
Change
Matrix * g = (Matrix *)malloc(sizeof(Matrix *));
^^
to
Matrix * g = (Matrix *)malloc(sizeof(Matrix));
Also you must always check the return value of malloc and make sure allocation succeed before you go and use the allocated memory.
I'm guessing you're using some ancient 16-bit compiler, probably Turbo C. Junk it and get gcc, either djgpp if you want to build DOS programs or mingw or cygwin if you want to build Windows programs.
Assuming I'm right, 257*257 overflows the maximum addressable size, 65536, not to mention what happens when you multiply it by 8.
Edit: OP changed the question after I wrote this so it may be completely off. If so I'll delete it.
Matrix * g = (Matrix *)malloc(sizeof(Matrix *));
should be
Matrix * g = (Matrix *)malloc(sizeof(Matrix));
You're only allocating the size of a pointer to Matrix, rather than a Matrix itself.
You do not allocate memory for a Matrix object, you are allocating for a pointer.
Change your first malloc call to do this:
Matrix * g = malloc(sizeof(*g));
I prefer this style as you do not need explicit pointer cast from void * in C, and you are allowed to do sizeof on the variable's underlying type. This could save you headaches just in case you change the type of g (or for future code).
Similarly for style:
g->matrix = malloc(sizeof(double) * nn * nn);
should be:
g->matrix = malloc(sizeof(*(g->matrix)) * nn * nn);
Parameter of your sizeof should not be a pointer.
Matrix * g = (Matrix *)malloc(sizeof(Matrix *));
This reserves enough heap space for a pointer to a Matrix, but you want enough heap space for the Matrix itself. Try:
Matrix* g = (Matrix*)malloc(sizeof(Matrix));
For a complete, working program:
#include <stdlib.h>
#include <stdio.h>
typedef struct Matrix {
double * matrix;
int sizex;
int sizey;
} Matrix;
int main()
{
int nn = 257;
Matrix * g = (Matrix *)malloc(sizeof(Matrix));
if (g == NULL)
{
printf("g = malloc() failed\n");
return 1;
}
g->matrix = malloc(sizeof(double) * nn * nn);
g->sizex = nn;
g->sizey = nn;
printf("g %p, g->matrix %p, g->sizex %d, g->sizey %d\n",
g, g->matrix, g->sizex, g->sizey);
return 0;
}
Output on my Linux box:
g 0x8822008, g->matrix 0xf6ea6008, g->sizex 257, g->sizey 257
I'll just add the direction of not casting malloc's return. It is unnecessary, pollutant, and might lead to unwanted behavior (as described here: http://c-faq.com/malloc/mallocnocast.html

Resources