Cuda retrieving 3d array - arrays

I am having trouble trying to figure out how to retrieve a 3D array from the GPU.
I want to allocate the memory for the 3d array in the host code, call the kernel, where the array will be populated, Then retrieve the 3D array in the host code to a return variable in the mexFunction (host code).
I have made several attempts at it, here is my latest code. The results are all '0's, where they should be '7'. Can anyone tell me where i'm going wrong? It might have something to do with the 3D parameters, i dont think i fully understand that part.
simulate3DArrays.cpp
/* Device code */
__global__ void simulate3DArrays(cudaPitchedPtr devPitchedPtr,
int width,
int height,
int depth)
{
int threadId;
threadId = (blockIdx.x * blockDim.x) + threadIdx.x;
size_t pitch = devPitchedPtr.pitch;
for (int widthIndex = 0; widthIndex < width; widthIndex++) {
for (int heightIndex = 0; heightIndex < height; heightIndex++) {
*((double*)(((char*)devPitchedPtr.ptr + threadId * pitch * height) + heightIndex * pitch) + widthIndex) = 7.0;
}
}
}
mexFunction.cu
/* Host code */
#include <stdio.h>
#include "mex.h"
/* Kernel function */
#include "simulate3DArrays.cpp"
/* Define some constants. */
#define width 5
#define height 9
#define depth 6
void displayMemoryAvailability(mxArray **MatlabMemory);
void mexFunction(int nlhs,
mxArray *plhs[],
int nrhs,
mxArray *prhs[])
{
double *output;
mwSize ndim3 = 3;
mwSize dims3[] = {height, width, depth};
plhs[0] = mxCreateNumericArray(ndim3, dims3, mxDOUBLE_CLASS, mxREAL);
output = mxGetPr(plhs[0]);
cudaExtent extent = make_cudaExtent(width * sizeof(double), height, depth);
cudaPitchedPtr devicePointer;
cudaMalloc3D(&devicePointer, extent);
simulate3DArrays<<<1,depth>>>(devicePointer, width, height, depth);
cudaMemcpy3DParms deviceOuput = { 0 };
deviceOuput.srcPtr.ptr = devicePointer.ptr;
deviceOuput.srcPtr.pitch = devicePointer.pitch;
deviceOuput.srcPtr.xsize = width;
deviceOuput.srcPtr.ysize = height;
deviceOuput.dstPtr.ptr = output;
deviceOuput.dstPtr.pitch = devicePointer.pitch;
deviceOuput.dstPtr.xsize = width;
deviceOuput.dstPtr.ysize = height;
deviceOuput.kind = cudaMemcpyDeviceToHost;
/* copy 3d array back to 'ouput' */
cudaMemcpy3D(&deviceOuput);
return;
} /* End Mexfunction */

The basic problem appears to be that you are instructing the cudaMemcpy3D to copy zero bytes, because you have not included a non-zero extent which defines the size of the transfer to the API.
Your transfer could probably be as simple as:
cudaMemcpy3DParms deviceOuput = { 0 };
deviceOuput.srcPtr = devicePointer;
deviceOuput.dstPtr.ptr = output;
deviceOuput.extent = extent;
cudaMemcpy3D(&deviceOuput);
I can't comment on whether the MEX interface you are using is correct, but the kernel looks superficially correct and I don't see anything else obviously wrong, without going to a compiler and trying to run your code with Matlab, which I cannot.

Related

Returning a large variable vs. setting it using a pointer supplied in arguments

Im interested in what practices are common when setting or returning large structures generated inside C functions. What is the best and safest way to do so. I can come up with 3 flavors of returning the generated structures. Do they all perform the same actions memory wise, or is one more efficient over the other? Do things change when overwriting existing values? For example when one changes a pointer does the old associated value get garbage collected automatically.
// Returning the instance
Image new_Image(const int height, const int width, const int depth) {
Image out;
out.width = width;
out.height = height;
out.depth = depth;
out.pixels = (float*) calloc((height*width*depth), sizeof(float));
return out;
}
Image image = new_Image(100,100,3);
// OR return a new pointer.
Image *new_Image(const int height, const int width, const int depth) {
Image out;
out.width = width;
out.height = height;
out.depth = depth;
out.pixels = (float*) calloc((height*width*depth), sizeof(float));
return &out;
}
Image *image;
image = new_Image(100,100,3);
// OR init outside function and populate in function. For cleanliness though I'd like as much of the image generating part to be done in the function.
Image *new_Image(Image *out, const int height, const int width, const int depth) {
out.width = width;
out.height = height;
out.depth = depth;
out.pixels = (float*) calloc((height*width*depth), sizeof(float));
}
Image *image = (Image*) malloc(sizeof(Image));
new_Image(image, 100,100,3);
Image new_Image(const int height, const int width, const int depth)
Safe but you return the whole structure by the value - which is not very effective and most implementation will do it through the stack. Stack especially on the small embedded systems is very limited in size. Not recursion friendly as well (a lots of stack consumed on every function call)
Image *new_Image(const int height, const int width, const int depth) {
Image out; - undefined behaviour as you return the pointer to the local variable, which stops to exists when you leave the function.
Image *new_Image(Image *out, const int height, const int width, const int depth) safe if you use the objects defined or allocated outside the function. BTW you forgot to return the pointer.
The option you did not mention in your question:
Image *new_Image(const int height, const int width, const int depth) {
Image *out = malloc(sizeof(*out));
/* malloc result tests */
out -> width = width;
out -> height = height;
out -> depth = depth;
out -> pixels = calloc((height*width*depth), sizeof(float));
/* calloc result tests */
return out;
}
You do not test your memory allocation results. It has to be done.
This function is also wrong:
Image *new_Image(Image *out, const int height, const int width, const int depth) {
out.width = width;
out.height = height;
out.depth = depth;
out.pixels = (float*) calloc((height*width*depth), sizeof(float));
}
It should be:
Image *new_Image(Image *out, const int height, const int width, const int depth) {
out -> width = width;
out -> height = height;
out -> depth = depth;
out -> pixels = calloc((height*width*depth), sizeof(float));
return out;
}
You do not need to cast the results of the malloc family functions. It was considered dangerous as using all standard of the language you would not get any warning messages if you forget to include . Nowadays compilers emit warnings if you call the function without the prototype
If you compile your code using the C++ compiler use command line options which will tell the compiler that the code is C (for example gcc or g++ -xc option)

OpenAcc error with copyin and copyout

General Information
NOTE: I am also decently new to C, OpenAcc.
Hi I am trying to develop an image blurring program, but first I wanted to see if I could parallelize the for loops and copyin/copyout my values.
The problem I am facing currently is when I try to copyin and copyout my data and output variables. The error looks to be a buffer overflow (I have also googled it and that is what people have said), but i am not sure how I should go about fixing this. I think I am doing something wrong with the pointers, but I am not sure.
Thanks so much in advance, if you think that I missed some information please let me know and I can provide it.
Question
I would like to confirm what the error actually is?
How should I go about fixing the issue?
Anything I should look into more so I can fix this kind of issue myself in the future.
Error
FATAL ERROR: variable in data clause is partially present on the device: name=output
file:/nfs/u50/singhn8/4F03/A3/main.c ProcessImageACC line:48
output lives at 0x7ffca75f6288 size 16 not present
Present table dump for device[1]: NVIDIA Tesla GPU 1, compute capability 3.5
host:0x7fe98eaf9010 device:0xb05dc0000 size:2073600 presentcount:1 line:47 name:(null)
host:0x7fe98f0e8010 device:0xb05bc0000 size:2073600 presentcount:1 line:47 name:(null)
host:0x7ffca75f6158 device:0xb05ac0400 size:4 presentcount:1 line:47 name:filterRad
host:0x7ffca75f615c device:0xb05ac0000 size:4 presentcount:1 line:47 name:row
host:0x7ffca75f6208 device:0xb05ac0200 size:4 presentcount:1 line:47 name:col
host:0x7ffca75f6280 device:0xb05ac0600 size:16 presentcount:1 line:48 name:data
Program Definition
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
// ================================================
// ppmFile.h
// ================================================
#include <sys/types.h>
typedef struct Image
{
int width;
int height;
unsigned char *data;
} Image;
Image* ImageCreate(int width,
int height);
Image* ImageRead(char *filename);
void ImageWrite(Image *image,
char *filename);
int ImageWidth(Image *image);
int ImageHeight(Image *image);
void ImageClear(Image *image,
unsigned char red,
unsigned char green,
unsigned char blue);
void ImageSetPixel(Image *image,
int x,
int y,
int chan,
unsigned char val);
unsigned char ImageGetPixel(Image *image,
int x,
int y,
int chan);
Blur Filter Function
// ================================================
// The Blur Filter
// ================================================
void ProcessImageACC(Image **data, int filterRad, Image **output) {
int row = (*data)->height;
int col = (*data)->width;
#pragma acc data copyin(row, col, filterRad, (*data)->data[0:row * col]) copyout((*output)->data[0:row * col])
#pragma acc kernels
{
#pragma acc loop independent
for (int j = 0; j < row; j++) {
#pragma acc loop independent
for (int i = 0; i < col; i++) {
(*output)->data[j * row + i] = (*data)->data[j * row + i];
}
}
}
}
Main Function
// ================================================
// Main Program
// ================================================
int main(int argc, char *argv[]) {
// vars used for processing:
Image *data, *result;
int dataSize;
int filterRadius = atoi(argv[1]);
// ===read the data===
data = ImageRead(argv[2]);
// ===send data to nodes===
// send data size in bytes
dataSize = sizeof(unsigned char) * data->width * data->height * 3;
// ===process the image===
// allocate space to store result
result = (Image *)malloc(sizeof(Image));
result->data = (unsigned char *)malloc(dataSize);
result->width = data->width;
result->height = data->height;
// initialize all to 0
for (int i = 0; i < (result->width * result->height * 3); i++) {
result->data[i] = 0;
}
// apply the filter
ProcessImageACC(&data, filterRadius, &result);
// ===save the data back===
ImageWrite(result, argv[3]);
return 0;
}
The problem here is that in addition to the data arrays, the output and data pointers need to be copied over as well. From the compiler feed back messages, you can see the compiler implicitly copying them over.
% pgcc -c image.c -ta=tesla:cc70 -Minfo=accel
ProcessImageACC:
46, Generating copyout(output->->data[:col*row])
Generating copyin(data->->data[:col*row],col,filterRad,row)
47, Generating implicit copyout(output[:1])
Generating implicit copyin(data[:1])
50, Loop is parallelizable
52, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
50, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
52, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
Now you might be able to get this to work by using unstructured data regions to create both the data and pointers, and then "attach" the pointers to the arrays (i.e. fill in the value of the device pointers to the address of the device data array).
Though an easier option is to create temp arrays to point to the data, and then copy the data to the device. This will also increase the performance of your code (both on the GPU and CPU) since it eliminates the extra levels of indirection.
void ProcessImageACC(Image **data, int filterRad, Image **output) {
int row = (*data)->height;
int col = (*data)->width;
unsigned char * ddata, * odata;
odata = (*output)->data;
ddata = (*data)->data;
#pragma acc data copyin(ddata[0:row * col]) copyout(odata[0:row * col])
#pragma acc kernels
{
#pragma acc loop independent
for (int j = 0; j < row; j++) {
#pragma acc loop independent
for (int i = 0; i < col; i++) {
odata[j * row + i] = ddata[j * row + i];
}
}
}
}
Note that scalars are firstprivate by default so there's no need to add the row, col, and filterRad variables in the data clause.

Copy Array of pointers inside a struct using CUDA

I wish to copy an array of pointers from one struct to another. The Struct looks like this:
typedef struct COORD3D
{
int x,y,z;
}
COORD3D;
typedef struct structName
{
double *volume;
COORD3D size;
// .. some other vars
}
structName;
I wish to do this inside a function where I pass in the address of an empty instance of the struct and the address of the struct with the data I wish to copy. Currently I do this serially via:
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
int i;
for(i=0;i<size;i++)
dest->volume[i] = source->volume[i];
}
I want to do this in CUDA to speed up the process (as the array is very large [~12 million elements].
I have tried the following however, although the code compiles and runs, I get incorrect results stored in the array (seems to be very large random numbers)
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
// Device Pointers
double *DEVICE_SOURCE, *DEVICE_DEST;
// Declare memory on GPU
cudaMalloc(&DEVICE_DEST,size);
cudaMalloc(&DEVICE_SOURCE,size);
// Copy Source to GPU
cudaMemcpy(DEVICE_SOURCE,source->volume,size,
cudaMemcpyHostToDevice);
// Setup Blocks/Grids
dim3 dimGrid(ceil(source->size.x/10.0),
ceil(source->size.y/10.0),
ceil(source->size.z/10.0));
dim3 dimBlock(10,10,10);
// Run CUDA Kernel
copyVol<<<dimGrid,dimBlock>>> (DEVICE_SOURCE,
DEVICE_DEST,
source->size.x,
source->size.y,
source->size.z);
// Copy Constructed Array back to Host
cudaMemcpy(dest->volume,DEVICE_DEST,size,
cudaMemcpyDeviceToHost);
}
The Kernel looks like this:
__global__ void copyVol(double *source, double *dest,
int x, int y, int z)
{
int posX = blockIdx.x * blockDim.x + threadIdx.x;
int posY = blockIdx.y * blockDim.y + threadIdx.y;
int posZ = blockIdx.z * blockDim.z + threadIdx.z;
if (posX < x && posY < y && posZ < z)
{
dest[posX+(posY*x)+(posZ*y*x)] =
source[posX+(posY*x)+(posZ*y*x)];
}
}
Can anyone tell me where I am going wrong?
I am risking a wrong answer, but have you left out the size of the data type?
cudaMalloc(&DEVICE_DEST,size);
should be
cudaMalloc(&DEVICE_DEST,size*sizeof(double));
Also
cudaMemcpy(DEVICE_SOURCE,source->volume,size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size*sizeof(double), cudaMemcpyHostToDevice);
and so on.

Passing back arrays through mex

I've been at this for a couple days now, have tried every variation I can think of, and looked at countless examples. I just can't get it working.
I'm trying to make a mexFunction to call from matlab. This mexFunction calls into another C function I have, lets call it retrieveValues, and returns an array and the length of that array. I need to return both of those back to the matlab function, which as I understand it, means I need to put them in the plhs array.
I call my mexFunction from matlab like this:
[foofooArray, foofooCount] = getFoo();
Which to my understanding, means that nlhs = 2, plhs is an array of length 2, nrhs = 0, and prhs just a pointer.
Here's my code for the mexFunction:
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray* prhs[])
{
foo* fooArray
int fooCount
plhs = mxCreateNumericMatrix(1, 2, mxUINT64_CLASS, mxREAL);
//feels like I shouldn't need this
retrieveValues(&fooArray, &fooCount);
plhs[0] = fooArray;
plhs[1] = fooCount;
}
Running the matlab program gets me One or more output arguments not assigned during call
I've tested and confirmed that the values are being returned from retrieveValues correctly.
You are correct that the plhs = mxCreateNumericMatrix(...) is not needed. Also, note that nlhs is the number of left-hand-sides you supply in MATLAB - so in your case, you're calling it with 2 left-hand-sides. Here's how to return trivial scalar values:
plhs[0] = mxCreateDoubleScalar(2);
plhs[1] = mxCreateDoubleScalar(3);
To handle your actual return values, you'll need to do something to copy the values out of foo and into a newly-created mxArray. For example, if your function returned doubles, you might do this:
double * values;
int numValues;
myFcn(&values, &numValues);
/* Build a 1 x numValues real double matrix for return to MATLAB */
plhs[0] = mxCreateDoubleMatrix(1, numValues, mxREAL);
/* Copy from 'values' into the data part of plhs[0] */
memcpy(mxGetPr(plhs[0]), values, numValues * sizeof(double));
EDIT Of course someone somewhere needs to de-allocate values in both my example and yours.
EDIT 2 Complete executable example code:
#include <string.h>
#include "mex.h"
void doStuff(double ** data, int * numData) {
*numData = 7;
*data = (double *) malloc(*numData * sizeof(data));
for (int idx = 0; idx < *numData; ++idx) {
(*data)[idx] = idx;
}
}
void mexFunction( int nlhs, mxArray * plhs[],
int nrhs, const mxArray * prhs[] ) {
double * data;
int numData;
doStuff(&data, &numData);
plhs[0] = mxCreateDoubleMatrix(1, numData, mxREAL);
memcpy(mxGetPr(plhs[0]), data, numData * sizeof(double));
free(data);
plhs[1] = mxCreateDoubleScalar(numData);
}
Here is an example:
testarr.cpp
#include "mex.h"
#include <stdlib.h>
#include <string.h>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray* prhs[])
{
// validate number of arguments
if (nrhs != 0 || nlhs > 2) {
mexErrMsgTxt("Wrong number of arguments");
}
// create C-array (or you can recieve this array from a function)
int len = 5;
double *arr = (double*) malloc(len*sizeof(double));
for(int i=0; i<len; i++) {
arr[i] = 10.0 * i;
}
// return outputs from MEX-function
plhs[0] = mxCreateDoubleMatrix(1, len, mxREAL);
memcpy(mxGetPr(plhs[0]), arr, len*sizeof(double));
if (nlhs > 1) {
plhs[1] = mxCreateDoubleScalar(len);
}
// dellocate heap space
free(arr);
}
MATLAB:
>> mex -largeArrayDims testarr.cpp
>> [a,n] = testarr
a =
0 10 20 30 40
n =
5

Sending 2D array to Cuda Kernel

I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.

Resources