CUDA Pointer Dereferencing Issue - c

I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program
0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.
1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).
2)For this purpose, the kernel declares an array of int pointers of size 3000.
3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.
4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.
This last step Fails an it halts the PC.
Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?
, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )
Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?

I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.
"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)
You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)
You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)
You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."
^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.
Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.
Edit: I also just noticed a function called cudaMemcpy
Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:
cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)
I literally just started looking into this like I said though so maybe theres a better solution.

CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.
If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.
To copy data between the two, use cudaMemcpy().
When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.
I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.
All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.

Related

Pointer Array shared between OpenCL kernels

Is it possible to share an array of pointers between multiple kernels in OpenCL. If so, how would I go about implementing it? If I am not completely mistaken - which may though be the case - the only way of sharing things between kernels would be a shared cl_mem, however I also think these cannot contain pointers.
This is not possible in OpenCL 1.x because host and device have completely separate memory spaces, so a buffer containing host pointers makes no sense on the device side.
However, OpenCL 2.0 supports Shared Virtual Memory (SVM) and so memory containing pointers is legal because the host and device share an address space. There are three different levels of granularity though, which will limit what you can have those pointers point to. In the coarsest case they can only refer to locations within the same buffer or other SVM buffers currently owned by the device. Yes, cl_mem is still the way to pass in a buffer to a kernel, but in OpenCL 2.0 with SVM that buffer may contain pointers.
Edit/Addition: OP points out they just want to share pointers between kernels. If these are just device pointers, then you can store them in the buffer in one kernel and read them from the buffer in another kernel. They can only refer to __global, not __local memory. And without SVM they can't be used on the host. The host will of course need to allocate the buffer and pass it to both kernels for their use. As far as the host is concerned, it's just opaque memory. Only the kernels know they are __global pointers.
I ran into a similar problem, but I managed to get around it by using a simple pointer structure.I have doubts about the fact that someone says that buffers change their position in memory,perhaps this is true for some special cases.But this definitely cannot happen while the kernel is working with it. I have not tested it on different video cards, but on nvidia(cl 1.2) it works perfectly, so I can access data from an array that was not even passed as an argument into the kernel.
typedef struct
{
__global volatile point_dataT* point;//pointer to another struct in different buffer
} pointerBufT;
__kernel void tester(__global pointerBufT * pointer_buf){
printf("Test id: %u\n",pointer_buf[coord.x+coord.y*img_width].point->id);//Retrieving information from an array not passed to the kernel
}
I know that this is a late reply, but for some reason I have only come across negative answers to similar questions, or a suggestion to use indexes instead of pointers. While a structure with a pointer inside works great.

CUDA device pointers

Quick question about the standard CUDA memory allocation model:
double* x_device;
cudaMalloc(&x_device,myArraySize);
The variable x_device is a pointer-to-double. After I call cudaMalloc, does x_device now point to a memory location on the cuda device? So, in other words, *x_device would result in a segfault because we can't directly access device memory from the host.
Incidental question, the compiler doesn't complain that I don't use (void**)&x_device, is this required? I sometimes see it in examples, sometimes not.
Thanks!
You are right: cudaMalloc allocates memory on the device. You can't use this pointer directly on the host, but only as argument to functions like cudaMemcpy, and as arguments to kernel calls.
More recent CUDA versions support unified memory addressing, there you can use cudaMallocManaged to allocate device memory, and access it on the host directly via the device pointer.
For the second question: C++ doesn't allow implicit casts between pointer types, so there leaving out the explicit cast (void**)&x_device will result in a compiler error.

malloc in an embedded system without an operating system

This query is regarding allocation of memory using malloc.
Generally what we say is malloc allocates memory from heap.
Now say I have a plain embedded system(No operating system), I have normal program loaded where I do malloc in my program.
In this case where is the memory allocated from ?
malloc() is a function that is usually implemented by the runtime-library. You are right, if you are running on top of an operating system, then malloc will sometimes (but not every time) trigger a system-call that makes the OS map some memory into your program's address space.
If your program runs without an operating system, then you can think of your program as being the operating system. You have access to all addresses, meaning you can just assign an address to a pointer, then de-reference that pointer to read/write.
Of course you have to make sure that not other parts of your program just use the same memory, so you write your own memory-manager:
To put it simply you can set-aside a range of addresses which your "memory-manager" uses to store which address-ranges are already in use (the datastructures stored in there can be as easy as a linked list or much much more complex). Then you will write a function and call it e.g. malloc() which forms the functional part of your memory-manager. It looks into the mentioned datastructure to find an address of ranges that is as long as the argument specifies and return a pointer to it.
Now, if every function in your program calls your malloc() instead of randomly writing into custom addresses you've done the first step. You can write a free()-function which will look for the pointer it is given in the mentioned datastructure, and adapts the datastructure (in the naive linked-list it would merge two links).
The only real answer is "Wherever your compiler/library-implementation puts it".
In the embedded system I use, there is no heap, since we haven't written one.
From the heap as you say. The difference is that the heap is not provided by the OS. Your application's linker script will no doubt include an allocation for the heap. The run-time library will manage this.
In the case of the Newlib C library often used in GCC based embedded systems not running an OS or at least not running Linux, the library has a stub syscall function called sbrk(). It is the respnsibility of the developer to implement sbrk(), which must provide more memory the the heap manager on request. Typically it merely increments a pointer and returns a pointer to the start of the new block, thereafter the library's heap manager manages and maintains the new block which may or may not be contiguous with previous blocks. The previous link includes an example implementation.

What is the most memory efficient way to return large arrays in a C structure from a Matlab C mex function?

I have a program which creates a C structure which contains large arrays of various basic data types (ints doubles etc.). What is the most memory efficient way for me to return this data to Matlab from a C mexfunction, while also ensuring all of the memory deallocation is carefully taken care of? I would ideally like to return the whole structure, but methods for returning each array individually are also acceptable.
You may also assume I understand the basics of writing mexfunctions and returning arguments using the basic method of copying the data to an array pointed to by the plhs pointer. As I understand it, this will create a duplicate of the memory, i.e. requiring double the memory, correct me if this is incorrect.
My question has now been answered on another forum here. Below is the answer given:
"You cannot mix native C/C++ memory (i.e., local stack variables or allocated variables with malloc & cousins) into an mxArray for returning to the MATLAB workspace. That will eventually lead to crashing MATLAB when it tries to free this memory. So you are stuck with duplicating this memory. As I see it your options are:
1) Rewrite your code to create your C/C++ structure using MATLAB API functions mxMalloc & cousins instead of native C/C++ functions malloc & friends. Then this memory could be directly attached to an mxArray struct for returning to the MATLAB workspace ... no duplication or deallocation would be required.
2) Create your MATLAB struct piecemeal with mxMalloc & cousins as you deallocate the C/C++ memory piecemeal. This would still require you to duplicate the largest block temporarily, but saves you from duplicating everything in memory at the same time.
3) Ignore what I said about mixing native C/C++ memory and MATLAB API memory. Play games with hacking into the mxArray to mix them, keep shared data copies of them inside the mex routine to prevent MATLAB from attempting to free the memory. This is very tricky and is not recommended since you can easily leak memory and/or crash MATLAB if you don't manage everything correctly.
It doesn't save you any significant amount of memory returning several individual variables to MATLAB vs returning a struct or cell array, so just return whatever is easier to create and manage based on your intended use." -James Tursa

Array of pointer in CUDA

Is it possible to pass an array of pointers to a cuda kernel?
i am looking for something like this:
__global__ void Kernel(int **arr)
{
int *temp = arr[blockDim.x];
temp[blockIdx.x] = blockIdx.x;
}
How can i allocate cuda memory for such structure?
Memory allocation for such array is not a problem, you'll do this by cudaMalloc(sizeof(void*)*SIZE). However, writing correct values into it is main problem. Only way to change values in device memory from host function is actually copying information from host memory to device memory (cudaMemcpy() or cudaMemcpyToSymbol()). Thus, to write device pointers into device memory, we must have pointer to device memory in host memory, which I don't think is possible. (pointer which is stored in host variables allocated by cudaMalloc() isn't actual pointer in device memory). So, the only way to write correct values in the array is from kernel, which makes array of pointers unconvenient.
I suggest using indexes instead of pointers, it is much better. Basically if in your array of indexes you have written {4,3,0,1,2} it means that first element points to some array in index 4, second one - to the 3rd element and so on. If you want to point multiple arrays, you should make indexing by some rule, in which you will fill the array of indexes and in which you will access memory from kernel.
I'm doing some image processing work in CUDA currently, and I recommend that you just allocate a linear memory buffer and use an indexing scheme rather than dealing with arrays of pointers. It's way, way simpler in my experience. My 2c.

Resources