There are several strings like
std::string first, second, third; ...
My plan was to collect their addresses into a char* array:
char *addresses = {&first[0], &second[0], &third[0]} ...
and pass the char **addresses to the OpenCL kernel.
There are several problems or questions:
The main issue is that I cannot pass array of pointers.
Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory?
I'm using NVIDIA on Windows. So, I can use only OpenCL 1.2 version.
I cannot concatenate the string because those are from different structure...
EDIT:
According to the first answer, if I have this (example):
char *p;
cl_mem cmHostString = clCreateBuffer(myDev.getcxGPUContext(), CL_MEM_ALLOC_HOST_PTR, BUFFER_SIZE, NULL, &oclErr);
oclErr = clEnqueueWriteBuffer(myDev.getCqCommandQueue(), cmHostString, CL_TRUE, 0, BUFFER_SIZE, p, 0, NULL, NULL);
Do I need copy the each element of my char array from host memory to other part of the host memory (and the new address is hidden from the host)?? It is not logical me. Why cannot I use the same address? I could directly access the host memory from the GPU device and use it.
Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory?
Not in OpenCL1.2. Shared Virtual Memory concept is available since OpenCL 2.0 which isn't supported by NVidia as yet. You will need to either switch to GPU that supports OpenCL 2.0 or for OpenCL 1.2 copy your strings into continuous array of characters and pass them (copy) to the kernel.
EDIT: Responding to your edit - you can use:
CL_MEM_ALLOC_HOST_PTR flag to create empty buffer of required size and then map that buffer using clEnqueueMapBuffer and fill it using the pointer returned from mapping. After that unmap the buffer using clEnqueueUnmapMemObject.
CL_MEM_USE_HOST_PTR flag to create buffer of required size and pass there pointer to your array of characters.
From my experience buffer created using CL_MEM_USE_HOST_PTR flag is usually slightly faster, I think whether data is really copied or not under the hood depends on the implementation. But to use that you need to have your array of characters first prepared on the host.
You basically need to benchmark and see what is faster. Also don't concentrate too much on data copying, these are usually tiny numbers (transfers in GB/sec) in compare to how long it takes to run the kernel (depends of course what's in the kernel).
Related
Is it possible to share an array of pointers between multiple kernels in OpenCL. If so, how would I go about implementing it? If I am not completely mistaken - which may though be the case - the only way of sharing things between kernels would be a shared cl_mem, however I also think these cannot contain pointers.
This is not possible in OpenCL 1.x because host and device have completely separate memory spaces, so a buffer containing host pointers makes no sense on the device side.
However, OpenCL 2.0 supports Shared Virtual Memory (SVM) and so memory containing pointers is legal because the host and device share an address space. There are three different levels of granularity though, which will limit what you can have those pointers point to. In the coarsest case they can only refer to locations within the same buffer or other SVM buffers currently owned by the device. Yes, cl_mem is still the way to pass in a buffer to a kernel, but in OpenCL 2.0 with SVM that buffer may contain pointers.
Edit/Addition: OP points out they just want to share pointers between kernels. If these are just device pointers, then you can store them in the buffer in one kernel and read them from the buffer in another kernel. They can only refer to __global, not __local memory. And without SVM they can't be used on the host. The host will of course need to allocate the buffer and pass it to both kernels for their use. As far as the host is concerned, it's just opaque memory. Only the kernels know they are __global pointers.
I ran into a similar problem, but I managed to get around it by using a simple pointer structure.I have doubts about the fact that someone says that buffers change their position in memory,perhaps this is true for some special cases.But this definitely cannot happen while the kernel is working with it. I have not tested it on different video cards, but on nvidia(cl 1.2) it works perfectly, so I can access data from an array that was not even passed as an argument into the kernel.
typedef struct
{
__global volatile point_dataT* point;//pointer to another struct in different buffer
} pointerBufT;
__kernel void tester(__global pointerBufT * pointer_buf){
printf("Test id: %u\n",pointer_buf[coord.x+coord.y*img_width].point->id);//Retrieving information from an array not passed to the kernel
}
I know that this is a late reply, but for some reason I have only come across negative answers to similar questions, or a suggestion to use indexes instead of pointers. While a structure with a pointer inside works great.
After investigating the reason why my program was crashing, I found that I was hitting the maximum for a buffer size, which is 512Mb for me (CL_DEVICE_MAX_MEM_ALLOC_SIZE).
In my case, here are the parameters.
P = 146 (interpolation factor)
num_items = 918144 (number of samples)
sizeof(float) -> 4
So my clCreateBuffer looks something like this:
output = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
When the above is multiplied together and divided by (1024x1024), you get around 511Mb which is under the threshold. Change any of the parameters to one higher now and it crashes because it will exceed that 512 value.
My questions is, how can I implement the code in a way where I can use block sizes to do my calculations instead of storing everything in memory and passing that massive chunk of data to the kernel? In reality, the number of samples I have could easily vary to over 5 million and I definitely will not have enough memory to store all those values.
I'm just not sure how to pass small sets of values into my kernel as I have three steps that the values go though before getting an output.
First is an interpolation kernel, then the values go to a lowpass filter kernel and then to a kernel that does decimation. After that the values are written to an output array. If further details of the program are needed for the sake of the problem I can add more.
UPDATE
Not sure what the expected answer is here, if anyone has a reason I would love to hear it and potentially accept it as the valid answer. I don't work with OpenCL anymore so i don't have the setup to verify.
Looking at the OpenCL specification and clCreateBuffer I would say the solution here is allowing use of host memory by adding CL_MEM_USE_HOST_PTR to flags (or whatever suits your use case). Paragraphs from CL_MEM_USE_HOST_PTR:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object.
The contents of the memory pointed to by host_ptr at the time
of the clCreateBuffer call define the initial contents of the buffer
object.
OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
What this means is the driver will pass memory between host and device in the most efficient way it can. Basically what you propose yourself in comments, except it is already built into the driver, activated with a single flag, and probably more efficient than anything you can come up with.
I want to create a local array inside my OpenCL kernel, whose size depends on a parameter of the kernel. It seems that's not allowed - at least with AMD APP.
Is your experience different? Perhaps it's just the APP? Or is is there some rationale here?
Edit: I would now suggest variable length arrays should be allowed in CPU-side code too, and it was an unfortunate call by the C standard committee; but the question stands.
You can dynamically allocate the size of a local block. You need to take it as a parameter to your kernel, and define its size when you call clSetKernelArg.
definition example:
__kernel void kernelName(__local float* myLocalFloats, ...)
host code:
clSetKernelArg(kernel, 0, myLocalFloatCount * sizeof(float), NULL); // <-- set the size to the correct number of bytes to allocate, but use NULL for the data.
Make sure you know what the limit for local memory is on your device before you do this. Call clGetDeviceInfo, and poll for the 'CL_DEVICE_LOCAL_MEM_SIZE' value.
Not sure why people are saying you can't do this as it is something many people do with OpenCL (Yes, I understand it's not exactly the same but it works well enough for many cases).
Since OpenCL kernels are compiled at runtime and, just like text, you can just simply set the size to whatever size you want and then recompile your kernel. This obviously won't be perfect in cases where you have huge variability in sizes but usually I compile several different sizes at startup and then just call the correct one as needed (in your case based on the kernel argument). If I get a new size I don't have a kernel for I will compile it right then and cache the kernel in case it comes up again.
I believe I may be over-thinking this problem a bit... I've got a text file located on my filesystem which I am parsing at boot and storing the results into an array of structs. I need to copy this array from user space into kernel space (copy_from_user), and must have this data accessible by the kernel at any time. The data in kernel space will need to be accessed by the Sockets.c file. Is there a special place to store an array within kernel space, or can I simply add a reference to the array in Sockets.c? My C is a bit rusty...
Thanks for any advice.
I believe there are two main parts in your problem:
Passing the data from userspace to kernelspace
Storing the data in the kernelspace
For the first issue, I would suggest using a Netlink socket, rather than the more traditional system call (read/write/ioctl) interface. Netlink sockets allow configuration data to be passed to the kernel using a socket-like interface, which is significantly simpler and safer to use.
Your program should perform all the input parsing and validation and then pass the data to the kernel, preferably in a more structured form (e.g. entry-by-entry) than a massive data blob.
Unless you are interested in high throughput (megabytes of data per second), the netlink interface is fine. The following links provide an explanation, as well as an example:
http://en.wikipedia.org/wiki/Netlink
http://www.linuxjournal.com/article/7356
http://linux-net.osdl.org/index.php/Generic_Netlink_HOWTO
http://www.kernel.org/doc/Documentation/connector/
As far as the array storage goes, if you plan on storing more than 128KB of data you will have to use vmalloc() to allocate the space, otherwise kmalloc() is preferred. You should read the related chapter of the Linux Device Drivers book:
http://lwn.net/images/pdf/LDD3/ch08.pdf
Please note that buffers allocated with vmalloc() are not suitable for DMA to/from devices, since the memory pages are not contiguous. You might also want to consider a more complex data structure like a list if you do not know how many entries you will have beforehand.
As for accessing the storage globally, you can do it as with any C program:
In a header file included by all .c files that you need to access the data put something like:
extern struct my_struct *unique_name_that_will_not_conflict_with_other_symbols;
The extern keyword indicates that this declares a variable that is implemented at another source file. This will make this pointer accesible to all C files that include this header.
Then in a C file, preferrably the one with the rest of your code - if one exists:
struct my_struct *unique_name_that_will_not_conflict_with_other_symbols = NULL;
Which is the actual implementation of the variable declared in the header file.
PS: If you are going to work with the Linux kernel, you really need to brush up on your C. Otherwise you will be in for some very frustrating moments and you WILL end up sorry and sore.
PS2: You will also save a lot of time if you at least skim through the whole Linux Device Drivers book. Despite its name and its relative age, it has a lot of information that is both current and important when writing any code for the Linux Kernel.
You can just define an extern pointer somewhere in the kernel (say, in the sockets.c file where you're going to use it). Initialise it to NULL, and include a declaration for it in some appropriate header file.
In the part of the code that does the copy_from_user(), allocate space for the array using kmalloc() and store the address in the pointer. Copy the data into it. You'll also want a mutex to be locked around access to the array.
The memory allocated by kmalloc() will persist until freed with kfree().
Your question is basic and vague enough that I recommend you work through some of the exercises in this book. The whole of chapter 8 is dedicated to allocating kernel memory.
Initializing the Array as a global variable in your Kernel Module will make it accessible forever until the kernel is running i.e. until your system is running.
I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program
0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.
1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).
2)For this purpose, the kernel declares an array of int pointers of size 3000.
3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.
4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.
This last step Fails an it halts the PC.
Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?
, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )
Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?
I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.
"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)
You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)
You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)
You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."
^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.
Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.
Edit: I also just noticed a function called cudaMemcpy
Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:
cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)
I literally just started looking into this like I said though so maybe theres a better solution.
CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.
If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.
To copy data between the two, use cudaMemcpy().
When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.
I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.
All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.