Is it possible to share an array of pointers between multiple kernels in OpenCL. If so, how would I go about implementing it? If I am not completely mistaken - which may though be the case - the only way of sharing things between kernels would be a shared cl_mem, however I also think these cannot contain pointers.
This is not possible in OpenCL 1.x because host and device have completely separate memory spaces, so a buffer containing host pointers makes no sense on the device side.
However, OpenCL 2.0 supports Shared Virtual Memory (SVM) and so memory containing pointers is legal because the host and device share an address space. There are three different levels of granularity though, which will limit what you can have those pointers point to. In the coarsest case they can only refer to locations within the same buffer or other SVM buffers currently owned by the device. Yes, cl_mem is still the way to pass in a buffer to a kernel, but in OpenCL 2.0 with SVM that buffer may contain pointers.
Edit/Addition: OP points out they just want to share pointers between kernels. If these are just device pointers, then you can store them in the buffer in one kernel and read them from the buffer in another kernel. They can only refer to __global, not __local memory. And without SVM they can't be used on the host. The host will of course need to allocate the buffer and pass it to both kernels for their use. As far as the host is concerned, it's just opaque memory. Only the kernels know they are __global pointers.
I ran into a similar problem, but I managed to get around it by using a simple pointer structure.I have doubts about the fact that someone says that buffers change their position in memory,perhaps this is true for some special cases.But this definitely cannot happen while the kernel is working with it. I have not tested it on different video cards, but on nvidia(cl 1.2) it works perfectly, so I can access data from an array that was not even passed as an argument into the kernel.
typedef struct
{
__global volatile point_dataT* point;//pointer to another struct in different buffer
} pointerBufT;
__kernel void tester(__global pointerBufT * pointer_buf){
printf("Test id: %u\n",pointer_buf[coord.x+coord.y*img_width].point->id);//Retrieving information from an array not passed to the kernel
}
I know that this is a late reply, but for some reason I have only come across negative answers to similar questions, or a suggestion to use indexes instead of pointers. While a structure with a pointer inside works great.
Related
I'm trying to write a simple soft CPU in C that will work on an imaginary machine for an embedded application. I'm new to this, so bear with me.
I've been trying to do this in an IDE, but run into an issue where I need to malloc the memory and am not getting a consistent memory address for allocating my registers, so I'm unable to run tests and debug. On an actual piece of hardware, I understand that the documentation would give me the addresses of specific registers, main memory, and hard disk memory, correct? I'd like to be able to define macros for my registers that I can then pass around to read/write, but this seems impossible without static memory addresses.
So it seems like I need a good way to allocate a static chunk of memory with static addresses, either in an IDE or on my own machine with a text editor. What would be the best way to do this? For reference, I'm using Cloud9 IDE but can't figure out how to do it in this platform.
Thanks!
You should do something like uint8_t* const address_space = calloc( memory_size, sizeof(uint8_t) );, check the return value of course, and then make all your machine addresses indices into the array, like address_space[dest] = register[src];. If your emulated CPU can handle data of different sizes or has less strict alignment restrictions than your host CPU, you would need to use memcpy() or pointer casts to transfer data.
Your debugger will understand expressions like address_space[i] whether address_space is statically or dynamically allocated, but you can statically allocate it if you know the exact size in advance, such as to emulate a machine with 16-bit addresses that always has exactly 65,536 bytes of RAM.
There are several strings like
std::string first, second, third; ...
My plan was to collect their addresses into a char* array:
char *addresses = {&first[0], &second[0], &third[0]} ...
and pass the char **addresses to the OpenCL kernel.
There are several problems or questions:
The main issue is that I cannot pass array of pointers.
Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory?
I'm using NVIDIA on Windows. So, I can use only OpenCL 1.2 version.
I cannot concatenate the string because those are from different structure...
EDIT:
According to the first answer, if I have this (example):
char *p;
cl_mem cmHostString = clCreateBuffer(myDev.getcxGPUContext(), CL_MEM_ALLOC_HOST_PTR, BUFFER_SIZE, NULL, &oclErr);
oclErr = clEnqueueWriteBuffer(myDev.getCqCommandQueue(), cmHostString, CL_TRUE, 0, BUFFER_SIZE, p, 0, NULL, NULL);
Do I need copy the each element of my char array from host memory to other part of the host memory (and the new address is hidden from the host)?? It is not logical me. Why cannot I use the same address? I could directly access the host memory from the GPU device and use it.
Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory?
Not in OpenCL1.2. Shared Virtual Memory concept is available since OpenCL 2.0 which isn't supported by NVidia as yet. You will need to either switch to GPU that supports OpenCL 2.0 or for OpenCL 1.2 copy your strings into continuous array of characters and pass them (copy) to the kernel.
EDIT: Responding to your edit - you can use:
CL_MEM_ALLOC_HOST_PTR flag to create empty buffer of required size and then map that buffer using clEnqueueMapBuffer and fill it using the pointer returned from mapping. After that unmap the buffer using clEnqueueUnmapMemObject.
CL_MEM_USE_HOST_PTR flag to create buffer of required size and pass there pointer to your array of characters.
From my experience buffer created using CL_MEM_USE_HOST_PTR flag is usually slightly faster, I think whether data is really copied or not under the hood depends on the implementation. But to use that you need to have your array of characters first prepared on the host.
You basically need to benchmark and see what is faster. Also don't concentrate too much on data copying, these are usually tiny numbers (transfers in GB/sec) in compare to how long it takes to run the kernel (depends of course what's in the kernel).
I'm currently experimenting with IPC via mmap on Unix.
So far, I'm able to map a sparse-file of 10MB into the RAM and access it reading and writing from two separated processes. Awesome :)
Now, I'm currently typecasting the address of the memory-segment returned by mmap to char*, so I can use it as a plain old cstring.
Now, my real question digs a bit further. I've quite a lot experience with higher levels of programming (ruby, java), but never did larger projects in C or ASM.
I want to use the mapped memory as an address-space for variable-allocation. I dont't wether this is possible or does make any sense at all. Im thinking of some kind of a hash-map-like data structure, that lives purely in the shared segment. This would allow some interesting experiments with IPC, even with other languages like ruby over FFI.
Now, a regular implementation of a hash would use pretty often something like malloc. But this woult allocate memory out of the shared space.
I hope, you understand my thoughts, although my english is not the best!
Thank you in advance
Jakob
By and large, you can treat the memory returned by mmap like memory returned by malloc. However, since the memory may be shared between multiple "unrelated" processes, with independent calls to mmap, the starting address for each may be different. Thus, any data structure you build inside the shared memory should not use direct pointers.
Instead of pointers, offsets from the initial map address should be used instead. The data structure would then compute the right pointer value by adding the offset to the starting address of the mmamp region.
The data structure would be built from the single call to mmap. If you need to grow the data structure, you have to extend the mmap region itself. The could be done with mremap or by manually munmap and mmap again after the backing file has been extended.
I believe I may be over-thinking this problem a bit... I've got a text file located on my filesystem which I am parsing at boot and storing the results into an array of structs. I need to copy this array from user space into kernel space (copy_from_user), and must have this data accessible by the kernel at any time. The data in kernel space will need to be accessed by the Sockets.c file. Is there a special place to store an array within kernel space, or can I simply add a reference to the array in Sockets.c? My C is a bit rusty...
Thanks for any advice.
I believe there are two main parts in your problem:
Passing the data from userspace to kernelspace
Storing the data in the kernelspace
For the first issue, I would suggest using a Netlink socket, rather than the more traditional system call (read/write/ioctl) interface. Netlink sockets allow configuration data to be passed to the kernel using a socket-like interface, which is significantly simpler and safer to use.
Your program should perform all the input parsing and validation and then pass the data to the kernel, preferably in a more structured form (e.g. entry-by-entry) than a massive data blob.
Unless you are interested in high throughput (megabytes of data per second), the netlink interface is fine. The following links provide an explanation, as well as an example:
http://en.wikipedia.org/wiki/Netlink
http://www.linuxjournal.com/article/7356
http://linux-net.osdl.org/index.php/Generic_Netlink_HOWTO
http://www.kernel.org/doc/Documentation/connector/
As far as the array storage goes, if you plan on storing more than 128KB of data you will have to use vmalloc() to allocate the space, otherwise kmalloc() is preferred. You should read the related chapter of the Linux Device Drivers book:
http://lwn.net/images/pdf/LDD3/ch08.pdf
Please note that buffers allocated with vmalloc() are not suitable for DMA to/from devices, since the memory pages are not contiguous. You might also want to consider a more complex data structure like a list if you do not know how many entries you will have beforehand.
As for accessing the storage globally, you can do it as with any C program:
In a header file included by all .c files that you need to access the data put something like:
extern struct my_struct *unique_name_that_will_not_conflict_with_other_symbols;
The extern keyword indicates that this declares a variable that is implemented at another source file. This will make this pointer accesible to all C files that include this header.
Then in a C file, preferrably the one with the rest of your code - if one exists:
struct my_struct *unique_name_that_will_not_conflict_with_other_symbols = NULL;
Which is the actual implementation of the variable declared in the header file.
PS: If you are going to work with the Linux kernel, you really need to brush up on your C. Otherwise you will be in for some very frustrating moments and you WILL end up sorry and sore.
PS2: You will also save a lot of time if you at least skim through the whole Linux Device Drivers book. Despite its name and its relative age, it has a lot of information that is both current and important when writing any code for the Linux Kernel.
You can just define an extern pointer somewhere in the kernel (say, in the sockets.c file where you're going to use it). Initialise it to NULL, and include a declaration for it in some appropriate header file.
In the part of the code that does the copy_from_user(), allocate space for the array using kmalloc() and store the address in the pointer. Copy the data into it. You'll also want a mutex to be locked around access to the array.
The memory allocated by kmalloc() will persist until freed with kfree().
Your question is basic and vague enough that I recommend you work through some of the exercises in this book. The whole of chapter 8 is dedicated to allocating kernel memory.
Initializing the Array as a global variable in your Kernel Module will make it accessible forever until the kernel is running i.e. until your system is running.
I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program
0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.
1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).
2)For this purpose, the kernel declares an array of int pointers of size 3000.
3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.
4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.
This last step Fails an it halts the PC.
Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?
, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )
Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?
I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.
"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)
You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)
You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)
You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."
^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.
Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.
Edit: I also just noticed a function called cudaMemcpy
Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:
cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)
I literally just started looking into this like I said though so maybe theres a better solution.
CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.
If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.
To copy data between the two, use cudaMemcpy().
When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.
I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.
All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.