I am having trouble using the __constant qualifier in my OpenCL kernels. My platform is Snow Leopard.
I have tried initializing a CL read-only memory object on the GPU, copying my constant array from host into it. Then I set the kernel argument just as with __global memory arguments, but this does not work as it should but I see no error or warnings. I have also tried using the data directly in the clSetKernelArg function as with float and int types, it works neither.
Do I make any mistakes or is there something wrong with Apple's implementation? I would like to see any working examples how this is done, both OpenCL (gpu) and host code.
I doubt there is something so fundamental wrong with Apple's implementation. I used the following OpenCL Hello World Example application to get my head around the basics.
In this example I replaced the __global float* input with __constant float* input and it worked fine. You also need to make sure your buffer is CL_MEM_READ_ONLY, using something like clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL).
From reading the spec, I think __constant => __global + CL_MEM_READ_ONLY.
I'm running Snow Leopard on MBP 15".
There are some bugs with the way Apple's OpenCL compiler handles __constant variables on the GPU. If the compiler log says something like
OpenCL Build Error : Compiler build log:
Error while compiling the ptx module: CLH_ERROR_NO_BINARY_FOR_GPU
PTX Info log:
PTX Error log:
then I had the same error as you, and filed a bug on it. The folks at Apple marked it as a duplicate (of rdar://7217974 apparently) so I assume it's a known problem and they're working on it.
"From reading the spec, I think __constant => __global + CL_MEM_READ_ONLY."
Not really, when you specify _constant instead of __global, you are saying to your device to save this data in a different portion of memory. In some devices, its true that can be the same, but others couldn't be. On NVIDIA cards, for instance, you've only 64kb of _constant memory and loads of mb for __global. The advantage on __constants is that in NVIDIA devices, it is cached:)
You can query your device: (example of my device query)
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 255 MByte
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
Related
I have to write a Stub for:
extern ECAN1MSGBUF ecan1msgBuf __attribute__((space(dma)));
Can someone explain to me what makes this call, how it works and how I can write / use a stub for a test program? I have the hardware not at home and must write a test, but the XCode announces as a warning: unknown attributes space ignored. Otherwise I work on the MPLabX compiler / debugger with access to the hardware. There is not the problem, of course.
DMA space on dspics is dual ported RAM that can be accessed without competing for memory bandwidth with the ALU (the actual CPU).
However, in some dspicE's (*) , DMA space is beyond the 32kb mark which needs EDS addressing. If so, you might want to view the sample code I posted about dspice CAN at http://www.microchip.com/forums/m790729.aspx#792226
Note that you can also use non dma space memory, the dma space memory is just more optimal.
(*) the ones with 56k memory, which are generally the 512KB flash parts for the GP and MU series.
I'm trying to write a simple soft CPU in C that will work on an imaginary machine for an embedded application. I'm new to this, so bear with me.
I've been trying to do this in an IDE, but run into an issue where I need to malloc the memory and am not getting a consistent memory address for allocating my registers, so I'm unable to run tests and debug. On an actual piece of hardware, I understand that the documentation would give me the addresses of specific registers, main memory, and hard disk memory, correct? I'd like to be able to define macros for my registers that I can then pass around to read/write, but this seems impossible without static memory addresses.
So it seems like I need a good way to allocate a static chunk of memory with static addresses, either in an IDE or on my own machine with a text editor. What would be the best way to do this? For reference, I'm using Cloud9 IDE but can't figure out how to do it in this platform.
Thanks!
You should do something like uint8_t* const address_space = calloc( memory_size, sizeof(uint8_t) );, check the return value of course, and then make all your machine addresses indices into the array, like address_space[dest] = register[src];. If your emulated CPU can handle data of different sizes or has less strict alignment restrictions than your host CPU, you would need to use memcpy() or pointer casts to transfer data.
Your debugger will understand expressions like address_space[i] whether address_space is statically or dynamically allocated, but you can statically allocate it if you know the exact size in advance, such as to emulate a machine with 16-bit addresses that always has exactly 65,536 bytes of RAM.
Is it possible to share an array of pointers between multiple kernels in OpenCL. If so, how would I go about implementing it? If I am not completely mistaken - which may though be the case - the only way of sharing things between kernels would be a shared cl_mem, however I also think these cannot contain pointers.
This is not possible in OpenCL 1.x because host and device have completely separate memory spaces, so a buffer containing host pointers makes no sense on the device side.
However, OpenCL 2.0 supports Shared Virtual Memory (SVM) and so memory containing pointers is legal because the host and device share an address space. There are three different levels of granularity though, which will limit what you can have those pointers point to. In the coarsest case they can only refer to locations within the same buffer or other SVM buffers currently owned by the device. Yes, cl_mem is still the way to pass in a buffer to a kernel, but in OpenCL 2.0 with SVM that buffer may contain pointers.
Edit/Addition: OP points out they just want to share pointers between kernels. If these are just device pointers, then you can store them in the buffer in one kernel and read them from the buffer in another kernel. They can only refer to __global, not __local memory. And without SVM they can't be used on the host. The host will of course need to allocate the buffer and pass it to both kernels for their use. As far as the host is concerned, it's just opaque memory. Only the kernels know they are __global pointers.
I ran into a similar problem, but I managed to get around it by using a simple pointer structure.I have doubts about the fact that someone says that buffers change their position in memory,perhaps this is true for some special cases.But this definitely cannot happen while the kernel is working with it. I have not tested it on different video cards, but on nvidia(cl 1.2) it works perfectly, so I can access data from an array that was not even passed as an argument into the kernel.
typedef struct
{
__global volatile point_dataT* point;//pointer to another struct in different buffer
} pointerBufT;
__kernel void tester(__global pointerBufT * pointer_buf){
printf("Test id: %u\n",pointer_buf[coord.x+coord.y*img_width].point->id);//Retrieving information from an array not passed to the kernel
}
I know that this is a late reply, but for some reason I have only come across negative answers to similar questions, or a suggestion to use indexes instead of pointers. While a structure with a pointer inside works great.
I want to create a local array inside my OpenCL kernel, whose size depends on a parameter of the kernel. It seems that's not allowed - at least with AMD APP.
Is your experience different? Perhaps it's just the APP? Or is is there some rationale here?
Edit: I would now suggest variable length arrays should be allowed in CPU-side code too, and it was an unfortunate call by the C standard committee; but the question stands.
You can dynamically allocate the size of a local block. You need to take it as a parameter to your kernel, and define its size when you call clSetKernelArg.
definition example:
__kernel void kernelName(__local float* myLocalFloats, ...)
host code:
clSetKernelArg(kernel, 0, myLocalFloatCount * sizeof(float), NULL); // <-- set the size to the correct number of bytes to allocate, but use NULL for the data.
Make sure you know what the limit for local memory is on your device before you do this. Call clGetDeviceInfo, and poll for the 'CL_DEVICE_LOCAL_MEM_SIZE' value.
Not sure why people are saying you can't do this as it is something many people do with OpenCL (Yes, I understand it's not exactly the same but it works well enough for many cases).
Since OpenCL kernels are compiled at runtime and, just like text, you can just simply set the size to whatever size you want and then recompile your kernel. This obviously won't be perfect in cases where you have huge variability in sizes but usually I compile several different sizes at startup and then just call the correct one as needed (in your case based on the kernel argument). If I get a new size I don't have a kernel for I will compile it right then and cache the kernel in case it comes up again.
I have some software that I have working on a redhat system with icc and it is working fine. When I ported the code to an IRIX system running with MIPS then I get some calculations that come out as "nan" when there should definitely be values there.
I don't have any good debuggers on the non-redhat system, but I have tracked down that some of my arrays are getting "nan" sporadically in them and that is causing my dot product calculation to come back as "nan."
Seeing as how I can't track it down with a debugger, I am thinking that the problem may be with a memcpy. Are there any issues with the MIPS compiler memcpy() function with dynamically allocated arrays? I am basically using
memcpy(to, from, n*sizeof(double));
And I can't really prove it, but I think this may be the issue. Is there some workaround? Perhaps sme data is misaligned? How do I fix that?
I'd be surprised if your problem came from a bug in memcpy. It may be an alignment issue: are your doubles sufficiently aligned? (They will be if you only store them in double or double[] objects or through double* pointers but might not be if you move them around via void* pointers). X86 platforms are more tolerant to misalignment than most.
Did you try compiling your code with gcc at a high warning level? (Gcc is available just about everywhere that's not a microcontroller or mainframe. It may produce slower code but better diagnostics than the “native” compiler.)
Of course, it could always be a buffer overflow or other memory management problem in some unrelated part of the code that just happened not to cause any visible bug on your original platform.
If you can't get a access to a good debugger, try at least printf'ing stuff in key places.
Is it possible for the memory regions to and from to overlap? memcpy isn't required to handle overlapping memory regions. If this is your problem then the solution is as simple as using memmove instead.
Is sizeof() definitely supported?