OpenCL - clCreateBuffer size error. Possible work arounds? - c

After investigating the reason why my program was crashing, I found that I was hitting the maximum for a buffer size, which is 512Mb for me (CL_DEVICE_MAX_MEM_ALLOC_SIZE).
In my case, here are the parameters.
P = 146 (interpolation factor)
num_items = 918144 (number of samples)
sizeof(float) -> 4
So my clCreateBuffer looks something like this:
output = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
When the above is multiplied together and divided by (1024x1024), you get around 511Mb which is under the threshold. Change any of the parameters to one higher now and it crashes because it will exceed that 512 value.
My questions is, how can I implement the code in a way where I can use block sizes to do my calculations instead of storing everything in memory and passing that massive chunk of data to the kernel? In reality, the number of samples I have could easily vary to over 5 million and I definitely will not have enough memory to store all those values.
I'm just not sure how to pass small sets of values into my kernel as I have three steps that the values go though before getting an output.
First is an interpolation kernel, then the values go to a lowpass filter kernel and then to a kernel that does decimation. After that the values are written to an output array. If further details of the program are needed for the sake of the problem I can add more.
UPDATE
Not sure what the expected answer is here, if anyone has a reason I would love to hear it and potentially accept it as the valid answer. I don't work with OpenCL anymore so i don't have the setup to verify.

Looking at the OpenCL specification and clCreateBuffer I would say the solution here is allowing use of host memory by adding CL_MEM_USE_HOST_PTR to flags (or whatever suits your use case). Paragraphs from CL_MEM_USE_HOST_PTR:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object.
The contents of the memory pointed to by host_ptr at the time
of the clCreateBuffer call define the initial contents of the buffer
object.
OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
What this means is the driver will pass memory between host and device in the most efficient way it can. Basically what you propose yourself in comments, except it is already built into the driver, activated with a single flag, and probably more efficient than anything you can come up with.

Related

Process strings form OpenCL kernel

There are several strings like
std::string first, second, third; ...
My plan was to collect their addresses into a char* array:
char *addresses = {&first[0], &second[0], &third[0]} ...
and pass the char **addresses to the OpenCL kernel.
There are several problems or questions:
The main issue is that I cannot pass array of pointers.
Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory?
I'm using NVIDIA on Windows. So, I can use only OpenCL 1.2 version.
I cannot concatenate the string because those are from different structure...
EDIT:
According to the first answer, if I have this (example):
char *p;
cl_mem cmHostString = clCreateBuffer(myDev.getcxGPUContext(), CL_MEM_ALLOC_HOST_PTR, BUFFER_SIZE, NULL, &oclErr);
oclErr = clEnqueueWriteBuffer(myDev.getCqCommandQueue(), cmHostString, CL_TRUE, 0, BUFFER_SIZE, p, 0, NULL, NULL);
Do I need copy the each element of my char array from host memory to other part of the host memory (and the new address is hidden from the host)?? It is not logical me. Why cannot I use the same address? I could directly access the host memory from the GPU device and use it.
Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory?
Not in OpenCL1.2. Shared Virtual Memory concept is available since OpenCL 2.0 which isn't supported by NVidia as yet. You will need to either switch to GPU that supports OpenCL 2.0 or for OpenCL 1.2 copy your strings into continuous array of characters and pass them (copy) to the kernel.
EDIT: Responding to your edit - you can use:
CL_MEM_ALLOC_HOST_PTR flag to create empty buffer of required size and then map that buffer using clEnqueueMapBuffer and fill it using the pointer returned from mapping. After that unmap the buffer using clEnqueueUnmapMemObject.
CL_MEM_USE_HOST_PTR flag to create buffer of required size and pass there pointer to your array of characters.
From my experience buffer created using CL_MEM_USE_HOST_PTR flag is usually slightly faster, I think whether data is really copied or not under the hood depends on the implementation. But to use that you need to have your array of characters first prepared on the host.
You basically need to benchmark and see what is faster. Also don't concentrate too much on data copying, these are usually tiny numbers (transfers in GB/sec) in compare to how long it takes to run the kernel (depends of course what's in the kernel).

variable length array declaration not allowed in OpenCL - why?

I want to create a local array inside my OpenCL kernel, whose size depends on a parameter of the kernel. It seems that's not allowed - at least with AMD APP.
Is your experience different? Perhaps it's just the APP? Or is is there some rationale here?
Edit: I would now suggest variable length arrays should be allowed in CPU-side code too, and it was an unfortunate call by the C standard committee; but the question stands.
You can dynamically allocate the size of a local block. You need to take it as a parameter to your kernel, and define its size when you call clSetKernelArg.
definition example:
__kernel void kernelName(__local float* myLocalFloats, ...)
host code:
clSetKernelArg(kernel, 0, myLocalFloatCount * sizeof(float), NULL); // <-- set the size to the correct number of bytes to allocate, but use NULL for the data.
Make sure you know what the limit for local memory is on your device before you do this. Call clGetDeviceInfo, and poll for the 'CL_DEVICE_LOCAL_MEM_SIZE' value.
Not sure why people are saying you can't do this as it is something many people do with OpenCL (Yes, I understand it's not exactly the same but it works well enough for many cases).
Since OpenCL kernels are compiled at runtime and, just like text, you can just simply set the size to whatever size you want and then recompile your kernel. This obviously won't be perfect in cases where you have huge variability in sizes but usually I compile several different sizes at startup and then just call the correct one as needed (in your case based on the kernel argument). If I get a new size I don't have a kernel for I will compile it right then and cache the kernel in case it comes up again.

How to handle multiple pages buffers and scatterlists for Linux Crypto API?

I am facing some trouble processing large buffers. Since I was testing my code only on quite small buffers (not larger than PAGE_SIZE), I have not met this before. The code is simply about ciphering or deciphering a buffer.
Currently, the code just sets one scatterlist object with the sg_set_buf() call for the source and the destination buffers. But it appears that, when doing things this simple, the encryption does not occur if the buffer size exceeds PAGE_SIZE.
Obviously, I can bypass the problem by allocating a smaller buffer, that fits in a single page and processing the larger buffer "progressively" with appropriate memcpy() calls. But since this is ugly, time and resource consuming ...
I was wondering if there was a way to handle nicely the scatterlist objects for this kind of buffers ?
EDIT : I forgot to say I already went through this question.
Other EDIT : In fact, I have quite the same issue as user173586. The thing is, I cannot know in advance if the buffer handed to me is allocated with vmalloc() or kmalloc().
To determine this, I just have to check whether the given address is in the range [VMALLOC_START, VMALLOC_END]. Once it is done, I have still to set up the scatterlist objects nicely - here is the hard part -.
I know I can retrieve the page corresponding to a vmalloc()-ed buffer with vmalloc_to_page(). At this point, I have a struct page object corresponding to the address I gave. I do not know how I can get the offset in the corresponding page.
How can I know the page object "validity" ? I mean which area of the page is actually used by the vmalloc()-ed buffer. At first glance, it seems I need to retrieve each page used for the buffer and set up a scatterlist entry for it but I have no idea on how I should do that.
(Any insight on the inner vmalloc() functioning can help. My current knowledge about this can be inferred from this article)
The key is that kmalloc will always return space in physically continuous pages but vmalloc will not. You can detect the kmalloc'd buffer with virt_addr_valid. If that fails, you have to iterate through all the pages in the buffer and create separate scatterlist entries. You can see an example by looking at this recent commit to click, though this code reuses the same, single scatterlist. It is also possible to allocate (len/PAGE_SIZE) + 1 scatterlists (the last one might not actually get used) and fill them out the same way.

Best place to put constant memory which is known before kernel launch and never changed

I have an array of integers which size is known before the kernel launch but not during the compilation stage. The upper bound on the size is around 10000 float3 elements (I guess that means 10000 * 3 * 4 = ~120KB). It is not known at the compile time.
All threads scan linearly through (at most) all of the elements in the array.
You could check the size at runtime, then if it will fit use cudaMemcpyToSymbol, or otherwise use texture or global memory. This is slightly messy, you will have to have some parameter to tell the kernel where the data is. As always, always test actual performance. Different access patterns can have drastically different speeds in different types of memory.
Another thought is to take a step back and look at the algorithm again. There are often ways of dividing the problem differently to get the constant table to always fit into constant memory.
If all threads in a warp access the same elements at the same time then you should probably consider using constant memory, since this is not only cached, but it also has a broadcast capability whereby all threads can read the same address in a single cycle.
You could calculate the free constant memory after compile your kernels and allocate it statically.
__constant__ int c[ALL_I_CAN_ALLOCATE];
Then, copy your data to constant memory using cudaMemcpyToSymbol().
I think this might answer your question but your requirement for constant memory exceed the limits of the GPU.
I'll recommend other approaches, i.e. use the share memory which can broadcast data if all threads in a halfwarp read from the same location.

What is a good size for medium sized memory allocations?

For a serializing system, I need to allocate buffers to write data into. The size needed is not known in advance, so the basic pattern is to malloc N bytes and use realloc if more is needed. The size of N would be large enough to accommodate most objects, making reallocation rare.
This made me think that there is probably an optimal initial amount of bytes that malloc can satisfy more easily than others. I'm guessing somewhere close to pagesize, although not necessarily exactly if malloc needs some room for housekeeping.
Now, I'm sure it is a useless optimization, and if it really mattered, I could use a pool, but I'm curious; I can't be the first programmer to think give me whatever chunk of bytes is easiest to allocate as a start. Is there a way to determine this?
Any answer for this that specifically applies to modern GCC/G++ and/or linux will be accepted.
From reading this wiki page it seems that your answer would vary wildly depending on the implementation of malloc you're using and the OS. Reading the bit on OpenBSD's malloc is particularly interesting. It sounds like you want to look at mmap, too, but at a guess I'd say allocating the default pagesize (4096?) would be optimised for.
My suggestion to you would be to find an appropriate malloc/realloc/free source code such that you can implement your own "malloc_first" alongside the others in the same source module (and using the same memory structures) which simply allocates and returns the first available block greater than or equal to a passed minimum_bytes parameter. If 0 is passed you'll get the first block period.
An appropriate declaration could be
void *malloc_first (size_t minimum_bytes, size_t *actual_bytes);
How doable such an undertaking would be I don't know. I suggest you attempt it using Linux where all source codes are available.
The way it's done in similar cases is for the first malloc to allocate some significant but not too large chunk, which would suit most cases (as you described), and every subsequent realloc call to double the requested size.
So, if at first you allocate 100, next time you'll realloc 200, then 400, 800 and so on. In this way the chances of subsequent reallocation will be lower after each time you do it.
If memory serves me right, that's how std::vector behaves.
after edit
The optimal initial allocation size would be the one that will cover most of your cases on one side, but won't be too wasteful on the other side. If your average case is 50, but can spike to 500, you'll want to allocate initially 50, and then double or triple (or multiple by 10) every next realloc so that you could get to 500 in 1-3 reallocs, but any further reallocs would be unlikely and infrequent. So it depends on your usage patterns, basically.

Resources