Best place to put constant memory which is known before kernel launch and never changed

Best place to put constant memory which is known before kernel launch and never changed - c

I have an array of integers which size is known before the kernel launch but not during the compilation stage. The upper bound on the size is around 10000 float3 elements (I guess that means 10000 * 3 * 4 = ~120KB). It is not known at the compile time.
All threads scan linearly through (at most) all of the elements in the array.

You could check the size at runtime, then if it will fit use cudaMemcpyToSymbol, or otherwise use texture or global memory. This is slightly messy, you will have to have some parameter to tell the kernel where the data is. As always, always test actual performance. Different access patterns can have drastically different speeds in different types of memory.
Another thought is to take a step back and look at the algorithm again. There are often ways of dividing the problem differently to get the constant table to always fit into constant memory.

If all threads in a warp access the same elements at the same time then you should probably consider using constant memory, since this is not only cached, but it also has a broadcast capability whereby all threads can read the same address in a single cycle.

You could calculate the free constant memory after compile your kernels and allocate it statically.
__constant__ int c[ALL_I_CAN_ALLOCATE];
Then, copy your data to constant memory using cudaMemcpyToSymbol().
I think this might answer your question but your requirement for constant memory exceed the limits of the GPU.
I'll recommend other approaches, i.e. use the share memory which can broadcast data if all threads in a halfwarp read from the same location.

Related

OpenCL - clCreateBuffer size error. Possible work arounds?

After investigating the reason why my program was crashing, I found that I was hitting the maximum for a buffer size, which is 512Mb for me (CL_DEVICE_MAX_MEM_ALLOC_SIZE).
In my case, here are the parameters.
P = 146 (interpolation factor)
num_items = 918144 (number of samples)
sizeof(float) -> 4
So my clCreateBuffer looks something like this:
output = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
When the above is multiplied together and divided by (1024x1024), you get around 511Mb which is under the threshold. Change any of the parameters to one higher now and it crashes because it will exceed that 512 value.
My questions is, how can I implement the code in a way where I can use block sizes to do my calculations instead of storing everything in memory and passing that massive chunk of data to the kernel? In reality, the number of samples I have could easily vary to over 5 million and I definitely will not have enough memory to store all those values.
I'm just not sure how to pass small sets of values into my kernel as I have three steps that the values go though before getting an output.
First is an interpolation kernel, then the values go to a lowpass filter kernel and then to a kernel that does decimation. After that the values are written to an output array. If further details of the program are needed for the sake of the problem I can add more.
UPDATE
Not sure what the expected answer is here, if anyone has a reason I would love to hear it and potentially accept it as the valid answer. I don't work with OpenCL anymore so i don't have the setup to verify.

Looking at the OpenCL specification and clCreateBuffer I would say the solution here is allowing use of host memory by adding CL_MEM_USE_HOST_PTR to flags (or whatever suits your use case). Paragraphs from CL_MEM_USE_HOST_PTR:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object.
The contents of the memory pointed to by host_ptr at the time
of the clCreateBuffer call define the initial contents of the buffer
object.
OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
What this means is the driver will pass memory between host and device in the most efficient way it can. Basically what you propose yourself in comments, except it is already built into the driver, activated with a single flag, and probably more efficient than anything you can come up with.

malloc and other associated functions

I have an array named 'ArrayA' and it is full of ints but I want to add another 5 cell to the end of the array every time a condition is met. How would I do this? ( The internet is not being very helpful )

If this is a static array, you will have to create a new one with more space and copy the data yourself. If it was allocated with malloc(), as the title to your question suggests, then you can use realloc() to do this more-or-less automatically. Note that the address of your array will, in general, have changed.
It is precisely because of the need for "dynamic" arrays that grow (and shrink) as needed, that languages like C++ introduced vectors. They do the management under the covers.

You need the realloc function.
Also note that adding 5 cells is not the best performance solution.
It is best to double the size of your arrays every time an array increase is needed.
Use two variables, one for the size (the number of integers used) and one for capacity (the actual memory size of arrays)

In a modern OS it is generally safe to assume that if you allocate a lot of memory that you don't use then it will not actually consume physical RAM, but only exist as virtual mappings. The OS will provide physical RAM as soon as a page (today generally in chunks of 4Kb) is used for the first time.
You can specifically enforce this behavior by using mmap to create a large anonymous mapping (MAP_PRIVATE | MAP_ANONYMOUS) e.g. as much as you intend to hold at maximum. On modern x64 systems virtual mappings can be up to 64Tb large. It is logically memory available to your program, but in practice pages will be added to it as you start using them.
realloc as described by the other posters is the naiive way to resize a malloc mapping, but make sure that realloc was successful. It can fail!
Problems with memory arise when you use memory once, don't deallocate it and stop using it. In contrast allocated, but untouched memory generally does not actually use resources other then VM table entries.

Constant memory usage in CUDA code

I can not figure it out myself, what is the best way to ensure the memory used in my kernel is constant. There is a similar question at http://stackoverflow...r-pleasant-way.
I am working with GTX580 and compiling only for 2.0 capability. My kernel looks like
__global__ Foo(const int *src, float *result) {...}
I execute the following code on host:
cudaMalloc(src, size);
cudaMemcpy(src, hostSrc, size, cudaMemcpyHostToDevice);
Foo<<<...>>>(src, result);
the alternative way is to add
__constant__ src[size];
to .cu file, remove src pointer from the kernel and execute
cudaMemcpyToSymbol("src", hostSrc, size, 0, cudaMemcpyHostToDevice);
Foo<<<...>>>(result);
Are these two ways equivalent or the first one does not guarantee the usage of constant memory instead of global memory? size changes dynamically so the second way is not handy in my case.

The second way is the only way to ensure that the array is compiled to CUDA constant memory and accessed correctly via the constant memory cache. But you should ask yourself how the contents of that array are going to be accessed within a block of threads. If every thread will access the array uniformly, then there will be a performance advantage in using constant memory, because there is a broadcast mechanism from the constant memory cache (it also saves global memory bandwidth because constant memory is stored in offchip DRAM and the cache reduces the DRAM transaction count). But if access is random, then there can be serialisation of access to local memory which will negatively effect performance.
Typical things which might be good fits for __constant__ memory would be model coefficients, weights, and other constant values which need to be set at runtime. On Fermi GPUs, the kernel argument list is stored in constant memory, for example. But if the contents are access non-uniformly and the type or size of members isn't constant from call to call, then normal global memory is preferable.
Also keep in mind that there is a limit of 64kb of constant memory per GPU context, so is it not practical to store very large amounts of data in constant memory. If you need a lot of read-only storage with cache, it might be worth trying binding the data to a texture and see what the performance is like. On pre-Fermi cards, it usually yields a handy performance gain, on Fermi the results can be less predictable compared to global memory because of the improve cache layout in that architecture.

The First method will guarantee the memory is constant inside of the function Foo. The two are not equivalent, the second guarantees it contant there after it's initialization. If you need dynamic than you nee to use something similar to the first way.

Is it slow to do a lot of mallocs and frees on iPhone?

I have an array of point data (for particles) which constantly changes size. To adapt for the changing size, I use code like the following to create correctly sized buffers at about 60 hz.
free(points);
points = malloc(sizeof(point3D) * pointCount);
Is this acceptable or is there another way I could be doing this? Could this cause my app to slow down or cause memory thrashing? When I run it under instruments in the simulator it doesn't look particular bad, but I know simulator is different from device.
EDIT: At the time of writing, one could not test on device without a developer license. I did not have a license and could not profile on device.

Allocating memory is fast relative to some things and slow relative to others. The average Objective-C program does a lot more than 60 allocations per second. For allocations of a few million bytes, malloc+free should take less than a thousand of a second. Compared to arithmetic operations, that's slow. But compared to other things, it's fast.
Whether it's fast enough in your case is a question for testing. It's certainly possible to do 60 Hz memory allocations on the iPhone — the processor runs at 600 MHz.
This certainly does seem like a good candidate for reusing the memory, though. Keep track of the size of the pool and allocate more if you need more. Not allocating memory is always faster than allocating it.

Try starting with an estimated particle count and malloc-ing an array of that size. Then, if your particle count needs to increase, use realloc to re-size the existing buffer. That way, you minimize the amount of allocate/free operations that you are doing.
If you want to make sure that you don't waste memory, you can also keep a record of the last 100 (or so) particle counts. If the max particle count out of that set is less than (let's say) 75% of your current buffer size, then resize the buffer down to fit that smaller particle count.

I'll add another answer that's more direct to the point of the original question. Most of the answers prior to this one (including my own) are very likely premature optimizations.
I have iPhone apps that do many 1000's of mallocs and frees per second, and they don't even show up in a profile of the app.
So the answer to the original question is no.

You don't need to remalloc unless the number of particles increases (or you handled a memory warning in the interim). Just keep the last malloc'd size around for comparison.

As hotpaw2 mentioned, if you need to optimise you could perhaps do so by only allocating if you need more space i.e.:
particleCount = [particles count];
if (particleCount > allocatedParticleCount) {
if (vertices) {
free(vertices);
}
if (textures) {
free(textures);
}
vertices = malloc(sizeof(point3D) * 4 * particleCount);
textures = malloc(sizeof(point2D) * 4 * particleCount);
allocatedParticleCount = particleCount;
}
...having initialised allocatedParticleCount to 0 on instantiation of your object.
P.S. Don't forget to free your objects when your object is destroyed. Consider using an .mm file and use C++/Boost's shared_array for both vertices and textures. You would then not require the above free statements either.

In that case, you'd probably want to keep that memory around and just reassign it.

Does initialization of 2D array in c program waste of too much time?

I am writing a C program which has to use a 2D array to store previously processed data for later using.
The size of this 2D array 33x33; matrix[33][33].
I define it as a global parameter, so it will be initialized for only one time. Dose this definition cost a lot of time when program is running? Because I found my program turn to be slower than previous version without using this matrix to store data.
Additional:
I initialize this matrix as a global parameter like this:
int map[33][33];
In one of function A, I need to store all of 33x33 data into this matrix.
In another function B, I will fetch 3x3 small matrix from map[33][33] for my next step of processing.
Above 2 steps will be repeated for about 8000 times. So, will it affect program running efficiency?
Or, I have another guess that the program truns to be slower because of there are couple of if-else branch statements were lately added into the program.

How ere you doing it before? The only problem I can think of is that extracting a 3x3 sub matrix from a 33x33 integer matrix is going to cause you cacheing issues every time you extract the sub matrix.
On most modern machines the cacheline is 64 bytes in size. Thats enough for 8 elements of the matrix. So for each extra line of the 3x3 sub matrix you will be performing a new cacheline fetch. If the matrix gets hammered very regularly then the matrix will probably sit mostly in the level 2 cache (or maybe even the level 1 if its big enough) but if you are doing lots of other data calculations in between each sub-matrix fetch then you will be getting 3expensive cacheline fetches each time you grab the sub matrix.
However even then its unlikely you'd see a HUGE difference in performance. As stated elsewhere we need to see before and after code to be able to hazard a guess at why performance has got worse ...

Simplifying slightly, there are three kinds of variables in C: static, automatic, and dynamic.
Static variables exist throughout the lifetime of the program, and include both global variables, and local variables declared using static. They are either initialized to zeroes (the default), or explicitly initialized data. If they are zeroes, the linker stores them into a fresh memory page that it initializes to zeroes by the operating system (this takes a tiny amount of time). If they are explicitly allocated, the linker puts the data into a memory area in the executable and the operating system loads it from there (this requires reading the data from disk into memory).
Automatic variables are allocated from the stack, and if they are initialized, this happens every time they are allocated. (If not, they have no value, or perhaps they have a random value, and so initialization takes no time.)
Dynamic variables are allocated using malloc, and you have to initialize them yourself, and that again takes a bit of time.
It is highly probably that your slowdown is not caused by the initialization. To make sure of this, you should measure it by profiling your program and seeing where time is spent. Unfortunately, profiling may be difficult for initialization done by the compiler/linker/operating system, especially for the parts that happen before your program starts executing.
If you want to measure how much time it takes to initialize your array, you could write a dummy program that does nothing but includes the array.
However, since 33*33 is a fairly small number, either your matrix items are very large, your computer is very slow, or your 33 is larger than mine.

No, there is no difference in runtime between initializing an array once (with whatever method) and not initializing it.
If you found a difference between your 2 versions, that must be due to differences in the implementation of the algorithm (or a different algorithm).

Well, I wouldn't expect it to (something like that should take much less than a second), but an easy way to find out would be to simply put a print statement at the start of main().
That way you can see if global, static variable initialization is really causing this. Is there anything else in your program that you've changed lately?
EDIT One way to get a clearer idea of whats taking so long would be to use a debugger like GDB or a profiler like GProf

If your program is accessing the matrix a lot during running (even if it's not being updated at all), the calculations of address of an element will involve a multiply by 33. Doing a lot of this could have the effect of slowing down your program.

How did your previous program version store the data if not in matrix? How were you able to read a sub-matrix if you did not have the big matrix?
Many answers talk about the time spent for initializing. But I don't think that was the question. Anyway, on modern processors, initializing such a small array takes just a few microseconds. And it is only done once, at program start.
If you need to fetch a sub-matrix from any position, there is probably no faster method than using a static 2D array. However, depending on processor architecture, accessing the array could be faster if the array dimensions (or just the last dimension) are power of 2 (e.g. 32, 64 etc.) since this would allow using shift instead of multiply.
If the accessed sub-matrices do not overlap (i.e. you would only access indexes 0, 3, 6 etc.) then using 3-dimensional or 4-dimensional array could speed up the access
int map[11][11][3][3];
This makes each sub-matrix a contiguous block of memory, which can be copied with a single block copy command.
Further, it may fit in single cache line.

theoretically using N-th dimensional array shouldn't have performance difference as all of them resolve into contiguous memory reservation by compiler.
int _1D[1089];
int _2D[33][33];
int _3D[3][11][33];
should give similar allocation/deallocation speed.

You need to benchmark your program. If you don't need the initialization, don't make the variable static, or (maybe) allocate it yourself from the heap using malloc():
mystery_type *matrix;
matrix = malloc(33 * 33 * sizeof *matrix);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight