Combine several OpenCL buffers into a single large buffer - c

I have a 2D array that I have split up into several 1D arrays and made those 1D arrays into OpenCL buffers. Sometimes I need a kernel function to take the entire 2D array but since its size is determined at runtime I cannot just make enough kernel arguments as there are 1D arrays (plus there can be over 1000 1D arrays). I am hoping that there is some way I can take the 1D array holding OpenCL buffers and combine them into one large buffer that has the entire data and send it to my kernel. Right now the only way I can see of doing this is if I read the data from the 1D buffers back to my program, arrange them into a giant 1D array and write the new buffer back to my compute device this seems like it will be extremely slow, is there any other way?

Here are a couple of ideas (though I admit they are not ideal).
Instead of copying the buffers back to your program, then building new buffers from it, you can use the clEnqueueCopyBuffer() method (or the clEnqueueCopyBufferRect(), depending on your situation) to copy data from one buffer to another. I believe (but I wouldn't swear to it) that how this copy is performed is implementation dependent, but it seems that a buffer that resides in device memory could be copied to another buffer in device memory without the need to cross the bus back to host memory.
Of course (if I understand correctly), copying is not really what you wanted anyway. How about using the clCreateSubBuffer() method? This method can make a new buffer that simply points to a sub-section of an existing buffer (without actually making a copy of its own). To do this, (from my understanding of what you've described) you would need to make the large 2D buffer, then create a number of light-weight 1D sub-buffers that point to regions of this memory.
In this way, you can pass the buffer that represents the whole 2D array when necessary, but just pass one or more 1D sub-buffers when that is all that is required.

I tested clCreateSubBuffer (whith his release) and a saw that it was slower than copy, better than create/ release, but ...
:(
System:
OpenCL 1.1 AMD-APP-SDK-v2.5 (684.212) FULL_PROFILE
Radeon 5870

Related

Shrink memory of an array of pointers, possible?

I am having difficulties to find a possible solution so I decided to post my question. I am writing a program in C, and:
i am generating a huge array containing a lot of pointers to ints, it is allocated dynamically and filled during runtime. So before I don't know which pointers will be added and how many. The problem is that they are just to many of them, so I need to shrink somehow the space.
IS there any package or tool available which could possibly encode my entries somehow or change the representation so that I save space?
Another question, I also thought about writing a file with my information, is this then kept in memory the whole time or just if I reopen the file again?
It seems like you are looking for a simple dynamic array (the advanced data type dynamic array, that is). There should be many implementations for this out there. You can simply start with a small dynamic array and push new items to the back just like you would do with a vector in c++ or java. One implementation would be GArray. You will only allocate the memory you need.
If you have to/want to do it manually, the usual method is to store the capacity and the size of the array you allocated along with the pointer in a struct and call realloc() from within push_back() whenever you need more space. Usually you should increase the size of your array by a factor of 1.3 to 1.4, but a factor of 2 will do if you're not expecting a HUGE array. If you call remove and your size is below a certain threshold (e.g. capacity/2) you shrink the array again with realloc();

Fortran memory management and subroutines/functions

at the moment I am working on a code for numerical simulations in Fortran 95. My platform is WIndows and I take advantage of the MSVC environment with the Intel Fortran compiler.
This code, as many in this field, creates a system of equations to be resolved. Numerically, this happen storing a square matrix and a vector of known values. Now, in order to optimize the memory, the matrices are stored in convenient form, like the compressed sparse rows format (CSR) or analogous, so the zero values are not stored.
Given this brief introduction, here there are my doubts.
Since at compiling time I do not know the dimension of my arrays, I just declare them as:
REAL, DIMENSION(:), ALLOCATABLE :: myArray
and once I retrieve the dimension of such a vector, I call
ALLOCATE(myArray(N)) where N is the number of elements that I want to allocate
Still, memory is empty, since the values are not stored but a memory check is done in order to avoid stack overflow. Is it right?
Now, filling it with values, the occupied space ramp up. The structure of a Fortran array, both for a 1D vector and multi-dimensional array, is to fill in column order a space equivalent to the number of value. It is to say that if we have a 2D array of dimension 1000x1000, it will be stored in 1M "contiguous boxes" ordered by column numbers (first the first column is stored, then the second one and so on..).
If this is true, so the structure of data is the same, is the access time to a particular value the only difference between a multidimensional and a 1D vector?
Is then the command RESHAPE changing only the way the program "sees" the arrays?
The array I need for my purposes is defined in a module that each subroutine/function share. In particular, a subroutine allocate and fill it. Coming back to the main program, there is no problem with that since I display to the user some statistics about it. Let us say, we allocated 400M REAL*4 values, with about 1.5GB of used memory.
However, once that I get into another subroutine, the program stops saying forrtl: severe(170): Program Exception - Stack Overflow. I ran out of memory. But how could it be if the matrix is already allocated and I did not allocate anything more? Notice that: the subroutine uses the same module, so variables are already declared; my RAM has still a free space of about 1.3GB; the stop is at the first line of the subroutine.
Is subroutine (and also function) doubling the data? I thought Fortran would pass the address of my variables in that case, avoiding copies and working directly on the values.
Finally, as many of you, I enjoyed in C++ the STD library functions, like vector::push_back and so on. In Fortran, there are not such beautiful routines but some very useful functions are still there. Masking an array, using WHERE or COUNT or MERGEcan help you to handle some operation effectively.
However, they are veeeeery slow when my matrix is bigger than 1M entries. In that case even a sequential search and substitute is faster than creating a mask or use where. How could it be possible? Aren't they multithreaded?
Thank you in advance for your patience!! All suggestions are very welcome!!
Comment space is limited, so I am posting this as an answer. Obviously you are running out of stack space, not out of memory. The stack size of the main thread on Windows is fixed at link time (the default is 1 MiB) and any larger stack allocation could result in a stack overflow. This could happen because of many reasons, but mainly:
the subroutine that you call uses big stack arrays (e.g. non-ALLOCATABLE arrays);
you pass a non-contiguous array subsection to the subroutine, e.g. myArray(1:10:2), and you don't have an explicit interface for that subroutine. In this case the compiler would make a temporary most likely stack copy of the data being passed, which could exhaust the stack space and trigger the exception.
I would guess the first point is the one, relevant to your case, since the exception occurs when you enter the subroutine (probably in the prologue, where stack space for all local variables is being reserved). You might instruct Intel Fortran to enable heap arrays in the project settings and see if it helps (not sure if the Windows version enables heap arrays be default or not).
Without even a single line of your code shown, it would be quite hard to guess what is the source of the problem and to solve it.

Combining two buffers into one

I need to have two buffers (A and B) and when either of the buffers is full it needs to write its contents to the "merged" buffer - C. Using memcopy seems to be too slow for this operation as noted below in my question. Any insight?'
I haven't tried but I've been told that memcopy will not work. This is an embedded system. 2 buffers. Both of different sizes and when they are full dumb to a common 'C' buffer which is a bigger size than the other two.. Not sure why I got down rated..
Edit: Buffer A and B will be written to prior to C being completely empty.
The memcopy is taking too long and the common buffer 'C' is getting over run.
memcpy is pretty much the fastest way to copy memory. It's frequently a compiler intrinsic and is highly optimized. If it's too slow you're probably going to have to find another way to speed your program up.
I'd expect that copying memory faster is not the lowest hanging fruit in a program.
Some other opportunities could be to copy less memory or copy less often. See if you can profile your program to analyze it's performance and find where the biggest opportunities are.
Edit: With your edit it sounds like the problem is that there's not enough time for you to deal with some data all at once between the time you notice that it needs to be handled and the time that more data comes in. A solution in this case could be, as one of the commenters noted, to have additional buffers that you can flip between. So you may then have time to handle the data in one while another is filled up.
The only way you can merge two buffers without memcpy is by linking them, like a linked list of buffer fragments (or an array of fragments).
Consider that a buffer may not always have to be contiguous. I've done a lot of work with 600dpi images, which means very large buffers. If you can break them up into a sequence of smaller fragments, that helps reducing fragmentation as well as unnecessary copying due to buffer growth.
In some cases buffers must be contiguous, if your API / microcontroller mandates it. For example, Windows bitmap functions require continuity. You could try to use the C realloc function, but it might internally work like the combination of malloc+memcpy+free. Either way, as others have said earlier, memcpy is supposed to be the fastest possible way of copying contiguous buffers.
If the buffer must be contiguous, you could reserve a large address space and commit it on demand. The implementation depends on the platform. For example, on Win32 the VirtualAlloc function can do that. This gives you a very large contiguous buffer, of which only a portion is allocated (committed). Later you can commit further pages as the buffer needs to grow. This trick requires the concept of virtual memory, which may not be available on a microcontroller.

Use an array of pointers to structs, or just an array of structs?

I'm working on a FFT algorithm in C for a microcontroller, and am having trouble deciding on whether to have the real and imaginary parts of the input data stored in just an array of structs, or use pointers to array of structs. I'm facing the conflicting requirements that that the code has to run in a tiny amount of memory, and yet also be as fast as possible. I believe the array of pointers to structs will have a somewhat larger memory overhead, but there's a line in my code basically like the following:
for (uint8_t i = 0; i < RECORD_SIZE; i++)
{
uint8_t decimateValue = fft_decimate(i);
fftData[i]->realPart = fftTempData[decimateValue]->realPart;
fftData[i]->imPart = fftTempData[decimateValue]->imPart;
}
I'm thinking that if I use an array of pointers to structs as in the above example that the compiled code will be faster as it is just reshuffling the pointers, rather than actually copying all the data between the two data structures as an array-of-structures implementation would. I'm willing to sacrifice some extra memory if the above section of code runs as fast as possible. Thanks for any advice.
Every time you access data through an array of pointers, you have two memory accesses. This often comes with a pipeline stall, even on microcontrollers (unless it's a really small microcontroller with no pipeline).
Then you have to consider the size of the data. How big is a pointer? 2 bytes? 4 bytes? How big are the structs? 4 bytes? 8 bytes?
If the struct is twice as big as a pointer, shuffling the data will be half as expensive with pointers. However, reading or modifying the data in any other way will be more expensive. So it depends on what your program does. If you spend a lot of time reading the data and only a little time shuffling it, optimize for reading the data. Other people have it right -- profile. Make sure to profile on your microcontroller, not on your workstation.
If your structs are very small, it will actually be faster to have an array of structs and shuffle them around. If your structs are large, this specific action will be faster if you are only shuffling around pointers.
Wait a minute... on second glance, it appears in your code that you are not shuffling around pointers, but you are accessing fields of the structs that those pointers reference; in effect you are still shuffling the structs themselves, not the pointers. This is going to be slower than moving pointers and also slower than just moving structs since it has to dereference the pointers and then still move the struct anyway.
You're right. The array of pointers will be faster, but there will be an overhead in memory usage. If have memory to use the pointers, use them.
First: It depends. Profile.
Cache locality is going to reign here. I expect the structs to be very small (representing complex numbers?). In FFT I'd expect a lot more gain from storing the real and imaginary parts in separate arrays.
You could then split the load between CPU cores.
If it is about larger chunks (say 1024 sample blocks), I strongly suspect that shuffling pointers is way more efficient. It will also allow you to - much more easily - work on the same (readonly) data from several threads. Moving memory around is a certain way to invalidate a lot of iterators, and usually you want tasks (i.e. threads) to work on a subrange of your data, i.e.: all they have is an iterator subrange.

Does initialization of 2D array in c program waste of too much time?

I am writing a C program which has to use a 2D array to store previously processed data for later using.
The size of this 2D array 33x33; matrix[33][33].
I define it as a global parameter, so it will be initialized for only one time. Dose this definition cost a lot of time when program is running? Because I found my program turn to be slower than previous version without using this matrix to store data.
Additional:
I initialize this matrix as a global parameter like this:
int map[33][33];
In one of function A, I need to store all of 33x33 data into this matrix.
In another function B, I will fetch 3x3 small matrix from map[33][33] for my next step of processing.
Above 2 steps will be repeated for about 8000 times. So, will it affect program running efficiency?
Or, I have another guess that the program truns to be slower because of there are couple of if-else branch statements were lately added into the program.
How ere you doing it before? The only problem I can think of is that extracting a 3x3 sub matrix from a 33x33 integer matrix is going to cause you cacheing issues every time you extract the sub matrix.
On most modern machines the cacheline is 64 bytes in size. Thats enough for 8 elements of the matrix. So for each extra line of the 3x3 sub matrix you will be performing a new cacheline fetch. If the matrix gets hammered very regularly then the matrix will probably sit mostly in the level 2 cache (or maybe even the level 1 if its big enough) but if you are doing lots of other data calculations in between each sub-matrix fetch then you will be getting 3expensive cacheline fetches each time you grab the sub matrix.
However even then its unlikely you'd see a HUGE difference in performance. As stated elsewhere we need to see before and after code to be able to hazard a guess at why performance has got worse ...
Simplifying slightly, there are three kinds of variables in C: static, automatic, and dynamic.
Static variables exist throughout the lifetime of the program, and include both global variables, and local variables declared using static. They are either initialized to zeroes (the default), or explicitly initialized data. If they are zeroes, the linker stores them into a fresh memory page that it initializes to zeroes by the operating system (this takes a tiny amount of time). If they are explicitly allocated, the linker puts the data into a memory area in the executable and the operating system loads it from there (this requires reading the data from disk into memory).
Automatic variables are allocated from the stack, and if they are initialized, this happens every time they are allocated. (If not, they have no value, or perhaps they have a random value, and so initialization takes no time.)
Dynamic variables are allocated using malloc, and you have to initialize them yourself, and that again takes a bit of time.
It is highly probably that your slowdown is not caused by the initialization. To make sure of this, you should measure it by profiling your program and seeing where time is spent. Unfortunately, profiling may be difficult for initialization done by the compiler/linker/operating system, especially for the parts that happen before your program starts executing.
If you want to measure how much time it takes to initialize your array, you could write a dummy program that does nothing but includes the array.
However, since 33*33 is a fairly small number, either your matrix items are very large, your computer is very slow, or your 33 is larger than mine.
No, there is no difference in runtime between initializing an array once (with whatever method) and not initializing it.
If you found a difference between your 2 versions, that must be due to differences in the implementation of the algorithm (or a different algorithm).
Well, I wouldn't expect it to (something like that should take much less than a second), but an easy way to find out would be to simply put a print statement at the start of main().
That way you can see if global, static variable initialization is really causing this. Is there anything else in your program that you've changed lately?
EDIT One way to get a clearer idea of whats taking so long would be to use a debugger like GDB or a profiler like GProf
If your program is accessing the matrix a lot during running (even if it's not being updated at all), the calculations of address of an element will involve a multiply by 33. Doing a lot of this could have the effect of slowing down your program.
How did your previous program version store the data if not in matrix? How were you able to read a sub-matrix if you did not have the big matrix?
Many answers talk about the time spent for initializing. But I don't think that was the question. Anyway, on modern processors, initializing such a small array takes just a few microseconds. And it is only done once, at program start.
If you need to fetch a sub-matrix from any position, there is probably no faster method than using a static 2D array. However, depending on processor architecture, accessing the array could be faster if the array dimensions (or just the last dimension) are power of 2 (e.g. 32, 64 etc.) since this would allow using shift instead of multiply.
If the accessed sub-matrices do not overlap (i.e. you would only access indexes 0, 3, 6 etc.) then using 3-dimensional or 4-dimensional array could speed up the access
int map[11][11][3][3];
This makes each sub-matrix a contiguous block of memory, which can be copied with a single block copy command.
Further, it may fit in single cache line.
theoretically using N-th dimensional array shouldn't have performance difference as all of them resolve into contiguous memory reservation by compiler.
int _1D[1089];
int _2D[33][33];
int _3D[3][11][33];
should give similar allocation/deallocation speed.
You need to benchmark your program. If you don't need the initialization, don't make the variable static, or (maybe) allocate it yourself from the heap using malloc():
mystery_type *matrix;
matrix = malloc(33 * 33 * sizeof *matrix);

Resources