Is it possible to pass an array of pointers to a cuda kernel?
i am looking for something like this:
__global__ void Kernel(int **arr)
{
int *temp = arr[blockDim.x];
temp[blockIdx.x] = blockIdx.x;
}
How can i allocate cuda memory for such structure?
Memory allocation for such array is not a problem, you'll do this by cudaMalloc(sizeof(void*)*SIZE). However, writing correct values into it is main problem. Only way to change values in device memory from host function is actually copying information from host memory to device memory (cudaMemcpy() or cudaMemcpyToSymbol()). Thus, to write device pointers into device memory, we must have pointer to device memory in host memory, which I don't think is possible. (pointer which is stored in host variables allocated by cudaMalloc() isn't actual pointer in device memory). So, the only way to write correct values in the array is from kernel, which makes array of pointers unconvenient.
I suggest using indexes instead of pointers, it is much better. Basically if in your array of indexes you have written {4,3,0,1,2} it means that first element points to some array in index 4, second one - to the 3rd element and so on. If you want to point multiple arrays, you should make indexing by some rule, in which you will fill the array of indexes and in which you will access memory from kernel.
I'm doing some image processing work in CUDA currently, and I recommend that you just allocate a linear memory buffer and use an indexing scheme rather than dealing with arrays of pointers. It's way, way simpler in my experience. My 2c.
Related
I'm making an Nondeterministic Finite Automata (NFA). An NFA has a set of states, I need four array with the same size(number of states in the NFA) to record temporary information about states during the simulation of the NFA.
However, different NFAs has different number of states, thus the size of arrays varies with different NFAs.
Using C I have come up with three ways to deal with the memory allocation:
Use a very big fix-sized array.
malloc memroy dynamically every time the function is called, and before the function finishes, free the allocated memory
Use malloc but not allocate memory on every function call, use four static pointer variables, and one static int arraySize to record allocated array size, for the first time when the function is called, allocate arrays of size NFA->numStates and assign to each of the static pointers, assign NFA->numStates to static int arraySize; for the second time, if NFA->numStates is less than or equal to arraySize, no memory allocation; if NFA->numStates is greater than arraySize, free previous memory and realloc arrays of size NFA->numStates.
Method 1 uses fix-sized array, which means when the input NFA->numStates is greater than the hard-coded array size, the function will not work.
Method 2 is adaptable, but need to allocate memory every time the function is called, not so efficient ?
Method 3 is adaptable too, but is complicated, and is not thread safe ?
Suggestions or other alternatives ?
How about option 2 but with alloca(), i.e. allocate on the stack? It's much (much) faster than malloc(), and automatically de-allocates when you leave the scope which sounds like what you want.
Of course, it's not standard or portable, so it might not be applicable for you.
Failing that, a big fixed-size array seems easy enough and doesn't use more memory.
Variable-Length Arrays (VLAs) have been available since C99. I'd be impressed if you are still working with an implementation that does not support such a thing. VLAs work like regular arrays, except that the size you use to declare them is determined at runtime. That seems to be what you're looking for.
As you are dealing with arrays of arbitrary size I suggest you to implement simple linked list structure that at initialization will hold an array of predefined size and can be extended to hold additional chunks of memory. This will be an abstraction over contiguous memory space. You just need to store the current size of this structure in parent container.
#include <stddef.h>
struct contmem
{
size_t size;
struct memchunk
{
void *data;
struct memchunk *next, *prev;
} head;
}
So, you can extend this structure when you have to and store information by iterating over elements of linked list. This is similar to what happens inside lists from standard C++ library and differs from straight memory reallocation.
When I write linear algebra programs in C++, I use the Armadillo library. It is based on templates, which provide me a way to define vectors of any length that don't necessarily require additional memory allocation, since they are statically assigned an appropriate memory buffer at compile-time. When I use arma::Col<double>::fixed<3> the compiler creates a "new type" on the fly so that the vector contains a buffer of exactly 3 doubles.
Now I'm working on a linear algebra program in C, and I'm using the GNU Scientific Library (GSL). In order to instantiate a 3D vector, I do: gsl_vector_alloc(3) that returns a gsl_vector*. The problem is that this operation causes the dynamic allocation of a small portions of memory, and this happens millions of times during program runtime. My program is wasting a lot of resources to perform tens of millions of malloc/free operations.
I've inspected the internal structure of gsl_vector:
typedef struct
{
size_t size;
size_t stride;
double * data;
gsl_block * block;
int owner;
} gsl_vector;
For the library to work correctly, data should point to the first element of the vector, usually inside a gsl_block structure like this:
typedef struct
{
size_t size;
double * data;
} gsl_block;
which contains another data pointer. So, for instantiating a simple 3D vector, this sequence of mallocs happen:
A gsl_vector structure is malloc'd (something around 40 bytes on x86_64).
A gsl_block structure is malloc'd (16 bytes) and the block pointer of the gsl_vector is set to the memory address of the gsl_block just allocated
An array of 3 doubles is malloc'd and its memory address is assigned to both data pointers (the one in gsl_block and the one in gsl_vector).
I obtained a 40% performance gain by removing two mallocs. I created my custom gsl_vector creation routine, that allocates an array of 3 doubles and sets the data pointer of the gsl_vector to the address of this array. Then I return a gsl_vector (not a pointer).
But doing so, I still get millions of malloc(3 * sizeof(double)) operations.
I didn't manage to "embed" the array of 3 doubles inside the gsl_vector struct, since if the data pointer points to something which is inside the struct itself (hacky!), then the pointer is no longer valid when the vector is copied elsewhere!
Do you have any ideas (apart switching to C++ or rolling my own linear algebra library)? I am open to any suggestion.
It looks to me as if you are misunderstanding the purpose of the gls_block data structure. It seems to me that you should just use it to allocate a large chunk of data in an gsl_block data structure and then split that chunk up for the use in multiple gsl_vector. If you'd allocate all your gsl_vector in one go by allocating an array of them, you are almost there. You just have two calls to malloc and some bookkeeping during initialization.
What this would impose on you, is that you'd have to think precisely in advance which gsl_vector you need. But that is the "price" to pay when you use a language that has no builtin garbage collection. If you invest in that, most of the times it has the advantage of structuring your code, you'll probably learn a lot on how you could organize your computation.
C is a bit too primitive to do this properly.
If you need to use a function from GSL, you might be able to still use C++ Armadillo vectors and matrices with it.
For example, you can obtain a pointer to the memory used by a vector or matrix via the .memptr() member function. This also works for fixed size matrices/vectors.
Alternatively, you can tell Armadillo to use an already allocated block of memory, by giving it a pointer during vector or matrix construction.
From what I understands, when I create a static array, say, int[] array = new int[N];, the run time actually looks for N*4 bytes of memory whose addresses are also continuous. right?
So what if the run time can't find continuous memory addresses?
for example, if my memory is 128MB, and in my application N = 25M, which means I need 100MB memory for my array. Is it possible for this creation of array to fail? Is it possible, the 100MB of memory in need can't be located because there are too many memory fragments?
thanks
Yes it can fail. In that case an OutOfMemoryException will be thrown. An easy way to test this is the following:
int[] array = new int[int.MaxValue];
(This assumes C#, behavior in Java will be similar)
If we are talking abut C++ (but is the same in general) arrays are contiguous, meaning that the memory has consecutive addresses, i.e. it's contiguous in virtual address space. It need not be contiguous in physical address space(programmers never sees an actual address of an array element, just a reference to the array and the means to index it).
Anyway if there is not memory available you will get an exception(it is not a contiguous matter)
The following question is in regards to C programming. I am using Microchip C30 compiler (because I know someone will ask)
What is the difference between having a structure which contains several other structures vs a structure which contains several pointers to other structures? Does one make for faster code execution than the other? Does one technique use more or less memory? Does the memory get allocated at the same time in both cases?
If I use the following code does memory automatically get allocated for the subStruct?
// Header file...
typedef struct{
int a;
subStruct * b;
} mainStruct;
typedef struct{
int c;
int d;
}subStruct;
extern mainStruct myMainStruct;
// source file...
mainStruct myMainStruct;
int main(void)
{
//...
{
If you use a pointer, you have to allocate the memory yourself. If you use a substructure, you can allocate the entire thing in one go, either using malloc or on the stack.
What you need depends on your use case:
Pointers will give you smaller struct's
Substructures provide better locality of reference
A pointer may point to either a single struct or the first member in an array of them, while substructures are self-documenting: there's always one of them unless you use an array explicitly
Pointers take up some space, for the pointer itself + overhead from extra memory allocations
And no, it doesn't matter which compiler you use :)
Memory for pointers doesn't get automatically allocated, but when you contain whole structure in your struct, it does.
Also - with pointers you are likely to have fragmented memory - each pointed part of tructure could be in other part of memory.
But with poniters you can share the same substructures across many structs (but this makes changing and deleting them later harder).
Memory for a pointer wouldn't be automatically allocated. You would need to run:
myMainStruct.b=malloc(sizeof(*myMainStruct.b));
In terms of performance, there is likely a small hit to going from one structure to another via the pointer.
As far as speed goes, it varies. Generally, including structs, rather than pointers, will be faster, because the CPU doesn't have to dereference the pointer for every member access. However, if some of the members aren't used very often, and the sub-struct's size is massive, the structure might not fit in the cache and this can slow down your code quite a bit.
Using pointers will use /slightly/ more memory (but only the size of the pointers themselves) than the direct approach.
Usually with pointers to sub-structs you'll allocate the sub-structs separately, but you can write some kind of initialization function which abstracts all the allocation out to "the same time." In your code, memory is allocated for myMainStruct on the stack, but the b member will be garbage. You need to call malloc to allocate heap memory for b, or create a subStruct object on the stack and point myMainStruct.b to it.
What is the difference between having a structure which contains several other structures vs a structure which contains several pointers to other structures?
In the first case what you have is essentially one big structure in contiguous memory. In the "pointers to structures" case your master structure just contains the addresses to the sub-structures which are allocated separately.
Does one make for faster code execution than the other?
The difference should be negligible, but pointers method is will be slightly slower. This is because you must dereference the pointer with each access to the substructure.
Does one technique use more or less memory?
The pointer method uses number_of_pointers * sizeof(void*) more memory. sizeof(void*) will be 4 for 32-bit and 8 for 64-bit.
Does the memory get allocated at the same time in both cases?
No, you need to go through each pointer in your master struct and allocate memory for the sub-structs via malloc().
Conclusion
The pointers add a layer of indirection to the code, which is useful for switching out the sub-structs or having more than one pointer point to the same sub-struct. Having different master-structs pointing to common sub-structs in particular could save quite a bit of memory and allocation time.
I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program
0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.
1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).
2)For this purpose, the kernel declares an array of int pointers of size 3000.
3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.
4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.
This last step Fails an it halts the PC.
Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?
, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )
Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?
I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.
"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)
You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)
You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)
You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."
^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.
Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.
Edit: I also just noticed a function called cudaMemcpy
Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:
cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)
I literally just started looking into this like I said though so maybe theres a better solution.
CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.
If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.
To copy data between the two, use cudaMemcpy().
When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.
I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.
All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.