CUDA : How to copy a 3D array from host to device?

CUDA : How to copy a 3D array from host to device? - arrays

I want to learn how can i copy a 3 dimensional array from host memory to device memory.
Lets say i have a 3d array which contains data. For example
int host_data[256][256][256];
I want to copy that data to dev_data (a device array) in such a way so
host_data[x][y][z]=dev_data[x][y][z];
How can i do it? and how am i supposed to access the dev_data array in the device?
A simple example would be very helpfull.

The common way is to flatten an array (make it one-dimensional). Then you'll have to make some calculations to map from (x,y,z) triple to one number - a position in a flattened one-dimensional array.
Example 2D:
int data[256][256];
int *flattened = data;
data[x][y] == fattened[x * 256 + y];
Example 3D:
int data[256][256][256];
int *flattened = data;
data[x][y][z] == flattened[x * 256 * 256 + y * 256 + z];
or use a wrapper:
__host__ __device___ inline int index(const int x, const int y, const int z) {
return x * 256 * 256 + y * 256 + z;
}
Knowing that, you can allocate a linear array with cudaMalloc, as usual, then use an index function to access corresponding element in device code.
Update:
The author of this question claims to have found a better solution (at least for 2D), you might want to have a look.

For fixed dimensions (e.g. [256][256][256]) let the compiler do the work for you and follow this example. This is attractive because we need only do a single cudaMalloc/cudaMemcpy to tranfer the data, using a single pointer. If you must have variable dimensions, it's better to think about alternate ways to handle this due to the complexity, but you may wish to look at this example (referring to the second example code that I posted). Please be advised that this method is considerably more complicated and hard to follow. I recommend not using it if you can avoid it.
Edit: If you're willing to flatten your array, the answer provided by #Ixanezis is recommended, and is commonly used. My answer is based on the assumption that you really want to access the array using 3 subscripts both on the host and device. As pointed out in the other answer, however, you can simulate 3 subscript access using a macro or function to calculate offsets into a 1-D array.

Related

Seg faulting with 4D arrays & initializing dynamic arrays

I ran into a big of a problem with a tetris program I'm writing currently in C.
I am trying to use a 4D multi-dimensional array e.g.
uint8_t shape[7][4][4][4]
but I keep getting seg faults when I try that, I've read around and it seems to be that I'm using up all the stack memory with this kind of array (all I'm doing is filling the array with 0s and 1s to depict a shape so I'm not inputting a ridiculously high number or something).
Here is a version of it (on pastebin because as you can imagine its very ugly and long).
If I make the array smaller it seems to work but I'm trying to avoid a way around it as theoretically each "shape" represents a rotation as well.
https://pastebin.com/57JVMN20
I've read that you should use dynamic arrays so they end up on the heap but then I run into the issue how someone would initialize a dynamic array in such a way as linked above. It seems like it would be a headache as I would have to go through loops and specifically handle each shape?
I would also be grateful for anybody to let me pick their brain on dynamic arrays how best to go about them and if it's even worth doing normal arrays at all.

Even though I have not understood why do you use 4D arrays to store shapes for a tetris game, and I agree with bolov's comment that such an array should not overflow the stack (7*4*4*4*1 = 448 bytes), so you should probably check other code you wrote.
Now, to your question on how to manage 4D (N-Dimensional)dynamically sized arrays. You can do this in two ways:
The first way consists in creating an array of (N-1)-Dimensional arrays. If N = 2 (a table) you end up with a "linearized" version of the table (a normal array) which dimension is equal to R * C where R is the number of rows and C the number of columns. Inductively speaking, you can do the very same thing for N-Dimensional arrays without too much effort. This method has some drawbacks though:
You need to know beforehand all the dimensions except one (the "latest") and all the dimensions are fixed. Back to the N = 2 example: if you use this method on a table of C columns and R rows, you can change the number of rows by allocating C * sizeof(<your_array_type>) more bytes at the end of the preallocated space, but not the number of columns (not without rebuilding the entire linearized array). Moreover, different rows must have the same number of columns C (you cannot have a 2D array that looks like a triangle when drawn on paper, just to get things clear).
You need to carefully manage the indicies: you cannot simply write my_array[row][column], instead you must access that array with my_array[row*C + column]. If N is not 2, then this formula gets... interesting
You can use N-1 arrays of pointers. That's my favourite solution because it does not have any of the drawbacks from the previous solution, although you need to manage pointers to pointers to pointers to .... to pointers to a type (but that's what you do when you access my_array[7][4][4][4].
Solution 1
Let's say you want to build an N-Dimensional array in C using the first solution.
You know the length of each dimension of the array up to the (N-1)-th (let's call them d_1, d_2, ..., d_(N-1)). We can build this inductively:
We know how to build a dynamic 1-dimensional array
Supposing we know how to build a (N-1)-dimensional array, we show that we can build a N-Dimensional array by putting each (N-1)-dimensional array we have available in a 1-Dimensional array, thus increasing the available dimensions by 1.
Let's also assume that the data type that the arrays must hold is called T.
Let's suppose we want to create an array with R (N-1)-dimensional arrays inside it. For that we need to know the size of each (N-1)-dimensional array, so we need to calculate it.
For N = 1 the size is just sizeof(T)
For N = 2 the size is d_1 * sizeof(T)
For N = 3 the size is d_2 * d_1 * sizeof(T)
You can easily inductively prove that the number of bytes required to store R (N-1)-dimensional arrays is R*(d_1 * d_2 * ... * d_(n-1) * sizeof(T)). And that's done.
Now, we need to access a random element inside this massive N-dimensional array. Let's say we want to access the item with indicies (i_1, i_2, ..., i_N). For this we are going to repeat the inductive reasoning:
For N = 1, the index of the i_1 element is just my_array[i_1]
For N = 2, the index of the (i_1, i_2) element can be calculated by thinking that each d_1 elements, a new array begins, so the element is my_array[i_1 * d_1 + i_2].
For N = 3, we can repeat the same process and end up having the element my_array[d_2 * ((i_1 * d_1) + i_2) + i_3]
And so on.
Solution 2
The second solution wastes a bit more memory, but it's more straightforward, both to understand and to implement.
Let's just stick to the N = 2 case so that we can think better. Imagine to have a table and to split it row by row and to place each row in its own memory slot. Now, a row is a 1-dimensional array, and to make a 2-dimensional array we only need to be able to have an ordered array with references to each row. Something like the following drawing shows (the last row is the R-th row):
+------+
| R1 -------> [1,2,3,4]
|------|
| R2 -------> [2,4,6,8]
|------|
| R3 -------> [3,6,9,12]
|------|
| .... |
|------|
| RR -------> [R, 2*R, 3*R, 4*R]
+------+
In order to do that, you need to first allocate the references array (R elements long) and then, iterate through this array and assign to each entry the pointer to a newly allocated memory area of size d_1.
We can easily extend this for N dimensions. Simply build a R dimensional array and, for each entry in this array, allocate a new 1-Dimensional array of size d_(N-1) and do the same for the newly created array until you get to the array with size d_1.
Notice how you can easily access each element by simply using the expression my_array[i_1][i_2][i_3]...[i_N].
For example, let's suppose N = 3 and T is uint8_t and that d_1, d_2 and d_3 are known (and not uninitialized) in the following code:
size_t d1 = 5, d2 = 7, d3 = 3;
int ***my_array;
my_array = malloc(d1 * sizeof(int**));
for(size_t x = 0; x<d1; x++){
my_array[x] = malloc(d2 * sizeof(int*));
for (size_t y = 0; y < d2; y++){
my_array[x][y] = malloc(d3 * sizeof(int));
}
}
//Accessing a random element
size_t x1 = 2, y1 = 6, z1 = 1;
my_array[x1][y1][z1] = 32;
I hope this helps. Please feel free to comment if you have questions.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:

There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.

Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula

You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

How should a RPG tile-based map be represented?

I have a tile-based RPG system where a specific tile type is represented by a string (i.e. Grass = "g", Dirt = "d"). The problem is that I do not know how to represent a map (a group of tiles gathered in a specific order) in a way where each tile can be accessed by their x/y coordinates efficiently. Should the maps be represented in array format :
map[0].coords[x][y] = "g";
Or perhaps in some other way?

It depends on what language you are using, but a 2-dimensional array is usually an efficient way to do this.
Accessing elements in an array is usually quick because the position of a given element in memory can be calculated based on the array indexes provided, without having to iterate over other elements. Other data structures, (eg linked lists) are much slower for this type of retrieval.

A few things, dependant on the language:
1: If possible, set constant integers for terrain type. Constants use less memory and are quicker to referance/retrieve, same with integers over strings
2: A two dimensional would probably be the most efficant way of doing it.
An example
CONST(INT) GRASS = 1;
CONST(INT) DIRT = 2;
CONST(INT) SNOW = 3;
// assuming map is an array containing objects, and coords is a 2d
// array of said object:
map[0].coords[x,y] = GRASS;

A two dimensional array is fine.
You can also use a one-dimensional array. Here's a snippet of Java code I have lying around:
char [] cells = new char[WORLD_WIDTH * WORLD_HEIGHT];
public char get(int x, int y) {
return cells[x + y * WORLD_WIDTH];
}
public void set(int x, int y, char c) {
cells[x + y * WORLD_WIDTH] = c;
}
Suppose your world is 10 by 10 tiles, then the first row is in cells[0] to cells[9], the last row in cells[90] to cells[99], and so on.
Of course, you may want to add additional checks to ensure that the x and y parameters are valid.

Matlab array of struct : Fast assignment

Is there any way to "vector" assign an array of struct.
Currently I can
edges(1000000) = struct('weight',1.0); //This really does not assign the value, I checked on 2009A.
for i=1:1000000; edges(i).weight=1.0; end;
But that is slow, I want to do something more like
edges(:).weight=[rand(1000000,1)]; //with or without the square brackets.
Any ideas/suggestions to vectorize this assignment, so that it will be faster.
Thanks in advance.

This is much faster than deal or a loop (at least on my system):
N=10000;
edge(N) = struct('weight',1.0); % initialize the array
values = rand(1,N); % set the values as a vector
W = mat2cell(values, 1,ones(1,N)); % convert values to a cell
[edge(:).weight] = W{:};
Using curly braces on the right gives a comma separated value list of all the values in W (i.e. N outputs) and using square braces on the right assigns those N outputs to the N values in edge(:).weight.

You can try using the Matlab function deal, but I found it requires to tweak the input a little (using this question: In Matlab, for a multiple input function, how to use a single input as multiple inputs?), maybe there is something simpler.
n=100000;
edges(n)=struct('weight',1.0);
m=mat2cell(rand(n,1),ones(n,1),1);
[edges(:).weight]=deal(m{:});
Also I found that this is not nearly as fast as the for loop on my computer (~0.35s for deal versus ~0.05s for the loop) presumably because of the call to mat2cell. The difference in speed is reduced if you use this more than once but it stays in favor of the for loop.

You could simply write:
edges = struct('weight', num2cell(rand(1000000,1)));

Is there something requiring you to particularly use a struct in this way?
Consider replacing your array of structs with simply a separate array for each member of the struct.
weights = rand(1, 1000);
If you have a struct member which is an array, you can make an extra dimension:
matrices = rand(3, 3, 1000);
If you just want to keep things neat, you could put these arrays into a struct:
edges.weights = weights;
edges.matrices = matrices;
But if you need to keep an array of structs, I think you can do
[edges.weight] = rand(1, 1000);

The reason that the structs in your example don't get initialized properly is that the syntax you're using only addresses the very last element in the struct array. For a nonexistent array, the rest of them get implicitly filled in with structs that have the default value [] in all their fields.
To make this behavior clear, try doing a short array with clear edges; edges(1:3) = struct('weight',1.0) and looking at each of edges(1), edges(2), and edges(3). The edges(3) element has 1.0 in its weight like you want; the others have [].
The syntax for efficiently initializing an array of structs is one of these.
% Using repmat and full assignment
edges = repmat(struct('weight', 1.0), [1 1000]);
% Using indexing
% NOTE: Only correct if variable is uninitialized!!!
edges(1:1000) = struct('weight', 1.0); % QUESTIONABLE
Note the 1:1000 instead of just 1000 when indexing in to the uninitialized edges array.
There's a problem with the edges(1:1000) form: if edges is already initialized, this syntax will just update the values of selected elements. If edges has more than 1000 elements, the others will be left unchanged, and your code will be buggy. Or if edges is a different type, you could get an error or weird behavior depending on its existing datatype. To be safe, you need to do clear edges before initializing using the indexing syntax. So it's better to just do full assignment with the repmat form.
BUT: Regardless of how you initialize it, an array-of-structs like this is always going to be inherently slow to work with for larger data sets. You can't do real "vectorized" operations on it because your primitive arrays are all broken up in to separate mxArrays inside each struct element. That includes the field assignment in your question – it is not possible to vectorize that. Instead, you should switch a struct-of-arrays like Brian L's answer suggests.

You can use a reverse struct and then do all operations without any errors
like this
x.E(1)=1;
x.E(2)=3;
x.E(2)=8;
x.E(3)=5;
and then the operation like the following
x.E
ans =
3 8 5
or like this
x.E(1:2)=2
x =
E: [2 2 5]
or maybe this
x.E(1:3)=[2,3,4]*5
x =
E: [10 15 20]
It is really faster than for_loop and you do not need other big functions to slow your program.

Ideal data structure for mapping integers to integers?

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.

0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)

If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.

You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?

At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.

You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.

To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.

How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)