I ran into a big of a problem with a tetris program I'm writing currently in C.
I am trying to use a 4D multi-dimensional array e.g.
uint8_t shape[7][4][4][4]
but I keep getting seg faults when I try that, I've read around and it seems to be that I'm using up all the stack memory with this kind of array (all I'm doing is filling the array with 0s and 1s to depict a shape so I'm not inputting a ridiculously high number or something).
Here is a version of it (on pastebin because as you can imagine its very ugly and long).
If I make the array smaller it seems to work but I'm trying to avoid a way around it as theoretically each "shape" represents a rotation as well.
https://pastebin.com/57JVMN20
I've read that you should use dynamic arrays so they end up on the heap but then I run into the issue how someone would initialize a dynamic array in such a way as linked above. It seems like it would be a headache as I would have to go through loops and specifically handle each shape?
I would also be grateful for anybody to let me pick their brain on dynamic arrays how best to go about them and if it's even worth doing normal arrays at all.
Even though I have not understood why do you use 4D arrays to store shapes for a tetris game, and I agree with bolov's comment that such an array should not overflow the stack (7*4*4*4*1 = 448 bytes), so you should probably check other code you wrote.
Now, to your question on how to manage 4D (N-Dimensional)dynamically sized arrays. You can do this in two ways:
The first way consists in creating an array of (N-1)-Dimensional arrays. If N = 2 (a table) you end up with a "linearized" version of the table (a normal array) which dimension is equal to R * C where R is the number of rows and C the number of columns. Inductively speaking, you can do the very same thing for N-Dimensional arrays without too much effort. This method has some drawbacks though:
You need to know beforehand all the dimensions except one (the "latest") and all the dimensions are fixed. Back to the N = 2 example: if you use this method on a table of C columns and R rows, you can change the number of rows by allocating C * sizeof(<your_array_type>) more bytes at the end of the preallocated space, but not the number of columns (not without rebuilding the entire linearized array). Moreover, different rows must have the same number of columns C (you cannot have a 2D array that looks like a triangle when drawn on paper, just to get things clear).
You need to carefully manage the indicies: you cannot simply write my_array[row][column], instead you must access that array with my_array[row*C + column]. If N is not 2, then this formula gets... interesting
You can use N-1 arrays of pointers. That's my favourite solution because it does not have any of the drawbacks from the previous solution, although you need to manage pointers to pointers to pointers to .... to pointers to a type (but that's what you do when you access my_array[7][4][4][4].
Solution 1
Let's say you want to build an N-Dimensional array in C using the first solution.
You know the length of each dimension of the array up to the (N-1)-th (let's call them d_1, d_2, ..., d_(N-1)). We can build this inductively:
We know how to build a dynamic 1-dimensional array
Supposing we know how to build a (N-1)-dimensional array, we show that we can build a N-Dimensional array by putting each (N-1)-dimensional array we have available in a 1-Dimensional array, thus increasing the available dimensions by 1.
Let's also assume that the data type that the arrays must hold is called T.
Let's suppose we want to create an array with R (N-1)-dimensional arrays inside it. For that we need to know the size of each (N-1)-dimensional array, so we need to calculate it.
For N = 1 the size is just sizeof(T)
For N = 2 the size is d_1 * sizeof(T)
For N = 3 the size is d_2 * d_1 * sizeof(T)
You can easily inductively prove that the number of bytes required to store R (N-1)-dimensional arrays is R*(d_1 * d_2 * ... * d_(n-1) * sizeof(T)). And that's done.
Now, we need to access a random element inside this massive N-dimensional array. Let's say we want to access the item with indicies (i_1, i_2, ..., i_N). For this we are going to repeat the inductive reasoning:
For N = 1, the index of the i_1 element is just my_array[i_1]
For N = 2, the index of the (i_1, i_2) element can be calculated by thinking that each d_1 elements, a new array begins, so the element is my_array[i_1 * d_1 + i_2].
For N = 3, we can repeat the same process and end up having the element my_array[d_2 * ((i_1 * d_1) + i_2) + i_3]
And so on.
Solution 2
The second solution wastes a bit more memory, but it's more straightforward, both to understand and to implement.
Let's just stick to the N = 2 case so that we can think better. Imagine to have a table and to split it row by row and to place each row in its own memory slot. Now, a row is a 1-dimensional array, and to make a 2-dimensional array we only need to be able to have an ordered array with references to each row. Something like the following drawing shows (the last row is the R-th row):
+------+
| R1 -------> [1,2,3,4]
|------|
| R2 -------> [2,4,6,8]
|------|
| R3 -------> [3,6,9,12]
|------|
| .... |
|------|
| RR -------> [R, 2*R, 3*R, 4*R]
+------+
In order to do that, you need to first allocate the references array (R elements long) and then, iterate through this array and assign to each entry the pointer to a newly allocated memory area of size d_1.
We can easily extend this for N dimensions. Simply build a R dimensional array and, for each entry in this array, allocate a new 1-Dimensional array of size d_(N-1) and do the same for the newly created array until you get to the array with size d_1.
Notice how you can easily access each element by simply using the expression my_array[i_1][i_2][i_3]...[i_N].
For example, let's suppose N = 3 and T is uint8_t and that d_1, d_2 and d_3 are known (and not uninitialized) in the following code:
size_t d1 = 5, d2 = 7, d3 = 3;
int ***my_array;
my_array = malloc(d1 * sizeof(int**));
for(size_t x = 0; x<d1; x++){
my_array[x] = malloc(d2 * sizeof(int*));
for (size_t y = 0; y < d2; y++){
my_array[x][y] = malloc(d3 * sizeof(int));
}
}
//Accessing a random element
size_t x1 = 2, y1 = 6, z1 = 1;
my_array[x1][y1][z1] = 32;
I hope this helps. Please feel free to comment if you have questions.
Related
In julia, one can pre-allocate an array of a given type and dims with
A = Array{<type>}(undef,<dims>)
example for a 10x10 matrix of floats
A = Array{Float64,2}(undef,10,10)
However, for array of array pre-allocation, it does not seem to be possible to provide a pre-allocation for the underlying arrays.
For instance, if I want to initialize a vector of n matrices of complex floats I can only figure this syntax,
A = Vector{Array{ComplexF64,2}}(undef, n)
but how could I preallocate the size of each Array in the vector, except with a loop afterwards ? I tried e.g.
A = Vector{Array{ComplexF64,2}(undef,10,10)}(undef, n)
which obviously does not work.
Remember that "allocate" means "give me a contiguous chunk of memory, of size exactly blah". For an array of arrays, which is really a contiguous chunk of pointers to other contiguous chunks, this doesn't really make sense in general as a combined operation -- the latter chunks might just totally differ.
However, by stating your problem, you make clear that you actually have more structural information: you know that you have n 10x10 arrays. This really is a 3D array, conceptually:
A = Array{Float64}(undef, n, 10, 10)
At that point, you can just take slices, or better: views along the first axis, if you need an array of them:
[#view A[i, :, :] for i in axes(A, 1)]
This is a length n array of AbstractArrays that in all respects behave like the individual 10x10 arrays you wanted.
In the cases like you have described you need to use comprehension:
a = [Matrix{ComplexF64}(undef, 2,3) for _ in 1:4]
This allocates a Vector of Arrays. In Julia's comprehension you can iterate over more dimensions so higher dimensionality is also available.
I am trying to turn a 3 dimensional array "upside-down" as:
I have tried the inverse function, but if we look at the inverse operation in mathematical terms, it gives us another result. I need to turn without changing the data in the array. How to do that?
To split the 3-dimensional array (A x B x C) into A-sub-array (2d, B x C) I have used squeeze: k=squeeze(array(n,:,:)). Now I have a 2-dimensional array of size B x C. How to put it back together (in 3 dimensional array)?
You can use permute() to change the order of dimensions, which can be used as a multidimensional transpose.
Putting 2D matrices into a 3D one then is a simple indexing operation. Read more in indexing here.
A = rand(10,10,10);
B = permute(A, [ 3 2 1 ]); % Permute he order of dimensions
mat1 = rand(10,10);
mat2 = rand(10,10);
mat_both(:,:,2) = mat2; % Stack 2D matrices along the third dimension
mat_both(:,:,1) = mat1;
mat_both = cat(3,mat1, mat2); % Stacks along the third dimension in a faster way
Note that this has probably been answered before, but now in a way that I could understand it.
I'm trying to understand how pointers and arrays work together. I thought I understood both how and why 1D and 2D arrays could be accessed with pointers until I tried to wrap my head around 3D arrays (just to be clear, I understand 3D arrays, just not how to access them with pointers).
So let's say I have an array x[5] and I want to access the element x[1] using pointers. In that case, I can use *(&x[0]+1), because x is the same as &x[0] and *(x + 1) can be used to access a[1].
Now, for a 2D array, let's say it is declared as x[5][3].
To access x[1][2], I can use *(&x[0][0]+3*1+2). I can kind of understand this. We start at memory location x[0][0], then add 3 * 1 to get from x[0][0] to x[1][0], and then add 2 to get from x[1][0] to x[1][2].
The above makes sense to me until I get to a three dimensional array. Let's assume that x is declared as x[3][4][5], how would I get to, let's say element x[2][1][3] using the same logic(if possible)? Also, how can I apply this logic to higher dimensional arrays as well?
Any advice on this is greatly appreciated!
Supposing you have the following array:
//x[i][j][k]
int x[2][3][5] = { {{0,1,2,3,4},
{5,6,7,8,9},
{10,11,12,13,14}},
{{ 88,1,2,3,4 },
{ 5,6,7,8,9},
{ 10,11,12,77,14}}
};
Elements of 3D array are arranged in series in the memory like this:
{0,1,2,3,4},{5,6,7,8,9},{10,11,12,13,14},{ 88,1,2,3,4 },{ 5,6,7,8,9},{ 10,11,12,77,14}
Obviously, If you hava the address of the first element you can access the others by adding an offset. The address of the first element is **x, and the address of the next one is **x+1 and so forth. To access the elements of the x array look at the following example:
int D3_width = sizeof(x)/sizeof(x[0]); //2
int D2_width = sizeof(x[0]) / sizeof(x[0][0]); //3
int D1_width = sizeof(x[0][0]) / sizeof(int); //5
int i = 0, j = 2, k = 3;
cout << *(**x + i*D2_width*D1_width + j*D1_width + k);
output:13
You can use the same approach for other multi-dimensional arrays.
The formula is tested and works perfectly.
For a two-dimensional array you can skip through indexes of rows with pointer arithmetic as you were doing:
First index of first row + (length of row * desired Row)
If you are making your way up the dimensions, this expression will start to get rather large.
For 3D:
First index of first plane + (size of plane * desired plane)
Note that the plane is a 2-dimensional structure.
Now moving into 4D:
First index of first cube + (size of cube * desired cube )
Hope you get the idea.
I have the above loop running on the above variables:
A is a 2d array of size mxn.
mask is a 1d logical array of size 1xn
results is a 1d array of size 1xn
B is a vector of the form mx1
C is a mxm matrix, m is the same as the above.
Edit: expanded foo(x) into the function.
here is the code:
temp = (B.'*C*B);
for k = 1:n
x = A(:,k);
if(mask(k) == 1)
result(k) = (B.'*C*x)^2 / (temp*(x.'*C*x)); %returns scalar
end
end
take note, I am already successfully using the above code as a parfor loop instead of for. I was hoping you would be able to suggest some way to use meshgrid or the sort to yield better performance improvement. I don't think I have RAM problems so a solution can also be expensive memory wise.
Many thanks.
try this:
result=(B.'*C*A).^2./diag(temp*(A.'*C*A))'.*mask;
This vectorization via matrix multiplication will also make sure that result is a 1xn vector. In the code you provided there can be a case where the last elements in mask are zeros, in this case your code will truncate result to a smaller length, whereas, in the answer it'll keep these elements zero.
If your foo admits matrix input, you could do:
result = zeros(1,n); % preallocate result with zeros
mask = logical(mask); % make mask logical type
result(mask) = foo(A(mask),:); % compute foo for all selected columns
My code is being slowed down by a my 4D arrays access in global memory.
I am using PGI compiler 2010.
The 4D array I am accessing is read only from the device and the size is known at run time.
I wanted to allocate to the texture memory and found that my PGI version does not support texture. As the size is known only at run time, it is not possible to use constant memory too.
Only One dimension is known at compile time like this MyFourD(100, x,y,z) where x,y,z are user input.
My first idea is about pointers but not familiar with pointer fortran.
If you have experience how to deal with such a situation, I will appreciate your help. Because only this makes my codes 5times slower than expected
Following is a sample code of what I am trying to do
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(10,i,j,k)*regvalue1
& +myFourdArray(32,i,j,k)*regvalue2
& +myFourdArray(45,i,j,k)*regvalue3
end do
Best regards,
I believe the answer from #Alexander Vogt is on the right track - I would think about re-ordering the array storage. But I would try it like this:
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(i,j,k,10)*regvalue1
& +myFourdArray(i,j,k,32)*regvalue2
& +myFourdArray(i,j,k,45)*regvalue3
end do
Note that the only change is to myFourdArray, there is no need for a change in data ordering in the d_value array.
The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray and so we are allowing for coalesced access. Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing.
Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. So the rapidly varying thread subscript is X first. Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying.
In Fortran this index is the first of a multiple-subscripted array.
In C, this index is the last of a multiple-subscripted array.
Your original code followed this convention for d_value by placing the X thread index (i) in the first array subscript position. But it broke this convention for myFourdArray by putting a constant in the first array subscript position. Thus your access to myFourdArray are noticeably slower.
When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (i.e. k, in this case, as Alexander Vogt did) because doing that will also break coalescing. For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. This is facilitated by having the X thread indexed subscript (e.g. i) first (for Fortran, or last for C).
You could invert the indexing, i.e. let the first dimension change the Fastest. Fortran is column major!
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(k,i,j)=d_value(k,i,j) + &
myFourdArray(k,i,j,10)*regvalue1 + &
myFourdArray(k,i,j,32)*regvalue2 + &
myFourdArray(k,i,j,45)*regvalue3
end do
If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead.
In my experience, pointers do not change much in terms of speed-up when applied to large arrays. What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler.
Ah, ok it is a simple directive:
!$acc do vector
do k=...
enddo