stack representation of multidimensional array

stack representation of multidimensional array - c

Can some explain me the row & column wise representation of a 2 dimensional array in stack? My teacher told that if we have following matrix:
a00 a01 a02
a10 a11 a12
a20 a21 a22
Column wise representation: Row Wise representation:
a00 a00
a10 a01
a20 a02
a01 a10
a11 a11
a21 a12
a02 a20
a12 a21
a22 a22
Whereas i only know about the representation of multidimensional array in memory:
a00 then a01 then a02 then a10 and so on(there increasing order of addresses)
I raised this question in class what is the difference b/w stack representation & memory representation of multidimensional arrays. She said we are doing 2-D array here not pointer. What kind of answer is that. Please explain me this.
She also told some formulae to calculate the address of any element of 2-D array of row representation and column representation in stack. I didn't understand it.
Location(A[j,k]) = Base_address(A) + W(M(k-1)+(j-1))

You said,
Whereas i only know about the representation of multidimentional array in memory: a00 then a01 then a02 then a10 and so on(there increasing order of addresses)
In C/C++, multidimensional arrays are stored using the row representation.
IIRC, in FORTRAN, multidimensional arrays are stored using the column representation.
In C, you can define a 2D array as:
int a[10][3];
When you pass the array to a function, it decays to a pointer of type int (*)[3].
Disclaimer: My FORTRAN is rusty, so pardon any use of incorrect syntax
In FORTRAN, you can define a 2D array as:
INTEGER A(10, 3)
When you pass the array to a function, it the argument type in the function looks like
INTEGER A(10, *)
The differences in the syntax makes it more natural for multidimensional arrays in C to be represented by rows while in FORTRAN it seems natural for them to be represented by columns.
You also said:
Location(A[j,k]) = Base_address(A) + W(M(k-1)+(j-1))
It seems you are using 1-based index. Not sure what W and M stand for.
Let's say you have ROW number of rows and COL number of columns.
If you have row representation:
Location(A[j,k]) = Base_address(A) + (j-1)*COL + (k-1)
If you have column representation:
Location(A[j,k]) = Base_address(A) + (k-1)*ROW + (j-1)

Here's a better representation of your 2D array in the RAM:
Column wise representation:
Chip1 Chip2 Chip3
a00 a01 a02
a10 a11 a12
a20 a21 a22
Row Wise representation:
Chip1 Chip2 Chip3
a00 a10 a20
a01 a11 a21
a02 a12 a22

Related

SIMD Intel Instruction Sets for 2D Matrix

I am developing high performance algorithms based on the Intel instruction sets (AVX, FMA, ...). My algorithms (my kernels) are working pretty well when the data is stored sequentially. However, now I am facing a big problem and I didn't find a work around or solution for it:
see 2D Matrix
int x, y; x = y = 4096;
float data[x*y]__attribute__((aligned(32)));
float buffer[y]__attribute__((aligned(32)));
/* simple test data */
for (i = 0; i < x; i++)
for (j = 0; j < y; j++)
data[y*i+j] = y*i+j; // 0,1,2,3...4095, | 4096,4097, ... 8191 |...
/* 1) Extract the columns out of matrix */
__m256i vindex; __m256 vec;
vindex = _mm256_set_epi32(7*y, 6*y, 5*y, 4*y, 3*y, 2*y, y, 0);
for(i = 0; i < x; i+=8)
{
vec = _mm256_i32gather_ps (&data[i*y], vindex, 4);
_mm256_store_ps (buffer[i], vec);
}
/* 2) Perform functions */
fft(buffer, x) ;
/*3) write back buffer into matrix*/
/* strided write??? ...*/
I want to find a very efficient way to do the following:
Extract the columns out of the matrix: col1 = 0, 4096, 8192, ... col2 = 1, 4097, 8193, ...
I tried it with gather_ps which is really slow.
Perform my high efficient algorithms on the extracted columns...
Store back the columns into the matrix:
Is there any special trick for that?
How can you read and write the with stride (e.g. 4096) using the Intel instructions sets?
Or is there any memory manipulation option the get the columns out of the matrix?
Thank you!

[For row-major data, SIMD access to a row is fast, but to a column slow]
Yes, that is the nature of the x86-64 and similar architectures. Accessing consecutive data in memory is fast, but accessing scattered data (whether randomly or in a regular pattern) is slow. It is a consequence of having processor caches.
There are two basic approaches: copy the data to a new order that facilitates better access patterns, or you do the computations in an order that allows better access patterns.
No, there are no rules of thumb or golden tricks that makes all just work. In fact, even comparing different implementations is tricky, because there are so many complex interactions (from cache latencies to operation interleaving, to cache and memory access patterns), that results are heavily dependent on particular hardware and the dataset at hand.
Let's look at the typical example case, matrix-matrix multiplication. Let's say we multiply two 5×5 matrices (c = a × b), using standard C row-major data order:
c00 c01 c02 c03 c04 a00 a01 a02 a03 a04 b00 b01 b02 b03 b04
c05 c06 c07 c08 c09 a05 a06 a07 a08 a09 b05 b06 b07 b08 b09
c10 c11 c12 c13 c14 = a10 a11 a12 a13 a14 × b10 b11 b12 b13 b14
c15 c16 c17 c18 c19 a15 a16 a17 a18 a19 b15 b16 b17 b18 b19
c20 c21 c22 c23 c24 a20 a21 a22 a23 a24 b20 b21 b22 b23 b24
If we write the result as vertical SIMD vector registers with five components, we have
c00 a00 b00 a01 b05 a02 b10 a03 b15 a04 b20
c01 a00 b01 a01 b06 a02 b11 a03 b16 a04 b21
c02 = a00 × b02 + a01 × b07 + a02 × b12 + a03 × b17 + a04 × b22
c03 a00 b03 a01 b08 a02 b13 a03 b18 a04 b23
c04 a00 b04 a01 b09 a02 b14 a03 b19 a04 b24
c05 a05 b00 a06 b05 a07 b10 a08 b15 a09 b20
c06 a05 b01 a06 b06 a07 b11 a08 b16 a09 b21
c07 = a05 × b02 + a06 × b07 + a07 × b12 + a08 × b17 + a09 × b22
c08 a05 b03 a06 b08 a07 b13 a08 b18 a09 b23
c09 a05 b04 a06 b09 a07 b14 a08 b19 a09 b24
and so on. In other words, if c has the same order as b, we can use SIMD registers with consecutive memory contents for both c and b, and only gather a. Furthermore, the SIMD registers for a have all components the same value.
Note, however, that the b registers repeat for all five rows of c. So, it might be better to initialize c to zero, then do the additions with products having the same b SIMD registers:
c00 a00 b00 c05 a05 b00 c10 a10 b00 c15 a15 b00 c20 a20 b00
c01 a00 b01 c06 a05 b01 c11 a10 b01 c16 a15 b01 c21 a20 b01
c02 += a00 × b02, c07 += a05 × b02, c12 += a10 × b02, c17 += a15 × b02, c22 += a20 × b02
c03 a00 × b03 c08 a05 b03 c13 a10 b03 c18 a15 b03 c23 a20 b03
c04 a00 × b04 c09 a05 b04 c14 a10 b04 c19 a15 b04 c24 a20 b04
If we transposed a first, then the SIMD vector registers for a would also get values from consecutive memory locations. In fact, if a is large enough, linearizing the memory access pattern for a too gives large enough speed boost that it is faster to do a transpose copy (using uint32_t for floats, and uint64_t for doubles; i.e. not using SIMD or floating point at all for the transpose, but just copying the storage in transpose order).
Note that the situation with column-major data order, i.e. data order transposed compared to above, is very similar. There is deep symmetry here. For example, if c and b have the same data order, and a the opposite data order, you can SIMD-vectorize the matrix product efficiently without having to copy any data. Only the summing differs, as that depends on the data order, and matrix multiplication is not commutative (a×b != b×a).
Obviously, a major wrinkle is that the SIMD vector registers have a fixed size, so instead of using a complete row as a register as in the example above, you can only use partial rows. (If the number of columns in the result is not a multiple of the SIMD register width, you have that partial vector to worry about as well.)
SSE and AVX have a relatively large number of registers (8, 16, or 32, depending on the set of extensions used), and depending on the particular processor type, might be able to perform some vector operations simultaneously, or at least with fewer latencies if unrelated vector operations are interleaved. So, even the choice of how wide a chunk to operate at once, and whether that chunk is like an extended vector, or more like a block submatrix, is up to discussion, testing, and comparison.
So how do I do the matrix-matrix multiplication most efficiently using SIMD?
Like I said, that depends on the data set. No easy answers, I'm afraid.
The main parameters (to choosing the most efficient approach) are the sizes and memory ordering of the multiplicand and result matrices.
It gets even more interesting, if you calculate the product of more than two matrices of different sizes. This is because the number of operations then depends on the order of the products.
Why are you so discouraging?
I'm not, actually. All of the above means that not too many people can handle this kind of complexity and stay sane and productive, so there is a lot of undiscovered approaches, and a lot to gain in real-world performance.
Even if we ignore the SIMD intrinsics compilers provide (<x86intrin.h> in this case), we can apply the logic above when designing internal data structures, so that the C compiler we use has the best opportunities for vectorizing the calculations for us. (They're not very good at it yet, though. Like I said, complex stuff. Some like Fortran better than C, because its expressions and rules make it easier for Fortran compilers to optimize and vectorize them.)
If this was simple or easy, the solutions would be well known by now. But they aren't, because this is not. But that does not mean that this is impossible or out of our reach; all it means is that smart enough developers haven't yet put enough effort into this to unravel this.

If you can run your algorithms over 8 (or 161) columns in parallel, one regular AVX load can grab 8 columns of data into one vector. Then another load can grab the next row from all those columns.
This has the advantage that you never need to shuffle within a vector; everything is pure vertical and you have consecutive elements of each column in different vectors.
If this was a reduction like summing a column, you'd produce 8 results in parallel. If you're updating the columns as you go, then you're writing vectors of results for 8 columns at once, instead of a vector of 8 elements of one column.
Footnote 1: 16 float columns = 64 byte = 1 full cache line = two AVX vectors or one AVX512 vector. Reading / writing full cache lines at a time is a lot better than striding down one column at a time, although it is usually worse than accessing consecutive cache lines. Especially if your stride is larger than a 4k page, HW prefetching might not lock on to it very well.
Obviously make sure your data is aligned by 64 for this, with the row stride a multiple of 64 bytes, too. Pad the ends of rows if you need to.
Doing only 1 AVX vector (half a cache line) at a time would be bad if the first row will be evicted from L1d before you loop back to read the 2nd 32-byte vector of columns 8..15.
Other caveats:
4k aliasing can be a problem: A store and then a load from addresses that are a multiple of 4kiB apart aren't detected as non-overlapping right away, so the load is blocked by the store. This can massively reduce the amount of parallelism the CPU can exploit.
4k strides can also lead to conflict misses in the cache, if you're touching lots of lines that alias to the same set. So updating data in place might still have cache misses for the stores, because lines could be evicted after loading and processing, before the store is ready to commit. This is most likely to be a problem if your row stride is a large power of 2. If that ends up being a problem, maybe allocate more memory in that case and pad your rows with unused elements at the end, so the storage format never has a large power of 2 row stride.
Adjacent-line prefetching in L2 cache (Intel CPUs) may try to fill in the pair of every line you touch, if there's spare bandwidth. This could end up evicting useful data, especially if you're close to aliasing and/or L2 capacity. But if you're not pushing into those limits, it's probably a good thing and will help when you loop over the next 16 columns.

The Data should be stored one line after the other in the memory.
Since C does not really care if it is an array or a matrix you can access the elements with
for(int i=0;i<columncount;i++)
data[i*LENGTH + desired_column];
you can now store the data or even better the addreses to give it to your worker function. If you take the addreses the values in the matrix will change so you don't need to write them back.

Clever way of adding an array to longer array at particular indices in Fortran?

I have two (1d) arrays, a long one A (size m) and a shorter one B (size n). I want to update the long array by adding each element of the short array at a particular index.
Schematically the arrays are structured like this,
A = [a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 ... am]
B = [ b1 b2 b3 b4 b5 b6 b7 b8 b9 ... bn ]
and I want to update A by adding the corresponding elements of B.
The most straightforward way is to have some index array indarray (same size as B) which tells us which index of A corresponds to B(i):
Option 1
do i = 1, size(B)
A(indarray(i)) = A(indarray(i)) + B(i)
end do
However, there is an organization to this problem which I feel should allow for some better performance:
There should be no barrier to doing this in vectorized way. I.e. the updates for each i are independent and can be done in any order.
There is no need to jump back and forth in array A. The machine should know to just loop once through the arrays only updating A where necessary.
There should be no need for any temporary arrays.
What is the best way to do this in Fortran?
Option 2
One way might be using PACK, UNPACK, and a boolean mask M (same size as A) that serves the same purpose as indarray:
A = [a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 ... am]
B = [ b1 b2 b3 b4 b5 b6 b7 b8 b9 ... bn ]
M = [. T T T . T T . . T . T T T T . ]
(where T represents .true. and . is .false.).
And the code would just be
A = UNPACK(PACK(A, M) + B, M, A)
This is very concise and maybe satisfies (1) and sort of (2) (seems to do two loops through the arrays instead of just one). But I fear the machine will be creating a few temporary arrays in the process which seems unnecessary.
Option 3
What about using where with UNPACK?
where (M)
A =A + UNPACK(B, M, 0.0d0)
end where
This seems about the same as option 2 (two loops and maybe creates temporary arrays). It also has to fill the M=.false. elements of the UNPACK'd array with 0's which seems like a total waste.
Option 4
In my situation the .true. elements of the mask will usually be in continuous blocks (i.e. a few true's in a row then a bunch of false's, then another block of true's, etc). Maybe this could lead to something similar to option 1. Let's say there's K of these .true. blocks. I would need an array indstart (of size K) giving the index into A of the start of each true block, and an index blocksize (size K) with the length of the true block.
j = 1
do i = 1, size(indstart)
i0 = indstart(i)
i1 = i0 + blocksize(i) - 1
A(i0:i1) = A(i0:i1) + B(j:j+blocksize(i)-1)
j = j + blocksize(i)
end do
At least this only does one loop through. This code seems more explicit about the fact that there's no jumping back and forth within the arrays. But I don't think the compiler will be able to figure that out (blocksize could have negative values for example). So this option probably won't result in a vectorized result.
--
Any thoughts on a nice way to do this? In my situation the arrays indarray, M, indstart, and blocksize would be created once but the adding operation must be done many times for different arrays A and B (though these arrays will have constant sizes). The where statement seems like it could be relevant.

Array 2D to 1D conversion and confusion

I am confused to convert a 2D array into 1D array.
I want to write a neighboring 8 elements for "a11" (which is at (1,1)) in the form of width ,rows and cols format without using for loop.
|<--Width--->|
cols
____________
| a00 a01 a02
rows | a10 a11 a12
| a20 a21 a22
I tried in this way :
a00 = pSrc[(cols-1)+ (rows - 1)*width];
a02 = pSrc[(cols-1)+ (rows + 1)*width];
a10 = pSrc[cols+ (rows -1)*width];
a12 = pSrc[cols+ (rows +1)*width];
a20 = pSrc[(cols+1)+ (rows - 1)*width];
a22 = pSrc[(cols+1)+ (rows + 1)*width];
a01 = pSrc[(cols-1)+ (rows )*width];
a21 = pSrc[(cols+1)+ (rows )*width];
But I think I did some mistake .Can any one help me in that .

It isn't clear how pSrc is defined since you don't show its definition. However, your code is consistent with it being declared as a 1D array:
int pSrc[9]; // Or a larger dimension
Your code can sensibly be written so it is more uniformly laid out:
a00 = pSrc[(cols-1) + (rows-1)*width];
a01 = pSrc[(cols-1) + (rows+0)*width];
a02 = pSrc[(cols-1) + (rows+1)*width];
a10 = pSrc[(cols+0) + (rows-1)*width];
a12 = pSrc[(cols+0) + (rows+1)*width];
a20 = pSrc[(cols+1) + (rows-1)*width];
a21 = pSrc[(cols+1) + (rows+0)*width];
a22 = pSrc[(cols+1) + (rows+1)*width];
The +0 will be ignored by even the most simple-minded compiler, almost certainly without even turning the optimizer on, but it makes the code much easier to read. I also resequenced the entries so the row above are listed first, then the row in the middle, and then the bottom row. Again, it makes it easier to see the patterns.
It is then clear that you are using 'rows' and 'cols' backwards. You actually need:
a00 = pSrc[(cols-1) + (rows-1)*width];
a01 = pSrc[(cols+0) + (rows-1)*width];
a02 = pSrc[(cols+1) + (rows-1)*width];
a10 = pSrc[(cols-1) + (rows+0)*width];
a12 = pSrc[(cols+1) + (rows+0)*width];
a20 = pSrc[(cols-1) + (rows+1)*width];
a21 = pSrc[(cols+0) + (rows+1)*width];
a22 = pSrc[(cols+1) + (rows+1)*width];

why only column size works but only row size does not works in 2-d array initialisation?

this works
int a[][2]={
{2,4},
{6,8}
};
but this shows error
int a[2][]={
{2,4},
{6,8}
};
why giving only column size shows no error but giving only row size gives error?

In C, you can omit only the length of first dimension. For 1D array, you can do as
int oneD_array[2] = {1,2};
or
int oneD_array[] = {1,2};
In case of 2D array, both of
int twoD_array[2][2] = { {2,4}, {6,8} };
and
int twoD_array[][2] = { {2,4}, {6,8} };
is valid.
But the above declaration is valid only if the initializer is present. Otherwise it would through error.
The compiler uses length of the initializer to determine how long is the array. But the length of the column can't be determined this way. Without knowing the length of the array, compiler is not able to calculate the address of its corresponding elements. By knowing the length of rows and column, compiler calculate the address of its elements using array equation:
address(array) = address(first element) + (row number * columns + column number)*sizeof)type)
Detailed look on array equation:
A 2D array in C is treated as a 1D array whose elements are 1D arrays (the rows).
For example, a 4x3 array of T (where T is some data type) may be declared by: T mat[4][3], and described by the following scheme:
+-----+-----+-----+
mat == mat[0] ---> | a00 | a01 | a02 |
+-----+-----+-----+
+-----+-----+-----+
mat[1] ---> | a10 | a11 | a12 |
+-----+-----+-----+
+-----+-----+-----+
mat[2] ---> | a20 | a21 | a22 |
+-----+-----+-----+
+-----+-----+-----+
mat[3] ---> | a30 | a31 | a32 |
+-----+-----+-----+
The array elements are stored in memory row after row, so the array equation for element mat[m][n] of type T is:
address(mat[i][j]) = address(mat[0][0]) + (i * n + j) * size(T)
address(mat[i][j]) = address(mat[0][0]) +
i * n * size(T) +
j * size(T)
address(mat[i][j]) = address(mat[0][0]) +
i * size(row of T) +
j * size(T)

The compiler has to convert the array into a linear structure (i.e. memory addresses). It does this by multiplying the row number by the width of the column then add the column number you are interested in. You can note that this calculation requires the width (number of columns) needs to be known. The compiler is able to count the number of rows.
So memory address = row number * number of columns + column number you are interested in. Cannot get away from the fact that number of columns is a compile time requirement

Genetics algorithms theoretical question

I'm currently reading "Artificial Intelligence: A Modern Approach" (Russell+Norvig) and "Machine Learning" (Mitchell) - and trying to learn basics of AINN.
In order to understand few basic things I have two 'greenhorn' questions:
Q1: In a genetic algorithm given the two parents A and B with the chromosomes 001110 and 101101, respectively, which of the following offspring could have resulted from a one-point crossover?
a: 001101
b: 001110
Q2: Which of the above offspring could have resulted from a two-point crossover? and why?
Please advise.

It is not possible to find parents if you do not know the inverse-crossover function (so that AxB => (a,b) & (any a) => (A,B)).
Usually the 1-point crossover function is:
a = A1 + B2
b = B1 + A2
Even if you know a and b you cannot solve the system (system of 2 equations with 4 variables).
If you know any 2 parts of any A or/and B then it can be solved (system of 2 equations with 2 variables). This is the case for your question as you provide both A and B.
Generally crossover function does not have inverse function and you just need to find the solution logically or, if you know parents, perform the crossover and compare.
So to make a generic formula for you we should know 2 things:
Crossover function.
Inverse-crossover function.
The 2nd one is not usually used in GAs as it is not required.
Now, I'll just answer your questions.
Q1: In a genetic algorithm given the
two parents A and B with the
chromosomes 001110 and 101101,
respectively, which of the following
offspring could have resulted from a
one-point crossover?
Looking at the a and b I can see the crossover point is here:
1 2
A: 00 | 1110
B: 10 | 1101
Usually the crossover is done using this formula:
a = A1 + B2
b = B1 + A2
so that possible children are:
a: 00 | 1101
b: 10 | 1110
which excludes option b from the question.
So the answer to Q1 is the result child is a: 001101 assuming given crossover function
Q2: Which of the above offspring could
have resulted from a two-point
crossover? and why?
Looking at the a and b I can see the crossover points can be here:
1 2 3
A: 00 | 11 | 10
B: 10 | 11 | 01
Usual formula for 2-point crossover is:
a = A1 + B2 + A3
b = B1 + A2 + B3
So the children would be:
a = 00 | 11 | 10
b = 10 | 11 | 01
Comparing them to the options you asked (small a and b) we can say the answer:
Q2. A: Neither of a or b could be result of 2-point crossover with AxB according to the given crossover function.
Again it is not possible to answer your questions without knowing the crossover function.
The functions I provided are common in GA, but you can invent so many of them so they could answer the question (see the comment below):

One point crossover is when you make one join from each parent, two point crossover is when you make two joins. i.e. two from one parent and one from the others.
See crossover (wikipedia) for further info.

Regarding Q1, (a) could have been produced by a one-point crossover, taking bits 0-4 from parent A and bit 5 from parent B. (b) could not unless your crossover algorithm allows for null contributions, i.e. parent contributions of null weight. In that case, parent A could contribute its full chromosome (bits 0-5) and parent B would contribute nil, yielding (b).
Regarding Q2, both (a) and (b) are possible. There are a few combinations to test; too tedious to write, but you can do the work with pen and paper. :-)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight