SIMD Intel Instruction Sets for 2D Matrix - c

I am developing high performance algorithms based on the Intel instruction sets (AVX, FMA, ...). My algorithms (my kernels) are working pretty well when the data is stored sequentially. However, now I am facing a big problem and I didn't find a work around or solution for it:
see 2D Matrix
int x, y; x = y = 4096;
float data[x*y]__attribute__((aligned(32)));
float buffer[y]__attribute__((aligned(32)));
/* simple test data */
for (i = 0; i < x; i++)
for (j = 0; j < y; j++)
data[y*i+j] = y*i+j; // 0,1,2,3...4095, | 4096,4097, ... 8191 |...
/* 1) Extract the columns out of matrix */
__m256i vindex; __m256 vec;
vindex = _mm256_set_epi32(7*y, 6*y, 5*y, 4*y, 3*y, 2*y, y, 0);
for(i = 0; i < x; i+=8)
{
vec = _mm256_i32gather_ps (&data[i*y], vindex, 4);
_mm256_store_ps (buffer[i], vec);
}
/* 2) Perform functions */
fft(buffer, x) ;
/*3) write back buffer into matrix*/
/* strided write??? ...*/
I want to find a very efficient way to do the following:
Extract the columns out of the matrix: col1 = 0, 4096, 8192, ... col2 = 1, 4097, 8193, ...
I tried it with gather_ps which is really slow.
Perform my high efficient algorithms on the extracted columns...
Store back the columns into the matrix:
Is there any special trick for that?
How can you read and write the with stride (e.g. 4096) using the Intel instructions sets?
Or is there any memory manipulation option the get the columns out of the matrix?
Thank you!

[For row-major data, SIMD access to a row is fast, but to a column slow]
Yes, that is the nature of the x86-64 and similar architectures. Accessing consecutive data in memory is fast, but accessing scattered data (whether randomly or in a regular pattern) is slow. It is a consequence of having processor caches.
There are two basic approaches: copy the data to a new order that facilitates better access patterns, or you do the computations in an order that allows better access patterns.
No, there are no rules of thumb or golden tricks that makes all just work. In fact, even comparing different implementations is tricky, because there are so many complex interactions (from cache latencies to operation interleaving, to cache and memory access patterns), that results are heavily dependent on particular hardware and the dataset at hand.
Let's look at the typical example case, matrix-matrix multiplication. Let's say we multiply two 5×5 matrices (c = a × b), using standard C row-major data order:
c00 c01 c02 c03 c04 a00 a01 a02 a03 a04 b00 b01 b02 b03 b04
c05 c06 c07 c08 c09 a05 a06 a07 a08 a09 b05 b06 b07 b08 b09
c10 c11 c12 c13 c14 = a10 a11 a12 a13 a14 × b10 b11 b12 b13 b14
c15 c16 c17 c18 c19 a15 a16 a17 a18 a19 b15 b16 b17 b18 b19
c20 c21 c22 c23 c24 a20 a21 a22 a23 a24 b20 b21 b22 b23 b24
If we write the result as vertical SIMD vector registers with five components, we have
c00 a00 b00 a01 b05 a02 b10 a03 b15 a04 b20
c01 a00 b01 a01 b06 a02 b11 a03 b16 a04 b21
c02 = a00 × b02 + a01 × b07 + a02 × b12 + a03 × b17 + a04 × b22
c03 a00 b03 a01 b08 a02 b13 a03 b18 a04 b23
c04 a00 b04 a01 b09 a02 b14 a03 b19 a04 b24
c05 a05 b00 a06 b05 a07 b10 a08 b15 a09 b20
c06 a05 b01 a06 b06 a07 b11 a08 b16 a09 b21
c07 = a05 × b02 + a06 × b07 + a07 × b12 + a08 × b17 + a09 × b22
c08 a05 b03 a06 b08 a07 b13 a08 b18 a09 b23
c09 a05 b04 a06 b09 a07 b14 a08 b19 a09 b24
and so on. In other words, if c has the same order as b, we can use SIMD registers with consecutive memory contents for both c and b, and only gather a. Furthermore, the SIMD registers for a have all components the same value.
Note, however, that the b registers repeat for all five rows of c. So, it might be better to initialize c to zero, then do the additions with products having the same b SIMD registers:
c00 a00 b00 c05 a05 b00 c10 a10 b00 c15 a15 b00 c20 a20 b00
c01 a00 b01 c06 a05 b01 c11 a10 b01 c16 a15 b01 c21 a20 b01
c02 += a00 × b02, c07 += a05 × b02, c12 += a10 × b02, c17 += a15 × b02, c22 += a20 × b02
c03 a00 × b03 c08 a05 b03 c13 a10 b03 c18 a15 b03 c23 a20 b03
c04 a00 × b04 c09 a05 b04 c14 a10 b04 c19 a15 b04 c24 a20 b04
If we transposed a first, then the SIMD vector registers for a would also get values from consecutive memory locations. In fact, if a is large enough, linearizing the memory access pattern for a too gives large enough speed boost that it is faster to do a transpose copy (using uint32_t for floats, and uint64_t for doubles; i.e. not using SIMD or floating point at all for the transpose, but just copying the storage in transpose order).
Note that the situation with column-major data order, i.e. data order transposed compared to above, is very similar. There is deep symmetry here. For example, if c and b have the same data order, and a the opposite data order, you can SIMD-vectorize the matrix product efficiently without having to copy any data. Only the summing differs, as that depends on the data order, and matrix multiplication is not commutative (a×b != b×a).
Obviously, a major wrinkle is that the SIMD vector registers have a fixed size, so instead of using a complete row as a register as in the example above, you can only use partial rows. (If the number of columns in the result is not a multiple of the SIMD register width, you have that partial vector to worry about as well.)
SSE and AVX have a relatively large number of registers (8, 16, or 32, depending on the set of extensions used), and depending on the particular processor type, might be able to perform some vector operations simultaneously, or at least with fewer latencies if unrelated vector operations are interleaved. So, even the choice of how wide a chunk to operate at once, and whether that chunk is like an extended vector, or more like a block submatrix, is up to discussion, testing, and comparison.
So how do I do the matrix-matrix multiplication most efficiently using SIMD?
Like I said, that depends on the data set. No easy answers, I'm afraid.
The main parameters (to choosing the most efficient approach) are the sizes and memory ordering of the multiplicand and result matrices.
It gets even more interesting, if you calculate the product of more than two matrices of different sizes. This is because the number of operations then depends on the order of the products.
Why are you so discouraging?
I'm not, actually. All of the above means that not too many people can handle this kind of complexity and stay sane and productive, so there is a lot of undiscovered approaches, and a lot to gain in real-world performance.
Even if we ignore the SIMD intrinsics compilers provide (<x86intrin.h> in this case), we can apply the logic above when designing internal data structures, so that the C compiler we use has the best opportunities for vectorizing the calculations for us. (They're not very good at it yet, though. Like I said, complex stuff. Some like Fortran better than C, because its expressions and rules make it easier for Fortran compilers to optimize and vectorize them.)
If this was simple or easy, the solutions would be well known by now. But they aren't, because this is not. But that does not mean that this is impossible or out of our reach; all it means is that smart enough developers haven't yet put enough effort into this to unravel this.

If you can run your algorithms over 8 (or 161) columns in parallel, one regular AVX load can grab 8 columns of data into one vector. Then another load can grab the next row from all those columns.
This has the advantage that you never need to shuffle within a vector; everything is pure vertical and you have consecutive elements of each column in different vectors.
If this was a reduction like summing a column, you'd produce 8 results in parallel. If you're updating the columns as you go, then you're writing vectors of results for 8 columns at once, instead of a vector of 8 elements of one column.
Footnote 1: 16 float columns = 64 byte = 1 full cache line = two AVX vectors or one AVX512 vector. Reading / writing full cache lines at a time is a lot better than striding down one column at a time, although it is usually worse than accessing consecutive cache lines. Especially if your stride is larger than a 4k page, HW prefetching might not lock on to it very well.
Obviously make sure your data is aligned by 64 for this, with the row stride a multiple of 64 bytes, too. Pad the ends of rows if you need to.
Doing only 1 AVX vector (half a cache line) at a time would be bad if the first row will be evicted from L1d before you loop back to read the 2nd 32-byte vector of columns 8..15.
Other caveats:
4k aliasing can be a problem: A store and then a load from addresses that are a multiple of 4kiB apart aren't detected as non-overlapping right away, so the load is blocked by the store. This can massively reduce the amount of parallelism the CPU can exploit.
4k strides can also lead to conflict misses in the cache, if you're touching lots of lines that alias to the same set. So updating data in place might still have cache misses for the stores, because lines could be evicted after loading and processing, before the store is ready to commit. This is most likely to be a problem if your row stride is a large power of 2. If that ends up being a problem, maybe allocate more memory in that case and pad your rows with unused elements at the end, so the storage format never has a large power of 2 row stride.
Adjacent-line prefetching in L2 cache (Intel CPUs) may try to fill in the pair of every line you touch, if there's spare bandwidth. This could end up evicting useful data, especially if you're close to aliasing and/or L2 capacity. But if you're not pushing into those limits, it's probably a good thing and will help when you loop over the next 16 columns.

The Data should be stored one line after the other in the memory.
Since C does not really care if it is an array or a matrix you can access the elements with
for(int i=0;i<columncount;i++)
data[i*LENGTH + desired_column];
you can now store the data or even better the addreses to give it to your worker function. If you take the addreses the values in the matrix will change so you don't need to write them back.

Related

Mapping specific bits in input bytes to specific bits in output word

Background: Given some input bytes B0, B1, B2, B3 and B4, I want to extract selected bits from these 5 bytes and generate an output word.
For example, denoting the nth bit of Bi as Bi[n], I want to be able to write a mapping f : (B0, B1, B2, B3, B4) → B2[4] B3[5] B3[4] B3[3] B3[2] B3[1] B0[5] B0[3] B0[1]. So f(0b11001, 0b01100, 0b10101, 0b10011, 0b11111) would return 0b010011101.
An expression in C that might do this exact example would be
(B2 & 4 << 5) | (B3 << 3) | (B0 & 16 << 2) | (B0 & 4 << 1) | (B0 & 1)
using naive bitmasks and bitshifts.
Question: Is there any way to simplify such an expression to minimize the number of operations that need to be carried out?
For example, I note that B3 is copied in its entirety to some of the bits of the output, so I put it in place using B3 << 3 instead of masking and shifting individual bits. The first thing I thought of were Karnaugh maps since they came in handy in simplifying Boolean expressions, but I realised that since I am extracting and placing individual bits in different parts of a byte there is no simplification possible using Boolean algebra.
Reasoning: The reason why I want to do this is to be able to light the LEDs in a programmer-friendly manner on the BBC micro:bit. I want B0 to B4 to represent which LEDs are on in the physical 5x5 arrangement, but electronically these LEDs are wired in a complex 3x9 configuration. More information on the LEDs can be found here.
Typically a pattern would be stored in memory according to the physical 3x9 arrangement so as to be able to output this pattern to the LEDs in a single instruction, but I want to be able to map a 5x5 pattern to the 3x9 pattern programmatically. However an expression as shown above would require 5 load instructions, 9 bitwise AND/OR operations and 4 logical shifts, which is at least 9 times more inefficient that the normal method.
First consider how much each bit needs to be shifted (rather than merely its final position). You can then execute the required shift amount with one command for multiple bits for those groups of input bits where the shift in the same. For example, (B3 & 31) << 3). You might also be able to eliminate the "masking" (done with the bitwise AND, &) if the masked out bits get shifted out.

Clever way of adding an array to longer array at particular indices in Fortran?

I have two (1d) arrays, a long one A (size m) and a shorter one B (size n). I want to update the long array by adding each element of the short array at a particular index.
Schematically the arrays are structured like this,
A = [a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 ... am]
B = [ b1 b2 b3 b4 b5 b6 b7 b8 b9 ... bn ]
and I want to update A by adding the corresponding elements of B.
The most straightforward way is to have some index array indarray (same size as B) which tells us which index of A corresponds to B(i):
Option 1
do i = 1, size(B)
A(indarray(i)) = A(indarray(i)) + B(i)
end do
However, there is an organization to this problem which I feel should allow for some better performance:
There should be no barrier to doing this in vectorized way. I.e. the updates for each i are independent and can be done in any order.
There is no need to jump back and forth in array A. The machine should know to just loop once through the arrays only updating A where necessary.
There should be no need for any temporary arrays.
What is the best way to do this in Fortran?
Option 2
One way might be using PACK, UNPACK, and a boolean mask M (same size as A) that serves the same purpose as indarray:
A = [a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 ... am]
B = [ b1 b2 b3 b4 b5 b6 b7 b8 b9 ... bn ]
M = [. T T T . T T . . T . T T T T . ]
(where T represents .true. and . is .false.).
And the code would just be
A = UNPACK(PACK(A, M) + B, M, A)
This is very concise and maybe satisfies (1) and sort of (2) (seems to do two loops through the arrays instead of just one). But I fear the machine will be creating a few temporary arrays in the process which seems unnecessary.
Option 3
What about using where with UNPACK?
where (M)
A =A + UNPACK(B, M, 0.0d0)
end where
This seems about the same as option 2 (two loops and maybe creates temporary arrays). It also has to fill the M=.false. elements of the UNPACK'd array with 0's which seems like a total waste.
Option 4
In my situation the .true. elements of the mask will usually be in continuous blocks (i.e. a few true's in a row then a bunch of false's, then another block of true's, etc). Maybe this could lead to something similar to option 1. Let's say there's K of these .true. blocks. I would need an array indstart (of size K) giving the index into A of the start of each true block, and an index blocksize (size K) with the length of the true block.
j = 1
do i = 1, size(indstart)
i0 = indstart(i)
i1 = i0 + blocksize(i) - 1
A(i0:i1) = A(i0:i1) + B(j:j+blocksize(i)-1)
j = j + blocksize(i)
end do
At least this only does one loop through. This code seems more explicit about the fact that there's no jumping back and forth within the arrays. But I don't think the compiler will be able to figure that out (blocksize could have negative values for example). So this option probably won't result in a vectorized result.
--
Any thoughts on a nice way to do this? In my situation the arrays indarray, M, indstart, and blocksize would be created once but the adding operation must be done many times for different arrays A and B (though these arrays will have constant sizes). The where statement seems like it could be relevant.

stack representation of multidimensional array

Can some explain me the row & column wise representation of a 2 dimensional array in stack? My teacher told that if we have following matrix:
a00 a01 a02
a10 a11 a12
a20 a21 a22
Column wise representation: Row Wise representation:
a00 a00
a10 a01
a20 a02
a01 a10
a11 a11
a21 a12
a02 a20
a12 a21
a22 a22
Whereas i only know about the representation of multidimensional array in memory:
a00 then a01 then a02 then a10 and so on(there increasing order of addresses)
I raised this question in class what is the difference b/w stack representation & memory representation of multidimensional arrays. She said we are doing 2-D array here not pointer. What kind of answer is that. Please explain me this.
She also told some formulae to calculate the address of any element of 2-D array of row representation and column representation in stack. I didn't understand it.
Location(A[j,k]) = Base_address(A) + W(M(k-1)+(j-1))
You said,
Whereas i only know about the representation of multidimentional array in memory: a00 then a01 then a02 then a10 and so on(there increasing order of addresses)
In C/C++, multidimensional arrays are stored using the row representation.
IIRC, in FORTRAN, multidimensional arrays are stored using the column representation.
In C, you can define a 2D array as:
int a[10][3];
When you pass the array to a function, it decays to a pointer of type int (*)[3].
Disclaimer: My FORTRAN is rusty, so pardon any use of incorrect syntax
In FORTRAN, you can define a 2D array as:
INTEGER A(10, 3)
When you pass the array to a function, it the argument type in the function looks like
INTEGER A(10, *)
The differences in the syntax makes it more natural for multidimensional arrays in C to be represented by rows while in FORTRAN it seems natural for them to be represented by columns.
You also said:
Location(A[j,k]) = Base_address(A) + W(M(k-1)+(j-1))
It seems you are using 1-based index. Not sure what W and M stand for.
Let's say you have ROW number of rows and COL number of columns.
If you have row representation:
Location(A[j,k]) = Base_address(A) + (j-1)*COL + (k-1)
If you have column representation:
Location(A[j,k]) = Base_address(A) + (k-1)*ROW + (j-1)
Here's a better representation of your 2D array in the RAM:
Column wise representation:
Chip1 Chip2 Chip3
a00 a01 a02
a10 a11 a12
a20 a21 a22
Row Wise representation:
Chip1 Chip2 Chip3
a00 a10 a20
a01 a11 a21
a02 a12 a22

inverse a number

As we go up the musical scale the note frequency increases;
#define A4 440 // These are the frequencies of the notes in herts
#define AS4 466
#define B4 494
#define C5 523
#define CS5 554
#define D5 587
I am generating the tones mechanically, I tell a step motor to step, delay, step, delay etc etc very quickly.
The longer the delay between steps, the lower the note. Is there some smart maths I could use to inverse the frequencies so as I climb up the scale the numbers come out lower and lower?
This way I could use the frequencies to help calculate the correct delay to generate a note.
So what you're saying is you want the numbers to represent the time between steps rather than a frequency?
440 Hz means 440 cycles/second. What you want is the number of seconds/cycle (i.e. time between steps). That's just 1 / <frequency>. That means all you have to do is define your values as 1/440, 1/466, etc. (or, if you want the values to be milliseconds, 1000/440, 1000/466 etc.)
If that is too fast (or doesn't match the actual notes), you can multiply each value by a scale factor and the relationships between the audible tones should remain the same.
For example, lets say that you empirically discover that for your machine to make an "A4" tone, the delay between steps is 10 milliseconds. To figure out the scale factor, solve for x:
x / 440 = 10
x = 4400
So define scale = 4400, and define each of your notes as scale / 440, scale / 466 etc.
Yes, that sounds possible! Let's have a look... (some of this you will know but I'll post it anyway)
In what's called an equal tempered scale, you can calculate Hertz values by multiplying by the twelfth root of two for every semitone you go up. There are 12 semitones in a whole octave, and multiplying by this value twelve times doubles the frequency, which raises the tone by an octave.
So, if you wanted to calculate descending semitone frequencies from e.g. A 440, you can calculate double x = pow(2.0, 1.0/12.0) (assuming C), and then repeatedly divide by that value (remember to do the divisions as doubles not ints :) ) and then you'll get your descending scale.
Aside: If you want to do a major scale rather than a chromatic (semitone) scale, this is the pattern of tones and semitones to use: (e.g. in C Major - using T for Tone, S for semitone)
C [T] D [T] E [S] F [T] G [T] A [T] B [S] C

Genetics algorithms theoretical question

I'm currently reading "Artificial Intelligence: A Modern Approach" (Russell+Norvig) and "Machine Learning" (Mitchell) - and trying to learn basics of AINN.
In order to understand few basic things I have two 'greenhorn' questions:
Q1: In a genetic algorithm given the two parents A and B with the chromosomes 001110 and 101101, respectively, which of the following offspring could have resulted from a one-point crossover?
a: 001101
b: 001110
Q2: Which of the above offspring could have resulted from a two-point crossover? and why?
Please advise.
It is not possible to find parents if you do not know the inverse-crossover function (so that AxB => (a,b) & (any a) => (A,B)).
Usually the 1-point crossover function is:
a = A1 + B2
b = B1 + A2
Even if you know a and b you cannot solve the system (system of 2 equations with 4 variables).
If you know any 2 parts of any A or/and B then it can be solved (system of 2 equations with 2 variables). This is the case for your question as you provide both A and B.
Generally crossover function does not have inverse function and you just need to find the solution logically or, if you know parents, perform the crossover and compare.
So to make a generic formula for you we should know 2 things:
Crossover function.
Inverse-crossover function.
The 2nd one is not usually used in GAs as it is not required.
Now, I'll just answer your questions.
Q1: In a genetic algorithm given the
two parents A and B with the
chromosomes 001110 and 101101,
respectively, which of the following
offspring could have resulted from a
one-point crossover?
Looking at the a and b I can see the crossover point is here:
1 2
A: 00 | 1110
B: 10 | 1101
Usually the crossover is done using this formula:
a = A1 + B2
b = B1 + A2
so that possible children are:
a: 00 | 1101
b: 10 | 1110
which excludes option b from the question.
So the answer to Q1 is the result child is a: 001101 assuming given crossover function
Q2: Which of the above offspring could
have resulted from a two-point
crossover? and why?
Looking at the a and b I can see the crossover points can be here:
1 2 3
A: 00 | 11 | 10
B: 10 | 11 | 01
Usual formula for 2-point crossover is:
a = A1 + B2 + A3
b = B1 + A2 + B3
So the children would be:
a = 00 | 11 | 10
b = 10 | 11 | 01
Comparing them to the options you asked (small a and b) we can say the answer:
Q2. A: Neither of a or b could be result of 2-point crossover with AxB according to the given crossover function.
Again it is not possible to answer your questions without knowing the crossover function.
The functions I provided are common in GA, but you can invent so many of them so they could answer the question (see the comment below):
One point crossover is when you make one join from each parent, two point crossover is when you make two joins. i.e. two from one parent and one from the others.
See crossover (wikipedia) for further info.
Regarding Q1, (a) could have been produced by a one-point crossover, taking bits 0-4 from parent A and bit 5 from parent B. (b) could not unless your crossover algorithm allows for null contributions, i.e. parent contributions of null weight. In that case, parent A could contribute its full chromosome (bits 0-5) and parent B would contribute nil, yielding (b).
Regarding Q2, both (a) and (b) are possible. There are a few combinations to test; too tedious to write, but you can do the work with pen and paper. :-)

Resources