3-D Loop comparison in 7-pt Stencil - c

I carry out a 7-pt stencil update on two 3-D domains. The first one is 258x130x258and the second one is 130x258x258. Both of them have the same number of elements being updated. In C they are represented as contiguous arrays : a1[258][130][258] and x1[130][258][258]. Simply stated their x-dimension and y-dimension are exchanged but z-dimension (fastest changing index) is equal.
Loop 1:
for(i = 1; i <= 256 ; i++)
for(j = 1; j <= 128 ; j++)
for(k = 1; k <= 256; k++)
a1[i][j][k] = alpha * b1[i][j][k] + (Omega_6) *(b1[i-1][j][k] + b1[i+1][j][k] +
b1[i][j-1][k] + b1[i][j+1][k] +
b1[i][j][k-1] + b1[i][j][k+1] +
c1[i][j][k] * H);
Loop 2:
for(i = 1; i <= 128 ; i++)
for(j = 1; j <= 256 ; j++)
for(k = 1; k <= 256; k++)
x1[i][j][k] = alpha * y1[i][j][k] + (Omega_6) *(y1[i-1][j][k] + y1[i+1][j][k] +
y1[i][j-1][k] + y1[i][j+1][k] +
y1[i][j][k-1] + y1[i][j][k+1] +
z1[i][j][k] * H);
a1, b1, c1 all have same dimensions and x1, y1, z1 have the same dimensions. alpha and Omega_6 are constants. Loop 1 runs 0.5 seconds faster than Loop 2. Why does this happen ?

Related

Intrinsics load matrix

Im learning Intrinsics. I dont know how to load a matrix correctly. I want to do matrix multiplication.
This is my code:
int i, j, k;
__m128 mat2values = _mm_setzero_ps();
__m128 mat1values = _mm_setzero_ps();
__m128 r = _mm_setzero_ps();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N - 3; j += 4)
{
for (k = 0; k < N - 3; k += 4)
{
mat1values = _mm_load_ps(&mat1[i][k]);
mat2values = _mm_load_ps(&mat2[k][j]);
r = _mm_add_ps(r, _mm_mul_ps(mat1values, mat2values));
}
result[i][j] = r.m128_f32[0] + r.m128_f32[1] + r.m128_f32[2] + r.m128_f32[3];
for (; k < N; k++)
result[i][j] += mat1[i][j] * mat2[k][j];
}
}
When debugging result will still hold all 0 values after loop.
Are you sure the expression
_mm_load_ps(mat1[i][k])
returns the correct memory address in float*?

Divide an array into subarrays so that sum of product of their length and XOR is minimum

We have an array of "n" numbers. We need to divide it in M subarray such that the cost is minimum.
Cost = (XOR of subarray) X ( length of subarray )
Eg:
array = [11,11,11,24,26,100]
M = 3
OUTPUT => 119
Explanation:
Dividing into subarrays as = > [11] , [11,11,24,26] , [100]
As 11*1 + (11^11^24^26)*4 + 100*1 => 119 is minimum value.
Eg2: array = [12,12]
M = 1
output: 0
As [12,12] one way and (12^12)*2 = 0.
You can solve this problem by using dynamic programming.
Let's define dp[i][j]: the minimum cost for solving this problem when you only have the first i elements of the array and you want to split (partition) them into j subarrays.
dp[i][j] = cost of the last subarray plus cost of the partitioning of the other part of the given array into j-1 subarrays
This is my solution which runs in O(m * n^2):
#include <bits/stdc++.h>
using namespace std;
const int MAXN = 1000 + 10;
const int MAXM = 1000 + 10;
const long long INF = 1e18 + 10;
int n, m, a[MAXN];
long long dp[MAXN][MAXM];
int main() {
cin >> n >> m;
for (int i = 1; i <= n; i++) {
cin >> a[i];
}
// start of initialization
for (int i = 0; i <= n; i++)
for (int j = 0; j <= n; j++)
dp[i][j] = INF;
dp[0][0] = 0;
// end of initialization
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
int last_subarray_xor = 0, last_subarray_length = 0;
for (int k = i; k >= 1; k--) {
last_subarray_xor ^= a[k];
last_subarray_length = i - k + 1;
dp[i][j] = min(dp[i][j], dp[k - 1][j - 1] + (long long)last_subarray_xor * (long long)last_subarray_length);
}
}
}
cout << dp[n][m] << endl;
return 0;
}
Sample input:
6 3
11 11 11 24 26 100
Sample output:
119
One of the most simple classic dynamic programming problems is called "0-1 Knapsack" that's available on Wikipedia.

Maximum sized square sub-matrix

I have a matrix of size N*M filled with 0's and 1's.
For each query K, I have to answer the maximum sized square sub-matrix in which minimum(number of 1's, number of 0's)=k where 1<=K<=10^9. For example consider the matrix of size 8*8:
10000000
01000000
00000000
00000000
00000000
00000000
00000000
00000000
k= 1 answer= 7
k=2 answer= 8
k=0 answer= 6
k=1001 answer= 8
I understood that for k=1, the sub-matrix (1,1) to (7,7) works for k=2, the largest square sub-matrix is the original matrix itself.
For k=1, we have to get all the 7*7 square sub-matrix. Find their min(no. of 1's,no. of 0's) and then get the minimum of all those as the answer.
I am not able to generate all the pairs of square sub-matrix. Can anyone help me in achieving that? Also, if any shorter way is available, that will be good as well because this takes very much time.
Is this an interview question? This problem is very similar to that of the maximum submatrix sum (https://www.geeksforgeeks.org/maximum-sum-rectangle-in-a-2d-matrix-dp-27/), whose DP solution you should be able to adapt for this.
EDIT:
The following is O(n^3) time O(n^2) memory
The import piece to realize is that the area D = Entire Area - B - C + A
| A B |
| C D |
#include <stdlib.h>
#include <stdio.h>
int **create_dp(int **matrix, int **dp, int row, int col) {
dp[0][0] = matrix[0][0];
for (int i = 1; i < row; ++i)
dp[i][0] = matrix[i][0] + dp[i - 1][0];
for (int j = 1; j < col; ++j)
dp[0][j] = matrix[0][j] + dp[0][j - 1];
for (int i = 1; i < row; ++i)
for (int j = 1; j < col; ++j)
dp[i][j] = dp[i - 1][j] + dp[i][j - 1] + matrix[i][j] - dp[i - 1][j - 1];
}
int min(int x, int y) {
if (x > y) return y;
return x;
}
int max_square_submatrix(int **matrix, int row, int col, int query) {
// the value dp[i][j] is the sum of all values in matrix up to i, j
// i.e. dp[1][1] = matrix[0][0] + matrix[1][0] + matrix[0][1] + matrix[1][1]
int **dp = malloc(sizeof(int*) * row);
for (int i = 0; i < row; ++i) dp[i] = malloc(sizeof(int) * col);
create_dp(matrix, dp, row, col);
int global_max_size = 0;
// go through all squares in matrix
for (int i = 0; i < row; ++i) {
for (int j = 0; j < col; ++j) {
// begin creating square matrices
// this is the largest size a square matrix could have
int max_size = min(row - i, col - j) - 1;
for (; max_size >= 0; --max_size) {
// you need to see above diagram in order to visualize this step
int num_ones = dp[i + max_size][j + max_size];
if (i > 0 && j > 0)
num_ones += -dp[i + max_size][j - 1] - dp[i - 1][j + max_size] + dp[i - 1][j - 1];
else if (j > 0)
num_ones += -dp[i + max_size][j - 1];
else if (i > 0)
num_ones += -dp[i - 1][j + max_size];
if (num_ones <= query) break;
}
if (global_max_size < max_size + 1) global_max_size = max_size + 1;
}
}
// free dp memory here
return global_max_size;
}
int main() {
#define N 8
#define M 8
int **matrix = malloc(sizeof(int*) * N);
for (int i = 0; i < N; ++i) matrix[i] = malloc(sizeof(int) * M);
for (int i = 0; i < N; ++i)
for (int j = 0; j < M; ++j)
matrix[i][j] = 0;
matrix[0][0] = matrix[1][1] = 1;
printf("%d\n", max_square_submatrix(matrix, 8, 8, 1));
printf("%d\n", max_square_submatrix(matrix, 8, 8, 2));
printf("%d\n", max_square_submatrix(matrix, 8, 8, 0));
printf("%d\n", max_square_submatrix(matrix, 8, 8, 1001));
}

Optimization of C code

For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made ​​some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4
y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.
A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.
int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}

Cache utilization in matrix transpose in c

This code transposes a matrix four ways. The first does sequential writes, non sequential reads. The second is the opposite. The next two are the same, but with cache skipping writes. What seems to happen is sequential writes are faster, and skipping the cache is faster. What I don't understand is, if the cache is being skipped why are sequential writes still faster?
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);
EDIT: The output is
Transpose 1: 47603
Transpose 2: 92449
Transpose 3: 38340
Transpose 4: 69597
CPU has a write combining buffer to combine writes on a cache line to happen in one burst. In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped.
To be exact, in case of cache being skipped, writes are still happening in bursts to memory.
See write-combining logic behavior here.
You could try non linear memory layout for the matrix to improve cache utilization. With 4x4 32bit float tiles one could do transpose with only single access to each cache line. Plus as a bonus tile transposes could be done easily with _MM_TRANSPOSE4_PS.
Transposing a very large matrix is still very memory intensive operation. It will still be heavily bandwidth limited but at least cache word load is near optimal. I don't know if the performance could be still optimized. My testing shows that a few years old laptop manages to transpose 16k*16k (1G memory) in about 300ms.
I tried to use also _mm_stream_sd but it actually makes performance worse for some reason. I don't understand nontemporal memory writes enough to have any practical guess why performance would drop with _mm_stream_ps. Possible reason is of course that cache line is already in L1 cache ready for the write operation.
But actually important part with non linear matrix would possibility to avoid transpose completely and simple run the multiplication in tile friendly order. But I only have transpose code that I'm using to improve my knowledge about cache management in algorithms.
I haven't yet tried to test if prefetching would improve memory bandwidth usage. Current code runs at about 0.5 instructions per cycle (good cache friendly code runs around 2 ins per cycle on this CPU) that leaves a lot of free cycles for prefetch instructions allowing even quite complex calculation to optimize prefetching timing in runtime.
example code from my transpose benchmark test follows.
#define MATSIZE 16384
#define align(val, a) (val + (a - val % a))
#define tilewidth 4
typedef int matrix[align(MATSIZE,tilewidth)*MATSIZE] __attribute__((aligned(64)));
float &index(matrix m, unsigned i, unsigned j)
{
/* tiled address calculation */
/* single cache line is used for 4x4 sub matrices (64 bytes = 4*4*sizeof(int) */
/* tiles are arranged linearly from top to bottom */
/*
* eg: 16x16 matrix tile positions:
* t1 t5 t9 t13
* t2 t6 t10 t14
* t3 t7 t11 t15
* t4 t8 t12 t16
*/
const unsigned tilestride = tilewidth * MATSIZE;
const unsigned comp0 = i % tilewidth; /* i inside tile is least significant part */
const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
const unsigned comp2 = i / tilewidth * tilestride;
const unsigned add = comp0 + comp1 + comp2;
return m[add];
}
/* Get start of tile reference */
float &tile(matrix m, unsigned i, unsigned j)
{
const unsigned tilestride = tilewidth * MATSIZE;
const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
const unsigned comp2 = i / tilewidth * tilestride;
return m[comp1 + comp2];
}
template<bool diagonal>
static void doswap(matrix mat, unsigned i, unsigned j)
{
/* special path to swap whole tile at once */
union {
float *fs;
__m128 *mm;
} src, dst;
src.fs = &tile(mat, i, j);
dst.fs = &tile(mat, j, i);
if (!diagonal) {
__m128 srcrow0 = src.mm[0];
__m128 srcrow1 = src.mm[1];
__m128 srcrow2 = src.mm[2];
__m128 srcrow3 = src.mm[3];
__m128 dstrow0 = dst.mm[0];
__m128 dstrow1 = dst.mm[1];
__m128 dstrow2 = dst.mm[2];
__m128 dstrow3 = dst.mm[3];
_MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
_MM_TRANSPOSE4_PS(dstrow0, dstrow1, dstrow2, dstrow3);
#if STREAMWRITE == 1
_mm_stream_ps(src.fs + 0, dstrow0);
_mm_stream_ps(src.fs + 4, dstrow1);
_mm_stream_ps(src.fs + 8, dstrow2);
_mm_stream_ps(src.fs + 12, dstrow3);
_mm_stream_ps(dst.fs + 0, srcrow0);
_mm_stream_ps(dst.fs + 4, srcrow1);
_mm_stream_ps(dst.fs + 8, srcrow2);
_mm_stream_ps(dst.fs + 12, srcrow3);
#else
src.mm[0] = dstrow0;
src.mm[1] = dstrow1;
src.mm[2] = dstrow2;
src.mm[3] = dstrow3;
dst.mm[0] = srcrow0;
dst.mm[1] = srcrow1;
dst.mm[2] = srcrow2;
dst.mm[3] = srcrow3;
#endif
} else {
__m128 srcrow0 = src.mm[0];
__m128 srcrow1 = src.mm[1];
__m128 srcrow2 = src.mm[2];
__m128 srcrow3 = src.mm[3];
_MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
#if STREAMWRITE == 1
_mm_stream_ps(src.fs + 0, srcrow0);
_mm_stream_ps(src.fs + 4, srcrow1);
_mm_stream_ps(src.fs + 8, srcrow2);
_mm_stream_ps(src.fs + 12, srcrow3);
#else
src.mm[0] = srcrow0;
src.mm[1] = srcrow1;
src.mm[2] = srcrow2;
src.mm[3] = srcrow3;
#endif
}
}
}
static void transpose(matrix mat)
{
const unsigned xstep = 256;
const unsigned ystep = 256;
const unsigned istep = 4;
const unsigned jstep = 4;
unsigned x1, y1, i, j;
/* need to increment x check for y limit to allow unrolled inner loop
* access entries close to diagonal axis
*/
for (x1 = 0; x1 < MATSIZE - xstep + 1 && MATSIZE > xstep && xstep; x1 += xstep)
for (y1 = 0; y1 < std::min(MATSIZE - ystep + 1, x1 + 1); y1 += ystep)
for ( i = x1 ; i < x1 + xstep; i += istep ) {
for ( j = y1 ; j < std::min(y1 + ystep, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j && j < (y1 + ystep))
doswap<true>(mat, i, j);
}
for ( i = 0 ; i < x1; i += istep ) {
for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j)
doswap<true>(mat, i, j);
}
for ( i = x1 ; i < MATSIZE - istep + 1; i += istep ) {
for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j)
doswap<true>(mat, i, j);
}
x1 = MATSIZE - MATSIZE % istep;
y1 = MATSIZE - MATSIZE % jstep;
for ( i = x1 ; i < MATSIZE; i++ )
for ( j = 0 ; j < std::min((unsigned)MATSIZE, i); j++ )
std::swap(index(mat, i, j+0), index(mat, j+0, i));
for ( i = 0; i < x1; i++ )
for ( j = y1 ; j < std::min((unsigned)MATSIZE, i) ; j++ )
std::swap(index(mat, i, j+0), index(mat, j+0, i));
}

Resources