For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4
y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.
A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.
int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}
Related
I'm looking for a fast way to compute the maximal n s.t. n over k <= x for given k and x.
In my context n \leq n' for some known constant n', lets say 1000. k is either 1,2, or 3 and x is choosen at random from 0 ... n' over k
My current approach is to compute the binomial coefficient iterativly, starting from a_0 = k over k = 1. The next coefficient a_1 = k+1 over k can be computed as a_1 = a_0 * (k+1) / 1 and so on.
The current C code looks like this
uint32_t max_bc(const uint32_t a, const uint32_t n, const uint32_t k) {
uint32_t tmp = 1;
int ctr = 0;
uint32_t c = k, d = 1;
while(tmp <= a && ctr < n) {
c += 1;
tmp = tmp*c/d;
ctr += 1;
d += 1;
}
return ctr + k - 1;
}
int main() {
const uint32_t n = 10, w = 2;
for (uint32_t a = 0; a < 10 /*bc(n, w)*/; a++) {
const uint32_t b = max_bc(a, n, w);
printf("%d %d\n", a, b);
}
}
which outputs
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
So I'm looking for a Bittrick or something to get around the while-loop to speed up my application. Thats because the while loop gets executedat worst n-k times. Precomputation is not an option, because this code is part of a bigger algorithm which uses a lot of memory.
Thanks to #Aleksei
This is my solution:
template<typename T, const uint32_t k>
inline T opt_max_bc(const T a, const uint32_t n) {
if constexpr(k == 1) {
return n - k - a;
}
if constexpr (k == 2) {
const uint32_t t = __builtin_floor((double)(__builtin_sqrt(8 * a + 1) + 1)/2.);
return n - t - 1;
}
if constexpr (k == 3) {
if (a == 1)
return n-k-1;
float x = a;
float t1 = sqrtf(729.f * x * x);
float t2 = cbrtf(3.f * t1 + 81.f * x);
float t3 = t2 / 2.09f;
float ctr2 = t3;
int ctr = int(ctr2);
return n - ctr - k;
}
if constexpr (k == 4) {
const float x = a;
const float t1 = __builtin_floorf(__builtin_sqrtf(24.f * x + 1.f));
const float t2 = __builtin_floorf(__builtin_sqrtf(4.f * t1 + 5.f));
uint32_t ctr = (t2 + 3.f)/ 2.f - 3;
return n - ctr - k;
}
// will never happen
return -1;
}
If k is really limited to just 1, 2 or 3, you can use different methods depending on k:
k == 1: C(n, 1) = n <= x, so the answer is n.
k == 2: C(n, 2) = n * (n - 1) / 4 <= x. You can solve the equation n * (n - 1) / 4 = x, the positive solution is n = 1/2 (sqrt(16x + 1) + 1), the answer to the initial question should be floor( 1/2 (sqrt(16x + 1) + 1) ).
k == 3: C(n, 3) = n(n-1)(n-2)/6 <= x. There is no nice solution, but the formula for the number of combinations is straightforward, so you can use a binary search to find the answer.
Given the code below, is it possible to modify it so that there's a single set of M "random" numbers for x and y that will "restart" at the beginning of the set for every iteration of i?
What I know I can do is pre-generate an array for x and y of length M but I cannot use arrays because of limited memory. I was thinking of using random numbers with seeds somehow but haven't been able to figure it out.
double sampleNormal()
{
double u = ((double) rand() / (RAND_MAX)) * 2 - 1;
double v = ((double) rand() / (RAND_MAX)) * 2 - 1;
double r = u * u + v * v;
if (r == 0 || r > 1) return sampleNormal();
double c = sqrt(-2 * log(r) / r);
return u * c;
}
...
double x = 0;
double y = 0;
double a = 0;
double f = 100e6;
double t = 0;
double fsamp = 2e9;
for(int i = 0; i < N; i++)
{
for(int j = 0; j < M; j++)
{
x = sampleNormal();
y = sampleNormal();
t = j/fsamp;
a = x*cos(2*pi*f*t)+y*sin(2*pi*f*t);
}
}
that will "restart" at the beginning of the set for every iteration of i
was thinking of using random numbers with seeds somehow but haven't been able to figure it out.
Code could abuse srand()
// Get some state info from rand() for later.
unsigned start = rand();
start = start*(RAND_MAX + 1u) + rand();
for(int i = 0; i < N; i++) {
// Initialize the random number generator to produce the same sequence.
srand(42); // Use your favorite constant.
for(int j = 0; j < M; j++) {
x = sampleNormal();
y = sampleNormal();
t = j/fsamp;
a = x*cos(2*pi*f*t)+y*sin(2*pi*f*t);
}
}
// Re-seed so other calls to `rand()` are not so predictable.
srand(start);
I am trying to apply dynamic programming to the following problem:
"A robot is located in the top-left corner of an m x n grid. The robot can only move down or right at any point in time. The robot is trying to reach the bottom-right corner of the grid. How many unique paths are there?"
I have a recursive solution to this which I think works fine. However, it is slow:
int uniquePaths(int m, int n)
{
if (m==1 || n==1)
{
return 1;
}
else
{
return (uniquePaths(m,n-1)+uniquePaths(m-1,n));
}
}
I can see that it would be useful if we were able to save the outputs of the uniquePath calls since many will be done more than once. One idea I have on how to achieve this is to create an m x n array and store then outputs in there. However, this would mean I would need to input the array into my recursive function and I think for this problem I am only allowed to input two integers. Is there a simple way to apply this?
You don't need to input the array as a function argument. It can be a local variable.
The naive way: using a recursive function
If you really want to use a recursive function, you can declare the array in uniquePaths, then call another function which will use the array and do the calculations.
int uniquePaths_helper(int *grid, int m, int n, int i, int j);
int uniquePaths(int m, int n)
{
int *grid = malloc(m * n * sizeof(int));
int k;
for (k = 0; k < m * n; ++k)
{
grid[k] = 0;
}
return uniquePaths_helper(grid, 0, 0, m, n);
}
int uniquePaths_helper(int *grid, int m, int n, int i, int j)
{
if (grid[i * m + j] == 0)
{
if (i == n - 1 || j == n - 1)
{
grid[i * m + j] = 1;
}
else
{
grid[i * m + j] = (
uniquePaths_helper(grid,m,n, i+1, j)
+ uniquePaths_helper(grid,m,n, i, j+1)
);
}
}
return grid[i * m + j];
}
Being smarter: filling the array in the correct order
In the previous solution there was a lot of overhead because we had to initialise the array with default values, then at every recursive call we need to check whether the value has already been stored in the array or needs to be calculated.
You can shortcut all that. Fill the array directly using your formula on the cells of grid rather than on the arguments of recursive calls.
The formula is: grid[i * m + j] == grid[(i+1) * m + j] + grid[i * m + j+1].
The only tricky part is finding out in which order to fill the array, so that this formula can be written as a simple assignment, replacing == with =.
Since the value in a cell only depends on values with higher i and j indices, we can simply fill the array backwards:
int uniquePaths(int m, int n)
{
int *grid = malloc(m * n * sizeof(int));
int i,j;
for (int i = 0; i < n; ++i)
grid[i * m + m-1] = 1;
for (int j = 0; j < m; ++j)
grid[(n-1) * m + j] = 1;
for (i = n - 2; i >= 0; --i)
{
for (j = m - 2; j >= 0; --j)
{
grid[i * m + j] = grid[(i+1) * m + j] + grid[i * m + j+1];
}
}
return grid[0];
}
int mystery(int n) {
int s = 0;
int tmp = n+1;
for (int i; i<=n; i++) {
s = tmp + i;
tmp = s;
}
return s;
}
How can I determine this function and what the function does? Also, can this function be improved with respect to its running time?
There is some superfluous code in the above; s is completely unnecessary. Re-writing it without it makes it clearer.
int mystery(int n) {
int tmp = n + 1;
for (int i = 1; i<=n; i++) {
tmp += i;
}
return tmp;
}
What it does is
Set tmp to n + 1
Add 1 then 2 then 3 then 4 and so on
This currently has a running time of O(n). However, it turns out that there is a constant time formula for 1 + 2 + 3 + ... + N. We can use this to create the following, which is constant time.
int mystery(int n) {
int triangleNumber = (n * (n + 1)) / 2;
return triangleNumber + n + 1;
}
This code transposes a matrix four ways. The first does sequential writes, non sequential reads. The second is the opposite. The next two are the same, but with cache skipping writes. What seems to happen is sequential writes are faster, and skipping the cache is faster. What I don't understand is, if the cache is being skipped why are sequential writes still faster?
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);
EDIT: The output is
Transpose 1: 47603
Transpose 2: 92449
Transpose 3: 38340
Transpose 4: 69597
CPU has a write combining buffer to combine writes on a cache line to happen in one burst. In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped.
To be exact, in case of cache being skipped, writes are still happening in bursts to memory.
See write-combining logic behavior here.
You could try non linear memory layout for the matrix to improve cache utilization. With 4x4 32bit float tiles one could do transpose with only single access to each cache line. Plus as a bonus tile transposes could be done easily with _MM_TRANSPOSE4_PS.
Transposing a very large matrix is still very memory intensive operation. It will still be heavily bandwidth limited but at least cache word load is near optimal. I don't know if the performance could be still optimized. My testing shows that a few years old laptop manages to transpose 16k*16k (1G memory) in about 300ms.
I tried to use also _mm_stream_sd but it actually makes performance worse for some reason. I don't understand nontemporal memory writes enough to have any practical guess why performance would drop with _mm_stream_ps. Possible reason is of course that cache line is already in L1 cache ready for the write operation.
But actually important part with non linear matrix would possibility to avoid transpose completely and simple run the multiplication in tile friendly order. But I only have transpose code that I'm using to improve my knowledge about cache management in algorithms.
I haven't yet tried to test if prefetching would improve memory bandwidth usage. Current code runs at about 0.5 instructions per cycle (good cache friendly code runs around 2 ins per cycle on this CPU) that leaves a lot of free cycles for prefetch instructions allowing even quite complex calculation to optimize prefetching timing in runtime.
example code from my transpose benchmark test follows.
#define MATSIZE 16384
#define align(val, a) (val + (a - val % a))
#define tilewidth 4
typedef int matrix[align(MATSIZE,tilewidth)*MATSIZE] __attribute__((aligned(64)));
float &index(matrix m, unsigned i, unsigned j)
{
/* tiled address calculation */
/* single cache line is used for 4x4 sub matrices (64 bytes = 4*4*sizeof(int) */
/* tiles are arranged linearly from top to bottom */
/*
* eg: 16x16 matrix tile positions:
* t1 t5 t9 t13
* t2 t6 t10 t14
* t3 t7 t11 t15
* t4 t8 t12 t16
*/
const unsigned tilestride = tilewidth * MATSIZE;
const unsigned comp0 = i % tilewidth; /* i inside tile is least significant part */
const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
const unsigned comp2 = i / tilewidth * tilestride;
const unsigned add = comp0 + comp1 + comp2;
return m[add];
}
/* Get start of tile reference */
float &tile(matrix m, unsigned i, unsigned j)
{
const unsigned tilestride = tilewidth * MATSIZE;
const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
const unsigned comp2 = i / tilewidth * tilestride;
return m[comp1 + comp2];
}
template<bool diagonal>
static void doswap(matrix mat, unsigned i, unsigned j)
{
/* special path to swap whole tile at once */
union {
float *fs;
__m128 *mm;
} src, dst;
src.fs = &tile(mat, i, j);
dst.fs = &tile(mat, j, i);
if (!diagonal) {
__m128 srcrow0 = src.mm[0];
__m128 srcrow1 = src.mm[1];
__m128 srcrow2 = src.mm[2];
__m128 srcrow3 = src.mm[3];
__m128 dstrow0 = dst.mm[0];
__m128 dstrow1 = dst.mm[1];
__m128 dstrow2 = dst.mm[2];
__m128 dstrow3 = dst.mm[3];
_MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
_MM_TRANSPOSE4_PS(dstrow0, dstrow1, dstrow2, dstrow3);
#if STREAMWRITE == 1
_mm_stream_ps(src.fs + 0, dstrow0);
_mm_stream_ps(src.fs + 4, dstrow1);
_mm_stream_ps(src.fs + 8, dstrow2);
_mm_stream_ps(src.fs + 12, dstrow3);
_mm_stream_ps(dst.fs + 0, srcrow0);
_mm_stream_ps(dst.fs + 4, srcrow1);
_mm_stream_ps(dst.fs + 8, srcrow2);
_mm_stream_ps(dst.fs + 12, srcrow3);
#else
src.mm[0] = dstrow0;
src.mm[1] = dstrow1;
src.mm[2] = dstrow2;
src.mm[3] = dstrow3;
dst.mm[0] = srcrow0;
dst.mm[1] = srcrow1;
dst.mm[2] = srcrow2;
dst.mm[3] = srcrow3;
#endif
} else {
__m128 srcrow0 = src.mm[0];
__m128 srcrow1 = src.mm[1];
__m128 srcrow2 = src.mm[2];
__m128 srcrow3 = src.mm[3];
_MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
#if STREAMWRITE == 1
_mm_stream_ps(src.fs + 0, srcrow0);
_mm_stream_ps(src.fs + 4, srcrow1);
_mm_stream_ps(src.fs + 8, srcrow2);
_mm_stream_ps(src.fs + 12, srcrow3);
#else
src.mm[0] = srcrow0;
src.mm[1] = srcrow1;
src.mm[2] = srcrow2;
src.mm[3] = srcrow3;
#endif
}
}
}
static void transpose(matrix mat)
{
const unsigned xstep = 256;
const unsigned ystep = 256;
const unsigned istep = 4;
const unsigned jstep = 4;
unsigned x1, y1, i, j;
/* need to increment x check for y limit to allow unrolled inner loop
* access entries close to diagonal axis
*/
for (x1 = 0; x1 < MATSIZE - xstep + 1 && MATSIZE > xstep && xstep; x1 += xstep)
for (y1 = 0; y1 < std::min(MATSIZE - ystep + 1, x1 + 1); y1 += ystep)
for ( i = x1 ; i < x1 + xstep; i += istep ) {
for ( j = y1 ; j < std::min(y1 + ystep, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j && j < (y1 + ystep))
doswap<true>(mat, i, j);
}
for ( i = 0 ; i < x1; i += istep ) {
for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j)
doswap<true>(mat, i, j);
}
for ( i = x1 ; i < MATSIZE - istep + 1; i += istep ) {
for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j)
doswap<true>(mat, i, j);
}
x1 = MATSIZE - MATSIZE % istep;
y1 = MATSIZE - MATSIZE % jstep;
for ( i = x1 ; i < MATSIZE; i++ )
for ( j = 0 ; j < std::min((unsigned)MATSIZE, i); j++ )
std::swap(index(mat, i, j+0), index(mat, j+0, i));
for ( i = 0; i < x1; i++ )
for ( j = y1 ; j < std::min((unsigned)MATSIZE, i) ; j++ )
std::swap(index(mat, i, j+0), index(mat, j+0, i));
}