I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).
The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:
// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
for (size_t j = 0; j < n; ++j) {
R[p[i],j] += X[i,j]
}
}
I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?
Isn't it strange that none of the compilers use avx-512 for this?
https://godbolt.org/z/ox9nfjh8d
Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?
I'm filling a lower triangular matrix like this:
for (i = 0; i < size; i++) {
for (j = 0; j <= i; j++)
l[i][j] = j + 1;
}
And I want to calculate the order of the code in Big O notation but I'm really bad. If it was a regular matrix it would be O(n²) but in this case I'm not sure if it's O(nlog(n)) or something like that.
Typically (but not always) one loop nested in another will cause O(N²).
Think about it, the inner loop is executed i times, for each value of j. The outer loop is executed size times.
This comes out to be 1/2 of N^2, which is still O(N^2)
I am unsure why this code evaluates to O(A*B)?
void printUnorderedPairs(int[] arrayA, int[] arrayB) {
for (int i= 0; i < arrayA.length; i++) {
for (int j = 0; j < arrayB.length; j++) {
for (int k= 0; k < 100000; k++) {
System.out.println(arrayA[i] + "," + arrayB[j]);
}
}
}
}
Sure, more precisely its O(1000*AB) and we would drop the 1000 making it O(AB). But what if array A had a length of 2? wouldn't the 1000 iterations be more significant? Is it just because we know the final loop is constant (and its value is shown) that we don't count it? what if we knew all of the arrays sizes?
Can anyone explain why we would not say its O(ABC)? What would be the runtime if I made the code this:
int[] arrayA = new int[20];
int[] arrayB = new int[500];
int[] arrayC = new int[100000];
void printUnorderedPairs(int[] arrayA, int[] arrayB) {
for (int i= 0; i < arrayA.length; i++) {
for (int j = 0; j < arrayB.length; j++) {
for (int k= 0; k < arrayC.length; k++) {
System.out.println(arrayA[i] + "," + arrayB[j]);
}
}
}
}
If the running time (or number of execution steps, or number of times println gets called, or whatever you are assessing with your Big O notation) is O(AB), it means that the running time approaches being linearly proportional to AB as AB grows without bound (approaches infinity). It is literally a limit to infinity, in terms of calculus.
Big O is not concerned with what happens for any finite number of iterations. It's about what the limiting behaviour of the function is as its free variables approach infinity. Sure, for small values of A there could very well be a constant term that dominates execution time. But as A approaches infinity, all those other factors becomes insignificant.
Consider a polynomial like Ax^3 + Bx^2 + Cn + D. It will be proportional to x^3 as x grows to infinity - regardless of the magnitude of A, B, C, or D. B can be Grahams number for all Big O cares; infinity is still way bigger than any big finite number you pick and therefore the x^3 term dominates.
So first, considering what if A were 2 is not really in the spirit of AB approaching infinity. Any number you can fit on a whiteboard basically rounds down to zero..
And second, remember that proportional to AB means equal to AB times some constant; and it doesn't matter what that constant is. It is fine if the constant happens to be 10000. Saying something is proportional to 2N is the same as saying it is proportional to N, or any other number times N. So O(2N) is the same as O(N). By convention we always simplify when using Big-O notation to drop any constant factors. So we would always write O(N), and never O(2N). And for that same reason, we would write O(AB) and not O(10000AB).
And finally we don't say O(ABC) only because "C" (the number of iterations of your inner loop in your question) happens to be a constant; which also happens to equal 10000. That's why we say it's O(AB) and not O(ABC) because C is not a free variable; it's hard-coded to 10000. If the size of B were not expected to change (were to be constant for whatever reason) then you could say that it is simply O(A). But if you allow B to grow without bound, then the limit is O(AB) and if you also allow C to grow without bound then the limit is O(ABC). You get to decide which numbers are constant and which variables are free variables depending on the context of your analysis.
You can read more about Big O notation at Wikipedia.
Appreciate that the for loops in i and j are independent of each other, so their running time is O(A*B). The inner loop in k is a fixed number of iterations, 100000, and also is independent of the two outer loops, so we get O(100000*A*B). But, since the k loop is just a constant (non variable) penalty, with are still left with O(A*B) for the overall complexity.
If you were to write the inner loop in k from 0 to C, then you could write O(A*B*C) for the complexity, and that would be valid as well.
Generally the A*B doesn't matter, and it's just considered O(N).
If there was some knowledge that A and B were always somewhat the same length, then one could argue that it's really O(N^2).
Any sort of constant doesn't really matter in order-notation, because for really really large numbers of A/B, the constant becomes of negligible importance.
void printUnorderedPairs(int[] arrayA, int[] arrayB) {
for (int i= 0; i < arrayA.length; i++) {
for (int j = 0; j < arrayB.length; j++) {
for (int k= 0; k < 100000; k++) {
System.out.println(arrayA[i] + "," + arrayB[j]);
}
}
}
}
This code is evaluated to O(AB), because arrayC has constant length. Of course, its run time is proportional to AB*100000. Here, we never care about constant values, because when the variables get higher and higher like 10^10000, the constants can be easily ignored.
In the second code, we say its O(1), because all arrays have constant length and we can calculate its run time without any variable.
In a [N][N] Matrix, what would be the best way of obtaining the sum of the 8 elements surrounding a certain element?
We've been doing it the brute way, just checking with a lot of if statements but i was wondering if there could be a most clever way of doing this.
The problems we face are the borders of the matrix, since we cannot find a way that looks more subtle than the original bunch of if(i>0 && j>0){...}
Assuming the matrix has been initialized and you are considering calculating sums of those elements whose all eight counterparts exist.Then you can save your time if you apply double for loops for only those elements by doing the following :
Let a N x N matrix then use the following to cover all the elements satisfying the above conditions
for( i = 1; i < N - 1 ;i++)
{
for( j = 1;j < N -1 ;j++)
{
//YOUR CODE
}
}
I could not understand the optimised chained Matrix multiplication(using DP) code example given in my algorithm's book.
int MatrixChainOrder(int p[], int n)
{
/* For simplicity of the program, one extra row and one extra column are
allocated in m[][]. 0th row and 0th column of m[][] are not used */
int m[n][n];
int i, j, k, L, q;
/* m[i,j] = Minimum number of scalar multiplications needed to compute
the matrix A[i]A[i+1]...A[j] = A[i..j] where dimention of A[i] is
p[i-1] x p[i] */
// cost is zero when multiplying one matrix.
for (i = 1; i < n; i++)
m[i][i] = 0;
// L is chain length.
for (L=2; L<n; L++)
{
for (i=1; i<=n-L+1; i++)
{
j = i+L-1;
m[i][j] = INT_MAX;
for (k=i; k<=j-1; k++)
{
// q = cost/scalar multiplications
q = m[i][k] + m[k+1][j] + p[i-1]*p[k]*p[j];
if (q < m[i][j])
m[i][j] = q;
}
}
}
return m[1][n-1];
}
Why does the first loop starts from 2 ?
Why is j set to i+L-1 and i to n-L+1 ?
I understood the recurrence relation, but could not understand why loops are set like this ?
EDIT:
What is the way to get the parenthesis order after DP ?
In bottom up, that is DP we try to solve the smallest possible case first(we solve each smallest case). Now when we look at the recurrence (m[i,j] represents cost to parenthise from i , j..)
We can see that the smallest possible solution(which will be needed by any other larger sub problem) is of a smaller length than that we need to solve... For P(n) .We need all the costs of parenthising the expression with length lessser than n. This leads us to solve the problem lengthwise... (Note l in the outer loop represents length of the segment whose cost we are trying to optimise)
Now first we solve all the sub problems of length 1 i.e. 0 always (No multiplication required)...
Now your question L=2 -> L=n
we are varying length from 2 to n just to solve the sub problems in order...
i is the starting point of all the sub intervals such that they can be the begining of an interval of length l..
Naturally j represents the end of sub interval -> i+l-1 is the end of sub interval (just because we know the starting point and length we can figure out the end of subinterval)
L iterates the length of a chain. Clearly, a chain cannot be 1 piece long. i iterates the beginning of the chain. If the first piece is i, then the last piece will be i+L-1, which is j. (Try to imagine a chain and count). The condition in the cycle makes sure that for any value of i, the last piece is not greater than the maximum Length n.
Shortly, those are limitations to keep the values in the given boundaries.