Simple explanation of PCA to reduce dimensionality of dataset

Simple explanation of PCA to reduce dimensionality of dataset - dataset

I know that PCA does not tell you which features of a dataset are the most significant, but which combinations of features keep the most variance.
How could you use the fact that PCA rotates the dataset in such a way that it has the most variance along the first dimension, second most along second, and so on to reduce the dimensionality of the dataset?
I mean, more in depth, How are the first N eigenvectors used to transform the feature vectors into a lower-dimensional representation that keeps most of the variance?

Let X be an N x d matrix where each row X_{n,:} is a vector from the dataset.
Then X'X is the covariance matrix and an eigen decomposition gives X'X=UDU' where U is a d x d matrix of eigenvectors with U'U=I and D is a d x d diagonal matrix of eigenvalues.
The form of the eigendecomposition means that U'X'XU=U'UDU'U=D which means that if you transform your dataset by U then the new dataset, XU, will have a diagonal covariance matrix.
If the eigenvalues are ordered from largest to smallest, this also means that the average squared value of the first transformed feature (given by the expression U_1'X'XU_1=\sum_n (\sum_d U_{1,d} X_{n,d})^2) will be larger that the second, the second larger than the third, etc.
If we order the features of a dataset from largest to smallest average value, then if we just get rid of the features with small average values (and the relative sizes of the large average values are much larger than the small ones), then we haven't lost much information. That is the concept.

Related

Finding max of difference + element for an array

We have an array consisting of each entry as a tuple of two integers. Let the array be A = [(a1, b1), (a2, b2), .... , (an, bn)]. Now we have multiple queries where we are given an integer x, we need to find the maximum value of ai + |x - bi| for 1 <= i <= n.
I understand this can be easily achieved in O(n) time complexity for each query but I am looking for something faster than that, probably O(log n) for each query. I can preprocess the array in O(n) time, but the queries should be done faster than O(n).
Any kind of help would be appreciated.

It seems to be way too easy to over-think this.
For n = 1, the function is v-shaped with a minimum of a1 at b1, with slopes of -1 and 1, respectively - let's call these values ac and bc (for combined).
For an additional pair (ai, bi), one of the pairs may dominate the other (|bc - bi| ≤ |ac - ai), which may then be ignored.
Otherwise, the falling slope of the combination will be from the pair with the larger b, the rising slope from the other.
The minimum will be between the individual b, closer to the b of the pair with the larger a, the distance being half the difference between the (absolute value of the) "coordinate" differences, the minimum value that amount higher.
The main catch is that neither needs to be an integer - the only alternative being exactly in the middle between two integers.
(Ending up with the falling slope from max ai + bi, and the rising slope of max ai - bi.)

Compact storage coefficients of a multivariate polynomial

The setup
I am writing a code for dealing with polynomials of degree n over d-dimensional variable x and ran into a problem that others have likely faced in the past. Such polynomial can be characterized by coefficients c(alpha) corresponding to x^alpha, where alpha is a length d multi-index specifying the powers the d variables must be raised to.
The dimension and order are completely general, but known at compile time, and could be easily as high as n = 30 and d = 10, though probably not at the same time. The coefficients are dense, in the sense that most coefficients are non-zero.
The number of coefficients required to specify such a polynomial is n + d choose n, which in high dimensions is much less than n^d coefficients that could fill a cube of side length n. As a result, in my situation I have to store the coefficients rather compactly. This is at a price, because retrieving a coefficient for a given multi-index alpha requires knowing its location.
The question
Is there a (straightforward) function mapping a d-dimensional multi-index alpha to a position in an array of length (n + d) choose n?

Ordering combinations
A well-known way to order combinations can be found on this wikipedia page. Very briefly you order the combinations lexically so you can easily count the number of lower combinations. An explanation can be found in the sections Ordering combinations and Place of a combination in the ordering.
Precomputing the binomial coefficients will speed up the index calculation.
Associating monomials with combinations
If we can now associate each monomial with a combination we can effectively order them with the method above. Since each coefficient corresponds with such a monomial this would provide the answer you're looking for. Luckily if
alpha = (a[1], a[2], ..., a[d])
then the combination you're looking for is
combination = (a[1] + 0, a[1] + a[2] + 1, ..., a[1] + a[2] + ... + a[d] + d - 1)
The index can then readily be calculated with the formula from the wikipedia page.

A better, more object oriented solution, would be to create Monomial and Polynomial classes. The Polynomial class would encapsulate a collection of Monomials. That way you can easily model a pathological case like
y(x) = 1.0 + x^50
using just two terms rather than 51.
Another solution would be a map/dictionary where the key was the exponent and the value is the coefficient. That would only require two entries for my pathological case. You're in business if you have a C/C++ hash map.
Personally, I don't think doing it the naive way with arrays is so terrible, even with a polynomial containing 1000 terms. RAM is cheap; that array won't make or break you.

Eigen L How can I compute the inner product of a row vector from a sparse matrix and a dense vector? What is the efficiency?

SparseMatrix SM;
MatrixXd f;
SM is a m*n sparse Matrix(0.18%), and f is a n*1 column vector.
I want to get the ith row vector of SM and product it with f. How should I write the code?
I was also worried about the efficiency. As many redundant zeros may be involved in computation.

If SM is a column major matrix, then indexing one its row is very inefficient and essentially a no-go if performance matter. If SM is row-major, then you can simply do SM.row(i).dot(f), and the cost will be in the order of the number of non zeros in SM.row(i).

Generating random vector that's linearly independent of a set of vectors

I'm trying to come up with an algorithm that will allow me to generate a random N-dimensional real-valued vector that's linearly independent with respect to a set of already-generated vectors. I don't want to force them to be orthogonal, only linearly independent. I know Graham-Schmidt exists for the orthogonalization problem, but is there a weaker form that only gives you linearly independent vectors?

Step 1. Generate random vector vr.
Step 2. Copy vr to vo and update as follows: for every already generated vector v in v1, v2... vn, subtract the projection of vo on vi.
The result is a random vector orthogonal to the subspace spanned by v1, v2... vn. If that subspace is a basis, then it is the zero vector, of course :)
The decision of whether the initial vector was linearly independent can be made based on the comparison of the norm of vr to the norm of vo. Non-linearly independent vectors will have a vo-norm which is zero or nearly zero (some numerical precision issues may make it a small nonzero number on the order of a few times epsilon, this can be tuned in an application-dependent way).
Pseudocode:
vr = random_vector()
vo = vr
for v in (v1, v2, ... vn):
vo = vo - dot( vr, v ) / norm( v )
if norm(vo) < k1 * norm(vr):
# this vector was mostly contained in the spanned subspace
else:
# linearly independent, go ahead and use
Here k1 is a very small number, 1e-8 to 1e-10 perhaps?
You can also go by the angle between vr and the subspace: in that case, calculate it as theta = arcsin(norm(vo) / norm(vr)). Angles substantially different from zero correspond to linearly independent vectors.

A somewhat OTT scheme is to generate a NxN non-singular matrix, and use it's columns (or rows) as the N linearly independent vectors.
To generate a non=singular matrix one could generate it's SVD and multiply up. In more detail:
a/ generate a 'random' NxN orthogonal matrix U
b/ generate a 'random' NxN diagonal matrix S with positive numbers in the diagonal
c/ generate a 'random' NxN orthogonal matrix V
d/ compute
M = U*S*V'
To generate a 'random' orthogonal matrix U, one can use the fact that every orthogonal matrix can be written as a product of Household relectors, that is of matrices of the form
H(v) = I - 2*v*v'/(v'*v)
where v is a non zero random vector.
So one could
initialise U to I
for( i=1..N)
generate a none zero vector v
update: U := H(v)*U
Note that if all these matrix multiplications become burdonesome, one could write a special routine to do the update of U. Applying H(v) to a vector u is O(N):
u -> u - 2*(h'*u)/(h'*h) * h
and so applying H to U can be done in O(N squared) rather than O( N cubed)
One advantage of this scheme is that one has some control over 'how linearly independent' the vectors are. The product of the diagonal elements is (up to sign) the determinant of M, so that if this product is 'very small' the vectors are 'almost' linearly dependent

efficient methods to do summation

Is there any efficient techniques to do the following summation ?
Given a finite set A containing n integers A={X1,X2,…,Xn}, where Xi is an integer. Now there are n subsets of A, denoted by A1, A2, ... , An. We want to calculate the summation for each subset. Are there some efficient techniques ?
(Note that n is typically larger than the average size of all the subsets of A.)
For example, if A={1,2,3,4,5,6,7,9}, A1={1,3,4,5} , A2={2,3,4} , A3= ... . A naive way of computing the summation for A1 and A2 needs 5 Flops for additions:
Sum(A1)=1+3+4+5=13
Sum(A2)=2+3+4=9
...
Now, if computing 3+4 first, and then recording its result 7, we only need 3 Flops for addtions:
Sum(A1)=1+7+5=13
Sum(A2)=2+7=9
...
What about the generalized case ? Is there any efficient methods to speed up the calculation? Thanks!

For some choices of subsets there are ways to speed up the computation, if you don't mind doing some (potentially expensive) precomputation, but not for all. For instance, suppose your subsets are {1,2}, {2,3}, {3,4}, {4,5}, ..., {n-1,n}, {n,1}; then the naive approach uses one arithmetic operation per subset, and you obviously can't do better than that. On the other hand, if your subsets are {1}, {1,2}, {1,2,3}, {1,2,3,4}, ..., {1,2,...,n} then you can get by with n-1 arithmetic ops, whereas the naive approach is much worse.
Here's one way to do the precomputation. It will not always find optimal results. For each pair of subsets, define the transition cost to be min(size of symmetric difference, size of Y - 1). (The symmetric difference of X and Y is the set of things that are in X or Y but not both.) So the transition cost is the number of arithmetic operations you need to do to compute the sum of Y's elements, given the sum of X's. Add the empty set to your list of subsets, and compute a minimum-cost directed spanning tree using Edmonds' algorithm (http://en.wikipedia.org/wiki/Edmonds%27_algorithm) or one of the faster but more complicated variations on that theme. Now make sure that when your spanning tree has an edge X -> Y you compute X before Y. (This is a "topological sort" and can be done efficiently.)
This will give distinctly suboptimal results when, e.g., you have {1,2}, {3,4}, {1,2,3,4}, {5,6}, {7,8}, {5,6,7,8}. After deciding your order of operations using the procedure above you could then do an optimization pass where you find cheaper ways to evaluate each set's sum given the sums already computed, and this will probably give fairly decent results in practice.
I suspect, but have made no attempt to prove, that finding an optimal procedure for a given set of subsets is NP-hard or worse. (It is certainly computable; the set of possible computations you might do is finite. But, on the face of it, it may be awfully expensive; potentially you might be keeping track of about 2^n partial sums, be adding any one of them to any other at each step, and have up to about n^2 steps, for a super-naive cost of (2^2n)^(n^2) = 2^(2n^3) operations to try every possibility.)

Assuming that 'addition' isn't simply an ADD operation but instead some very intensive function involving two integer operands, then an obvious approach would be to cache the results.
You could achieve that via a suitable data structure, for example a key-value dictionary containing keys formed by the two operands and the answers as the value.
But as you specified C in the question, then the simplest approach would be an n by n array of integers, where the solution to x + y is stored at array[x][y].
You can then repeatedly iterate over the subsets, and for each pair of operands you check the appropriate position in the array. If no value is present then it must be calculated and placed in the array. The value then replaces the two operands in the subset and you iterate.
If the operation is commutative then the operands should be sorted prior to looking up the array (i.e. so that the first index is always the smallest of the two operands) as this will maximise "cache" hits.

A common optimization technique is to pre-compute intermediate results. In your case, you might pre-compute all sums with 2 summands from A and store them in a lookup table. This will result in |A|*|A+1|/2 table entries, where |A| is the cardinality of A.
In order to compute the element sum of Ai, you:
look up the sum of the first two elements of Ai and save them in tmp
while there is an element x left in Ai:
look up the sum of tmp and x
In order to compute the element sum of A1 = {1,3,4,5} from your example, you do the following:
lookup(1,3) = 4
lookup(4,4) = 8
lookup(8,5) = 13
Note that computing the sum of any given Ai doesn't require summation, since all the work has already been conducted while pre-computing the lookup table.
If you store the lookup table in a hash table, then lookup() is in O(1).
Possible optimizations to this approach:
construct the lookup table while computing the summation results; hence, you only compute those summations that you actually need. Your lookup table is now a cache.
if your addition operation is commutative, you can save half of your cache size by storing only those summations where the smaller summand comes first. Then modify lookup() such that lookup(a,b) = lookup(b,a) if a > b.

If assuming summation is time consuming action you can find LCS of every pair of subsets (by assuming they are sorted as mentioned in comments, or if they are not sorted sort them), after that calculate sum of LCS of maximum length (over all LCS in pairs), then replace it's value in related arrays with related numbers, update their LCS and continue this way till there is no LCS with more than one number. Sure this is not optimum, but it's better than naive algorithm (smaller number of summation). However you can do backtracking to find best solution.
e.g For your sample input:
A1={1,3,4,5} , A2={2,3,4}
LCS (A_1,A_2) = {3,4} ==>7 ==>replace it:
A1={1,5,7}, A2={2,7} ==> LCS = {7}, maximum LCS length is `1`, so calculate sums.
Still you can improve it by calculation sum of two random numbers, then again taking LCS, ...

NO. There is no efficient techique.
Because it is NP complete problem. and there are no efficient solutions for such problem
why is it NP-complete?
We could use algorithm for this problem to solve set cover problem, just by putting extra set in set, conatining all elements.
Example:
We have sets of elements
A1={1,2}, A2={2,3}, A3 = {3,4}
We want to solve set cover problem.
we add to this set, set of numbers containing all elements
A4 = {1,2,3,4}
We use algorhitm that John Smith is aking for and we check solution A4 is represented whit.
We solved NP-Complete problem.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Simple explanation of PCA to reduce dimensionality of dataset - dataset

Related

Finding max of difference + element for an array

Compact storage coefficients of a multivariate polynomial

Eigen L How can I compute the inner product of a row vector from a sparse matrix and a dense vector? What is the efficiency?

Generating random vector that's linearly independent of a set of vectors

efficient methods to do summation

Categories

Resources