I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
Cluster the n elements into 20
clusters.
Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n.
Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.
Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.
Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
Repeat 100 times:
Greedily select the datapoint that whose removal decreases the objective function the least and remove it.
Related
Say I have some array of length n where arr[k] represents how much of object k I want. I also have some arbitrary number of arrays which I can sum integer multiples of in any combination - my goal being to minimise the sum of the absolute differences across each element.
So as a dumb example if my target was [2,1] and my options were A = [2,3] and B = [0,1], then I could take A - 2B and have a cost of 0
I’m wondering if there is an efficient algorithm for approximating something like this? It has a weird knapsack-y flavour to is it maybe just intractable for large n? It doesn’t seem very amenable to DP methods
This is the (NP-hard) closest vector problem. There's an algorithm due to Fincke and Pohst ("Improved methods for calculating vectors of short length in a lattice, including a complexity analysis"), but I haven't personally worked with it.
I'm trying to find the (numerical) curvature at specific points. I have data stored in an array, and I essentially want to find the local curvature at every separate point. I've searched around, and found three different implementations for this in MATLAB: diff, gradient, and del2.
If my array's name is arr I have tried the following implementations:
curvature = diff(diff(arr));
curvature = diff(arr,2);
curvature = gradient(gradient(arr));
curvature = del2(arr);
The first two seem to output the same values. This makes sense, because they're essentially the same implementation. However, the gradient and del2 implementations give different values from each other and from diff.
I can't figure out from the documentation precisely how the implementations work. My guess is that some of them are some type of two-sided derivative, and some of them are not two-sided derivatives. Another thing that confuses me is that my current implementations use only the data from arr. arr is my y-axis data, the x-axis essentially being time. Do these functions default to a stepsize of 1 or something like that?
If it helps, I want an implementation that takes the curvature at the current point using only previous array elements. For context, my data is such that a curvature calculation based on data in the future of the current point wouldn't be useful for my purposes.
tl;dr I need a rigorous curvature at a point implementation that uses only data to the left of the point.
Edit: I kind of better understand what's going on based on this, thanks to the answers below. This is what I'm referring to:
gradient calculates the central difference for interior data points.
For example, consider a matrix with unit-spaced data, A, that has
horizontal gradient G = gradient(A). The interior gradient values,
G(:,j), are
G(:,j) = 0.5*(A(:,j+1) - A(:,j-1)); The subscript j varies between 2
and N-1, with N = size(A,2).
Even so, I still want to know how to do a "lefthand" computation.
diff is simply the difference between two adjacent elements in arr, which is exactly why you lose 1 element for using diff once. For example, 10 elements in an array only have 9 differences.
gradient and del2 are for derivatives. Of course, you can use diff to approximate derivative by dividing the difference by the steps. Usually the step is equally-spaced, but it does not have to be. This answers your question why x is not used in the calculation. I mean, it's okay that your x is not uniform-spaced.
So, why gradient gives us an array with the same length of the original array? It is clearly explained in the manual how the boundary is handled,
The gradient values along the edges of the matrix are calculated with single->sided differences, so that
G(:,1) = A(:,2) - A(:,1);
G(:,N) = A(:,N) - A(:,N-1);
Double-gradient and del2 are not necessarily the same, although they are highly correlated. It's all because how you calculate/approximate the 2nd-oder derivatives. The difference is, the former approximates the 2nd derivative by doing 1st derivative twice and the latter directly approximates the 2nd derivative. Please refer to the help manual, the formula are documented.
Okay, do you really want curvature or the 2nd derivative for each point on arr? They are very different. https://en.wikipedia.org/wiki/Curvature#Precise_definition
You can use diff to get the 2nd derivative from the left. Since diff takes the difference from right to left, e.g. x(2)-x(1), you can first flip x from left to right, then use diff. Some codes like,
x=fliplr(x)
first=x./h
second=diff(first)./h
where h is the space between x. Notice I use ./, which idicates that h can be an array (i.e. non-uniform spaced).
I'm studying the Ising model, and I'm trying to efficiently compute a function H(σ) where σ is the current state of an LxL lattice (that is, σ_ij ∈ {+1, -1} for i,j ∈ {1,2,...,L}). To compute H for a particular σ, I need to perform the following calculation:
where ⟨i j⟩ indicates that sites σ_i and σ_j are nearest neighbors and (suppose) J is a constant.
A couple of questions:
Should I store my state σ as an LxL matrix or as an L2 list? Is one better than the other for memory accessing in RAM (which I guess depends on the way I'm accessing elements...)?
In either case, how can I best compute H?
Really I think this boils down to how can I access (and manipulate) the neighbors of every state most efficiently.
Some thoughts:
I see that if I loop through each element in the list or matrix that I'll be double counting, so is there a "best" way to return the unique neighbors?
Is there a better data structure that I'm not thinking of?
Your question is a bit broad and a bit confusing for me, so excuse me if my answer is not the one you are looking for, but I hope it will help (a bit).
An array is faster than a list when it comes to indexing. A matrix is a 2D array, like this for example (where N and M are both L for you):
That means that you first access a[i] and then a[i][j].
However, you can avoid this double access, by emulating a 2D array with a 1D array. In that case, if you want to access element a[i][j] in your matrix, you would now do, a[i * L + j].
That way you load once, but you multiply and add your variables, but this may still be faster in some cases.
Now as for the Nearest Neighbor question, it seems that you are using a square-lattice Ising model, which means that you are working in 2 dimensions.
A very efficient data structure for Nearest Neighbor Search in low dimensions is the kd-tree. The construction of that tree takes O(nlogn), where n is the size of your dataset.
Now you should think if it's worth it to build such a data structure.
PS: There is a plethora of libraries implementing the kd-tree, such as CGAL.
I encountered this problem during one of my school assignments and I think the solution depends on which programming language you are using.
In terms of efficiency, there is no better way than to write a for loop to sum neighbours(which are actually the set of 4 points{ (i+/-1,j+/-1)} for a given (i,j). However, when simd(sse etc) functions are available, you can re-express this as a convolution with a 2d kernel {0 1 0;1 0 1;0 1 0}. so if you use a numerical library which exploits simd functions you can obtain significant performance increase. You can see the example implementation of this here(https://github.com/zawlin/cs5340/blob/master/a1_code/denoiseIsingGibbs.py) .
Note that in this case, the performance improvement is huge because to evaluate it in python I need to write an expensive for loop.
In terms of work, there is in fact some waste as the unecessary multiplications and sum with zeros at corners and centers. So whether you can experience performance improvement depends quite a bit on your programming environment( if you are already in c/c++, it can be difficult and you need to use mkl etc to obtain good improvement)
This is an algorithm question.
Given 1 million points , each of them has x and y coordinates, which are floating point numbers.
Find the 10 closest points for the given point as fast as possible.
The closeness can be measured as Euclidean distance on a plane or other kind of distance on a globe. I prefer binary search due to the large number of points.
My idea:
save the points in a database
1. Amplify x by a large integer e.g. 10^4 and cut off the decimal part and then Amplify x integer part by 10^4 again.
2. Amplify y by a large integer e.g. 10^4
3. Sum the above result from step 1 and 2 , we call the sum as associate_value
4. Repeat 1 to 3 for each number in the database
E.g.
x = 12.3456789 , y = 98.7654321
x times 10^4 = 123456 and then times 10^4 to get 1234560000
y times 10^2 = 9876.54321 and then get 9876
Sum them, get 1234560000 + 9876 = 1234569876
In this way, I transform 2-d data to 1-d data. In the database, each point is associated with an integer (associate_value). The integer column can be set as index in the database for fast search.
For a given point (x, y), I perform step 1 - 3 for it and then find the points in the database such that their associate_value is close to the given point associate_value.
e.g.
x = 59.469797 , y = 96.4976416
their associated value is 5946979649
Then in the database, I search the associate_values that are close to 5946979649, for example, 5946979649 + 50 , 5946979649 - 50 and also 5946979649 + 50000000 , 5946979649 - 500000000. This can be done by index-search in database.
In this way, I can find a group of points that are close to the given point. I can reduce the search space greatly. Then, I can use Euclidean or other distance formula to find the closest points.
I am not sure the efficiency of the algorithm, especially, the process of generating associate_values.
My idea works or not ? Any better ideas ?
Thanks
Your idea seems like it may work, but I would be concerned with degenerate cases (like if no points are in your specified ranges, but maybe that's not possible given the constraints). Either way, since you asked for other ideas, here's my stab at it: Store all of your points in a quad tree. Then just walk down the quad tree until you have a sufficiently small group to search through. Since the points are fixed, the cost of creating the quad is constant, and this should be logarithmic in the number of points you have.
You can do better and just concatenate the binary value from the x- and y co-oordinates. Instead of a straight line it orders the points along a z-curve. Then you can compute the upper bounds with the mostsignificant bits. The z-curve is often use in mapping applications:http://msdn.microsoft.com/en-us/library/bb259689.aspx.
The way I read your algorithm you are discriminating the values along a line with a slope of -1 that are similar to your point. i.e. if your point is 2,2 you would look at points 1,3 0,4 and -1,5 and likely miss points closer. Most algorithms to solve this are O(n) which isn't terribly bad.
A simple algorithm to solve this problem is to keep a priority queue of the closest ten and a measurement of the furthest distance of the ten points as you iterate over the set. If the x or y value is not within the furthest distance discard it immediately. Otherwise calculate it with whatever distance measurement your using and see if it gets inserted into the queue. If so update your furthest on top ten threshold and continue iterating.
If your points are pre-sorted on one of the axes you can further optimize the algorithm by starting at the matching the point on that axis and radiate outward until you are at a difference greater than the distance from your tenth closest point. I did not include sorting in the description in the paragraph above because sorting is O(nlogn) which is slower than O(n). If you are doing this multiple times on the same set then it could be beneficial to sort it.
I have an array (arr) of elements, and a function (f) that takes 2 elements and returns a number.
I need a permutation of the array, such that f(arr[i], arr[i+1]) is as little as possible for each i in arr. (and it should loop, ie. it should also minimize f(arr[arr.length - 1], arr[0]))
Also, f works sort of like a distance, so f(a,b) == f(b,a)
I don't need the optimum solution if it's too inefficient, but one that works reasonable well and is fast since I need to calculate them pretty much in realtime (I don't know what to length of arr is, but I think it could be something around 30)
What does "such that f(arr[i], arr[i+1]) is as little as possible for each i in arr" mean? Do you want minimize the sum? Do you want to minimize the largest of those? Do you want to minimize f(arr[0],arr[1]) first, then among all solutions that minimize this, pick the one that minimizes f(arr[1],arr[2]), etc., and so on?
If you want to minimize the sum, this is exactly the Traveling Salesman Problem in its full generality (well, "metric TSP", maybe, if your f's indeed form a metric). There are clever optimizations to the naive solution that will give you the exact optimum and run in reasonable time for about n=30; you could use one of those, or one of the heuristics that give you approximations.
If you want to minimize the maximum, it is a simpler problem although still NP-hard: you can do binary search on the answer; for a particular value d, draw edges for pairs which have f(x,y)
If you want to minimize it lexiocographically, it's trivial: pick the pair with the shortest distance and put it as arr[0],arr[1], then pick arr[2] that is closest to arr[1], and so on.
Depending on where your f(,)s are coming from, this might be a much easier problem than TSP; it would be useful for you to mention that as well.
You're not entirely clear what you're optimizing - the sum of the f(a[i],a[i+1]) values, the max of them, or something else?
In any event, with your speed limitations, greedy is probably your best bet - pick an element to make a[0] (it doesn't matter which due to the wraparound), then choose each successive element a[i+1] to be the one that minimizes f(a[i],a[i+1]).
That's going to be O(n^2), but with 30 items, unless this is in an inner loop or something that will be fine. If your f() really is associative and commutative, then you might be able to do it in O(n log n). Clearly no faster by reduction to sorting.
I don't think the problem is well-defined in this form:
Let's instead define n fcns g_i : Perms -> Reals
g_i(p) = f(a^p[i], a^p[i+1]), and wrap around when i+1 > n
To say you want to minimize f over all permutations really implies you can pick a value of i and minimize g_i over all permutations, but for any p which minimizes g_i, a related but different permatation minimizes g_j (just conjugate the permutation). So therefore it makes no sense to speak minimizing f over permutations for each i.
Unless we know something more about the structure of f(x,y) this is an NP-hard problem. Given a graph G and any vertices x,y let f(x,y) be 1 if there is no edge and 0 if there is an edge. What the problem asks is an ordering of the vertices so that the maximum f(arr[i],arr[i+1]) value is minimized. Since for this function it can only be 0 or 1, returning a 0 is equivalent to finding a Hamiltonian path in G and 1 is saying that no such path exists.
The function would have to have some sort of structure that disallows this example for it to be tractable.