i am not sure to have understood why the elbow method is an approximate right way to determine a value of epsilon for DBSCAN algorithm. For instance, in the example below:
I considered the distance from the 5-th nearest neighbors and the points are arranged from the one with the minimum 5th-neighbor distance to the one that is at most distance from the 5th-neighbor.
I considered euclidean distane for the plot.
So i know that point 0-20, for instance, are the ones that are the most close to their 5-th nearest neighbor, then the points in the elbow are the one at intermediate distance from their 5-th nearest neighbor, and so they have a medium density. Then we reach point of very low density, far from their 5-th closest neighbor.
But I can't understand why it is reasonable to choose as the value of epsilon the distance between the k-th closest neigbor of points in the elbow.
Thanks for the help.
From the paper dbscan: Fast Density-Based Clustering with R (page 11)
To find a suitable value for eps, we can plot the points’ kNN
distances (i.e., the distance of each point to its k-th nearest
neighbor) in decreasing order and look for a knee in the plot. The
idea behind this heuristic is that points located inside of clusters
will have a small k-nearest neighbor distance, because they are close
to other points in the same cluster, while noise points are more
isolated and will have a rather large kNN distance.
We need a cutoff to decide what k-NN distance is considered "small" and what is "large." The knee heuristic identifies such a cutoff for epsilon as the k-NN distance where the distance starts to increase rapidly.
Related
I have multiple faces in 3D space creating cells. All these faces lie within a predefined cube (e.g. of size 100x100x100).
Every face is convex and defined by a set of corner points and a normal vector. Every cell is convex. The cells are result of 3d voronoi tessellation, and I know the initial seed points of the cells.
Now for every integer coordinate I want the smallest distance to any face.
My current solution uses this answer https://math.stackexchange.com/questions/544946/determine-if-projection-of-3d-point-onto-plane-is-within-a-triangle/544947 and calculates for every point for every face for every possible triple of this faces points the projection of the point to the triangle created by the triple, checks if the projection is inside the triangle. If this is the case I return the distance between projection and original point. If not I calculate the distance from the point to every possible line segment defined by two points of a face. Then I choose the smallest distance. I repeat this for every point.
This is quite slow and clumsy. I would much rather calculate all points that lie on (or almost lie on) a face and then with these calculate the smallest distance to all neighbour points and repeat this.
I have found this Get all points within a Triangle but am not sure how to apply it to 3D space.
Are there any techniques or algorithms to do this efficiently?
Since we're working with a Voronoi tessellation, we can simplify the current algorithm. Given a grid point p, it belongs to the cell of some site q. Take the minimum over each neighboring site r of the distance from p to the plane that is the perpendicular bisector of qr. We don't need to worry whether the closest point s on the plane belongs to the face between q and r; if not, the segment ps intersects some other face of the cell, which is necessarily closer.
Actually it doesn't even matter if we loop r over some sites that are not neighbors. So if you don't have access to a point location subroutine, or it's slow, we can use a fast nearest neighbors algorithm. Given the grid point p, we know that q is the closest site. Find the second closest site r and compute the distance d(p, bisector(qr)) as above. Now we can prune the sites that are too far away from q (for every other site s, we have d(p, bisector(qs)) ≥ d(q, s)/2 − d(p, q), so we can prune s unless d(q, s) ≤ 2 (d(p, bisector(qr)) + d(p, q))) and keep going until we have either considered or pruned every other site. To do pruning in the best possible way requires access to the guts of the nearest neighbor algorithm; I know that it slots right into the best-first depth-first search of a kd-tree or a cover tree.
I have a problem where a maze is given and every cell has a number. At any cell, you can jump up/down/left/right as many cells as specified in the current cell. What could be a good admissible heuristic to reach the goal using A*?
I have tried Manhattan/Euclidean distance, but I believe they overestimate the cost.
A simple lower-bound would be the Manhatten distance divided by the size of the largest jump on the map (rounded down). This should also be a consistent heuristic.
I know this may be a duplicate, but it seems like a variation on the 'Closest pair of Points' algorithm.
Given a Set of N points (x, y) in the unit square and a distance d, find all pair of points such that the distance between them is at most d.
For large N the brute force method is not an option. Besides the 'sweep line' and 'divide and conquer' methods, is there a simpler solution? These pair of points are the edges of an undirected graph, that i need to traverse it and say if it's connected or not (which i already did using DFS, but when N = 1 million it never finishes!).
Any pseudocode, comments or ideas are welcome,
Thanks!
EDIT: I found this on Sedgewick book (i'm looking at the code right now):
Program 3.18 uses a two-dimensional array of linked lists to improve the running time of Program 3.7 by a factor of about 1/d2 when N is sufficiently large. It divides the unit
square up into a grid of equal-sized smaller squares. Then, for each square, it builds a linked list of all the
points that fall into that square. The two-dimensional array provides the capability to access immediately
the set of points close to a given point; the linked lists provide the flexibility to store the points where
they may fall without our having to know ahead of time how many points fall into each grid square.
We are really looking for points that are inside of a circle of center (x,y) and radius d.
The square that encloses circle is a square of center (x,y) and sides 2d. Any point out of this square does not need to be checked, it's out. So, a point a (xa, ya) is out if abs(xa - x) > d or abs (ya -yb) > d.
Same for the square that is enclosed by that circle is a square of center (x,y) and diagonals 2d. Any point out of this square does not need to be checked, it's in. So, a point a (xa, ya) is in if abs(xa - x) < (d * 1.412) or abs(ya -yb) < (d * 1.412).
Those two easy rules combined reduce a lot the number of points to be checked. If we sort the point by their x, filter the possible points, sort by their y, we come to the ones we really need to calculate the full distance.
For any given point, you can use a Manhattan distance (x-delta plus y-delta) heuristic to filter out most of the points that are not within the distance "d" - filter out any points whose Manhattan distance is greater than (sqrt(2) * d), then run the expensive-and-precise distance test on the remaining points.
I have a points with coordinates EPSG Projection 3059 - LKS92 / Latvia TM. I need to compute the distance in meters between two points.
It is easy to compute euclidean distance between two points, but I am not sure if the resulting distance is expressed in meters?
PROJCS["LKS92 / Latvia TM",
UNIT["metre",1,
PARAMETER["scale_factor",0.9996],
The unit is 1 meter, but should the scale factor taken into account? Maybe 1 unit in this system is not 1m but 0.9996 meters?
No, the scale factor in a map projection is integrated in the projection calculations. The purpose of the factor is to minimise overall distortion as the easting from the central meridian (24°E) increases. 0.9996 is a common factor when transverse mercator (TM) is involved.
Once projected, the resultant coordinate is in metres (in this case), no further scaling is needed, and you can use a simple Pythagorean hypotenuse calculation to determine a distance.
I have the following geometry problem:
You are given a circle with the center in origin - C(0, 0), and radius 1. Inside the circle are given N points which represent the centers of N different circles. You are asked to find the minimum radius of the small circles (the radius of all the circles are equal) in order to cover all the boundary of the large circle.
The number of circles is: 3 ≤ N ≤ 10000 and the problem has to be solved with a precision of P decimals where 1 ≤ P ≤ 6.
For example:
N = 3 and P = 4
and the coordinates:
(0.193, 0.722)
(-0.158, -0.438)
(-0.068, 0.00)
The radius of the small circles is: 1.0686.
I have the following idea but my problem is implementing it. The idea consists of a binary search to find the radius and for each value given by the binary search to try and find all the intersection point between the small circles and the large one. Each intersection will have as result an arc. The next step is to 'project' the coordinates of the arcs on to the X axis and Y axis, the result being a number of intervals. If the reunions of the intervals from the X and the Y axis have as result the interval [-1, 1] on each axis, it means that all the circle is covered.
In order to avoid precision problems I thought of searching between 0 and 2×10P, and also taking the radius as 10P, thus eliminating the figures after the comma, but my problem is figuring out how to simulate the intersection of the circles and afterwards how to see if the reunion of the resulting intervals form the interval [-1, 1].
Any suggestions are welcomed!
Each point in your set has to cover the the intersection of its cell in the point-set's voronoi diagram and the test-circle around the origin.
To find the radius, start by computing the voronoi diagram of your point set. Now "close" this voronoi diagram by intersecting all infinite edges with your target-circle. Then for each point in your set, check the distance to all the points of its "closed" voronoi cell. The maximum should be your solution.
It shouldn't matter that the cells get closed by an arc instead of a straight line by the test-circle until your solution radius gets greater than 1 (because then the "small" circles will arc stronger). In that case, you also have to check the furthest point from the cell center to that arc.
I might be missing something, but it seems that you only need to find the maximal minimal distance between a point in the circle and the given points.
That is, if you consider the set of all points on the circle, and take the minimal distance between each point to one of the given points, and then take the maximal values of all these - you have found your radius.
This is, of course, not an algorithm, as there are uncountably many points.
I think what I'll do would be along the line of:
Find the minimal distance between the circumference and the set of points, this is your initial radius R.
Check if the entire circle was covered, like so:
For any two points whose distance from each other is more than 2R, check if the entire segment was covered (for each point, check if the circle around it intersects, and if so, remove that segment and keep going). That should take about o(N^3) (you iterate over all of the points for each pair of points). If I'm correct (though I didn't formally prove it) the circle is covered iff all of the segments are covered.
Of all the segment which weren't covered, take the long one, and add half it's length to R.
Repeat.
This algorithm will never cover the circle per se, but it's easy to prove that it exponentially converges to a full cover, so it should be able to find the needed radius with arbitrary accuracy within a reasonable amount of iterations.
Hope that helps.