I am implementing DBSCAN on a dataset. First I sorted the data and then found the distance among its neighbors to find the minimum distance between them and plot the minimum distance.
This will give the elbow curve to find density of the data points and their minimum distance(eps) values. But I am getting this kind of curve which looks different from the sample example I found on the internet. Please suggest me whether my obtained curve is OK or not and I have taken eps value as 0.7. Is this value correct or not???
My curve
sample example from the internet
Related
I have multiple faces in 3D space creating cells. All these faces lie within a predefined cube (e.g. of size 100x100x100).
Every face is convex and defined by a set of corner points and a normal vector. Every cell is convex. The cells are result of 3d voronoi tessellation, and I know the initial seed points of the cells.
Now for every integer coordinate I want the smallest distance to any face.
My current solution uses this answer https://math.stackexchange.com/questions/544946/determine-if-projection-of-3d-point-onto-plane-is-within-a-triangle/544947 and calculates for every point for every face for every possible triple of this faces points the projection of the point to the triangle created by the triple, checks if the projection is inside the triangle. If this is the case I return the distance between projection and original point. If not I calculate the distance from the point to every possible line segment defined by two points of a face. Then I choose the smallest distance. I repeat this for every point.
This is quite slow and clumsy. I would much rather calculate all points that lie on (or almost lie on) a face and then with these calculate the smallest distance to all neighbour points and repeat this.
I have found this Get all points within a Triangle but am not sure how to apply it to 3D space.
Are there any techniques or algorithms to do this efficiently?
Since we're working with a Voronoi tessellation, we can simplify the current algorithm. Given a grid point p, it belongs to the cell of some site q. Take the minimum over each neighboring site r of the distance from p to the plane that is the perpendicular bisector of qr. We don't need to worry whether the closest point s on the plane belongs to the face between q and r; if not, the segment ps intersects some other face of the cell, which is necessarily closer.
Actually it doesn't even matter if we loop r over some sites that are not neighbors. So if you don't have access to a point location subroutine, or it's slow, we can use a fast nearest neighbors algorithm. Given the grid point p, we know that q is the closest site. Find the second closest site r and compute the distance d(p, bisector(qr)) as above. Now we can prune the sites that are too far away from q (for every other site s, we have d(p, bisector(qs)) ≥ d(q, s)/2 − d(p, q), so we can prune s unless d(q, s) ≤ 2 (d(p, bisector(qr)) + d(p, q))) and keep going until we have either considered or pruned every other site. To do pruning in the best possible way requires access to the guts of the nearest neighbor algorithm; I know that it slots right into the best-first depth-first search of a kd-tree or a cover tree.
i am not sure to have understood why the elbow method is an approximate right way to determine a value of epsilon for DBSCAN algorithm. For instance, in the example below:
I considered the distance from the 5-th nearest neighbors and the points are arranged from the one with the minimum 5th-neighbor distance to the one that is at most distance from the 5th-neighbor.
I considered euclidean distane for the plot.
So i know that point 0-20, for instance, are the ones that are the most close to their 5-th nearest neighbor, then the points in the elbow are the one at intermediate distance from their 5-th nearest neighbor, and so they have a medium density. Then we reach point of very low density, far from their 5-th closest neighbor.
But I can't understand why it is reasonable to choose as the value of epsilon the distance between the k-th closest neigbor of points in the elbow.
Thanks for the help.
From the paper dbscan: Fast Density-Based Clustering with R (page 11)
To find a suitable value for eps, we can plot the points’ kNN
distances (i.e., the distance of each point to its k-th nearest
neighbor) in decreasing order and look for a knee in the plot. The
idea behind this heuristic is that points located inside of clusters
will have a small k-nearest neighbor distance, because they are close
to other points in the same cluster, while noise points are more
isolated and will have a rather large kNN distance.
We need a cutoff to decide what k-NN distance is considered "small" and what is "large." The knee heuristic identifies such a cutoff for epsilon as the k-NN distance where the distance starts to increase rapidly.
I have a hilbert curve index based on this algorithm. I take two to four values (latitude, longitude, time in unix format and an id code) and create a 1-d hilbert curve.
I'm looking for a way to use this data to create a bounding box query (i.e. "find all ids within this rectangle).
I'm looking for a way to do so without decoding the 1d Hilbert code back into its constituent parts. It seems to be easier to do this with a Morton/Z-order curve but I was wondering about the locality preservation.
My question is: if I created a 2d hilbert curve range (i.e. I converted the range of the box into a hilbert curve so x1y1-> hilbert value1 and x2y2-> hilbertvalue2) would all values of corresponding 2d hilbert values fall within their range?
E.g. If I converted (1,2) and (20,30) into Hilbert values and then searched for all values between hilbertvalue1 and hilbertvalue2, would all the values I get decode to fall within (1,2) and (20, 30), or would I have to perform additional transformations?
An additional problem is crafting a range when you have more than 2 dimensions. I have the ability to convert in and out of Hilbert curves but how can I make sure that even 4d values have latitude and longitude that falls within the same rectangle/bounding box?
Thanks.
My question is: if I created a 2d hilbert curve range (i.e. I converted the range of the box into a hilbert curve so x1y1-> hilbert value1 and x2y2-> hilbertvalue2) would all values of corresponding 2d hilbert values fall within their range?
The answer is no. This is part of the challenge of using a Hilbert index. Below is an example curve. You'll notice that you can have a point on the curve that has a higher index than the vertices of a box containing that point. The light blue box is an example where the vertices have indices 117, 122, 133, 138 yet inside (though on the border) is the value 143.
One simple approach is brute force where you visit every cell in the search region and calculate the index in those cells. Then you compile a list of index ranges that would be used in a query. You might join some ranges and filter later as a performance optimization based on benchmarks (lots of small range queries might take longer than querying fewer larger ranges followed by a filter). I'd like to see something more elegant than this but have yet to see it.
UPDATE: I've worked out something more elegant than the brute force technique and the details (and a java library) are at https://github.com/davidmoten/hilbert-curve. In short, the endpoints of the ranges that exactly cover a search box will all be on the perimeter of the region. If you sort all the hilbert curve values on the perimeter of the region and start with the smallest value you can then pair up all the ranges by doing tests on whether the next point on the curve stays on the perimeter, leaves the box or is inside the box.
An additional problem is crafting a range when you have more than 2 dimensions. I have the ability to convert in and out of Hilbert curves but how can I make sure that even 4d values have latitude and longitude that falls within the same rectangle/bounding box?
The perimeter technique described above works for any number of dimensions (but of course becomes more expensive!).
For 2 dimensions you can treat the curve as a base-4 number (quadkey) and search from left to right.
What is the Difference between Geodist(sfield,x,y) and dist(2,x,y,a,b) in Apache Solr for Geo-Spacial Searches ??
dist(2,x,y,0,0) :- calculates the Euclidean distance between (0,0) and (x,y) for each document. Return the Distance between two Vectors (points) in an n-dimensional space.
I was earlier using geodist() distance function for Geo-Spatial searches on my website but its response time was large. so have done a POC(proof of concept) for different distance functions and found that dist(2,x,y,0,0) distance function is relatively taking half of the time. But I want to know the reason behind this and the algorithms which both functions are using to calculate the distance.
I have to make a difference matrix for the same to convey it further.
The main difference is that geodist() is intended to work with spatial field types.
Most spatial implementation are based on Lucene's Points API, which is a BKD Index. This field type is strictly limited to coordinates in lat/lon decimal degrees. Behind the scenes, latitude and longitude are indexed as separate numbers. Four main field types are available for spatial search :
LatLonPointSpatialField
LatLonType (now deprecated) and its non-geodetic twin PointType
SpatialRecursivePrefixTreeFieldType (RPT for short), including RptWithGeometrySpatialField, a derivative
BBoxField (for areas, 4 instances of another field type referred to by numberType)
In geodist (sfield, x, y), sfield is a spatial field type that represents two points (lat,lon), so the direct equivalent using dist() would be to implement dist (2, sfieldX, sfieldY, x, y) with sfieldX and sfieldY being respectively the (lat,lon) coordinates of sfield.
Using dist (power, a, b, ...) you can't query a spatial field type. In order to perform the same spatial search, you would have to specify every point's dimension separately. It would require 2 indexed fields (or values per field at least) for 2 dimensions, 3 for 3d, and so on. That makes a huge difference because you would have to index every coordinates of each point separately.
Besides, you can also use geodist() as is with the BBoxField field type that indexes a single rectangle per document field and supports searching via a bounding box. To do the same with dist() you would have to compute the center point of the box to input each one of its coordinates as a function argument, so it would be too much hassle to yield the same result if you want to use an area as parameter.
Lastly, LatLonPointSpatialField for example does distance calculations based on Haversine formula (Great Circle), BBoxField does it a little faster because the rectangular shape is faster to compute. It's true that dist() may be even faster but remember that requires more field to be indexed, a lot of preprocess at query time to be able to yield the same calculated distance, and, as mentioned by Mats, it wouldn't take the earth' curvature into account.
An euclidean distance doesn't account for the curvature of the earth. If you're only sorting by the distance, the behavior can be OK - but only if your hits are within a small geographical area (the value of a unit compared to meters greatly change when you're getting closer to the poles).
There's an extensive and good answer that explains the difference between a Euclidean distance and a proper geographical distance (usually calculated using haversine) available at the GIS Stack Exchange.
Although at small scales any smooth surface looks like a plane, the accuracy of the Pythagorean formula depends on the coordinates used. When those coordinates are latitude and longitude on a sphere (or ellipsoid), we can expect that
Distances along lines of longitude will be reasonably accurate.
Distances along the Equator will be reasonably accurate.
All other distances will be erroneous, in rough proportion to the differences in latitude and longitude.
I have the following geometry problem:
You are given a circle with the center in origin - C(0, 0), and radius 1. Inside the circle are given N points which represent the centers of N different circles. You are asked to find the minimum radius of the small circles (the radius of all the circles are equal) in order to cover all the boundary of the large circle.
The number of circles is: 3 ≤ N ≤ 10000 and the problem has to be solved with a precision of P decimals where 1 ≤ P ≤ 6.
For example:
N = 3 and P = 4
and the coordinates:
(0.193, 0.722)
(-0.158, -0.438)
(-0.068, 0.00)
The radius of the small circles is: 1.0686.
I have the following idea but my problem is implementing it. The idea consists of a binary search to find the radius and for each value given by the binary search to try and find all the intersection point between the small circles and the large one. Each intersection will have as result an arc. The next step is to 'project' the coordinates of the arcs on to the X axis and Y axis, the result being a number of intervals. If the reunions of the intervals from the X and the Y axis have as result the interval [-1, 1] on each axis, it means that all the circle is covered.
In order to avoid precision problems I thought of searching between 0 and 2×10P, and also taking the radius as 10P, thus eliminating the figures after the comma, but my problem is figuring out how to simulate the intersection of the circles and afterwards how to see if the reunion of the resulting intervals form the interval [-1, 1].
Any suggestions are welcomed!
Each point in your set has to cover the the intersection of its cell in the point-set's voronoi diagram and the test-circle around the origin.
To find the radius, start by computing the voronoi diagram of your point set. Now "close" this voronoi diagram by intersecting all infinite edges with your target-circle. Then for each point in your set, check the distance to all the points of its "closed" voronoi cell. The maximum should be your solution.
It shouldn't matter that the cells get closed by an arc instead of a straight line by the test-circle until your solution radius gets greater than 1 (because then the "small" circles will arc stronger). In that case, you also have to check the furthest point from the cell center to that arc.
I might be missing something, but it seems that you only need to find the maximal minimal distance between a point in the circle and the given points.
That is, if you consider the set of all points on the circle, and take the minimal distance between each point to one of the given points, and then take the maximal values of all these - you have found your radius.
This is, of course, not an algorithm, as there are uncountably many points.
I think what I'll do would be along the line of:
Find the minimal distance between the circumference and the set of points, this is your initial radius R.
Check if the entire circle was covered, like so:
For any two points whose distance from each other is more than 2R, check if the entire segment was covered (for each point, check if the circle around it intersects, and if so, remove that segment and keep going). That should take about o(N^3) (you iterate over all of the points for each pair of points). If I'm correct (though I didn't formally prove it) the circle is covered iff all of the segments are covered.
Of all the segment which weren't covered, take the long one, and add half it's length to R.
Repeat.
This algorithm will never cover the circle per se, but it's easy to prove that it exponentially converges to a full cover, so it should be able to find the needed radius with arbitrary accuracy within a reasonable amount of iterations.
Hope that helps.