Finding geohashes of certain length within radius from a point - geohashing

I have points with a given latlong and a distance around them - e.g. { 40.6826048,-74.0288632 : 20 miles, 51.5007825,-0.1258957 : 100 miles}. If I pick a fixed geohash length (say equals to ~ 1x1mile) how can I find all the geohash entries of that length that are with the given radius from each point?
To add some background - the reason I want to do that is so I can save a cache keyed by the geohash id with a value of the list of points for which the given geohash is within radius (and also matches some custom eligibility rules). Then I can do a quick lookup for a user's location geohash to find all the eligible points around them.

This is how I would try to do:
Input: Point of interest(lat, long), Query Radius
Step 1: Find the 'MINIMUM' BOUNDING RECTANGLE(MBR) which completely contains the QUERY CIRCLE
Step 2: To create the minimum bounding rectangle, first calculate its minimum and maximum lat, long using the input parameters. Please refer to section 3.1 and 3.3 of Computing the Minimum and Maximum Latitude Longitude – the Correct Way
Step 3: Using (minLat, minLon), (maxLat, maxLon) calculate the four corners of the MBR NorthWest(maxLat, minLon), SouthWest(minLat, minLon), SouthEast(minLat, maxLon), NorthEast(maxLat, maxLon)
Step 4: Calculate the GeoHash of all four corners of MBR
Ex: for a point in NYC, say (40.75798, -73.991516), distance: 800 Meters and GeoHash length: 12
NorthWest : dr5ruj4477kd
SouthWest : dr5ru46ne2ux
SouthEast : dr5ru6ryw0cp
NorthEast : dr5rumpfq534
Step 5: From these GeoHashes, calculate the Query Bounding Box(MBR) Prefix: dr5ru
This would give you the coarser GeoHash which completely contains our MBR and hence the query region. In other words, all points indexed by dr5ru, yielding with 32 GeoHashes from dr5ru0 - dr5ruz
Final Step:
To find the exact grids (or) GeoHashes that correspond to our Query Circle(Square(MBR) to be precise), we should pick from these 32 GeoHashes by representing a recurring (4X8) Matrix using 2D Array.
In our example: we get dr5ru + J, M, H, K, 5, 7, 4, 6. All these GeoHashes represent the points that are within 800 meters from the Central Query Point, Except very few GeoHashes, which could not be avoided, because of considering MBR instead of a perfect circle.
THE OVERALL PROCESS IN A SINGLE GIF: (Step 1- 5)
FINAL STEP:
Important: Please find the use of 4 x 8 Grid for GeoHash. It varies
for each character along the length of GeoHash. For ODD lengths it is
8 x 4, for even its transpose 4 X 8. In our case, we are inside dr5ru(5 + 1, 6th resolution)
and hence we use 4 X 8

Have a look at this -> ProximityHash.
ProximityHash generates a set of geohashes that cover a circular area, given the center coordinates and the radius. It also has an additional option to use GeoRaptor that creates the best combination of geohashes across various levels to represent the circle, starting from the highest level and iterating till the optimal blend is brewed. Result accuracy remains the same as that of the starting geohash level, but data size reduces considerably, thereby improving speed and performance.

Related

Finding groups of clusters in binary array

I have a binary image which I have extracted all the pixels and wrote them to a txt file. I am trying to find how many clusters and where there are clusters of 25 or more 1's in the array.
DBSCAN, euclidean distances.
db_scan = DBSCAN(eps=1, min_samples=25,metric='euclidean', metric_params=None, algorithm='auto').fit(im_bw)
I expect to find the i, j location of the center of the clusters. I expect to find the number of clusters but says i only have 1
It probably says 0 clusters, only noise.
Because in a radius of 1 pixel you won't have 25 pixels.
Note that you need to make sure you chose the right representation and don't make false assumptions on what the algorithm does... For example, it does not produce centers.

search the closest points for a given point in 1 million points

This is an algorithm question.
Given 1 million points , each of them has x and y coordinates, which are floating point numbers.
Find the 10 closest points for the given point as fast as possible.
The closeness can be measured as Euclidean distance on a plane or other kind of distance on a globe. I prefer binary search due to the large number of points.
My idea:
save the points in a database
1. Amplify x by a large integer e.g. 10^4 and cut off the decimal part and then Amplify x integer part by 10^4 again.
2. Amplify y by a large integer e.g. 10^4
3. Sum the above result from step 1 and 2 , we call the sum as associate_value
4. Repeat 1 to 3 for each number in the database
E.g.
x = 12.3456789 , y = 98.7654321
x times 10^4 = 123456 and then times 10^4 to get 1234560000
y times 10^2 = 9876.54321 and then get 9876
Sum them, get 1234560000 + 9876 = 1234569876
In this way, I transform 2-d data to 1-d data. In the database, each point is associated with an integer (associate_value). The integer column can be set as index in the database for fast search.
For a given point (x, y), I perform step 1 - 3 for it and then find the points in the database such that their associate_value is close to the given point associate_value.
e.g.
x = 59.469797 , y = 96.4976416
their associated value is 5946979649
Then in the database, I search the associate_values that are close to 5946979649, for example, 5946979649 + 50 , 5946979649 - 50 and also 5946979649 + 50000000 , 5946979649 - 500000000. This can be done by index-search in database.
In this way, I can find a group of points that are close to the given point. I can reduce the search space greatly. Then, I can use Euclidean or other distance formula to find the closest points.
I am not sure the efficiency of the algorithm, especially, the process of generating associate_values.
My idea works or not ? Any better ideas ?
Thanks
Your idea seems like it may work, but I would be concerned with degenerate cases (like if no points are in your specified ranges, but maybe that's not possible given the constraints). Either way, since you asked for other ideas, here's my stab at it: Store all of your points in a quad tree. Then just walk down the quad tree until you have a sufficiently small group to search through. Since the points are fixed, the cost of creating the quad is constant, and this should be logarithmic in the number of points you have.
You can do better and just concatenate the binary value from the x- and y co-oordinates. Instead of a straight line it orders the points along a z-curve. Then you can compute the upper bounds with the mostsignificant bits. The z-curve is often use in mapping applications:http://msdn.microsoft.com/en-us/library/bb259689.aspx.
The way I read your algorithm you are discriminating the values along a line with a slope of -1 that are similar to your point. i.e. if your point is 2,2 you would look at points 1,3 0,4 and -1,5 and likely miss points closer. Most algorithms to solve this are O(n) which isn't terribly bad.
A simple algorithm to solve this problem is to keep a priority queue of the closest ten and a measurement of the furthest distance of the ten points as you iterate over the set. If the x or y value is not within the furthest distance discard it immediately. Otherwise calculate it with whatever distance measurement your using and see if it gets inserted into the queue. If so update your furthest on top ten threshold and continue iterating.
If your points are pre-sorted on one of the axes you can further optimize the algorithm by starting at the matching the point on that axis and radiate outward until you are at a difference greater than the distance from your tenth closest point. I did not include sorting in the description in the paragraph above because sorting is O(nlogn) which is slower than O(n). If you are doing this multiple times on the same set then it could be beneficial to sort it.

geohash string length and accuracy

if length of geohash string is more, it is more accurate. But is there any direct relationship like if length is 7 it is providing 100 meter accuracy,
i.e. if two geohash (and either of their bounding box) is having first 7 char matching, both should be near 100 meter etc?
I am using geohash for finding, all near-by location for given geohash, with their distance
Also any directway to calculate distance between two geo-hash? (one way is to decode them to lat/lng, and then calculate distance)
Thanks
Saw a lot of confusion around geohashing so I am posting my understanding so far.
The principle behind geohash is very simple, you can create your own version.
For instance consider following geo-point,
156.34234534,-23.343423345
In the above example, 156 represents degrees, 2 digits after decmal (34) represents
decimal minute and rest, (34.5334) represents seconds.
If you remember school geography circumference of earth at equator is about 40,000kms and,
number of degrees around the earth (latitudes or longitudes) is 360. So at the widest
point each degree of latitude and longitude span equals to about 110kms (40,000/360).
So if you encode the above coordinates as, "156-23" (including negative sign), this will give you (110kmx110km) box.
You can go on and increase the precision,
Fist digit of minute (156.3-23.3) will give you (10kmx10km) box (each minute span equals 1km).
Increase this to include first digit of second you get (100mx100m)box,
each extra digit will add precision to another degree.
Geohashing is just the way to represent the above figure in an encoded form. You can happily use the above format as well!
Was curious about this myself.
If its any good to anyone I put together a spreadsheet here
Not 100% sure its right - feel free to comment if you find a problem.
Judging by graph below, using 6 to 10 digits gives accuracy ~1km to ~1m at 60 degree lat.
Here are the formulas for height and width in degrees of a geohash of length n characters:
First define this function:
parity(n) = 0 if n is even otherwise 1
Then
height = 180 / 2(5n-parity(n))/2 degrees
width = 180 / 2(5n+parity(n)-2)/2 degrees
Note that this is the height and width in degrees only. To convert this to metres requires that you know where on the earth the hash is.
Code for this in java is at http://github.com/davidmoten/geo.
Also any directway to calculate distance between two geo-hash? (one way is to decode them to lat/lng, and then calculate distance)
That is what you should do. Think of the geohash as just another representation of a latitude and longitude as a pair of printed decimal numbers are likewise. If I gave you a pair of lat & lon strings, you would parse them to numbers (in your programming language of choice), and then do the math. It's no different with geohashes -- decode to lat & lon then do the math.
Be very careful with any reasoning you are attempting to do with inferring closeness based on the length of the common prefix between a pair of points. If there is a long common prefix, then they are close, but the converse is not true! -- i.e. two points with no common prefix could be a millimeter apart.
Here is an equation (in pseudocode) that can approximate the optimal Geohash length for a latitude/longitude pair having a certain precision:
geohash_length = FLOOR ( LOG_2(5000000/precision_in_meters) / 2,5 + 1 )
if geohash_length > 12 then geohash_length = 12
if geohash_length < 1 then geohash_length = 1
I've used it to create the optimal Geohash from data received by the gpsddaemon, which also provide precision information via the epx and epy values.

Is the Leptonica implementation of 'Modified Median Cut' not using the median at all?

I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm.
I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd.
Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all.
The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic
The largest axis is divided by locating the bin with the median pixel
(by population), selecting the longer side, and dividing in the center
of that side. We could have simply put the bin with the median pixel
in the shorter side, but in the early stages of subdivision, this
tends to put low density clusters (that are not considered in the
subdivision) in the same vbox as part of a high density cluster that
will outvote it in median vbox color, even with future median-based
subdivisions. The algorithm used here is particularly important in
early subdivisions, and 3is useful for giving visible but low
population color clusters their own vbox. This has little effect on
the subdivision of high density clusters, which ultimately will have
roughly equal population in their vboxes.
For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following
Iterate over all the possible green and blue variations of the red color
For each iteration it adds to the total number of pixels (population) it's found along the red axis
For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red
note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red
So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations
4 8 20 16 1 9 12 8 8
After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above
4 12 32 48 49 58 70 78 86
And total would have a value of 86
Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346
It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2
Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49
The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was..
Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin.
Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)
In the 9-bin example, 49 is the number of pixels in the first 5 bins. 49 is the median number in the set of 9 partial sums, but we want the median pixel in the set of 86 pixels, which is 43 (or 44), and it resides in the 4th bin.
Inspection of the modified median cut algorithm in colorquant2.c of leptonica shows that the actual cut location for the 3d box does not necessarily occur adjacent to the bin containing the median pixel. The reasons for this are explained in the function medianCutApply(). This is one of the "modifications" to Paul Heckbert's original method. The other significant modification is to make the decision of which 3d box to cut next based on a combination of both population and the product (population * volume), thus permitting splitting of large but sparsely populated regions of color space.
I do not know the algo, but I would assume your array contains the population of each red; let's explain this with an example:
Assume you have four gradations of red: A,B,C and D
And you have the following sequence of red values:
AABDCADBBBAAA
To find the median, you would have to sort them according to red value and take the middle:
median
v
AAAAAABBBBCDD
Now let's use their approach:
A:6 => 6
B:4 => 10
C:1 => 11
D:2 => 13
13/2 = 6.5 => B
I think the mismatch happened because you are counting the population; the average color would be:
(6*A+4*B+1*C+2*D)/13

Given centers, find minimum radius for set of circles such that they fully cover another

I have the following geometry problem:
You are given a circle with the center in origin - C(0, 0), and radius 1. Inside the circle are given N points which represent the centers of N different circles. You are asked to find the minimum radius of the small circles (the radius of all the circles are equal) in order to cover all the boundary of the large circle.
The number of circles is: 3 ≤ N ≤ 10000 and the problem has to be solved with a precision of P decimals where 1 ≤ P ≤ 6.
For example:
N = 3 and P = 4
and the coordinates:
(0.193, 0.722)
(-0.158, -0.438)
(-0.068, 0.00)
The radius of the small circles is: 1.0686.
I have the following idea but my problem is implementing it. The idea consists of a binary search to find the radius and for each value given by the binary search to try and find all the intersection point between the small circles and the large one. Each intersection will have as result an arc. The next step is to 'project' the coordinates of the arcs on to the X axis and Y axis, the result being a number of intervals. If the reunions of the intervals from the X and the Y axis have as result the interval [-1, 1] on each axis, it means that all the circle is covered.
In order to avoid precision problems I thought of searching between 0 and 2×10P, and also taking the radius as 10P, thus eliminating the figures after the comma, but my problem is figuring out how to simulate the intersection of the circles and afterwards how to see if the reunion of the resulting intervals form the interval [-1, 1].
Any suggestions are welcomed!
Each point in your set has to cover the the intersection of its cell in the point-set's voronoi diagram and the test-circle around the origin.
To find the radius, start by computing the voronoi diagram of your point set. Now "close" this voronoi diagram by intersecting all infinite edges with your target-circle. Then for each point in your set, check the distance to all the points of its "closed" voronoi cell. The maximum should be your solution.
It shouldn't matter that the cells get closed by an arc instead of a straight line by the test-circle until your solution radius gets greater than 1 (because then the "small" circles will arc stronger). In that case, you also have to check the furthest point from the cell center to that arc.
I might be missing something, but it seems that you only need to find the maximal minimal distance between a point in the circle and the given points.
That is, if you consider the set of all points on the circle, and take the minimal distance between each point to one of the given points, and then take the maximal values of all these - you have found your radius.
This is, of course, not an algorithm, as there are uncountably many points.
I think what I'll do would be along the line of:
Find the minimal distance between the circumference and the set of points, this is your initial radius R.
Check if the entire circle was covered, like so:
For any two points whose distance from each other is more than 2R, check if the entire segment was covered (for each point, check if the circle around it intersects, and if so, remove that segment and keep going). That should take about o(N^3) (you iterate over all of the points for each pair of points). If I'm correct (though I didn't formally prove it) the circle is covered iff all of the segments are covered.
Of all the segment which weren't covered, take the long one, and add half it's length to R.
Repeat.
This algorithm will never cover the circle per se, but it's easy to prove that it exponentially converges to a full cover, so it should be able to find the needed radius with arbitrary accuracy within a reasonable amount of iterations.
Hope that helps.

Resources