How to determine the parameters of DBSCAN? - dbscan

I tried the kdist plot as introduced in one paper I read and used the knee distance to set the epsilon. However,the results were not satisfying.
I use the WEKA to implement DBSCAN, but it always returns only one cluster.
Can anyone please give me some advice?

This can happen if the k-dist plot has more than 1 knee (this can happen when the dataset contains clusters having different density, and the outcome you have obtained arise when the high density clusters are nested inside the low density ones).
The solution is to search the next knee and re-apply the algorithm on the core points you already have found.

Related

AI of spaceship's propulsion: land a 3D ship at position=0 and angle=0

This is a very difficult problem about how to maneuver a spaceship that can both translate and rotate in 3D, for a space game.
The spaceship has n jets placing in various positions and directions.
Transformation of i-th jet relative to the CM of spaceship is constant = Ti.
Transformation is a tuple of position and orientation (quaternion or matrix 3x3 or, less preferable, Euler angles).
A transformation can also be denoted by a single matrix 4x4.
In other words, all jet are glued to the ship and cannot rotate.
A jet can exert force to the spaceship only in direction of its axis (green).
As a result of glue, the axis rotated along with the spaceship.
All jets can exert force (vector,Fi) at a certain magnitude (scalar,fi) :
i-th jet can exert force (Fi= axis x fi) only within range min_i<= fi <=max_i.
Both min_i and max_i are constant with known value.
To be clear, unit of min_i,fi,max_i is Newton.
Ex. If the range doesn't cover 0, it means that the jet can't be turned off.
The spaceship's mass = m and inertia tensor = I.
The spaceship's current transformation = Tran0, velocity = V0, angularVelocity = W0.
The spaceship physic body follows well-known physic rules :-
Torque=r x F
F=ma
angularAcceleration = I^-1 x Torque
linearAcceleration = m^-1 x F
I is different for each direction, but for the sake of simplicity, it has the same value for every direction (sphere-like). Thus, I can be thought as a scalar instead of matrix 3x3.
Question
How to control all jets (all fi) to land the ship with position=0 and angle=0?
Math-like specification: Find function of fi(time) that take minimum time to reach position=(0,0,0), orient=identity with final angularVelocity and velocity = zero.
More specifically, what are names of technique or related algorithms to solve this problem?
My research (1 dimension)
If the universe is 1D (thus, no rotation), the problem will be easy to solve.
( Thank Gavin Lock, https://stackoverflow.com/a/40359322/3577745 )
First, find the value MIN_BURN=sum{min_i}/m and MAX_BURN=sum{max_i}/m.
Second, think in opposite way, assume that x=0 (position) and v=0 at t=0,
then create two parabolas with x''=MIN_BURN and x''=MAX_BURN.
(The 2nd derivative is assumed to be constant for a period of time, so it is parabola.)
The only remaining work is to join two parabolas together.
The red dash line is where them join.
In the period of time that x''=MAX_BURN, all fi=max_i.
In the period of time that x''=MIN_BURN, all fi=min_i.
It works really well for 1D, but in 3D, the problem is far more harder.
Note:
Just a rough guide pointing me to a correct direction is really appreciated.
I don't need a perfect AI, e.g. it can take a little more time than optimum.
I think about it for more than 1 week, still find no clue.
Other attempts / opinions
I don't think machine learning like neural network is appropriate for this case.
Boundary-constrained-least-square-optimisation may be useful but I don't know how to fit my two hyper-parabola to that form of problem.
This may be solved by using many iterations, but how?
I have searched NASA's website, but not find anything useful.
The feature may exist in "Space Engineer" game.
Commented by Logman: Knowledge in mechanical engineering may help.
Commented by AndyG: It is a motion planning problem with nonholonomic constraints. It could be solved by Rapidly exploring random tree (RRTs), theory around Lyapunov equation, and Linear quadratic regulator.
Commented by John Coleman: This seems more like optimal control than AI.
Edit: "Near-0 assumption" (optional)
In most case, AI (to be designed) run continuously (i.e. called every time-step).
Thus, with the AI's tuning, Tran0 is usually near-identity, V0 and W0 are usually not so different from 0, e.g. |Seta0|<30 degree,|W0|<5 degree per time-step .
I think that AI based on this assumption would work OK in most case. Although not perfect, it can be considered as a correct solution (I started to think that without this assumption, this question might be too hard).
I faintly feel that this assumption may enable some tricks that use some "linear"-approximation.
The 2nd Alternative Question - "Tune 12 Variables" (easier)
The above question might also be viewed as followed :-
I want to tune all six values and six values' (1st-derivative) to be 0, using lowest amount of time-steps.
Here is a table show a possible situation that AI can face:-
The Multiplier table stores inertia^-1 * r and mass^-1 from the original question.
The Multiplier and Range are constant.
Each timestep, the AI will be asked to pick a tuple of values fi that must be in the range [min_i,max_i] for every i+1-th jet.
Ex. From the table, AI can pick (f0=1,f1=0.1,f2=-1).
Then, the caller will use fi to multiply with the Multiplier table to get values''.
Px'' = f0*0.2+f1*0.0+f2*0.7
Py'' = f0*0.3-f1*0.9-f2*0.6
Pz'' = ....................
SetaX''= ....................
SetaY''= ....................
SetaZ''= f0*0.0+f1*0.0+f2*5.0
After that, the caller will update all values' with formula values' += values''.
Px' += Px''
.................
SetaZ' += SetaZ''
Finally, the caller will update all values with formula values += values'.
Px += Px'
.................
SetaZ += SetaZ'
AI will be asked only once for each time-step.
The objective of AI is to return tuples of fi (can be different for different time-step), to make Px,Py,Pz,SetaX,SetaY,SetaZ,Px',Py',Pz',SetaX',SetaY',SetaZ' = 0 (or very near),
by using least amount of time-steps as possible.
I hope providing another view of the problem will make it easier.
It is not the exact same problem, but I feel that a solution that can solve this version can bring me very close to the answer of the original question.
An answer for this alternate question can be very useful.
The 3rd Alternative Question - "Tune 6 Variables" (easiest)
This is a lossy simplified version of the previous alternative.
The only difference is that the world is now 2D, Fi is also 2D (x,y).
Thus I have to tune only Px,Py,SetaZ,Px',Py',SetaZ'=0, by using least amount of time-steps as possible.
An answer to this easiest alternative question can be considered useful.
I'll try to keep this short and sweet.
One approach that is often used to solve these problems in simulation is a Rapidly-Exploring Random Tree. To give at least a little credibility to my post, I'll admit I studied these, and motion planning was my research lab's area of expertise (probabilistic motion planning).
The canonical paper to read on these is Steven LaValle's Rapidly-exploring random trees: A new tool for path planning, and there have been a million papers published since that all improve on it in some way.
First I'll cover the most basic description of an RRT, and then I'll describe how it changes when you have dynamical constraints. I'll leave fiddling with it afterwards up to you:
Terminology
"Spaces"
The state of your spaceship can be described by its 3-dimension position (x, y, z) and its 3-dimensional rotation (alpha, beta, gamma) (I use those greek names because those are the Euler angles).
state space is all possible positions and rotations your spaceship can inhabit. Of course this is infinite.
collision space are all of the "invalid" states. i.e. realistically impossible positions. These are states where your spaceship is in collision with some obstacle (With other bodies this would also include collision with itself, for example planning for a length of chain). Abbreviated as C-Space.
free space is anything that is not collision space.
General Approach (no dynamics constraints)
For a body without dynamical constraints the approach is fairly straightforward:
Sample a state
Find nearest neighbors to that state
Attempt to plan a route between the neighbors and the state
I'll briefly discuss each step
Sampling a state
Sampling a state in the most basic case means choosing at random values for each entry in your state space. If we did this with your space ship, we'd randomly sample for x, y, z, alpha, beta, gamma across all of their possible values (uniform random sampling).
Of course way more of your space is obstacle space than free space typically (because you usually confine your object in question to some "environment" you want to move about inside of). So what is very common to do is to take the bounding cube of your environment and sample positions within it (x, y, z), and now we have a lot higher chance to sample in the free space.
In an RRT, you'll sample randomly most of the time. But with some probability you will actually choose your next sample to be your goal state (play with it, start with 0.05). This is because you need to periodically test to see if a path from start to goal is available.
Finding nearest neighbors to a sampled state
You chose some fixed integer > 0. Let's call that integer k. Your k nearest neighbors are nearby in state space. That means you have some distance metric that can tell you how far away states are from each other. The most basic distance metric is Euclidean distance, which only accounts for physical distance and doesn't care about rotational angles (because in the simplest case you can rotate 360 degrees in a single timestep).
Initially you'll only have your starting position, so it will be the only candidate in the nearest neighbor list.
Planning a route between states
This is called local planning. In a real-world scenario you know where you're going, and along the way you need to dodge other people and moving objects. We won't worry about those things here. In our planning world we assume the universe is static but for us.
What's most common is to assume some linear interpolation between the sampled state and its nearest neighbor. The neighbor (i.e. a node already in the tree) is moved along this linear interpolation bit by bit until it either reaches the sampled configuration, or it travels some maximum distance (recall your distance metric).
What's going on here is that your tree is growing towards the sample. When I say that you step "bit by bit" I mean you define some "delta" (a really small value) and move along the linear interpolation that much each timestep. At each point you check to see if you the new state is in collision with some obstacle. If you hit an obstacle, you keep the last valid configuration as part of the tree (don't forget to store the edge somehow!) So what you'll need for a local planner is:
Collision checking
how to "interpolate" between two states (for your problem you don't need to worry about this because we'll do something different).
A physics simulation for timestepping (Euler integration is quite common, but less stable than something like Runge-Kutta. Fortunately you already have a physics model!
Modification for dynamical constraints
Of course if we assume you can linearly interpolate between states, we'll violate the physics you've defined for your spaceship. So we modify the RRT as follows:
Instead of sampling random states, we sample random controls and apply said controls for a fixed time period (or until collision).
Before, when we sampled random states, what we were really doing was choosing a direction (in state space) to move. Now that we have constraints, we randomly sample our controls, which is effectively the same thing, except we're guaranteed not to violate our constraints.
After you apply your control for a fixed time interval (or until collision), you add a node to the tree, with the control stored on the edge. Your tree will grow very fast to explore the space. This control application replaces linear interpolation between tree states and sampled states.
Sampling the controls
You have n jets that individually have some min and max force they can apply. Sample within that min and max force for each jet.
Which node(s) do I apply my controls to?
Well you can choose at random, or your can bias the selection to choose nodes that are nearest to your goal state (need the distance metric). This biasing will try to grow nodes closer to the goal over time.
Now, with this approach, you're unlikely to exactly reach your goal, so you need to define some definition of "close enough". That is, you will use your distance metric to find nearest neighbors to your goal state, and then test them for "close enough". This "close enough" metric can be different than your distance metric, or not. If you're using Euclidean distance, but it's very important that you goal configuration is also rotated properly, then you may want to modify the "close enough" metric to look at angle differences.
What is "close enough" is entirely up to you. Also something for you to tune, and there are a million papers that try to get you a lot closer in the first place.
Conclusion
This random sampling may sound ridiculous, but your tree will grow to explore your free space very quickly. See some youtube videos on RRT for path planning. We can't guarantee something called "probabilistic completeness" with dynamical constraints, but it's usually "good enough". Sometimes it'll be possible that a solution does not exist, so you'll need to put some logic in there to stop growing the tree after a while (20,000 samples for example)
More Resources:
Start with these, and then start looking into their citations, and then start looking into who is citing them.
Kinodynamic RRT*
RRT-Connect
This is not an answer, but it's too long to place as a comment.
First of all, a real solution will involve both linear programming (for multivariate optimization with constraints that will be used in many of the substeps) as well as techniques used in trajectory optimization and/or control theory. This is a very complex problem and if you can solve it, you could have a job at any company of your choosing. The only thing that could make this problem worse would be friction (drag) effects or external body gravitation effects. A real solution would also ideally use Verlet integration or 4th order Runge Kutta, which offer improvements over the Euler integration you've implemented here.
Secondly, I believe your "2nd Alternative Version" of your question above has omitted the rotational influence on the positional displacement vector you add into the position at each timestep. While the jet axes all remain fixed relative to the frame of reference of the ship, they do not remain fixed relative to the global coordinate system you are using to land the ship (at global coordinate [0, 0, 0]). Therefore the [Px', Py', Pz'] vector (calculated from the ship's frame of reference) must undergo appropriate rotation in all 3 dimensions prior to being applied to the global position coordinates.
Thirdly, there are some implicit assumptions you failed to specify. For example, one dimension should be defined as the "landing depth" dimension and negative coordinate values should be prohibited (unless you accept a fiery crash). I developed a mockup model for this in which I assumed z dimension to be the landing dimension. This problem is very sensitive to initial state and the constraints placed on the jets. All of my attempts using your example initial conditions above failed to land. For example, in my mockup (without the 3d displacement vector rotation noted above), the jet constraints only allow for rotation in one direction on the z-axis. So if aZ becomes negative at any time (which is often the case) the ship is actually forced to complete another full rotation on that axis before it can even try to approach zero degrees again. Also, without the 3d displacement vector rotation, you will find that Px will only go negative using your example initial conditions and constraints, and the ship is forced to either crash or diverge farther and farther onto the negative x-axis as it attempts to maneuver. The only way to solve this is to truly incorporate rotation or allow for sufficient positive and negative jet forces.
However, even when I relaxed your min/max force constraints, I was unable to get my mockup to land successfully, demonstrating how complex planning will probably be required here. Unless it is possible to completely formulate this problem in linear programming space, I believe you will need to incorporate advanced planning or stochastic decision trees that are "smart" enough to continually use rotational methods to reorient the most flexible jets onto the currently most necessary axes.
Lastly, as I noted in the comments section, "On May 14, 2015, the source code for Space Engineers was made freely available on GitHub to the public." If you believe that game already contains this logic, that should be your starting place. However, I suspect you are bound to be disappointed. Most space game landing sequences simply take control of the ship and do not simulate "real" force vectors. Once you take control of a 3-d model, it is very easy to predetermine a 3d spline with rotation that will allow the ship to land softly and with perfect bearing at the predetermined time. Why would any game programmer go through this level of work for a landing sequence? This sort of logic could control ICBM missiles or planetary rover re-entry vehicles and it is simply overkill IMHO for a game (unless the very purpose of the game is to see if you can land a damaged spaceship with arbitrary jets and constraints without crashing).
I can introduce another technique into the mix of (awesome) answers proposed.
It lies more in AI, and provides close-to-optimal solutions. It's called Machine Learning, more specifically Q-Learning. It's surprisingly easy to implement but hard to get right.
The advantage is that the learning can be done offline, so the algorithm can then be super fast when used.
You could do the learning when the ship is built or when something happens to it (thruster destruction, large chunks torn away...).
Optimality
I observed you're looking for near-optimal solutions. Your method with parabolas is good for optimal control. What you did is this:
Observe the state of the system.
For every state (coming in too fast, too slow, heading away, closing in etc.) you devised an action (apply a strategy) that will bring the system into a state closer to the goal.
Repeat
This is pretty much intractable for a human in 3D (too many cases, will drive you nuts) however a machine may learn where to split the parabolas in every dimensions, and devise an optimal strategy by itself.
THe Q-learning works very similarly to us:
Observe the (secretized) state of the system
Select an action based on a strategy
If this action brought the system into a desirable state (closer to the goal), mark the action/initial state as more desirable
Repeat
Discretize your system's state.
For each state, have a map intialized quasi-randomly, which maps every state to an Action (this is the strategy). Also assign a desirability to each state (initially, zero everywhere and 1000000 to the target state (X=0, V=0).
Your state would be your 3 positions, 3 angles, 3translation speed, and three rotation speed.
Your actions can be any combination of thrusters
Training
Train the AI (offline phase):
Generate many diverse situations
Apply the strategy
Evaluate the new state
Let the algo (see links above) reinforce the selected strategies' desirability value.
Live usage in the game
After some time, a global strategy for navigation emerges. You then store it, and during your game loop you simply sample your strategy and apply it to each situation as they come up.
The strategy may still learn during this phase, but probably more slowly (because it happens real-time). (Btw, I dream of a game where the AI would learn from every user's feedback so we could collectively train it ^^)
Try this in a simple 1D problem, it devises a strategy remarkably quickly (a few seconds).
In 2D I believe excellent results could be obtained in an hour.
For 3D... You're looking at overnight computations. There's a few thing to try and accelerate the process:
Try to never 'forget' previous computations, and feed them as an initial 'best guess' strategy. Save it to a file!
You might drop some states (like ship roll maybe?) without losing much navigation optimality but increasing computation speed greatly. Maybe change referentials so the ship is always on the X-axis, this way you'll drop x&y dimensions!
States more frequently encountered will have a reliable and very optimal strategy. Maybe normalize the state to make your ship state always close to a 'standard' state?
Typically rotation speeds intervals may be bounded safely (you don't want a ship tumbling wildely, so the strategy will always be to "un-wind" that speed). Of course rotation angles are additionally bounded.
You can also probably discretize non-linearly the positions because farther away from the objective, precision won't affect the strategy much.
For these kind of problems there are two techniques available: bruteforce search and heuristics. Bruteforce means to recognize the problem as a blackbox with input and output parameters and the aim is to get the right input parameters for winning the game. To program such a bruteforce search, the gamephysics runs in a simulation loop (physics simulation) and via stochastic search (minimax, alpha-beta-prunning) every possibility is tried out. The disadvantage of bruteforce search is the high cpu consumption.
The other techniques utilizes knowledge about the game. Knowledge about motion primitives and about evaluation. This knowledge is programmed with normal computerlanguages like C++ or Java. The disadvantage of this idea is, that it is often difficult to grasp the knowledge.
The best practice for solving spaceship navigation is to combine both ideas into a hybrid system. For programming sourcecode for this concrete problem I estimate that nearly 2000 lines of code are necessary. These kind of problems are normaly done within huge projects with many programmers and takes about 6 months.

Fast spatial data structure for nearest neighbor search amongst non-uniformly sized hyperspheres

Given a k-dimensional continuous (euclidean) space filled with rather unpredictably moving/growing/shrinking  hyperspheres I need to repeatedly find the hypersphere whose surface is nearest to a given coordinate. If some hyperspheres are of the same distance to my coordinate, then the biggest hypersphere wins. (The total count of hyperspheres is guaranteed to stay the same over time.)
My first thought was to use a KDTree but it won't take the hyperspheres' non-uniform volumes into account.
So I looked further and found BVH (Bounding Volume Hierarchies) and BIH (Bounding Interval Hierarchies), which seem to do the trick. At least in 2-/3-dimensional space. However while finding quite a bit of info and visualizations on BVHs I could barely find anything on BIHs.
My basic requirement is a k-dimensional spatial data structure that takes volume into account and is either super fast to build (off-line) or dynamic with barely any unbalancing.
Given my requirements above, which data structure would you go with? Any other ones I didn't even mention?
Edit 1: Forgot to mention: hypershperes are allowed (actually highly expected) to overlap!
Edit 2: Looks like instead of "distance" (and "negative distance" in particular) my described metric matches the power of a point much better.
I'd expect a QuadTree/Octree/generalized to 2^K-tree for your dimensionality of K would do the trick; these recursively partition space, and presumably you can stop when a K-subcube (or K-rectangular brick if the splits aren't even) does not contain a hypersphere, or contains one or more hyperspheres such that partitioning doesn't separate any, or alternatively contains the center of just a single hypersphere (probably easier).
Inserting and deleting entities in such trees is fast, so a hypersphere changing size just causes a delete/insert pair of operations. (I suspect you can optimize this if your sphere size changes by local additional recursive partition if the sphere gets smaller, or local K-block merging if it grows).
I haven't worked with them, but you might also consider binary space partitions. These let you use binary trees instead of k-trees to partition your space. I understand that KDTrees are a special case of this.
But in any case I thought the insertion/deletion algorithms for 2^K trees and/or BSP/KDTrees was well understood and fast. So hypersphere size changes cause deletion/insertion operations but those are fast. So I don't understand your objection to KD-trees.
I think the performance of all these are asymptotically the same.
I would use the R*Tree extension for SQLite. A table would normally have 1 or 2 dimensional data. SQL queries can combine multiple tables to search in higher dimensions.
The formulation with negative distance is a little weird. Distance is positive in geometry, so there may not be much helpful theory to use.
A different formulation that uses only positive distances may be helpful. Read about hyperbolic spaces. This might help to provide ideas for other ways to describe distance.

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

google app engine : query geohash

I have a db.StringProperty() of geohash, by given a hashcode, how do I find the closer 10 result?
I tried below but doesn't seem to be right
pois = POI.all().filter('geohash <', h_latlng).order('-geohash').fetch(10)
A geohash cannot accomplish the task to find the n-nearest results. You can find the contents of any square region by prefix. But to find a reliable result containing the n-nearest you need to fetch at least 9 prefixes, making it a quite expensive query. Complicating the matter is that prefixes of the 9 squares need to be calculated.
IMO this problem is currently a hard problem to solve efficiently on app-engine. So far, I am on it since a year and have not found a sophisticated and fast solution. A Relational DB with geo index or 2 inequalities will perform such tasks better and faster. But I am interested in good solutions, too. :-)
Citation David Troy:
Geohash also has the property that as
the number of digits decreases (from
the right), accuracy degrades. This
property can be used to do bounding
box searches, as points near to one
another will share similar Geohash
prefixes.
However, because a given point may
appear at the edge of a given Geohash
bounding box, it is necessary to
generate a list of Geohash values in
order to perform a true proximity
search around a point. Because the
Geohash algorithm uses a base-32
numbering system, it is possible to
derive the Geohash values surrounding
any other given Geohash value using a
simple lookup table.
See: https://github.com/davetroy/geohash-js

Similarity between line strings

I have a number of tracks recorded by a GPS, which more formally can be described as a number of line strings.
Now, some of the recorded tracks might be recordings of the same route, but because of inaccurasies in the GPS system, the fact that the recordings were made on separate occasions and that they might have been recorded travelling at different speeds, they won't match up perfectly, but still look close enough when viewed on a map by a human to determine that it's actually the same route that has been recorded.
I want to find an algorithm that calculates the similarity between two line strings. I have come up with some home grown methods to do this, but would like to know if this is a problem that's already has good algorithms to solve it.
How would you calculate the similarity, given that similar means represents the same path on a map?
Edit: For those unsure of what I'm talking about, please look at this link for a definition of what a line string is: http://msdn.microsoft.com/en-us/library/bb895372.aspx - I'm not asking about character strings.
Compute the Fréchet distance on each pair of tracks. The distance can be used to gauge the similarity of your tracks.
Math alert: Fréchet was a pioneer in the field of metric space which is relevant to your problem.
I would add a buffer around the first line based on the estimated probable error, and then determine if the second line fits entirely within the buffer.
To determine "same route," create the minimal set of normalized path vectors, calculate the total power differences and compare the total to a quality measure.
Normalize the GPS waypoints on total path length,
walk the vectors of the paths together, creating a new set of path vectors for each path based upon the shortest vector at each waypoint,
calculate the total power differences between endpoints of each vector in the normalized paths weighting for vector length, and
compare against a quality measure.
Tune the power of the differences (start with, say, squared differences) and the quality measure (say as a percent of the total power differences) visually. This algorithm produces a continuous quality measure of the path match as well as a binary result (Are the paths the same?)
Paul Tomblin said: I would add a buffer
around the first line based on the
estimated probable error, and then
determine if the second line fits
entirely within the buffer.
You could modify the algorithm as the normalized vector endpoints are compared. You could determine if any endpoint difference was above a certain size (implementing Paul's buffer idea) or perhaps, if the endpoints were outside the "buffer," use that fact to ignore that endpoint difference, allowing a comparison ignoring side trips.
You could walk along each point (Pa) of LineString A and measure the distance from Pa to the nearest line-segment of LineString B, averaging each of these distances.
This is not a quick or perfect method, but should be able to give use a useful number and is pretty quick to implement.
Do the line strings start and finish at similar points, or are they of very different extents?
If you consider a single line string to be a sequence of [x,y] points (or [x,y,z] points), then you could compute the similarity between each pair of line strings using the Needleman-Wunsch algorithm. As described in the referenced Wikipedia article, the Needleman-Wunsch algorithm requires a "similarity matrix" which defines the distance between a pair of points. However, it would be easy to use a function instead of a matrix. In your case you could simply use the 2D Euclidean distance function (or a 3D Euclidean function if your points have elevation) to provide the distance between each pair of points.
I actually side with the person (Aaron F) who said that you might be interested in the Levenshtein distance problem (and cited this). His answer seems to me to be the best so far.
More specifically, Levenshtein distance (also called edit distance), does not measure strictly the character-by-character distance, but also allows you to perform insertions and deletions. The best algorithm for this distance measure can be computed in quadratic time (pretty slow if your strings are long), but the computational biologists have pretty good heuristics for this, that might be of interest to you on their own. Check out BLAST and FASTA.
In your problem, it seems that you are dealing with differences between strings of numbers, and you care about the numbers. If you give more information, I might be able to direct you to the right variant of BLAST/FASTA/etc for your purposes. In any case, you might consider adapting BLAST and FASTA for your needs. They're quite simple.
1: http://en.wikipedia.org/wiki/Levenshtein_distance, http://www.nist.gov/dads/HTML/Levenshtein.html

Resources