Rtree Quadratic Split example understand - database

Can someone explain me how the distance for the quadratic split here is being calculated for this example, and if you could suggest me more examples, it would be really helpful for me. Thank you

I think those quadratic split distances shown are considering the squares to be 1X1, not 10X5. The idea is to find how much space would be wasted in a bounding box that covered the two rectangles - for example, a bounding box covering R1 and R2 would be 4X2, area 8. Subtract 2 for the area of R1 and R2 - the wasted space is 6. Choose as seeds the two rectangles you would least want together, i.e. the two with the greatest wasted space.
This is explained in the original paper: http://pages.cs.wisc.edu/~cs764-1/rtree.pdf.
You can make up your own split algorithm. How good it is will affect the efficiency of the R-tree, but not the correctness.

Related

Proper Heuristic Mechanism For Hill Climbing

The following problem is an exam exercise I found from an Artificial Intelligence course.
"Suggest a heuristic mechanism that allows this problem to be solved, using the Hill-Climbing algorithm. (S=Start point, F=Final point/goal). No diagonal movement is allowed."
Since it's obvious that Manhattan Distance or Euclidean Distance will send the robot at (3,4) and no backtracking is allowed, what is a possible solution (heuristic mechanism) to this problem?
EDIT: To make the problem clearer, I've marked some of the Manhattan distances on the board:
It would be obvious that, using Manhattan distance, the robot's next move would be at (3,4) since it has a heuristic value of 2 - HC will choose that and get stuck forever. The aim is try and never go that path by finding the proper heuristic algorithm.
I thought of the obstructions as being hot, and that heat rises. I make the net cost of a cell the sum of the Manhattan metric distance to F plus a heat-penalty. Thus there is an attractive force drawing the robot towards F as well as a repelling force which forces it away from the obstructions.
There are two types of heat penalties:
1) It is very bad to touch an obstruction. Look at the 2 or 3 cells neighboring cells in the row immediately below a given cell. Add 15 for every obstruction cell which is directly below the given cell and 10 for every diagonal neighbor which is directly below
2) For cells not in direct contact with the instructions -- the heat is more diffuse. I calculate it as 6 times the average number of obstruction blocks below the cell both in its column and in its neighboring columns.
The following shows the result of combining this all, as well as the path taken from S to F:
A crucial point it the way that the averaging causes the robot to turn left rather than right when it hits the top row. The unheated columns towards the left make that the cooler direction. It is interesting to note how all cells (with the possible exception of the two at the upper-right corner) are drawn to F by this heuristic.

Fastest way to find minimum distance of one point to points on a curve

I'm looking for a fast solution for the following problem:
I have a fixed point (let's say the upper right on the white measurement line) and need to find the closest point on a curve made of equally spaced points (the lower curve). Additionally, I do this for every point on the upper curve to draw the distances between the curves with different colours (three levels: below minimum [red], between minimum and maximum [orange] and above maximum [green]).
My current solution is a tradeoff: I take the fixed point, iterate through an arbitrary interval (e. g. 50 units to the left and right of the fixed point) and calculate the distance of each pair. This saves some CPU power, but it is neither elegant nor accurate, since I could miss a minimum distance outside my chosen interval.
Any proposals for a faster algorithm?
Edit: Equally spaced means all points have the same distance on the x-axis, this is true for both curves. Also I do not need to interpolate between the points, this would be too time consuming.
Rather than an arbitrary distance, you could perhaps iterate until "out of range".
In your example, suppose you start with the point on the upper curve at the top-right of your line. Then drop vertically downwards, you get a distance of (by my eye) about 200um.
Now you can move right from here testing points until the horizontal distance is 200um. Beyond that, it's impossible to get a distance less than 200um.
Moving left, the distance goes down until you find the 150um minimum, then starts rising again. Once you're 150um to the left of your upper point, again, it's impossible to beat the minimum you've found.
If you'd gone left first, you wouldn't have had to go so far right, so as an optimization either follow the direction in which the distance falls, or else work out from the middle in both directions at once.
I don't know how many um 50 units is, so this might be slower or faster than what you have. It does avoid the risk of missing a lower value, though.
Since you're doing lots of tests against the same set of points on the lower curve, you can proably improve on this by ignoring the fact that the points form a curve at all. Stick them all in a k-d tree or similar, and search that repeatedly. It's called a Nearest neighbor search.
It may help to identify this problem as a nearest neighbour search problem. That link includes a good discussion about the various algorithms that are used for this. If you are OK with using C++ rather than straight C, ANN looks like a good library for this.
It also looks as though this question has been asked before.
We can label the top curve y=t(x) and the bottom curve y=b(x). Label the closest-function x_b=c(x_t). We know that the closest-function is weakly monotone non-decreasing as two shortest paths never cross each other.
If you know that the distance function d(x_t,x_b) has only one local minimum for every fixed x_t (this happens if the curve is "smooth enough"), then you can save time by "walking" the curve:
- start with x_t=0, x_b=0
- while x_t <= x_max
-- find the closest x_b by local search
(increment x_b while the distance is decreasing)
-- add {x_t, x_b} to the result set
-- increment x_t
If you expect x_b to be smooth enough, but you cannot assume that and you want an exact result,
Walk the curve in both directions. Where the results agree, they are correct. Where they disagree, run a complete search betwen the two results (the leftmost and the rightmost local maxima). Sample the "ambiguous block" in such an order (binary division) to allow the most pruning due to the monotonicity.
As a middle ground:
Walk the curve in both directions. If the results disagree, choose among the two. If you can guarantee at most two local maxima for each fixed x_t, this produces the optimal solution. There are still some pathological cases where the optimal solution is not found, and contain a local minimum that is flanked by two other local minima that are both worse than this one. I dare say it is uncommon to find a case where the solution is far from optimal (assuming smooth y=b(x)).

object / shape / piece fitting

I've been thinking for a few days about the best solution for this but can't seem to get the right idea on how to do this.
I have a pieces (objects) and I want to fit them in the smallest possible space.
What I'm ultimately looking for is something like this
http://i.stack.imgur.com/Yg09E.gif
But a simpler version of just calculating the best possible fit of two lines(stripes) would already do for now
like the lines(stripes) on the right
http://i.stack.imgur.com/HijMo.jpg
What I have is 2 arrays of points(vertices) on a xy axis representing two lines(stripes) and I'd like to arrange them in such a manner that there is 10 or 20 mm space between the closest point of the two.
I was thinking of looking at the first half of the array and finding the highest point then looking at the second half and finding it's highest point then compare the two
but that doesn't really seem to be a proper solution.
And I can't really imagine writing a program that fits shapes as in the first image is even possible using such methods.
Can anyone guide me in the right direction?
Well, this is really possible.
All you would Have to do is build area and distance function. You might need to add different algorithms for different kinds of shapes.
For the Ones you have provided in the first picture, it is difficult to calculate area. So, Probably will have to specify distance of vertices. Also, you need to add a condition to make sure that the locus of the shapes does not co-incide at any point.

smallest "scrabble board" containing every word of a list

I am looking for an algorithm able to build an array (2D) of letters from which I could extract each word of a given list.
Like in Scrabble, words can cross each other, and be horizontal, vertical or diagonal. Of course there are some obvious solutions, but the goal here is to make it as small as possible, which also means maximizing the number of crossing.
I have thought of a machine learning method using a large set of scrabble grids, either made by humans or computers, but I am sure there is a cleaner way of doing it.
Thanks for your help.
PS: that is for an art project, no kidding.
That would be quite some algorithm. I suspect the solution will involve some sort of recursion.
Let's say you have a grid G0 to start with, with all squares blank, and that f(G0) is the optimised completed grid.
Then I would try:
For each possible position of the first word
- set G1 = the grid with this word in this position and all other squares blank
- work out G1
Go on to next position
To work out G1, you could call f(G1) recursively.
If you had a large grid and a lot of words, this would take forever to run, as it's a wasteful algorithm, but with a typical Scrabble board I should think it would be quick enough on a laptop.

About curse of dimensionality

My question is about this topic I've been reading about a bit. Basically my understanding is that in higher dimensions all points end up being very close to each other.
The doubt I have is whether this means that calculating distances the usual way (euclidean for instance) is valid or not. If it were still valid, this would mean that when comparing vectors in high dimensions, the two most similar wouldn't differ much from a third one even when this third one could be completely unrelated.
Is this correct? Then in this case, how would you be able to tell whether you have a match or not?
Basically the distance measurement is still correct, however, it becomes meaningless when you have "real world" data, which is noisy.
The effect we talk about here is that a high distance between two points in one dimension gets quickly overshadowed by small distances in all the other dimensions. That's why in the end, all points somewhat end up with the same distance. There exists a good illustration for this:
Say we want to classify data based on their value in each dimension. We just say we divide each dimension once (which has a range of 0..1). Values in [0, 0.5) are positive, values in [0.5, 1] are negative. With this rule, in 3 dimensions, 12.5% of the space are covered. In 5 dimensions, it is only 3.1%. In 10 dimensions, it is less than 0.1%.
So in each dimension we still allow half of the overall value range! Which is quite much. But all of it ends up in 0.1% of the total space -- the differences between these data points are huge in each dimension, but negligible over the whole space.
You can go further and say in each dimension you cut only 10% of the range. So you allow values in [0, 0.9). You still end up with less than 35% of the whole space covered in 10 dimensions. In 50 dimensions, it is 0.5%. So you see, wide ranges of data in each dimension are crammed into a very small portion of your search space.
That's why you need dimensionality reduction, where you basically disregard differences on less informative axes.
Here is a simple explanation in layman terms.
I tried to illustrate this with a simple illustration shown below.
Suppose you have some data features x1 and x2 (you can assume they are blood pressure and blood sugar levels) and you want to perform K-nearest neighbor classification. If we plot the data in 2D, we can easily see that the data nicely group together, each point has some close neighbors that we can use for our calculations.
Now let's say we decide to consider a new third feature x3 (say age) for our analysis.
Case (b) shows a situation where all of our previous data comes from people the same age. You can see that they are all located at the same level along the age (x3) axis.
Now we can quickly see that if we want to consider age for our classification, there is a lot of empty space along the age(x3) axis.
The data that we currently have only over a single level for the age. What happens if we want to make a prediction for someone that has a different age(red dot)?
As you can see there are not enough data points close this point to calculate the distance and find some neighbors. So, If we want to have good predictions with this new third feature, we have to go and gather more data from people of different ages to fill the empty space along the age axis.
(C) It is essentially showing the same concept. Here assume our initial data, were gathered from people of different ages. (i.e we did not care about the age in our previous 2 feature classification task and might have assumed that this feature does not have an effect on our classification).
In this case , assume our 2D data come from people of different ages ( third feature). Now, what happens to our relatively closely located 2d data, if we plot them in 3D? If we plot them in 3D, we can see that now they are more distant from each other,(more sparse) in our new higher dimension space(3D). As a result, finding the neighbors becomes harder since we don't have enough data for different values along our new third feature.
You can imagine that as we add more dimensions the data become more and more apart. (In other words, we need more and more data if you want to avoid having sparsity in our data)

Resources