My question is about the Motivation for implementing a heuristic O(n) algorithm.
There are some generic solutions to this algorithmic problem of
generating the minimum number of operations to transform one tree into
another. However, the state of the art algorithms have a complexity in
the order of O(n^3) where n is the number of elements in the tree.
Why transforming one tree to another have a complexity of O(n^3)?
If we used this in React, displaying 1000 elements would require in
the order of one billion comparisons. This is far too expensive.
Instead, React implements a heuristic O(n) algorithm based on two assumptions:
Two elements of different types will produce different trees.
The developer can hint at which child elements may be stable across different renders with a key prop.
Can you elaborate on what is heuristic in React's implementation?
Do the assumptions make it O(n) in an average case?
There are pretty good reasons why the React's diff algorithm is the way it is, but the documented "motivation" doesn't really make enough sense to be the real truth.
Firstly, while it's certainly true that an optimal "tree diff" takes O(N3) time, a "tree diff" algorithm is not the single best alternative to what React actually does, and in fact doesn't really fit will into react's rendering process at all. This is mostly because, in the worst case, rendering a react component produces a list (not a tree) of react elements that needs to be matched up against a list of preexisting components.
There is no new tree when the diff is performed, because the new list needs to be matched against the preexisting tree before the children of the new elements are rendered. In fact, the results of the diff are required to decided whether or not to re-render the children at all.
So... In matching up these lists, we might compare the React diff against the standard Longest-Common-Subsequence algorithm, which is an O(N2) algorithm. That's still pretty slow, and there is a performance argument to be made. If LCS was as fast as the React diff then it would have a place in the rendering process for sure.
But, not only is LCS kinds slow, it also doesn't do the right thing. When React is matching the list of new elements up against the old tree, it is deciding whether or not each element is a new component, or just a prop update to a pre-existing component. LCS could find the largest possible matching of element types, but the largest possible matching isn't necessarily what the developer wants.
So, the problem with LCS (or a tree diff, if you really want to push the point), is not just that it's slow, but that it's slow and the answer it provides is still just a guess at the developer's intent. Slow algorithms just aren't worth it when they still make mistakes.
There are a lot of other fast algorithms, though, that React developers could have used that would be more accurate in most cases, but then the question is "Is it worth it?" Generally, the answer is "no", because no algorithm can do a really good job of guessing a developer's intent, and guessing the developer's intent is actually not necessary.
When it's important to a developer that his new elements are properly matched up to his preexisting components so they don't have to rerender, then the developer should make sure that this is the case. It's very easy -- he just has to provide key props when he renders a list. Developers should pretty much always do this when rendering a list, so that the component matching can be perfect, and they don't have to settle for any kind of guess.
React will produce a warning if you don't put in key props where they are required to make the matching unambiguous, which is far more helpful than a better diff. When you see it you should fix your components by adding the proper key props, and then the matching will be perfect and it doesn't matter that there are other algorithms that could do a better job on badly written components.
There is a transformation only based on assumptions
Two elements of different types will produce different trees.
The developer can hint at which child elements may be stable across different renders with a key prop.
The whole tree is not re-rendered if key do not change or no new elements are added
This management is based on assumption by experience, so it is heuristic
Current state-of-the-art diffing algorithms have a complexity of O(n^3) (n number of nodes) to find the minimum amount of transform operations needed to transform a tree into another tree. But this complexity, as React docs mention, can be too high and in usual cases it is not needed.
That is why React uses a heuristic that on average takes O(n) (linear time).
Two elements of different types will produce different trees.
The developer can hint at which child elements may be stable across different renders with a key prop.
Being a heuristic means that there are cases where the diff may produce more transformations than necessary (it is not optimal in general), however it can be optimal in usual and often-used cases where the two algorithms (optimal and heuristic) can produce exactly same results (with heuristic taking less time to generate), or differences between the two algorithms have minimum impact on performance.
PS:
Why transforming one tree to another have a complexity of O(n^3)?
In order to answer this question one has to look into state-of-the-art algorithms. But in general the answer is that many comparisons (between nodes and their children) have to be made in order to find the minimum number of necessary transformations.
Related
I think I understand the basic concept of simulated annealing. It's basically adding random solutions to cover a better area of the search space at the beginning then slowly reducing the randomness as the algorithm continues running.
I'm a little confused on how I would implement this into my genetic algorithm.
Can anyone give me a simple explanation of what I need to do and clarify that my understand of how simulated annealing works is correct?
When constructing a new generation of individuals in a genetic algorithm, there are three random aspects to it:
Matching parent individuals to parent individuals, with preference according to their proportional fitness,
Choosing the crossover point, and,
Mutating the offspring.
There's not much you can do about the second one, since that's typically a uniform random distribution. You could conceivably try to add some random factor to the roulette wheel as you're selecting your parent individuals, and then slowly decrease that random function. But that goes against the spirit of the genetic algorithm and (more importantly) I don't think it will do much good. I think it would hurt, actually.
That leaves the third factor-- change the mutation rate from high mutation to low mutation as the generations go by.
It's really not any more complicated than that.
What are the relevant differences, in terms of performance and use cases, between simulated annealing (with bean search) and genetic algorithms?
I know that SA can be thought as GA where the population size is only one, but I don't know the key difference between the two.
Also, I am trying to think of a situation where SA will outperform GA or GA will outperform SA. Just one simple example which will help me understand will be enough.
Well strictly speaking, these two things--simulated annealing (SA) and genetic algorithms are neither algorithms nor is their purpose 'data mining'.
Both are meta-heuristics--a couple of levels above 'algorithm' on the abstraction scale. In other words, both terms refer to high-level metaphors--one borrowed from metallurgy and the other from evolutionary biology. In the meta-heuristic taxonomy, SA is a single-state method and GA is a population method (in a sub-class along with PSO, ACO, et al, usually referred to as biologically-inspired meta-heuristics).
These two meta-heuristics are used to solve optimization problems, particularly (though not exclusively) in combinatorial optimization (aka constraint-satisfaction programming). Combinatorial optimization refers to optimization by selecting from among a set of discrete items--in other words, there is no continuous function to minimize. The knapsack problem, traveling salesman problem, cutting stock problem--are all combinatorial optimization problems.
The connection to data mining is that the core of many (most?) supervised Machine Learning (ML) algorithms is the solution of an optimization problem--(Multi-Layer Perceptron and Support Vector Machines for instance).
Any solution technique to solve cap problems, regardless of the algorithm, will consist essentially of these steps (which are typically coded as a single block within a recursive loop):
encode the domain-specific details
in a cost function (it's the
step-wise minimization of the value
returned from this function that
constitutes a 'solution' to the c/o
problem);
evaluate the cost function passing
in an initial 'guess' (to begin
iteration);
based on the value returned from the
cost function, generate a subsequent
candidate solution (or more than
one, depending on the
meta-heuristic) to the cost
function;
evaluate each candidate solution by
passing it in an argument set, to
the cost function;
repeat steps (iii) and (iv) until
either some convergence criterion is
satisfied or a maximum number of
iterations is reached.
Meta-heuristics are directed to step (iii) above; hence, SA and GA differ in how they generate candidate solutions for evaluation by the cost function. In other words, that's the place to look to understand how these two meta-heuristics differ.
Informally, the essence of an algorithm directed to solution of combinatorial optimization is how it handles a candidate solution whose value returned from the cost function is worse than the current best candidate solution (the one that returns the lowest value from the cost function). The simplest way for an optimization algorithm to handle such a candidate solution is to reject it outright--that's what the hill climbing algorithm does. But by doing this, simple hill climbing will always miss a better solution separated from the current solution by a hill. Put another way, a sophisticated optimization algorithm has to include a technique for (temporarily) accepting a candidate solution worse than (i.e., uphill from) the current best solution because an even better solution than the current one might lie along a path through that worse solution.
So how do SA and GA generate candidate solutions?
The essence of SA is usually expressed in terms of the probability that a higher-cost candidate solution will be accepted (the entire expression inside the double parenthesis is an exponent:
p = e((-highCost - lowCost)/temperature)
Or in python:
p = pow(math.e, (-hiCost - loCost) / T)
The 'temperature' term is a variable whose value decays during progress of the optimization--and therefore, the probability that SA will accept a worse solution decreases as iteration number increases.
Put another way, when the algorithm begins iterating, T is very large, which as you can see, causes the algorithm to move to every newly created candidate solution, whether better or worse than the current best solution--i.e., it is doing a random walk in the solution space. As iteration number increases (i.e., as the temperature cools) the algorithm's search of the solution space becomes less permissive, until at T = 0, the behavior is identical to a simple hill-climbing algorithm (i.e., only solutions better than the current best solution are accepted).
Genetic Algorithms are very different. For one thing--and this is a big thing--it generates not a single candidate solution but an entire 'population of them'. It works like this: GA calls the cost function on each member (candidate solution) of the population. It then ranks them, from best to worse, ordered by the value returned from the cost function ('best' has the lowest value). From these ranked values (and their corresponding candidate solutions) the next population is created. New members of the population are created in essentially one of three ways. The first is usually referred to as 'elitism' and in practice usually refers to just taking the highest ranked candidate solutions and passing them straight through--unmodified--to the next generation. The other two ways that new members of the population are usually referred to as 'mutation' and 'crossover'. Mutation usually involves a change in one element in a candidate solution vector from the current population to create a solution vector in the new population, e.g., [4, 5, 1, 0, 2] => [4, 5, 2, 0, 2]. The result of the crossover operation is like what would happen if vectors could have sex--i.e., a new child vector whose elements are comprised of some from each of two parents.
So those are the algorithmic differences between GA and SA. What about the differences in performance?
In practice: (my observations are limited to combinatorial optimization problems) GA nearly always beats SA (returns a lower 'best' return value from the cost function--ie, a value close to the solution space's global minimum), but at a higher computation cost. As far as i am aware, the textbooks and technical publications recite the same conclusion on resolution.
but here's the thing: GA is inherently parallelizable; what's more, it's trivial to do so because the individual "search agents" comprising each population do not need to exchange messages--ie, they work independently of each other. Obviously that means GA computation can be distributed, which means in practice, you can get much better results (closer to the global minimum) and better performance (execution speed).
In what circumstances might SA outperform GA? The general scenario i think would be those optimization problems having a small solution space so that the result from SA and GA are practically the same, yet the execution context (e.g., hundreds of similar problems run in batch mode) favors the faster algorithm (which should always be SA).
It is really difficult to compare the two since they were inspired from different domains..
A Genetic Algorithm maintains a population of possible solutions, and at each step, selects pairs of possible solution, combines them (crossover), and applies some random changes (mutation). The algorithm is based the idea of "survival of the fittest" where the selection process is done according to a fitness criteria (usually in optimization problems it is simply the value of the objective function evaluated using the current solution). The crossover is done in hope that two good solutions, when combined, might give even better solution.
On the other hand, Simulated Annealing only tracks one solution in the space of possible solutions, and at each iteration considers whether to move to a neighboring solution or stay in the current one according to some probabilities (which decays over time). This is different from a heuristic search (say greedy search) in that it doesn't suffer from the problems of local optimum since it can get unstuck from cases where all neighboring solutions are worst the current one.
I'm far from an expert on these algorithms, but I'll try and help out.
I think the biggest difference between the two is the idea of crossover in GA and so any example of a learning task that is better suited to GA than SA is going to hinge on what crossover means in that situation and how it is implemented.
The idea of crossover is that you can meaningfully combine two solutions to produce a better one. I think this only makes sense if the solutions to a problem are structured in some way. I could imagine, for example, in multi-class classification taking two (or many) classifiers that are good at classifying a particular class and combining them by voting to make a much better classifier. Another example might be Genetic Programming, where the solution can be expressed as a tree, but I find it hard to come up with a good example where you could combine two programs to create a better one.
I think it's difficult to come up with a compelling case for one over the other because they really are quite similar algorithms, perhaps having been developed from very different starting points.
I want to know if I build up a decision tree A like ID3 from training and validation set,but A is unpruned.
At the same time,I have another decision tree B also in ID3 generated from the same training and validation set,but B is pruned.
Now I test both A and B on a future unlabeled test set,is it always the case that pruned tree will perform better?
Any idea is welcomed,thanks.
I think we need to make the distinction clearer: pruned trees always perform better on the validation set, but not necessarily so on the testing set (in fact it is also of equal or worse performance on the training set). I am assuming that the pruning is done after the tree is built (ie: post-pruning)..
Remember that the whole reason of using a validation set is to avoid overfitting over the training dataset, and the key point here is generalization: we want a model (decision tree) that generalizes beyond the instances that have been provided at "training time" to new unseen examples.
Pruning is supposed to improve classification by preventing overfitting. Since pruning will only occur if it improves classification rates on the validation set, a pruned tree will perform as well or better than an un-pruned tree during validation.
Bad pruning can lead to wrong results. Although a reduced decision tree size is often desired, you usually aim for better results when pruning. Therefore the how is the crux of the pruning.
I agree with 1st answer by #AMRO. Post-pruning is the most common approach for decision tree pruning and it is done after the tree is built. But, Pre-pruning can also be done. in pre-pruning, a tree is pruned by halting its construction early, by using a specified threshold value. For example, by deciding not to split the subset of training tuples at a given node.
Then that node becomes a leaf. This leaf may hold the most frequent class among the subset of tuples or the probability of those tuples.
I have thought of some heuristics for a big (higher dimensions) tic-tac-toe game. How do I check which of them are actually consistent?
What is meant by consistency anyways?
Heuristics produce some sort of cost value for a given state. Consistency in this context means the estimate for a state plus the cost of moving to the next state is less than or equal to the estimate for that new state. If this wasn't true then it would imply that - if the heuristic was accurate - that transitioning from one state to the next could incur negative cost, which is typically impossible or incorrect.
This is intuitive to prove when it comes to pathfinding, as you expect every step along the path to take some time, therefore the estimate at step 1 must be lower than the estimate at any step 2. It's probably a bit more complex for tic-tac-toe since you probably have to arbitrarily decide what constitutes a 'cost' in your system. If your heuristic can go both up or down as a result of playing a move - eg. because you encode good moves with positive numbers and bad moves with negative numbers - then your heuristic cannot be consistent.
However, lack of a consistent heuristic is not always a problem. You may not be guaranteed of reaching an optimal solution without one, but it may still speed up the search compared to a brute force state search.
EDITED: This answer confused admissibility and consistency. I have corrected it to refer to admissibility, but the original question was about consistency, and this answer does not fully answer the question.
You could do it analytically, by distinguishing all different cases and thereby proving that your heuristic is indeed admissible.
For informed search, a heuristic is admissible with a search problem (say, the search for the best move in a game) if and only if it underestimates the 'distance' to a suitable state.
EXAMPLE: Search for the shortest route to a target city via a network of highways between cities. Here, one could use the Eucidean distance as a heuristic: the length of a straight line to the goal is always shorter or equally long than the best possible way.
Admissibility is required by algorithms like A*, which then quarantuee you to be optimal (i.e. they will find the best 'route' to a goal state if one exists).
I would recommend to look the topic up in an AI textbook.
Among the known limitations of Joe Celko's nested sets (modified pre-order traversal) is marked degredation in performance as the tree grows to a large size.
Vadim Tropashko proposed nested intervals, and provides examples and theory explanation in this paper: http://arxiv.org/html/cs.DB/0401014
Is this a viable solution, are there any viable examples (in any language) abstracted away from the native DB layer?
While I've seen examples for nested sets, I haven't seen much for nested intervals, although in theory it shouldn't be difficult to convert from one to the other. Instead of doing pre-order traversal to label the nodes, do a breadth-first recursion. The trick is to work out the most efficient way of labelling n children of a node. Since the node between a/b and c/d is (a+c)/(b+d), an ill-conditioned insert (for instance, inserting the children left to right), runs the risk of creating the same exponential growth in the index values as, for instance, using a full materialized path. It is not difficult to counteract this effect - create the new indexes one at a time, inserting each at the location that produces the lowest resulting denominator.
As far as performance degradation goes, much depends on the operations you intend to do. There are still some operations that will require a complete relabeling of the entire tree - the nested set or nested interval methods both work best for structures that seldom change. If you are doing a lot of structure changes to the hierarchy, the 'standard' parent-child table structure may be easier to work with. remember too that some operations (such as number of descendants) are far easier with the integer labeling of nested sets than the interval methods.
I have written a gem that abstracts away all the computations of nested intervals to be used with Rails's ActiveRecord https://github.com/clyfe/acts_as_nested_interval/ used in production on several systems.