I want to know if I build up a decision tree A like ID3 from training and validation set,but A is unpruned.
At the same time,I have another decision tree B also in ID3 generated from the same training and validation set,but B is pruned.
Now I test both A and B on a future unlabeled test set,is it always the case that pruned tree will perform better?
Any idea is welcomed,thanks.
I think we need to make the distinction clearer: pruned trees always perform better on the validation set, but not necessarily so on the testing set (in fact it is also of equal or worse performance on the training set). I am assuming that the pruning is done after the tree is built (ie: post-pruning)..
Remember that the whole reason of using a validation set is to avoid overfitting over the training dataset, and the key point here is generalization: we want a model (decision tree) that generalizes beyond the instances that have been provided at "training time" to new unseen examples.
Pruning is supposed to improve classification by preventing overfitting. Since pruning will only occur if it improves classification rates on the validation set, a pruned tree will perform as well or better than an un-pruned tree during validation.
Bad pruning can lead to wrong results. Although a reduced decision tree size is often desired, you usually aim for better results when pruning. Therefore the how is the crux of the pruning.
I agree with 1st answer by #AMRO. Post-pruning is the most common approach for decision tree pruning and it is done after the tree is built. But, Pre-pruning can also be done. in pre-pruning, a tree is pruned by halting its construction early, by using a specified threshold value. For example, by deciding not to split the subset of training tuples at a given node.
Then that node becomes a leaf. This leaf may hold the most frequent class among the subset of tuples or the probability of those tuples.
Related
I am studying informed search algorithms, and for New Bidirectional A* Search, I know that the space complexity is O(b^d), where d is the depth of the shallowest goal node and b is the branch factor. I have tried to find out what its time complexity is, but I haven't been able to find any exact information about it on online resources. Is the exact time complexity of NBA* Search unknown and what is the difference between the original Bidirectional A*? Any insights are appreciated.
If you have specific models of your problem (eg uniformly growing graph in both directions with unit edge costs and the number of states growing exponentially) then most bidirectional search algorithms require O(b^(d/2)) node expansions and require O(b^(d/2)) time. But, this simple model doesn't actual model most real-world problems.
Given this, I would not recommend putting significant effort into studying New Bidirectional A*.
The state of the art in bidirectional search has changed massively in the last few years. The current algorithm with the best theoretical guarantees is NBS - Near-Optimal Bidirectional Search. The algorithm finds optimal paths and is near-optimal in node-expansions. That is, NBS is guaranteed to do no more than 2x more necessary expansions than the best possible algorithm (given reasonable theoretical assumptions such as using the same heuristic). All other algorithms (including A*) can do arbitrarily worse than NBS.
Other algorithm variants of NBS, such as DVCBS, have been proposed which follow the same basic structure, do not have the same guarantees, but perform well in practice.
My question is about the Motivation for implementing a heuristic O(n) algorithm.
There are some generic solutions to this algorithmic problem of
generating the minimum number of operations to transform one tree into
another. However, the state of the art algorithms have a complexity in
the order of O(n^3) where n is the number of elements in the tree.
Why transforming one tree to another have a complexity of O(n^3)?
If we used this in React, displaying 1000 elements would require in
the order of one billion comparisons. This is far too expensive.
Instead, React implements a heuristic O(n) algorithm based on two assumptions:
Two elements of different types will produce different trees.
The developer can hint at which child elements may be stable across different renders with a key prop.
Can you elaborate on what is heuristic in React's implementation?
Do the assumptions make it O(n) in an average case?
There are pretty good reasons why the React's diff algorithm is the way it is, but the documented "motivation" doesn't really make enough sense to be the real truth.
Firstly, while it's certainly true that an optimal "tree diff" takes O(N3) time, a "tree diff" algorithm is not the single best alternative to what React actually does, and in fact doesn't really fit will into react's rendering process at all. This is mostly because, in the worst case, rendering a react component produces a list (not a tree) of react elements that needs to be matched up against a list of preexisting components.
There is no new tree when the diff is performed, because the new list needs to be matched against the preexisting tree before the children of the new elements are rendered. In fact, the results of the diff are required to decided whether or not to re-render the children at all.
So... In matching up these lists, we might compare the React diff against the standard Longest-Common-Subsequence algorithm, which is an O(N2) algorithm. That's still pretty slow, and there is a performance argument to be made. If LCS was as fast as the React diff then it would have a place in the rendering process for sure.
But, not only is LCS kinds slow, it also doesn't do the right thing. When React is matching the list of new elements up against the old tree, it is deciding whether or not each element is a new component, or just a prop update to a pre-existing component. LCS could find the largest possible matching of element types, but the largest possible matching isn't necessarily what the developer wants.
So, the problem with LCS (or a tree diff, if you really want to push the point), is not just that it's slow, but that it's slow and the answer it provides is still just a guess at the developer's intent. Slow algorithms just aren't worth it when they still make mistakes.
There are a lot of other fast algorithms, though, that React developers could have used that would be more accurate in most cases, but then the question is "Is it worth it?" Generally, the answer is "no", because no algorithm can do a really good job of guessing a developer's intent, and guessing the developer's intent is actually not necessary.
When it's important to a developer that his new elements are properly matched up to his preexisting components so they don't have to rerender, then the developer should make sure that this is the case. It's very easy -- he just has to provide key props when he renders a list. Developers should pretty much always do this when rendering a list, so that the component matching can be perfect, and they don't have to settle for any kind of guess.
React will produce a warning if you don't put in key props where they are required to make the matching unambiguous, which is far more helpful than a better diff. When you see it you should fix your components by adding the proper key props, and then the matching will be perfect and it doesn't matter that there are other algorithms that could do a better job on badly written components.
There is a transformation only based on assumptions
Two elements of different types will produce different trees.
The developer can hint at which child elements may be stable across different renders with a key prop.
The whole tree is not re-rendered if key do not change or no new elements are added
This management is based on assumption by experience, so it is heuristic
Current state-of-the-art diffing algorithms have a complexity of O(n^3) (n number of nodes) to find the minimum amount of transform operations needed to transform a tree into another tree. But this complexity, as React docs mention, can be too high and in usual cases it is not needed.
That is why React uses a heuristic that on average takes O(n) (linear time).
Two elements of different types will produce different trees.
The developer can hint at which child elements may be stable across different renders with a key prop.
Being a heuristic means that there are cases where the diff may produce more transformations than necessary (it is not optimal in general), however it can be optimal in usual and often-used cases where the two algorithms (optimal and heuristic) can produce exactly same results (with heuristic taking less time to generate), or differences between the two algorithms have minimum impact on performance.
PS:
Why transforming one tree to another have a complexity of O(n^3)?
In order to answer this question one has to look into state-of-the-art algorithms. But in general the answer is that many comparisons (between nodes and their children) have to be made in order to find the minimum number of necessary transformations.
after studying decision tree for a while, I noticed there is a small technique called boosting. I see in normal cases, it will improve the accuracy of the decision tree.
So I am just wondering, why don't we just simply incorporate this boosting into every decision tree we built? Since currently we leave boosting out as a separate technique, so I ponder: are there any disadvantages of using boosting than just using a single decision tree?
Thanks for helping me out here!
Boosting is a technique that can go on top any learning algorithm. It is the most effective when the original classifier you built performs just barely above random. If your decision tree is pretty good already, boosting may not make much difference, but have performance penalty -- if you run boosting for 100 iterations you'll have to train and store 100 decision trees.
Usually people do boost with decision stumps (decision trees with just one node) and get results as good as boosting with full decision trees.
I've done some experiments with boosting and found it to be fairly robust, better than single tree classifier, but also slower (I used to 10 iterations), and not as good as some of the simpler learners (to be fair, it was an extremely noisy dataset)
there are several disadvatages for boosting:
1-hard to implement
2-they need extensive training with training sets more than a decision tree does
3- the worst thing is that all boosting algorithms require a Threshold value
which is in most cases not easy to figure out because it requires extensive trial and error tests knowing that the whole performance of the boosting algorithm depends on this threshold
In a rule system, or any reasoning system that deduces facts via forward-chaining inference rules, how would you prune "unnecessary" branches? I'm not sure what the formal terminology is, but I'm just trying to understand how people are able to limit their train-of-thought when reasoning over problems, whereas all semantic reasoners I've seen appear unable to do this.
For example, in John McCarthy's paper An Example for Natural Language Understanding and the AI Problems It Raises, he describes potential problems in getting a program to intelligently answer questions about a news article in the New York Times. In section 4, "The Need For Nonmonotonic Reasoning", he discusses the use of Occam's Razer to restrict the inclusion of facts when reasoning about the story. The sample story he uses is one about robbers who victimize a furniture store owner.
If a program were asked to form a "minimal completion" of the story in predicate calculus, it might need to include facts not directly mentioned in the original story. However, it would also need some way of knowing when to limit its chain of deduction, so as not to include irrelevant details. For example, it might want to include the exact number of police involved in the case, which the article omits, but it won't want to include the fact that each police officer has a mother.
Good Question.
From your Question i think what you refer to as 'pruning' is a model-building step performed ex ante--ie, to limit the inputs available to the algorithm to build the model. The term 'pruning' when used in Machine Learning refers to something different--an ex post step, after model construction and that operates upon the model itself and not on the available inputs. (There could be a second meaning in the ML domain, for the term 'pruning.' of, but i'm not aware of it.) In other words, pruning is indeed literally a technique to "limit its chain of deduction" as you put it, but it does so ex post, by excision of components of a complete (working) model, and not by limiting the inputs used to create that model.
On the other hand, isolating or limiting the inputs available for model construction--which is what i think you might have had in mind--is indeed a key Machine Learning theme; it's clearly a factor responsible for the superior performance of many of the more recent ML algorithms--for instance, Support Vector Machines (the insight that underlies SVM is construction of the maximum-margin hyperplane from only a small subset of the data, i.e, the 'support vectors'), and Multi-Adaptive Regression Splines (a regression technique in which no attempt is made to fit the data by "drawing a single continuous curve through it", instead, discrete section of the data are fit, one by one, using a bounded linear equation for each portion, ie., the 'splines', so the predicate step of optimal partitioning of the data is obviously the crux of this algorithm).
What problem is solving by pruning?
At least w/r/t specific ML algorithms i have actually coded and used--Decision Trees, MARS, and Neural Networks--pruning is performed on an initially over-fit model (a model that fits the training data so closely that it is unable to generalize (accurately predict new instances). In each instance, pruning involves removing marginal nodes (DT, NN) or terms in the regression equation (MARS) one by one.
Second, why is pruning necessary/desirable?
Isn't it better to just accurately set the convergence/splitting criteria? That won't always help. Pruning works from "the bottom up"; the model is constructed from the top down, so tuning the model (to achieve the same benefit as pruning) eliminates not just one or more decision nodes but also the child nodes that (like trimming a tree closer to the trunk). So eliminating a marginal node might also eliminate one or more strong nodes subordinate to that marginal node--but the modeler would never know that because his/her tuning eliminated further node creation at that marginal node. Pruning works from the other direction--from the most subordinate (lowest-level) child nodes upward in the direction of the root node.
Among the known limitations of Joe Celko's nested sets (modified pre-order traversal) is marked degredation in performance as the tree grows to a large size.
Vadim Tropashko proposed nested intervals, and provides examples and theory explanation in this paper: http://arxiv.org/html/cs.DB/0401014
Is this a viable solution, are there any viable examples (in any language) abstracted away from the native DB layer?
While I've seen examples for nested sets, I haven't seen much for nested intervals, although in theory it shouldn't be difficult to convert from one to the other. Instead of doing pre-order traversal to label the nodes, do a breadth-first recursion. The trick is to work out the most efficient way of labelling n children of a node. Since the node between a/b and c/d is (a+c)/(b+d), an ill-conditioned insert (for instance, inserting the children left to right), runs the risk of creating the same exponential growth in the index values as, for instance, using a full materialized path. It is not difficult to counteract this effect - create the new indexes one at a time, inserting each at the location that produces the lowest resulting denominator.
As far as performance degradation goes, much depends on the operations you intend to do. There are still some operations that will require a complete relabeling of the entire tree - the nested set or nested interval methods both work best for structures that seldom change. If you are doing a lot of structure changes to the hierarchy, the 'standard' parent-child table structure may be easier to work with. remember too that some operations (such as number of descendants) are far easier with the integer labeling of nested sets than the interval methods.
I have written a gem that abstracts away all the computations of nested intervals to be used with Rails's ActiveRecord https://github.com/clyfe/acts_as_nested_interval/ used in production on several systems.