after studying decision tree for a while, I noticed there is a small technique called boosting. I see in normal cases, it will improve the accuracy of the decision tree.
So I am just wondering, why don't we just simply incorporate this boosting into every decision tree we built? Since currently we leave boosting out as a separate technique, so I ponder: are there any disadvantages of using boosting than just using a single decision tree?
Thanks for helping me out here!
Boosting is a technique that can go on top any learning algorithm. It is the most effective when the original classifier you built performs just barely above random. If your decision tree is pretty good already, boosting may not make much difference, but have performance penalty -- if you run boosting for 100 iterations you'll have to train and store 100 decision trees.
Usually people do boost with decision stumps (decision trees with just one node) and get results as good as boosting with full decision trees.
I've done some experiments with boosting and found it to be fairly robust, better than single tree classifier, but also slower (I used to 10 iterations), and not as good as some of the simpler learners (to be fair, it was an extremely noisy dataset)
there are several disadvatages for boosting:
1-hard to implement
2-they need extensive training with training sets more than a decision tree does
3- the worst thing is that all boosting algorithms require a Threshold value
which is in most cases not easy to figure out because it requires extensive trial and error tests knowing that the whole performance of the boosting algorithm depends on this threshold
Related
I am studying informed search algorithms, and for New Bidirectional A* Search, I know that the space complexity is O(b^d), where d is the depth of the shallowest goal node and b is the branch factor. I have tried to find out what its time complexity is, but I haven't been able to find any exact information about it on online resources. Is the exact time complexity of NBA* Search unknown and what is the difference between the original Bidirectional A*? Any insights are appreciated.
If you have specific models of your problem (eg uniformly growing graph in both directions with unit edge costs and the number of states growing exponentially) then most bidirectional search algorithms require O(b^(d/2)) node expansions and require O(b^(d/2)) time. But, this simple model doesn't actual model most real-world problems.
Given this, I would not recommend putting significant effort into studying New Bidirectional A*.
The state of the art in bidirectional search has changed massively in the last few years. The current algorithm with the best theoretical guarantees is NBS - Near-Optimal Bidirectional Search. The algorithm finds optimal paths and is near-optimal in node-expansions. That is, NBS is guaranteed to do no more than 2x more necessary expansions than the best possible algorithm (given reasonable theoretical assumptions such as using the same heuristic). All other algorithms (including A*) can do arbitrarily worse than NBS.
Other algorithm variants of NBS, such as DVCBS, have been proposed which follow the same basic structure, do not have the same guarantees, but perform well in practice.
In my Intro to AI class, we have been studying:
Uniformed Search (i.e. Depth-First Search)
Informed Search (i.e. A* Search)
Constraint Satisfaction pRoblems (i.e. Hill Climbing)
Adversarial Search (i.e. Minimax)
In general lines, why would we use, for example, Depth-First Search instead of using more complex algorithms such as A* Search? In other words, why choosing simple and limited algorithms when we can choose complex ones?
The main reason is efficiency. Some algorithms take much more time/memory than others.
Some algorithms won't work will in certain situations. For example, if there are local maxima, Hill Climbing won't work very well.
If you expect most paths to lead to destination, you can use Depth First, which could be much faster than A*.
In Artificial Intelligence - A Modern Approach 3rd Edition, I came across an interesting quote stating:
"As yet there is no good understanding of how to combine the two kinds of algorithms [Goal directed reasoning / planning and heuristic search] into a robust and efficient system" (Russel pg 189)
Why is this so? Why is it hard to combine goal oriented planning with a heuristic search? Wouldn't reinforcement learning solve this?
The term “Goal directed reasoning” was used in the 1980s for a backtracking search technique. Sometimes it was called backward reasoning or top-down search, which means all the same. It describes the working of the algorithm in traversing the state space. Or to be more specific: it describes the order in which the states in the graph are visited. In newer literature this aspect of a planner is no longer explained in detail, because a graph search algorithm is no big thing. It means simply to put the nodes on a stack and traverse them.
In contrast, the term “heuristic search” means to replace a brute-force solver with a knowledge based approach. Heuristic search is equal to not traverse a graph, but find a domain-specific strategy which leaves out most part of the graph. And indeed, it is hard to combine backtracking with heuristics, this approach would be called grounding. If a grounded problem is available, it is possible to use a backtracking solver on a knowledge-based problem. This is the strategy utilized in modern PDDL planners which are first describe the domain in a symbolic PDDL notation (which is knowledge based) and using then a fast solver to search in the state space.
I want to know if I build up a decision tree A like ID3 from training and validation set,but A is unpruned.
At the same time,I have another decision tree B also in ID3 generated from the same training and validation set,but B is pruned.
Now I test both A and B on a future unlabeled test set,is it always the case that pruned tree will perform better?
Any idea is welcomed,thanks.
I think we need to make the distinction clearer: pruned trees always perform better on the validation set, but not necessarily so on the testing set (in fact it is also of equal or worse performance on the training set). I am assuming that the pruning is done after the tree is built (ie: post-pruning)..
Remember that the whole reason of using a validation set is to avoid overfitting over the training dataset, and the key point here is generalization: we want a model (decision tree) that generalizes beyond the instances that have been provided at "training time" to new unseen examples.
Pruning is supposed to improve classification by preventing overfitting. Since pruning will only occur if it improves classification rates on the validation set, a pruned tree will perform as well or better than an un-pruned tree during validation.
Bad pruning can lead to wrong results. Although a reduced decision tree size is often desired, you usually aim for better results when pruning. Therefore the how is the crux of the pruning.
I agree with 1st answer by #AMRO. Post-pruning is the most common approach for decision tree pruning and it is done after the tree is built. But, Pre-pruning can also be done. in pre-pruning, a tree is pruned by halting its construction early, by using a specified threshold value. For example, by deciding not to split the subset of training tuples at a given node.
Then that node becomes a leaf. This leaf may hold the most frequent class among the subset of tuples or the probability of those tuples.
I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.