I a开发者_如何转开发m writing my own code for a decision tree. I need to decide on when to terminate the tree building process. I could think of limiting the height of the tree, but this seems trivial. Could anyone give me a better idea on how to implement my termination function.
Here in my tree building algorithm.
There is little context in your question, but I assume your are constructing a tree from a large set of data? In that case, a solution is in addition to a "LearnSet" to take a "StopSet" of examples and regularly verify your decision making process on this StopSet. If quality decreases, this is an indication that your are overtraing on the LearnSet.
I deliberately use "StopSet" and not "TestSet" because after this you should apply your decision tree on the TestSet to assess the real quality.
As a decision tree produces imbalanced splits, one part of the tree can be heavier than the other part. Hence it is not intelligent to use the height of the tree because this stops everywhere at the same level. Far better is to use the minimal number of observations required for a split search. This is more elaborated at http://zyxo.wordpress.com/2011/07/04/how-to-use-the-settings-to-control-the-size-of-decision-trees/
This is a somewhat old question, but I think I can improve the answers. The theory is you stop when a split is pure (ie impurity = 0) or all members in the left or right node are the same output value. For example, if you are trying to predict heart attack or not, and on a given split if a group has all heart attacks or no heart attack then you can safely stop splitting on that group because everything is the same and you can safely predict that common value. That theory is supported by the pruning process because you can build a very tall tree and if a node doesn't contribute to accuracy it gets pruned.
Now its rare that you get entirely pure splits. And often in order to split the data into entirely pure sets you'll split a lot making smaller and smaller sets until you get to a single observation in each node. Tall trees usually won't survive the pruning process and you are more than likely overfitting the training data anyway. So it's common practice to save yourself the extra time in the pruning algorithm to simply limit the number of observations you are willing to split on, and set a minimum on the number from a resulting split. You aren't going to keep a split that results in 1 and 999 observations. That's a bad split, try again.
So you add a config parameter for minimum number of observations in a node (ie after a split) and minimum number of nodes required for a split (before you split) that can be tweaked by the user.
The final criteria is also if you're splits don't improve from the last measurement of purity. If a node can't be split as to produce a more pure set then what it had before. You can stop because you're going in the wrong direction. Which essentially means if I(s) is the purity measurement of the node you are splitting. And I(s[l]) is the purity of the left split set, I(s[r]) is the purity of the right split set, and p(s) is the portion of that set to the parent set then:
Gain = I(s) - p(s[r]) * I(s[r]) - p(s[l]) * I(s[l])
And you stop if that Gain < 0 because you aren't getting anymore purity by splitting it.
We have 3 general terminate conditions: 1. All of the tuples in partition belong to the same class 2.There are no remaining attributes on which the tuples may be further partitioned. 3. There are no tuples for a given branch.
and of course you can make some other conditions like max depth or something else.
精彩评论