The effect of Decision Tree Pruning_问答_开发者

开发者 https://www.devze.com 2023-01-21 01:38 出处：网络

I want to know if I build up a decision tree A like ID3 from training and validation set,but A is unpruned.

I want to know if I build up a decision tree A like ID3 from training and validation set,but A is unpruned. At the same time,I have another decision tree B also in ID3 generated from the same training and validation set,but B is pruned. Now I test both A and B on a future unlabeled test set,is it always the case that pruned tree will perfo开发者_如何学Gorm better? Any idea is welcomed,thanks.

I think we need to make the distinction clearer: pruned trees always perform better on the validation set, but not necessarily so on the testing set (in fact it is also of equal or worse performance on the training set). I am assuming that the pruning is done after the tree is built (ie: post-pruning)..

Remember that the whole reason of using a validation set is to avoid overfitting over the training dataset, and the key point here is generalization: we want a model (decision tree) that generalizes beyond the instances that have been provided at "training time" to new unseen examples.

Pruning is supposed to improve classification by preventing overfitting. Since pruning will only occur if it improves classification rates on the validation set, a pruned tree will perform as well or better than an un-pruned tree during validation.

Bad pruning can lead to wrong results. Although a reduced decision tree size is often desired, you usually aim for better results when pruning. Therefore the how is the crux of the pruning.

I agree with 1st answer by @AMRO. Post-pruning is the most common approach for decision tree pruning and it is done after the tree is built. But, Pre-pruning can also be done. in pre-pruning, a tree is pruned by halting its construction early, by using a specified threshold value. For example, by deciding not to split the subset of training tuples at a given node.

Then that node becomes a leaf. This leaf may hold the most frequent class among the subset of tuples or the probability of those tuples.