Prioritize...
After you have read this section, you should be able to explain how a tree grows in CART and how this growth is terminated.
Read...
Now that we know how to build rules, we need to combine the rules to create the tree. In addition, we need to know when to stop growing the tree. Read on to finish learning about CART.
Method
Now that you know how to build a rule, you can go ahead and grow a tree. To begin, build a rule on the full training dataset, using it to split the training cases into two subsets. Now go to one of the resulting subsets and repeat the process. Continue this process with all of the branches until you hit the termination criterion.
Termination
How do we know when to terminate? There are multiple ways to determine when to stop building the tree. Some are bad and some are good.
One approach is to drive to purity. This, however, tends to result in so few cases in each leaf that it doesn’t have much statistical robustness. In turn, the tree will not work well on independent data (e.g., in actual operations). Instead, we want to stop growing our tree when the available data no longer supports making additional splits. Another approach is an ad hoc threshold. Instead of solely driving down purity, you set a threshold on the minimum number of cases allowed in a leaf. Although simple, this actually works quite well. What is the optimal threshold? You have to test several values to see which model results in the least overfitting of the data you’ve held out for testing (see previous lessons for how to test for overfitting).
Alternatively, or in addition, we can ‘prune’ a tree back to eliminate splits supported by too few cases. Sometimes this works better, but it takes more computer time since you have to grow the tree out until it is ‘too big’ before stopping. Again, numerous methods have been proposed to control which branches get pruned.
CART in R
In R, you can use the function ‘rpart’ from the package of the same name to grow a tree. Sound familiar? This is the same function used in Meteo 815. I will only provide a summary here. For more details, I suggest you review the Data Mining lesson from Meteo 815. To grow the tree, we use the following code:
rpart(formula, data=, method=, control=)
Where the formula format is:
Outcome~predictor1+predictor2+predictor3+….
You set data to the data frame and method to ‘class’ (this tells the function to use a classification tree as compared to ‘anova’ for a regression tree of a continuous outcome). The ‘control’ parameter is optional but allows you to set parameters for selecting the best model (such as the minimum number of cases allowed in a leaf). Remember, the function will create multiple models and select the best based on purity and any controls listed.
For a review of how to examine the results, please look at the material from Meteo 815 or use this resource. In addition, we can prune a tree by using the ‘prune’ function, which will allow you to avoid overfitting.