Prioritize...
Once you have read this section, you should be able to perform the CART technique on a dataset in R.
Read...
Now, it’s time to apply your knowledge and build a tree in R. Earlier, we fit a logistic model to daily precipitation (did it rain or not?). Sometimes we need to know more than whether it rained or not. In some cases, it might be more important to know whether the precipitation passed a certain threshold. This threshold usually corresponds to an impact, such as, more than an inch of rain in a day might cause stormwater drains to back up. In this section, we will translate the precipitation into a categorical variable and create a CART model for prediction. Read on to learn more!
Example
We are going to continue off of our earlier example of rain in Burnett, TX. Again, you can download the data here. To review the variables or their units, please use this reference.
Start by loading in the data set and retaining only complete cases.
# load in daily summary data for Burnett, TX
mydata <- read.csv("Texas_Daily_Summaries.csv")
# obtain only complete cases
mydata <- na.omit(mydata)
# check for missing values
badIndex <- which(is.na(mydata))
Instead of looking at yes/no it rained, let’s look at rain levels. I will create 3 rain levels called low (< 0.25 inches), moderate (0.5 and 1 inch), and high (> 1inch). Use the code below:
# create rain levels
rainLevel <-c(0.25, 0.5,1)
# create rain variable
rain <- array("low",length(mydata$PRCP))
# fill in for rain levels
rain[which(mydata$PRCP > rainLevel[1] & mydata$PRCP <= rainLevel[2])] <- "mod"
rain[which(mydata$PRCP > rainLevel[2])] <- "high"
Now save the ‘rain’ variable to the data frame (mydata) and remove $DATE and $PRCP as we will not use them in the model.
# save rain to data frame mydata$rain <- as.vector(rain) # remove $PRCP and $DATE mydata <- subset(mydata, select = -c(1, 5))
Next, we need to split the data into training and testing. Let’s use the 2/3|1/3 approach as we did previously. Use the code below:
#create training and validation data from given data library(caTools) split <- sample.split(mydata$rain, SplitRatio = 0.66) #get training and test data train<- subset(mydata,split == TRUE) test <- subset(mydata,split == FALSE)
Now we can build the tree by running the code below:
In this case, the best model was selected by the default parameters in ‘control’ which include minsplit=20 (number of minimum cases in a leaf) and by tending towards purity. To see all of the parameter defaults, use the help command on ‘rpart.control’. (Again, this was all covered in Meteo 815, so please return to that course if you need a refresher). Please note that each time you run the code, a slightly different tree might grow. That is because there is a random component in the splitting of the data and the growing of the tree. Each time you hit 'run' in the DataCamp code, you re-run the analysis from the beginning, thus creating a new split and a new analysis.
We can look at the final model using the print and summary command. The print command will provide the details of the model, while the summary provides information on how the model was selected. Run the code below:
The results might differ slightly from the tree formed previously, due to the random component of the analysis. To visualize the tree, run the code below:
Again, the tree visualized here may be slightly different from the one formed previously. You can save the figure by running the code below (I find it easier to view the final tree by saving the figure and opening it, then viewing the output in RStudio):
# create attractive postscript plot of tree
post(modelTree, file = "tree.ps",
title = "Classification Tree for Bartlett, TX")
You should get a figure that looks like this (might not be exactly the same):
In the example above, the first rule is whether the fastest 5-minute wind speed is greater than 34. If it’s less than 34, the outcome is labeled as ‘low’. If it is greater than 34, then you test the average wind speed. If the average wind speed is greater than 10.4, then the outcome is low. Otherwise, you test the average temperature. If the average temperature is greater than 79.5, the outcome is low; otherwise, the outcome is high. The dashes in the oval with the numbers correspond to the number of events in that rule for each level (high/low/moderate).
Finally, if you wish to predict the testing values (or any other values), use the code below. Pay particular attention to the parameter ‘type’ as this must be set to ‘class’ to ensure you predict the appropriate variable type (classification instead of continuous).
# predict predictTree <- predict(modelTree,newdata = test,type='class')
You can then use any of the categorical assessment metrics from the earlier lessons to look at the fit of the model (such as joint histograms and confusion matrix metrics).
Updated Flowchart
Below is an updated version of the flowchart that adds in the categorical forecasting methods we just discussed.