METEO 825
Predictive Analytic Techniques for Meteorological Data

Multiple Possible Categorical Outcomes

Prioritize...

Once you have completed this section, you should be able to explain rule-based forecast systems, their use of splits, and how to measure purity.

Read...

Rule-based forecasting is a machine learning method that identifies and uses rules to predict an outcome. The key is finding a set of rules that represent the system you are forecasting. There are numerous rule-based AI methods available. For this lesson, we will illustrate the Classification and Regression Tree (CART) method, which is used quite frequently in the field of weather and climate. Before we dive into CART, let’s first go over some nomenclature.

Rules and Splits

In rule-based forecasting, we use a set of rules that will split the data up and eventually lead to predictions. A rule says: if a specific condition is met, the outcome is Yes; otherwise, the outcome is No. Thus, a rule splits our set of cases into two groups: Yes and No. We’d like these Yes decisions to correspond to the occurrence of one of the categorical outcomes; the same goes for the No decisions.

A perfect rule system would yield splits where all the cases in each group had the same categorical outcome. We call this condition purity. An imperfect rule has a mix of actual outcomes on each side of the split. The better the rule, the closer each side of the split comes to having all cases with the same actual outcome.

Purity

We want our rule-based system to tend toward purity. But how do we actually assess purity? There are all sorts of ways of measuring purity, some much better than others. The one we will focus on in this course is the one with the best theoretical foundation: Shannon’s information Entropy H. Mathematically:

H= i N p i log2( p i )

Where i is the category, pi is the fraction of the cases in that category, sigma is the sum over all the N categories, and log2(pi) means to take the base-2 logarithm of pi.

H goes to 0 as the sample of the cases becomes pure (e.g., all of one category). For any category that doesn’t occur in the sample, p is zero and the product of p*log2(p) is also zero. For the category that occurs in each case of the pure leaf (end of the decision tree) p=1 so p*log2(p)=1*0=0. Note that because p<=1, log2(p) <=0, so H=0.

You do not need to remember all the math. Just remember that our goal is to build rules that drive H down as close to zero as we can get it. This is similar to minimizing the AIC score when looking at multiple linear regression models.