Lessons

Use the links below to link directly to one of the lessons listed.

^† indicates lessons that are Open Educational Resources.

Lesson 2: Measuring Forecast Accuracy

Motivation...

Forecasts vary widely in skill: between different systems, different seasons, from day to day, etc. Thus, we need to measure how well competing forecast systems typically do. And, maybe more importantly, how far off they could potentially be. Furthermore, we want to know how well a particular forecast system is meeting our needs. This is key, as needs differ.

For example, one rainfall forecast user, say an emergency manager, may be concerned about a rare extreme event which could lead to flooding, while another rainfall forecast user, say a gardener, might be concerned about the more normal rainfall amounts. They are both concerned about the forecast in rain, but their needs are very different. So, we need to be able to tailor our skill measurements to take into account different user scenarios. Continuing our example, the emergency manager may be concerned only with rainfall amounts over 2 inches in 24 hours and so would need to measure the accuracy of yes/no or a probabilistic forecast that rainfall would exceed this threshold. In contrast, the gardener would want to measure how many inches the forecast was typically off by and whether it was biased wet or dry. Again, same forecast but different need and, thus, different skill assessment. In this lesson, I encourage you to develop the skill of selecting which metric is the most appropriate given the situation. This is not an easy task as there are many fluid parts, but if you carefully consider the goal of the forecast and the intended use, you will succeed.

Lesson Objectives

Select skill metrics appropriate to the user scenario.
Quantify the skill of a forecast system.
Compute the value a forecast system brings to making a particular decision.

Newsstand

[12]

Download [12] all of this lesson’s code and data.

You will be prompted to save a .zip file to your computer.

When to Measure Forecast Skill

Prioritize...

Once you have finished this section, you should be able to state when in the forecast system development and testing we should measure forecast skill and discuss the general goal of the forecast skill metric for each phase of the development and maintenance cycle.

Read...

You might think that we measure forecast skill after the forecast has been made. Although that is correct, we actually measure forecast skill throughout the development and selection process. This ensures that not only do we begin running a suitable model, but we routinely verify that the forecast system is still performing as well as it did during testing. So instead of thinking of measuring forecast skill as a static one-time event, I encourage you to look at the task as being dynamic - a constantly ongoing process. Read on to learn more about the three key phases in which we measure forecast skill.

During Forecast System Development

Every statistical forecast system contains one or more tunable parameters that control the relationship between the predictors and the predictand. Developing a forecast system consists of using ‘training’ data (e.g., developmental data) to find the ‘best’ values for these parameters. For example, take a look at the figure below that shows two linear regression fits (green and red lines) to the data (black dots).

Two linear regression fits where the red line Y is equal to .26 times 37.74 and the green line Y is equal to .84 time 10.79. Refer to text below.

Example of multiple linear regression fits. Which one is best?

Credit: J. Roman

In linear regression, these parameters are the slope coefficients (circled in blue) and the intercept (circled in orange). But what is ‘best’? How were these coefficients selected? ‘Best’ is defined in terms of some measure of forecast skill (like the AIC score used in the example above) - or the reverse, forecast error.

In linear regression, we solve a set of equations once to find the best values of the parameters. Thus, the parameter tuning is done analytically. But we can also tune the parameters iteratively. We use a set of equations repeatedly to gradually improve initial guesses of the parameter’s value. This approach is used in logistic regression or other non-linear models.

Before we move on to the next phase of measuring forecast skill, let’s talk briefly about training vs. testing data. As a reminder, training data is the data used to tune the model parameter while the testing data is independent of the training data and is used to assess how well the model performs in operations. Generally, however, we will assess the skill on both the training and testing data. Why?

We can use the ratio of the error on the testing data to that on the training data to assess if the model is over fit. As a reminder, overfitting occurs when the model is too closely fit to a limited set of data points. (You can review the subject of overfitting in the Meteo 815 lessons on Linear Regression and Data Mining.) If the ratio is much larger than 1, then the model is over fit. Take a look at the figure below:

Example of the MSE Ratio (over fit) vs Polynomial Degree

Credit. J. Roman

As the degree of the polynomial fit increases, the ratio becomes larger. An over fit polynomial might look great in MSE on the training data but do much more poorly on independent data, indicating that they’ll do poorly in operations (where the data is also independent of that used to train the model). This will be discussed in more detail in future lessons. The key here is that we can test if a model has been over fit, but the requirement is that we MUST have testing data that is independent of the training data.

During Forecast System Selection

Often, we have several competing sources of forecasts available. For example, take a look at the figure below:

Mean RMSE for several three hour forecasts of 500 mb heights. Refer to text below.

RMSE for the three-hour forecast of 500 mb heights in the Northern Hemisphere for several models

Credit. NOAA/National Weather Service [13]

The figure shows the mean RMSE for several three-hour forecasts of 500 mb heights in the Northern Hemisphere (20°N through 80°N). You can find information about the models here [14]. In this example, the ECM (European Center for Medium-Range Weather Forecasts) has the lowest average RMSE for the verification period and would thus be a better choice than say the CFSR (Legacy GFS used for Climate Forecast System Reanalysis).

Now, take a look at this figure:

Mean RMSE for several three hour forecasts of 500 mb heights, the mean over tropics. Refer to text below.

RMSE for the three-hour forecast of 500 mb heights in the Tropics for several models

Credit. NOAA/National Weather Service [13]

Again, this is the RMSE for several three-day forecasts of 500 mb heights, but in this case, we are looking at the mean over the tropics (20°S to 20° N). In this example, the JMA (Japan Meteorological Agency) has the lowest average RMSE over the whole verification period and thus would be the best model for predicting the tropics. Same models, same variable, same forecast hour, different purpose (tropics vs. Northern Hemisphere). As you can see, comparing models will be imperative because the same model may not be the best model across all user scenarios.

We need to decide which set of forecasts provides the most valuable guidance for the decision of interest. What we need is a quantitative measure of forecast skill so that we can make this comparison objectively. And this measure needs to be designed to fit the customer; that is, who needs the forecast, and what is their purpose (think back to our earlier example of the gardener and the emergency manager)?

In short, our forecast scoring metric needs to measure those aspects of skill that are important for our decision. For example, if we were forecasting frost for citrus growers, we don’t care about the forecast skill at warmer temperatures. We only care about the skill on days cool enough to have a chance of frost. Thus, one size most definitely does not fit all.

The forecast skill metric must be selected or designed to capture that which is of value to the customer! For example, this could include the quality of the decisions that need to be made based on the prediction (often as measured by the financial outcome of a set of such decisions). We will discuss this in more detail later on. The key here is to emphasize that skill metric is greatly dependent on the situation at hand. You need to master the skill of understanding the customer’s needs and turning that into a quantitative, easy-to-use metric capturing the forecast’s value for decision making.

During Forecast System Use

You might think that once the forecast system has been selected, you won’t need to assess the skill. But routine verification is used to make sure the forecast system is still performing as well as it did during testing. Remember, skill assessment is dynamic. For example, the available sources of input data can evolve with time, changing the forecast skill for better or worse. This is an important issue when using either observational data or numerical weather prediction model output as input to your forecast system. In addition, during forecast system use, we can determine if there is a seasonality or location dependence in the forecast system’s skill.

Now that we know when to measure forecast skill, we need to learn how to measure it. Read on to learn more about the different techniques for measuring forecast skill.

Deterministic Forecasts

Prioritize...

After you have read this section, you should be able to assess the skill of a deterministic forecast using both graphical methods and statistical metrics.

Read...

So far in this lesson, we have highlighted when we need to assess a forecast. A natural follow-on is how do we assess a forecast? There are many different ways to test the forecast skill, both qualitatively and quantitatively. The choice of technique depends upon the type of predictand and what we're going to use the forecast for. In this section, we will discuss techniques used specifically for deterministic forecasts, which are forecasts in which we state what is going to happen, where it will happen and when (such as tomorrow’s temperature forecast for New York City).

Graphical Methods

As always, plotting the data is an essential and enlightening first step. In this case, the goal is to compare each forecast to the corresponding observation. There are two main types of plots we will focus on for deterministic forecasts: scatterplots and error histograms.

Scatterplots, which you should remember from Meteo 815 and 820, plot the forecast on the horizontal axis and the corresponding observation on the vertical axis. Scatterplots allow us to ask questions such as:

Did our sample of forecasts cover the full range of expected weather conditions?
1. If not, we really need a bigger sample in order to judge the forecast system accurately, or the forecast system lacks the statistical power to forecast extreme events.
2. Do we have a good climatology of observations?
3. Can the forecast system reproduce the full range of that climatology?
Are there any outliers?
1. If so, can we figure out what went wrong with our forecast system (or our observations!) for those cases?
  1. If we can, then we can fix them.
Is there either a multiplicative or additive bias (regression and correlation discussion from Meteo 815)?
1. If so, could they be corrected by linear regression?

Let’s work through an example. We are going to create a scatterplot of forecasted daily maximum temperature versus observed daily maximum temperature in Ames, IA. Begin by downloading these predictions [15] of daily MOS (Model Output Statistics) GFS (Global Forecast System) forecasted maximum temperature (degrees F) for Ames, IA. You can read more about the forecast here [16]. In addition, download this set of observations [17] of daily maximum temperature (degrees F) for Ames, IA. You can read more about the observational dataset here [18].

Now, let’s start by loading in the data using the code below:

Show me the code...

Your script should look something like this:

### Example of MOS Max Temp (https://mesonet.agron.iastate.edu/mos/fe.phtml)
# load in forecast data
load("GFSMOS_KAMW_MaxTemp.RData")

# load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php)
load("ISUAgClimate_KAMW_MaxTemp.RData")

I have already paired the data (and all data in future examples) so you can begin with the analysis. Let’s overlay a linear model fit on the scatterplot. To do this, we must create a linear model between the observations and the predictions. Use the code below:

Show me the code...

# create linear model
linearMod <- lm(obsData$maxTemp~
                forecastData$maxTemp,na.action=na.exclude)

Now, create the scatterplot with the forecasted maximum temperatures on the x-axis and the observed on the y-axis by running the code below.

You should get the following figure:

Scatterplot of Observed vs. Forecasted temperature with a linear fit overlay

Credit: J. Roman

The linear fit is quite good, and you can see we do not need to apply any correction.

The second type of visual used in assessing deterministic forecasts is the error histogram. The formula for error is as follows:

Error = Forecast - Observed

An error histogram, as the name suggests, is a histogram plot of the errors. You can ask the following questions when using an error histogram:

Are the errors normally distributed?
Are there any extreme errors?
How big are the most extreme positive and negative 1, 5, or 10% forecast errors?
1. This basically creates confidence intervals for forecast errors.

Now, let’s work on an example using the data from above. To start, calculate the error for each forecast and observation.

Show me the code...

# calculate error
error <- forecastData$maxTemp-obsData$maxTemp

Now, estimate the 1^st, 5^th, 10^th, 90^th, 95^th, and 99^th percentiles which we will include on the error histogram.

Show me the code...

# estimate the percentiles (1st, 5th, and 10th)
errorPercentiles <- quantile(error,
                             probs=c(0.01,0.05,0.1,0.9,0.95,0.99),
                             na.rm=TRUE)

Now, plot the error histogram using the ‘hist’ function. Run the code below:

You should get the following figure:

Histogram of error. Refer to text below.

Frequency histogram of the error with confidence intervals

Credit: J. Roman

The figure above shows the frequency histogram of the errors with confidence intervals at 1% (1^st and 99^th percentile), 5% (5^th and 95^th percentile), and 10% (10^th and 90^th percentile). The errors appear to be normally distributed.

Statistical Metrics

Now that we have visually assessed the forecast, we need to perform some quantitative assessments. As a note, most of the statistical metrics that we will discuss should be a review from Meteo 815. If you need a refresher, I suggest you go back to that course and review the material.

There are four main metrics that we will discuss in this lesson.

Bias
The first obvious approach to quantifying the forecast error is to average the errors from our sufficient sample of forecasts. This average is called forecast bias. The formula for bias is as follows:
$b i a s = \frac{1}{N} * \sum_{i = 1}^{N} (f o r e c a s t_{i} - o b s e r v a t i o n_{i})$
Where N is the number of cases. A good forecast system will exhibit little or no bias. A bad forecast system, however, can still have little bias because the averaging process lets positive and negative errors cancel out. Let’s compute the bias for our maximum temperature example from above. Run the code below: The bias is 0.52 degrees F, suggesting little bias.
MAE
Since the bias averages out the positive and negative errors, we need a method of seeing how big the errors are without the signs of the errors getting in the way. A natural solution is to take the absolute value of the errors before taking the average. This metric is called the Mean Absolute Error (MAE). Mathematically:
$M A E = \frac{1}{N} * \sum_{i = 1}^{N} a b s (f o r e c a s t_{i} - o b s e r v a t i o n_{i})$
Let’s try computing this for our example. Run the code below:

The result is 3.03 degrees F, which is much larger than the bias, suggesting there were some large errors with opposite signs that canceled out in the bias calculation.
RMSE
Another popular way of eliminating error signs so as to measure error magnitude is to square the errors before averaging. This is called the Mean Squared Error (MSE). Mathematically:
$M S E = \frac{1}{N} * \sum_{i = 1}^{N} {(f o r e c a s t_{i} - o b s e r v a t i o n_{i})}^{2}$
The downside to this approach is that the MSE has different units than those of the forecasts (the units are squared). This makes interpretation difficult.

The solution is to take the square root of the MSE to get the Root Mean Squared Error (RMSE). The RMSE formula is as follows:
$R M S E = \sqrt{M S E} = \sqrt{\frac{1}{N} * \sum_{i = 1}^{N} {(f o r e c a s t_{i} - o b s e r v a t i o n_{i})}^{2}}$
RMSE is more sensitive to outliers than MAE, thus, it tends to be somewhat larger for any given set of forecasts. Neither the MSE nor RMSE is as appropriate, however, as error confidence intervals if the error histogram is markedly skewed.

Now, let’s try calculating the RMSE for our maximum daily temperature example. Run the code below:

The resulting RMSE is 4.02 degrees F, slightly larger than the MAE.
Skill Score
To aid in the interpretation of these error metrics, their value can be compared for the forecast system in interest to some well-known no-skill reference forecast. This comparison effectively scales the forecast system error by how hard the forecast problem is. That way you know if your forecast system does well because it is good or because the problem is easy. The usual way to do this scaling is as a ratio of the improvement in MSE to the MSE of the reference forecast. This is formally called the Skill Score (SS). Mathematically:
SS=1− MS E forecast MS E reference
Climatology and persistence are commonly used reference forecasts.

Let’s give this a try with our daily maximum temperature example. We already have the MSE from the forecast, all we need is to create a reference MSE. I’m going to use the observations and create a monthly climatology of means. Then I will estimate the MSE between the observations and the climatology to create a reference MSE. Use the code below:
Show me the code...
```
# initialize errorRef variable
errorRef <- array(NA,length(obsData$Date))
# create climatology of observations and reference error (anomaly)
for(i in 1:12){
  index <- which(as.numeric(format(obsData$Date,"%m"))==i)
  # compute mean
  monthMean <- mean(obsData$maxTemp[index],na.rm=TRUE)
  # estimate anomaly
  errorRef[index] <- obsData$maxTemp[index]-monthMean
}
mseRef <- sum((errorRef^2),na.rm=TRUE)/length(errorRef)
```
The code above loops through the 12 months and pulls out the corresponding maximum daily temperatures that lie in that month for the length of the data set and estimates the mean. The monthly mean is then subtracted from the observations of the maximum temperature in the corresponding month, which is identical to creating an anomaly. The reference MSE is then calculated. Now we can calculate the skill score by running the code below:

A skill score of 1 would mean the forecast was perfect, while a skill score of 0 would suggest the forecast is equal in skill to that of the reference forecast. A negative skill score means the forecast is less skillful than the reference forecast. In this example, the SS is 0.86 (no units), which is a pretty good skill score.

Categorical Forecasts

Prioritize...

After you have read this section, you should be able to assess the skill of a categorical forecast using the appropriate techniques.

Read...

You will most likely encounter other forecast types beside deterministic. Categorical forecasts, such as the occurrence of rain, are another common type of forecast in meteorology that relies on a specific set of metrics to assess the skill. Read on to learn about the techniques commonly used to assess categorical forecasts.

Joint Histogram

For categorical forecasts, scatterplots and error histograms do not make sense. So, the exploratory error analysis is accomplished in a different way. Typically, we compute the joint histogram of the forecast categories and observed categories. The forecast categories are shown as columns, and the observed categories are rows. Each cell formed by the intersection of a row and column contains the count of how many times that column’s forecast was followed by that row’s outcome.

Although the joint histogram is formally called the confusion matrix, it need not be confusing at all. The cells on the diagonal (upper left to lower right) count cases where the forecast was correct. The cell containing event forecasts that failed to materialize captures ‘false alarms’ (bottom left), and the cell containing the count of events that were not forecasted captures ‘missed events’ (top right). Below is an example of the matrix where a, b, c, and d represent the count of each cell.

Example Confusion Matrix Table
Total Count=N=a+b+c+d	Forecast: (Event Occurred)	Forecast: (Event Did Not Occur)
Observed: (Event Occurred)	TP=a	FN=b
Observed: (Event Did Not Occur)	FP=c	TN=d

Wrongly Forecasted
Correctly Forecasted

Let’s work through each cell starting in the top left. TP stands for true positive - the event was forecasted to occur and it did. FN stands for false negative - the event was not forecasted to occur, but it did (missed events). FP stands for false positive - the event was forecasted to occur, but it did not occur (false alarms). Finally, TN stands for true negative - the event was not forecasted to occur, and it did not occur. We will use these acronyms later on, so I encourage you to make use of this table and go back to it when necessary.

All of the popular forecast metrics for categorical forecasts are computed from this matrix, so it’s important that you can read the matrix (it is not necessary to memorize, although you will probably end up memorizing it unintentionally). The confusion matrix can be generalized to forecasts with more than two categories, but we won’t go into that here.

Let’s work through an example. We are going to look at daily rainfall in Ames, IA from the MOS GFS and compare the predictions to observations (same data source as the previous example). You can download the rainfall forecast here [19] and the observations here [20]. A 1 means rain occurred (was predicted) and a 0 means rain did not occur (was not predicted).

Begin by loading in the data using the code below:

Show me the code...

# load in forecast data
load("GFSMOS_KAMW_rain.RData")

# load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php)
load("ISUAgClimate_KAMW_rain.RData")

Now compute the values of a, b, c, and d for the confusion matrix by running the code below.

If we fill in the matrix, we would get the following table:

Confusion Matrix Table for Rainfall in IA
Total Count=N=a+b+c+d=3653	Rain Predicted	No Rain Predicted
Rain Observed	TP=489	FN=422
No Rain Observed	FP=326	TN=2396

Wrongly Forecasted
Correctly Forecasted

You should also make sure a+b+c+d adds up to the length of the data set. Run the code below to check.

And there you have it! Your first confusion matrix.

Confusion Matrix Metrics

Now that we have the confusion matrix, we want to create some quantifiable metrics that allow us to assess the categorical forecast. The first metric we will talk about is the Percent Correct. Percent Correct is just the ratio of correct forecasts to all forecasts. It answers the question: overall, what faction of the forecasts were correct? The range is 0 to 1 with a perfect score equal to 1. The closer to 1 the better. Mathematically (referring back to the confusion matrix):

P e r c e n t C o r r e c t = = \frac{T P + T N}{T P + F N + F P + T N} = \frac{a + d}{a + b + c + d} = \frac{a + d}{N}

It is a great metric if events and non-events are both fairly likely and equally important to the user. But for rare events, you get a great percent correct value by always forecasting non-events. That is not helpful at all. Likewise, if missed events and false alarms have different importance to the user (think tornados) then weighting them equally in the denominator isn’t a good idea.

Therefore, numerous other metrics have been derived that can be computed from the confusion matrix. One example is the Hit Rate. The Hit Rate is the ratio of true positives to the sum of the true positives and false negatives. It answers the question: what fraction of the observed ‘yes’ events were correctly forecasted? The range is 0 to 1 and a perfect score is 1. The closer to 1 the better. Mathematically:

H i t R a t e = \frac{T P}{T P + F N} = \frac{a}{a + b}

Another metric is the False Alarm Rate, which is the number of times the forecast predicted an event would occur, but the event was not observed. Think of the False Alarm Rate as the forecast system crying wolf. It answers the question: what fraction of the observed ‘no’ events were incorrectly predicted as ‘yes’? The range is 0 to 1 with a perfect score of 0. The closer to 0 the better. Mathematically:

F a l s e A l a r m R a t e = \frac{F P}{F P + T N} = \frac{c}{c + d}

Another metric is the Threat Score. It is the number of times when an event is forecasted and the event occurs divided by the number of times the event is forecasted, and the event occurs plus the number of times the event is forecasted, and the event does not occur plus the number of times the event is not forecasted and the event occurs. It answers the question: How well did the forecasted ‘yes’ events correspond to the observed ‘yes’ events? The range is 0 to 1 with a perfect score of 1. The closer to 1 the better. Mathematically:

T h r e a t S c o r e = \frac{T P}{T P + F N + F P} = \frac{a}{a + b + c}

The threat score is more appropriate than percent correct for rare events.

Now let’s try computing these ourselves. Continuing on with the rainfall example, run the code below to estimate the four different metrics discussed above.

The percent correct is 0.79 (perfect score=1), the hit rate is 0.53 (perfect score =1), the false alarm rate is 0.12 (perfect score=0) and the threat score is 0.39 (perfect score=1). You can see that the percent correct is quite high, but upon further investigation, we see this might be biased due to the large number of predicted and observed ‘no’ events (TN).

Probabilistic Forecasts

Prioritize...

Once you have completed this section, you should be able to quantitatively and qualitatively assess a probabilistic forecast.

Read...

Now, we are on to the final forecast type, which we will discuss in this lesson. A probabilistic forecast is much like a categorical forecast, except that the forecast system is allowed to express its uncertainty by giving the probability of an event instead of just a yes or a no. While this provides more useful information for the decision maker, it requires different verification procedures, more closely related to those we used for continuous variables. Read on to learn more about how to assess the skill of a probabilistic forecast.

Brier Score

The basic metric for a probabilistic forecast is a variant of the MSE called the Brier Score (BS). It measures the mean squared difference of the predicted probability of the outcomes to the actual outcome. Mathematically:

B S = \frac{1}{N} \sum_{i = 1}^{N} {(f_{i} - o_{i})}^{2}

Where f_i is the probability (e.g. 25% chance) that was forecasted of an event occurring and o_i is the actual outcome (0 if the event does not occur and 1 if it does occur). The range of BS is 0 to 1 with a perfect score being 0. The closer to 0 the better.

Let’s try estimating the BS for a Probability of Precipitation (PoP) forecast. Again, we will use MOS GFS output and observations from Ames, IA using the same data sources from before. The model predictions can be downloaded here [21] and the observations can be downloaded here [22]. Begin by loading in the data:

Show me the code...

# load in forecast data
load("GFSMOS_KAMW_PoP.RData")

# load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php)
load("ISUAgClimate_KAMW_PoP.RData")

Now, turn the predicted probability of precipitation into a decimal value (instead of a percentage).

Show me the code...

# turn pop into decimal value
f <- forecastData$pop/100

Now, save the observed value (1 if it rained and 0 if it did not).

Show me the code...

# save observed value
o <- obsData$rainfall

Next, estimate the difference squared for each data point.

Show me the code...

# estimate difference squared
errorSquared <- (f-o)^2

Finally, sum up and divide by the length to estimate the Brier Score.

In this example, the BS is 0.15, which is relatively close to 0 suggesting a good forecast skill.

Graphical Methods

BS does not, however, capture the value of a probabilistic forecast to a particular user. The user will be picking some threshold probability above which to assume the event will occur. They’ve got to decide which threshold is best for their particular decision.

Generally, this depends on how expensive false alarms are relative to missed events. For example, consider a hurricane shut-down decision for an oil refinery. If you shut the refinery down when you didn’t have to (false alarm), you will lose several days of production. On the flip side, if you were flooded out while operating (missed events), you might have considerable repairs.

So, we need a way to measure the tradeoff between false alarms and missed events as we change our threshold probability. If you set the threshold near 0%, then you’ll always forecast the event and will have less missed events and more false alarms. If you set the threshold high, you will increase the number of missed events but reduce the number of false alarms.

How do we visualize this? One way is with a Receiver Operating Characteristic (ROC) curve. A ROC curve plots the fraction of events correctly forecasted (True Positive Rate - TPR) on the vertical axis versus the fraction of the non-events which were forecasted to be events (False Positive Rate - FPR) on the horizontal axis. The TPR is the True Positive Rate and is given by the formula below:

T P R = \frac{T P}{T P + F N} = \frac{a}{a + b}

The FPR is the False Positive Rate (also called the false alarm rate) and is given by the formula below:

F P R = \frac{F P}{F P + T N} = \frac{c}{c + d}

Let’s give this a try with our PoP example. To begin, initialize variables TPR and FPR using the code below:

Show me the code...

# initialize variables
truePositiveRate <- {}
falsePositiveRate <- {}

Now, create an index that contains the instances when rain was observed and when it wasn’t.

Show me the code...

# create rain index 
rainIndex <- which(o==1)
noRainIndex <- which(o==0)

Now, create a list of threshold probabilities to use for creating the curve.

Show me the code...

# select the probabilities
probs <- seq(0,1,0.1)

Now, loop through the probabilities and estimate the TP and FP rates.

Show me the code...

# loop through probabilities and estimate the True Positive and False Positive rate
for(i in 1:length(probs)){
  popIndex <- which(f > probs[i])
  noPopIndex <- which(f < probs[i])
  TP <- length(intersect(popIndex,rainIndex))
  FP <- length(intersect(popIndex,noRainIndex))
  FN <- length(intersect(noPopIndex,rainIndex))
  TN <- length(intersect(noPopIndex,noRainIndex))
  truePositiveRate[i] <- TP/(TP+FN)
  falsePositiveRate[i] <- FP/(FP+TN)
}

So for each probability threshold, you found when that probability or higher was forecasted (popIndex). Then, you counted the number of times TP, FP, FN, and TN occurred based on that probability threshold (you essentially created the confusion matrix). Finally, you estimated the rate.

Now, we plot the false positive rate on the x-axis and the true positive rate on the y-axis by running the code below.

I’ve included a one-to-one line in blue. I also displayed the probability threshold as a text. Below is a larger version of the figure:

True Positive Rate vs False Positive Rate for various threshold probabilities for PoP in IA

Credit: J. Roman

As we set our threshold probability high, we end up at the lower left of the curve with few cases forecasted as events, so low rates of both true and false positive. In contrast, if we set our threshold probability low, we forecast the event most of the time, so we move to the upper right end of the ROC curve with high rates of true positive (good) and false positive (bad). The trick here is finding the right threshold for our particular decision cost structure.

Cost-Based Decision Making

So, how do we pick the right threshold? Well, by applying a threshold to a probabilistic forecast, we turn the probabilistic forecast into a categorical forecast. This means we can compute the confusion matrix and all the associated error metrics.

The goal is to find a threshold probability that guides us to the most cost-effective decision. To do this, we need the decision’s cost matrix. This has the same columns (forecast category) and rows (observed outcome category) as the confusion matrix. But it contains losses (or profits) for each combination of forecast and observed category.

For example, below is the cost matrix for a hypothetical refinery in a potential hurricane-prone position:

Hypothetical Decision Cost Matrix for a Refinery
Profit or loss in millions of $ per day	Fcst Yes	Fcst No
Obs Yes	-1.5	-5.0
Obs No	-1.0	0.25

Let’s say that the corresponding confusion matrix for a threshold value of p is as follows:

Corresponding Confusion Matrix
Total Count=N	Fcst Yes	Fcst No
Obs Yes	a	b
Obs No	c	d

If we multiply each cell in the cost matrix by the corresponding confusion matrix, sum them up, and divide by the total number of forecasts, we get the expected profit or loss using the forecast system with this threshold (p). In this example, the formula would be:

E x p e c t e d P r o f i t o r L o s s = \frac{- 1.5 * a + - 5.0 * b + - 1.0 * c + 0.25 d}{N}

You then try a range of thresholds from 0 to 100% and see which yields the best financial outcome. The optimal threshold probability depends strongly on the cost matrix, so it’s not the same for any two decisions. This is why probabilistic forecasts are more valuable than categorical forecasts; they can be tuned to the decision problem at hand by picking the best threshold probability.

So let’s give it a try. Below is a hypothetical cost matrix for a bike share company when precipitation greater than 0 occurs in Ames, IA. When precipitation occurs, they lose out on people renting bikes. In addition, if precipitation occurs, and they don’t expect it, the bikes can get ruined if not stored correctly. On the flip side, if precipitation is forecasted, and they remove bikes, but then rain does not occur, they miss out on sales. (Please note, this is all hypothetical).

Hypothetical Decision Cost Matrix for Bike Share Company
Profit or Loss $ Per Day	Fcst Yes	Fcst No
Obs Yes	-200	-500
Obs No	-150	50

Now, let’s use our data and apply several different threshold probabilities to estimate the expected profit or loss. To begin, select the thresholds we will test:

Show me the code...

# select the probabilities
probs <- seq(0,1,0.1)

Now, initialize the variable that will save the expected profit or loss for each threshold.

Show me the code...

# initialize variables
expectedProfitLoss <- {}

Now, determine when it rained and when it did not rain.

Show me the code...

# create rain index 
rainIndex <- which(obsData$rainfall==1)
noRainIndex <- which(obsData$rainfall==0)

Now, enter the cost matrix values.

Show me the code...

# enter cost matrix values
costTP <- -200
costFN <- -500
costFP <- -150
costTN <- 50

Now, we can loop through the probabilities and estimate the expected gain/loss.

Show me the code...

# loop through probabilities and estimate the expectedProfitLoss
for(i in 1:length(probs)){
  popIndex <- which(f > probs[i])
  noPopIndex <- which(f < probs[i])
  TP <- length(intersect(popIndex,rainIndex))
  FP <- length(intersect(popIndex,noRainIndex))
  FN <- length(intersect(noPopIndex,rainIndex))
  TN <- length(intersect(noPopIndex,noRainIndex))
  sumProfitLoss <- (costTP*TP)+(costFN*FN)+(costFP*FP)+(costTN*TN)
  expectedProfitLoss[i] <- sumProfitLoss/length(obsData$Date)
}

You could do a number of things to visualize or quantify the best threshold. I will simply make a plot of the threshold probabilities and the expected profit or loss. Run the code below:

Below is a larger version of the figure:

Plot of threshold probabilities and the expected profit or loss. Refer to text below.

Expected profit/loss for hypothetical bike share decision cost matrix for various probability thresholds in IA

Credit: J. Roman

The optimal threshold is 30%, as this minimizes their loss. So, when there is a 30% or larger chance of precipitation, they should store bikes and expect a loss of roughly $60/day.

Summary

Prioritize...

Once you have completed this section, you should be able to select the appropriate visuals and statistical metrics for assessing forecast skill, depending on the forecast type.

Read...

In this lesson, we learned how to assess the skill of a forecast, dependent on the forecast type. The assessment relies heavily on the question at hand and the consequences of a bad forecast versus a good forecast (user-centric). Before you continue on to the assignment, make sure you feel comfortable selecting the appropriate assessment given a forecast type. Below, you will find one final note on strictly proper scores and a flowchart to help guide you in the decision making.

Strictly Proper Scores

If you chose not to go with cost minimization, there are a plethora of other error and skill metrics to choose from. Many of them, however, reward issuing forecasts that don’t reflect the true probability distribution. For example, MAE applied to a probability forecast will improve if you forecast a 1 for probability > 0.5 and a 0 for probability <= 0.5. Thus, destroying some of the available information about forecast uncertainty actually improves the score. It’s not rewarding the right aspects of skill in this setting.

In contrast, the Brier Score and cost minimization reward issuing the best guess of the event probability as your forecast. Error and skill metrics that reward issuing a correct probability forecast are called ‘strictly proper’.

Flowchart

I will end this lesson by providing a flowchart that should help guide your forecast assessment. There are many more methods for assessing the skill of a forecast, and you are not limited to what was presented here.

Self Check

Test your knowledge...

Lesson 4: Forecasting Categorical Variables

Motivation...

Click for the transcript of AWESOME 3D Explanation on How Different Precipitation Types Form!

Oftentimes with storms, we can get snow to rain, we get snow to sleet, snow/sleet, to freezing rain, but what determines these precipitation types? So, let's talk a little bit about this. First of all, when you think about the troposphere, which is where the game is played, you're looking at about 40,000 feet. So, we're only going to need about 15,000 feet of that atmosphere to age, grow the snow, and produce these different precipitation types. And sometimes you do not even need all of that. But let's just start up here where the snow crystals are developing, you know, we call this the dendritic growth region. Now it's pretty easy to produce snow when you've got the snowflakes that are already being produced up there, and they fall down through the atmosphere all the way down to the ground as snow because the whole column itself is below 32 degrees Fahrenheit or zero Celsius. And so, what do you wind up with? This beautiful Currier and Ives postcard down here, right? That's easy. But what happens when you start introducing a warm layer?

So, on average, let's play with that warm layer somewhere between 5 and 10,000 feet. Maybe not all of it. So, we've got snow, we've always got snow up there in the winter time, usually at 15,000 feet on up. Ok, so here come those snow crystals rolling on the down the layer. All of a sudden, oh! It goes above 32 degrees, starts to melt, either the entire flake or a little bit of it, but once it starts to fall back into the air that's below 32, it becomes a sleet pellet. So, it completely refreezes again. You can hear the sleet. You can hear it on the rooftops. You can hear it on the street. You can hear it on your car if you're just sitting there. And sleet and snow are typically ok to drive in, but this last precipitation type, ew! Alright, so let's play again with the atmosphere. Go up to 15,000, we've got snow down to 10 on average. It melts, uh oh, it's still melting all the way down to about 2,000 feet and then, on under that air mass, that's just cold enough to refreeze whatever is at the surface, but it's very shallow. Now, this can vary anywhere between 200 and 2,000 feet. But the problem is, you don't get a chance to refreeze that droplet. It comes down as a liquid and whatever it hits it turns to ice. And ice is the worst thing to drive on! Especially if you're going too fast!

[CAR SKIDDING ON ICE]

Yeah, stuff like that can happen. So, the rain keeps coming down in the freezing layer. It accumulates on everything. The rooftops, the tree limbs, and you wind up with some big problems in through here. And unfortunately, too, you've got power lines involved in many, many cases where that ice is accumulating on power lines. Now, between the two poles, if you just take on average, the weight of a quarter inch of ice adds about 500 pounds to those lines! And yeah, that means power loss and all sorts of problems.

So, those are precipitation types and basically, how they form. Obviously, you can get all of the same precipitation in one storm, you can get a combination of all three. But that's essentially how the atmosphere works for precipitation types.

The video above discusses different precipitation types and how they form. Precipitation type is an example of a categorical variable, one in which a finite set of outcomes can occur. Accurately predicting the precipitation type may be more important for a user than the magnitude of the precipitation. For example, you might expect less than an inch of rain, but if that rain is actually freezing rain, hazards can occur such as slippery roads, downed trees, etc., that you must prepare for. Or maybe you are concerned with a certain threshold of freezing rain, such as an amount greater than a ½ inch, which could down power lines. The actual magnitude of the precipitation does not matter as much as the type of precipitation that occurs and/or if the amount surpasses a specific threshold.

Categorical variables, like the ones described in the example above, show up quite frequently in weather and climate analytics. They allow us to create thresholds that focus on a key aspect of weather that is important to the user’s question. Instead of predicting all outcomes (precipitation amount and type), we worry about predicting the outcomes that matter (precipitation type and threshold). Categorical variables require special forecasting methods, which will be highlighted in this lesson. Read on to learn more.

Lesson Objectives

Select an appropriate categorical forecast method given a user scenario.
Describe a logistic regression and fit the model to data in R.
Explain the rules and splits of CART and measure purity.
Build, terminate, and interpret a tree diagram in R.

Newsstand

[26]

Download [26] all of this lesson’s code and data.

You will be prompted to save a .zip file to your computer.

Problem Definition

Prioritize...

Once you have completed this section, you should be able to define a categorical weather-dependent problem.

Read...

Categorical variables take on one of a limited (at least two) choice of fixed possible outcomes. In weather, categorical variables include hurricane categories, tornado strength, and precipitation type. You might think that categorical forecasts are not as important as say a probability forecast, but as we saw in the previous lessons, many probabilistic forecasts can become categorical forecasts, particularly when used to make a cost-based decision. For the same reason, many times a real-world scenario will require a categorical answer, not the value of a continuous variable. Read on to learn about what problems require categorical forecasts.

Overview

For a categorical forecast, there will be two or more possible outcomes. The goal is to predict the outcome category from the weather data. For example, one question may be whether a river will rise enough to overtop a levee. There are two possible outcomes: the river will overtop the levee, or it won’t. Flood preparation decisions would depend on the odds of the levee being overtopped.

Another example of a categorical weather-dependent forecast is with respect to aviation. Based on certain weather conditions, pilots need to know what set of flight rules apply. The rules are Visual Flight Rules (VFR), Marginal VFR (MVFR), Instrument Flight Rules (IFR), and Low Instrument Flight Rules (LIFR). You can learn more about these rules here [27]. Since there are four different flight rules, we have four possible outcomes. Fuel reserve decisions for an executive jet would depend on the odds of each of these categories occurring.

Logistic Regression

Prioritize...

When you have finished reading this section, you should be able to describe a logistic regression, be able to explain the shortcomings of such a model and perform a logistic regression in R.

Read...

The video below is a bit long, but I encourage you to watch the whole thing, as it provides a nice overview of logistic regression.

Click for the transcript of StatQuest: Logistic Regression.

[SINGING] If you can fit a line, you can fit a squiggle. If you can make me laugh, you can make me giggle. StatQuest.

[JOSH STARMER]: Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to talk about logistic regression. This is a technique that can be used for traditional statistics as well as machine learning. So let's get right to it.

Before we dive into logistic regression, let's take a step back and review linear regression. In another StatQuest, we talked about linear regression. We had some data, weight, and size. Then we fit a line to it and, with that line, we could do a lot of things. First, we could calculate r-squared and determine if weight and size are correlated. Large values imply a large effect. And second, calculate a p-value to determine if the r-squared value is statistically significant. And third, we could use the line to predict size given weight. If a new mouse has this weight, then this is the size that we predict from the weight. Although we didn't mention it at the time, using data to predict something falls under the category of machine learning. So plain old linear regression is a form of machine learning. We also talked a little bit about multiple regression. Now, we are trying to predict size using weight and blood volume. Alternatively, we could say that we are trying to model size using weight and blood volume. Multiple regression did the same things that normal regression did. We calculated r-squared, and we calculated the p-value, and we could predict size using weight and blood volume. And this makes multiple regression a slightly fancier machine learning method. We also talked about how we can use discrete measurements like genotype to predict size. If you're not familiar with the term genotype, don't freak out. It's no big deal. Just know that it refers to different types of mice.

Lastly, we could compare models. So, on the left side, we've got normal regression using weight to predict size. And we can compare those predictions to the ones we get from multiple regression, where we're using weight and blood volume to predict size. Comparing the simple model to the complicated one tells us if we need to measure weight and blood volume to accurately predict size, or if we can get away with just weight. Now that we remember all the cool things we can do with linear regression, let's talk about logistic regression. Logistic regression is similar to linear regression, except logistic regression predicts whether something is true or false instead of predicting something continuous, like size. These mice are obese, and these mice are not. Also, instead of fitting a line to the data, logistic regression fits an S-shaped logistic function. The curve goes from zero to one, and that means that the curve tells you the probability that a mouse is obese based on its weight. If we weighed a very heavy mouse, there is a high probability that the new mouse is obese. If we weighed an intermediate mouse, then there is only a 50% chance that the mouse is obese. Lastly, there's only a small probability that a light mouse is obese. Although logistic regression tells the probability that a mouse is obese or not, it's usually used for classification. For example, if the probability that a mouse is obese is greater than 50%, then we'll classify it as obese. Otherwise, we'll classify it as not obese. Just like with linear regression, we can make simple models. In this case, we can have obesity predicted by weight or more complicated models. In this case, obesity is predicted by weight and genotype. In this case, obesity is predicted by weight and genotype and age. And lastly, obesity is predicted by weight, genotype, age, and astrological sign. In other words, just like linear regression, logistic regression can work with continuous data like weight and age and discrete data like genotype and astrological sign. We can also test to see if each variable is useful for predicting obesity. However, unlike normal regression, we can't easily compare the complicated model to the simple model, and we'll talk more about why in a bit. Instead, we just test to see if a variable's effect on the prediction is significantly different from zero. If not, it means that the variable is not helping the prediction. Psst. We used Wald's test to figure this out. We'll talk about that in another StatQuest.

In this case, the astrological sign is totes useless. That's statistical jargon for not helping. That means, we can save time and space in our study by leaving it out. Logistic regression's ability to provide probabilities and classify new samples using continuous and discrete measurements makes it a popular machine learning method. One big difference between linear regression and logistic regression is how the line is fit to the data. With linear regression, we fit the line using least squares. In other words, we find the line that minimizes the sum of the squares of these residuals. We also use the residuals to calculate R squared and to compare simple models to complicated models. Logistic regression doesn't have the same concept of a residual, so it can't use least squares, and it can't calculate R squared. Instead, it uses something called maximum likelihood. There's a whole StatQuest on maximum likelihood so see that for details, but in a nutshell, you pick a probability scaled by weight of observing an obese mouse just like this curve, and you use that to calculate the likelihood of observing a non-obese mouse that weighs this much. And then, you calculate the likelihood of observing this mouse, and you do that for all of the mice. And lastly, you multiply all of those likelihoods together. That's the likelihood of the data given this line. Then you shift the line and calculate a new likelihood of the data and then shift the line and calculate the likelihood again, and again. Finally, the curve with the maximum value for the likelihood is selected. BAM!

In summary, logistic regression can be used to classify samples, and it can use different types of data like the size and/or genotype to do that classification. And it can also be used to assess what variables are useful for classifying samples, i.e., astrological sign is totes useless.

Hooray! We've made it to the end of another exciting StatQuest. Do you like this StatQuest and want to see more? Please subscribe. If you have suggestions for future StatQuests, well, put them in the comments below. Until next time, Quest on!

Logistic models, as the video discussed, predict whether something is true or false; that is, there are only two outcomes. It is an example of a categorical model. By using several variables of interest, we can create a logistic model and estimate the probability that an outcome will occur.

Overview

Logistic regression estimates the parameters of a logistic model, a widely used statistical model that uses a logistic function to model a binary (only two outcomes) dependent variable. The logistic function follows an s-curve, like the image below shows:

Logistic function demonstrating a s-curve

Example of a logistic function

Credit: Qef (talk) [28] - Created from scratch with Gnuplot

Mathematically, the expected probability that Y=1 for a given value of X for logistic regression is:

P (Y = 1) = \frac{1}{1 + e^{(- β_{0} + β_{1} X)}}

The expected probability that Y=0 for a given value of X is:

P (Y = 0) = 1 - P (Y = 1)

The equation turns linear regression into a probability forecast. Instead of predicting exactly 0 or 1 (the event does not happen, or it does), logistic regression generates a probability between 0 and 1 (the likelihood the event will occur). As the output of the linear regression equation increases, the exponential goes to zero so the probability goes to:

\frac{1}{(1 + 0)} = 1 o r 100 %

In contrast, as the output of the linear regression equation decreases, the exponential goes to infinity, so the probability goes to:

\frac{1}{(1 + inf)} = \frac{1}{i n f} = 0

Shortcomings

Logistic regression is advantageous in that the regression generates a probability that the outcome will occur. But there are some disadvantages to this technique. For one, logistic regression allows only linear relations or KNOWN interactions between variables. For example:

A * T e m p + b * H u m i d i t y + c * T e m p * H u m i d i t y

But we often don’t know what variable interactions are important, making logistic regression difficult to use. In addition, logistic regression only allows for two possible outcomes. You can only determine the odds of something occurring or not.

Example

Let’s work through an example in R. Begin by downloading this data set of daily values from Burnett, TX [29]. The variables included are: the date (DATE), average daily wind speed (AWND), time of the fastest one-mile wind (FMTM), peak gust time (PGTM), total daily precipitation (PRCP), average daily temperature (TAVG), maximum daily temperature (TMAX), minimum daily temperature (TMIN), wind direction of the fastest 2-minute wind (WDF2), wind direction of the fastest 5-minute wind (WDF5), the fastest 2-minute wind speed (WDF2), and fastest 5-second wind speed (WSF5). To review the variables or their units, please use this reference [30].

For this example, we are going to create a logistic model of rain for Burnett, TX using all of these variables. To start, load in the data set.

Show me the code...

# load in daily summary data for Burnett, TX
mydata <- read.csv("Texas_Daily_Summaries.csv")

We only want to retain complete cases, so remove any rows (observational dates) in which any variable has a missing value.

Show me the code...

# obtain only complete cases
mydata <- na.omit(mydata)

# check for missing values
badIndex <- which(is.na(mydata))

‘badIndex’ should be empty (this is a way of checking that all NAs were removed). Now that we have a complete data set, we need to create a categorical variable. I will model two rain outcomes: rain occurred (yes =1) and rain did not occur (no=0). Use the code below:

Show me the code...

# turn precipitation into categorical variable with 2 outcomes (yes it rain=1 No it did not rain=0)
mydata$rain <- array(0, length(mydata$PRCP))
mydata$rain[which(mydata$PRCP > 0)] <- 1

Now, I want to remove the variables Date and PRCP as we will not use them in the model. Use the code below:

Show me the code...

# remove $PRCP and $DATE
mydata <- subset(mydata, select = -c(1, 5))

We are left with a data frame that only contains the dependent variable (rain) and the independent variables (all other variables). Next, we need to split the data into training and testing. For this example, run the code below to create a 2/3 | 1/3 split.

The function ‘sample.split’ from the package ‘caTools’ randomly selects 2/3rds (0.66) of the observations (set to TRUE). We can then create our training and testing set as follows:

Show me the code...

#get training and test data
train<- subset(mydata,split == TRUE)
test <- subset(mydata,split == FALSE)

Now, we can create a logistic model by running the code below:

The function ‘glm’ is a generalized linear model. We are telling the function to model ‘rain’ from the training data using all the variables (~.) in the training set. The ‘family’ parameter tells the function to use logistic regression (‘binomal’).

Some of the coefficients are not statistically significant. What can we do? Well, similar to stepwise linear regression, we can perform a stepwise logistic regression. Run the code below:

The ‘step’ function selects the best model based on minimizing the AIC score. If you summarize the model, you will find that all the coefficients are statistically significant at alpha=0.05. The final variables will be slightly different each time you run, as there is a random aspect with the data split and the step function. But the model will probably include: AWND (average daily wind speed), FMTM (fastest one-mile wind), TMAX (maximum temperature), TAVG (average temperature) or TMIN (minimum temperature), WDF2 (wind direction of the fastest 2-minute wind), WSF2 (fastest 2-minute wind speed), and WSF5 (fastest 5-second wind speed).

We can predict our testing values using the code below:

Show me the code...

# predict testing data
predictGLM <- predict(bestModel,newdata = test,family="binomial",type="response")

The ‘type’ parameter makes the predictions scale to the response variable instead of the default which scales to the linear predictors. Unlike linear regression where we could plot the observed vs predicted values and assess the fit, we must use alternative assessment techniques. For categorical forecasts, we can display the confusion matrix (refer back to previous lessons for more details). Run the code below:

You should get a similar table to the one below (again, there is a random component, so the numbers will most likely not be identical):

Confusion Matrix for Rain in Burnett, TX
	FALSE	TRUE
0	462	48
1	80	94

Try changing the probability threshold to see how the confusion matrix changes. Or we can plot the ROC curve using the code below (again, refer back to previous lessons for more details):

Show me the code...

#ROC Curve
library(ROCR)
ROCRpred <- prediction(predictGLM, as.character(test$rain))
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

The first line standardizes the predictions and provides a label (as.character(test$rain)). The next line (performance) evaluates the True Positive Rate (TPR) and the False Positive Rate (FPR). The final line plots the assessment. The color represents the probability cutoffs. You should get the following figure:

Figure with colored line representing the probability cutoffs.

ROC Curve for precipitation model in Burnett, TX

Credit: J. Roman

When the probability cutoff is high (near 100%), the false positive rates and true positive rates are low because we have forecasted very few events. When the probability is low, we have high TPR (good) and FPR (bad), since we are forecasting most of the events. The trick here is finding the right threshold for our particular decision cost structure.

Multiple Possible Categorical Outcomes

Prioritize...

Once you have completed this section, you should be able to explain rule-based forecast systems, their use of splits, and how to measure purity.

Read...

Rule-based forecasting is a machine learning method that identifies and uses rules to predict an outcome. The key is finding a set of rules that represent the system you are forecasting. There are numerous rule-based AI methods available. For this lesson, we will illustrate the Classification and Regression Tree (CART) method, which is used quite frequently in the field of weather and climate. Before we dive into CART, let’s first go over some nomenclature.

Rules and Splits

In rule-based forecasting, we use a set of rules that will split the data up and eventually lead to predictions. A rule says: if a specific condition is met, the outcome is Yes; otherwise, the outcome is No. Thus, a rule splits our set of cases into two groups: Yes and No. We’d like these Yes decisions to correspond to the occurrence of one of the categorical outcomes; the same goes for the No decisions.

A perfect rule system would yield splits where all the cases in each group had the same categorical outcome. We call this condition purity. An imperfect rule has a mix of actual outcomes on each side of the split. The better the rule, the closer each side of the split comes to having all cases with the same actual outcome.

Purity

We want our rule-based system to tend toward purity. But how do we actually assess purity? There are all sorts of ways of measuring purity, some much better than others. The one we will focus on in this course is the one with the best theoretical foundation: Shannon’s information Entropy H. Mathematically:

H = - \sum_{i}^{N} p_{i} l o g 2 (p_{i})

Where i is the category, p_i is the fraction of the cases in that category, sigma is the sum over all the N categories, and log2(p_i) means to take the base-2 logarithm of p_i.

H goes to 0 as the sample of the cases becomes pure (e.g., all of one category). For any category that doesn’t occur in the sample, p is zero and the product of p*log2(p) is also zero. For the category that occurs in each case of the pure leaf (end of the decision tree) p=1 so p*log2(p)=1*0=0. Note that because p<=1, log2(p) <=0, so H=0.

You do not need to remember all the math. Just remember that our goal is to build rules that drive H down as close to zero as we can get it. This is similar to minimizing the AIC score when looking at multiple linear regression models.

CART Part 1: Defining Rules

Prioritize...

After you have read this section, you should be able to interpret a tree diagram and understand how rules are built.

Read...

As I mentioned earlier, CART, or Classification and Regression Trees, is a rule-based machine learning method. It’s a common AI technique used in the field of weather and climate analytics. We actually used regression trees in Meteo 815 when we discussed data mining, so this should not be the first time you have heard of the method. In this lesson, however, we will dive deeper into the details of CART and explore exactly how the tree forms and how the rules are built that define the tree. Read on to learn more.

A Tree of Rules

In the end, we will have a set of rules that are used to predict the categorical outcome. So, how does the tree of rules work? Each case goes to the first rule and gets sent down one branch of a split (Yes or No). Each of the two branches leads to another rule. The case goes into the next rule and splits again, getting sent down another branch. This process continues to the tips of the branches (leaves).

The goal is to have all the cases that end up at a given branch tip to have the same categorical outcome. Below is an example of a decision tree for flight rules [27].

Decision tree for flight rules. Refer to text below.

Example of a decision tree for flight rules

Credit: J. Roman

You start at the top with all of the observations (or whatever other inputs you are using). You follow Rule 1, which will lead you down one of the branches. Whichever branch you follow, you then go to Rule 2 which will lead you down two more branches which end up at the branch tip with the final outcome. The branch tip is formally called the leaf.

Building a Rule

Before we can build our tree, we need to build the rules. Each rule takes the form:

If predictor > threshold, then Yes, else No

Thus, we can build different rules using different predictors. And we can create different rules by changing the threshold applied to any one predictor.

We build rules by testing all the predictors to see which one gives us the biggest increase in purity (e.g., the biggest decrease in H). For each predictor, we have to test multiple threshold values to see which works best. Usually, about 20 thresholds spread across the range of a predictor is sufficient. We test each of these predictor threshold combinations on our entire set of training cases (cases where we know both the predictors and the actual categorical outcome).

Below is an example of a purity calculation for different thresholds.

Minneapolis, MN Predict Tmin less than 32 from Tmax. Refer to text below.

Purity for different maximum temperature thresholds in Minneapolis, MN to predict if the minimum temperature will be less than freezing

Credit: J. Roman

The goal was to predict whether the minimum temperature would be below 32°F in Minneapolis, MN based on the maximum temperature from the day before. I created 12 thresholds starting at 35°F and going up to 60°F. For each threshold, I estimated the probability of observing a Tmin < 32°F given Tmax was between the threshold values. There is an optimal threshold with worse (bigger) values of H on either side (this threshold is 43°F to 45°F). The H value for the optimal threshold is 0.15, which means it does not yield 0 or a perfect purity. This just means that our one forecast rule wasn’t enough to give perfect forecasts on the training data. Hence the need for multiple rules (e.g., CART).

CART Part 2: Growing A Tree

Prioritize...

After you have read this section, you should be able to explain how a tree grows in CART and how this growth is terminated.

Read...

Now that we know how to build rules, we need to combine the rules to create the tree. In addition, we need to know when to stop growing the tree. Read on to finish learning about CART.

Method

Now that you know how to build a rule, you can go ahead and grow a tree. To begin, build a rule on the full training dataset, using it to split the training cases into two subsets. Now go to one of the resulting subsets and repeat the process. Continue this process with all of the branches until you hit the termination criterion.

Termination

How do we know when to terminate? There are multiple ways to determine when to stop building the tree. Some are bad and some are good.

One approach is to drive to purity. This, however, tends to result in so few cases in each leaf that it doesn’t have much statistical robustness. In turn, the tree will not work well on independent data (e.g., in actual operations). Instead, we want to stop growing our tree when the available data no longer supports making additional splits. Another approach is an ad hoc threshold. Instead of solely driving down purity, you set a threshold on the minimum number of cases allowed in a leaf. Although simple, this actually works quite well. What is the optimal threshold? You have to test several values to see which model results in the least overfitting of the data you’ve held out for testing (see previous lessons for how to test for overfitting).

Alternatively, or in addition, we can ‘prune’ a tree back to eliminate splits supported by too few cases. Sometimes this works better, but it takes more computer time since you have to grow the tree out until it is ‘too big’ before stopping. Again, numerous methods have been proposed to control which branches get pruned.

CART in R

In R, you can use the function ‘rpart’ from the package of the same name to grow a tree. Sound familiar? This is the same function used in Meteo 815. I will only provide a summary here. For more details, I suggest you review the Data Mining lesson from Meteo 815. To grow the tree, we use the following code:

rpart(formula, data=, method=, control=)

Where the formula format is:

Outcome~predictor1+predictor2+predictor3+….

You set data to the data frame and method to ‘class’ (this tells the function to use a classification tree as compared to ‘anova’ for a regression tree of a continuous outcome). The ‘control’ parameter is optional but allows you to set parameters for selecting the best model (such as the minimum number of cases allowed in a leaf). Remember, the function will create multiple models and select the best based on purity and any controls listed.

For a review of how to examine the results, please look at the material from Meteo 815 or use this resource [24]. In addition, we can prune a tree by using the ‘prune’ function, which will allow you to avoid overfitting.

CART R Example

Prioritize...

Once you have read this section, you should be able to perform the CART technique on a dataset in R.

Read...

Now, it’s time to apply your knowledge and build a tree in R. Earlier, we fit a logistic model to daily precipitation (did it rain or not?). Sometimes we need to know more than whether it rained or not. In some cases, it might be more important to know whether the precipitation passed a certain threshold. This threshold usually corresponds to an impact, such as, more than an inch of rain in a day might cause stormwater drains to back up. In this section, we will translate the precipitation into a categorical variable and create a CART model for prediction. Read on to learn more!

Example

We are going to continue off of our earlier example of rain in Burnett, TX. Again, you can download the data here [29]. To review the variables or their units, please use this reference [30].

Start by loading in the data set and retaining only complete cases.

Show me the code...

# load in daily summary data for Burnett, TX
mydata <- read.csv("Texas_Daily_Summaries.csv")

# obtain only complete cases
mydata <- na.omit(mydata)

# check for missing values
badIndex <- which(is.na(mydata))

Instead of looking at yes/no it rained, let’s look at rain levels. I will create 3 rain levels called low (< 0.25 inches), moderate (0.5 and 1 inch), and high (> 1inch). Use the code below:

Show me the code...

# create rain levels
rainLevel <-c(0.25, 0.5,1)
# create rain variable
rain <- array("low",length(mydata$PRCP))
# fill in for rain levels
rain[which(mydata$PRCP > rainLevel[1] & mydata$PRCP <= rainLevel[2])] <- "mod"
rain[which(mydata$PRCP > rainLevel[2])] <- "high"

Now save the ‘rain’ variable to the data frame (mydata) and remove $DATE and $PRCP as we will not use them in the model.

Show me the code...

# save rain to data frame
mydata$rain <- as.vector(rain)

# remove $PRCP and $DATE
mydata <- subset(mydata, select = -c(1, 5))

Next, we need to split the data into training and testing. Let’s use the 2/3|1/3 approach as we did previously. Use the code below:

Show me the code...

#create training and validation data from given data
library(caTools)
split <- sample.split(mydata$rain, SplitRatio = 0.66)

#get training and test data
train<- subset(mydata,split == TRUE)
test <- subset(mydata,split == FALSE)

Now we can build the tree by running the code below:

In this case, the best model was selected by the default parameters in ‘control’ which include minsplit=20 (number of minimum cases in a leaf) and by tending towards purity. To see all of the parameter defaults, use the help command on ‘rpart.control’. (Again, this was all covered in Meteo 815, so please return to that course if you need a refresher). Please note that each time you run the code, a slightly different tree might grow. That is because there is a random component in the splitting of the data and the growing of the tree. Each time you hit 'run' in the DataCamp code, you re-run the analysis from the beginning, thus creating a new split and a new analysis.

We can look at the final model using the print and summary command. The print command will provide the details of the model, while the summary provides information on how the model was selected. Run the code below:

The results might differ slightly from the tree formed previously, due to the random component of the analysis. To visualize the tree, run the code below:

Again, the tree visualized here may be slightly different from the one formed previously. You can save the figure by running the code below (I find it easier to view the final tree by saving the figure and opening it, then viewing the output in RStudio):

Show me the code...

# create attractive postscript plot of tree 
post(modelTree, file = "tree.ps", 
     title = "Classification Tree for Bartlett, TX")

You should get a figure that looks like this (might not be exactly the same):

Classification tree for Bartlett, TX. Refer to text below.

Example of a decision tree to predict precipitation levels in Burnett, TX

Credit:

In the example above, the first rule is whether the fastest 5-minute wind speed is greater than 34. If it’s less than 34, the outcome is labeled as ‘low’. If it is greater than 34, then you test the average wind speed. If the average wind speed is greater than 10.4, then the outcome is low. Otherwise, you test the average temperature. If the average temperature is greater than 79.5, the outcome is low; otherwise, the outcome is high. The dashes in the oval with the numbers correspond to the number of events in that rule for each level (high/low/moderate).

Finally, if you wish to predict the testing values (or any other values), use the code below. Pay particular attention to the parameter ‘type’ as this must be set to ‘class’ to ensure you predict the appropriate variable type (classification instead of continuous).

Show me the code...

# predict 
predictTree <- predict(modelTree,newdata = test,type='class')

You can then use any of the categorical assessment metrics from the earlier lessons to look at the fit of the model (such as joint histograms and confusion matrix metrics).

Updated Flowchart

Below is an updated version of the flowchart that adds in the categorical forecasting methods we just discussed.

Self Check

Test your knowledge...

Lesson 6: Ensemble Overview

Motivation...

Click here for a transcript of the ECMWF 25 Years of Ensemble Prediction video.

DR. ROBERTO BUIZZO: If we go back 25 years ago, we used to have one single forecast issued every day. The problem we had was that we didn't know whether the forecast was going to be accurate or not.

DR. FLORENCE ROBIER: When we take observations of the atmosphere, you know, it's like when you have a thermometer in a room – when if you put another thermometer next to it, you will not necessarily have the same reading. Because nothing is perfect. Which means, there's always uncertainty in what we measure and what we predict.

TOM HAMILL: Ed Lorenz was one of the first to really cast in a mathematical framework with which he was able to demonstrate that if you make a tiny error at the initial time, that tiny error is going to grow into a very substantial error.

DR. FLORENCE ROBIER: We always say that atmospheric weather prediction is an initial value problem. So, we first determine what is the state of the atmosphere now, what is the initial value, the initial conditions, and then we run a model. Traditionally, people were just using one model, typically in one center. They were just doing an analysis of the atmosphere using one model and then getting the forecast for the next few days. And that was very much what we were doing. We were taking the best image of the atmosphere with the observations, the best model we could get, and we had one shot in the future to say what the forecast is.

Ensemble prediction is about running one model several times; and in particular, here at ECMWF, we were trying to change the initial conditions of the forecast.

DR. ROBERTO BUIZZO: For each point for the globe, instead of having just one value of temperature, we had 50 value of temperature at the initial point. The same for wind, the same for pressure.

DR. FLORENCE ROBIER: And this would actually trigger the atmosphere to go in different directions in our forecast.

DR. ROBERTO BUIZZO: Then I can estimate using all of them what's going to be the possible range of forecasts, and this was, for example, extremely helpful in the summer 2003 when Europe experienced extreme hot temperatures. So with our ensemble systems, we're able to predict these extremes two, three weeks ahead.

DR. FLORENCE ROBIER: What we really use a lot is tropical cyclones and the trajectory of tropical cyclones, so there you look at it on a map not on a given point, but you really look at the different trajectories of the tropical cyclones.

TOM HAMILL: You probably have heard of hurricane or superstorm Sandy along the east coast of the United States, and that was a case where the ECMWF prediction system provided the heads-up a day or two in advance of our own US Weather Service prediction system. That was a success story.

DR. FLORENCE ROBIER: So, I think at the beginning, we were trying to really push the forecast where it was going to be sensitive. Then, we started also to incorporate some idea on the uncertainty in the observations. Sort of quantifying actually what comes from the uncertainty in the observations themselves. And, finally, the third step has been to actually quantify what is the uncertainty in the model itself. We know there are some ranges where some parameters could lie, for instance, and so, it's quantifying these as well.

DR. ROBERTO BUIZZO: Now, if I think about the future, we want to make our models better. We know, for example, that coupling the models to the ocean is key, and we have started doing that. We want to improve the initialization of the model to our estimation of state of the atmosphere. Third big area of work is resolution. So, we need more resolution in the system.

DR. FLORENCE ROBIER: We have a very high collaboration goal, that we really think we benefit if we collaborate with other organizations.

TOM HAMILL: Oh, my colleagues and I really think the world of ECMWF. The scientists are really defining the forefront of the operational ensemble predictions, and where the research needs to head to address the remaining deficiencies in ensemble prediction systems.

DR. FLORENCE ROBIER: It's fantastic, extremely motivating to know that you're working in a place where you can really make a difference, and whenever we have a good success in our forecast, and we think we might have saved lives using the science to do that. I can't think of anything better to do.

The video above provides an overview of ensemble forecasting at ECMWF. As technology has advanced, forecasting has improved. NWP models can now create multiple runs for the same forecast time, creating an ensemble. Ensembles provide more confidence in forecasts, at least when runs coincide with one another, and tell the forecaster where and when higher uncertainty exists (where the runs diverge). Thus, ensembles can be used to assess the amount of uncertainty in a model forecast and the range of likely outcomes. In this lesson, you will learn more about ensembles, including how to use them, how to assess the model uncertainty, and how to turn the ensemble into a PDF for application purposes. Read on to learn more!

Lesson Objectives

Explain what an ensemble is and discuss its strengths and weaknesses.
Discuss the different sources of model error and how ensembles can be used to assess the resulting forecast uncertainty.
Turn ensembles into PDFs and calibrate the PDFs.

Newsstand

Developing a State-of-the-Art Verification Toolkit [32]. Retrieved June 27, 2019
Scherrer, S.C., Appenzeller, C., Eckert, P., & Cattani, D. (2004, June). Analysis of the Spread–Skill Relations Using the ECMWF Ensemble Prediction System over Europe [33]. Retrieved June 27, 2019.
Whitaker, J.S. and Loughe, A.F. (1998, December). The Relationship between Ensemble Spread and Ensemble Mean Skill [34]. Retrieved December 7, 2018.
Peña, M. (2011, March). Ensemble Forecasts and their Verification [35]. Retrieved December 7, 2018.
Hersbach, H. (2000, October). Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems [36]. Retrieved December 7, 2018.

[37]

Download [37] all of this lesson’s code and data.

You will be prompted to save a .zip file to your computer.

Overview

Prioritize...

After you have read this section, you should be able to define ensemble, discuss various visuals for ensembles, and list the strengths and weaknesses of an ensemble.

Read...

Over the past 20-30 years, forecasting techniques have advanced from each model making a single forecast, to several runs of the model yielding an ensemble of forecasts. An ensemble, multiple runs for the same forecast time, provides more information for a forecaster. In particular, by examining several different model runs, a forecaster can build confidence in the predictions if they coincide or question the outcome if the model runs are spread out. Read on to learn more.

What is an Ensemble and how do we visualize?

An ensemble consists of multiple model runs for the same forecast valid time. Each model run is called a “member” of the ensemble. Most often, but not always, all the member model runs start at the same initial time.

A spaghetti diagram plots a mission-critical contour for each member forecast. Take a look at the example below:

Example of a Spaghetti Diagram for an ensemble forecast at specific geopotential heights at 500 hPa

Credit: Malaquias Peña [35]

The diagram above shows the contours of an ensemble forecast at specific geopotential heights at 500 hPa. Note the two contours selected, one roughly corresponding to the edge of really cold weather and the other the edge of tropical weather.

Spaghetti plots allow us to visualize the amount of uncertainty in the forecast by looking at the spread among ensemble members. There is high confidence in the forecast when the members tend to coincide and low confidence when they are spread apart. This is highlighted in the figure above; the forecast has higher confidence along the Gulf Coast (New York) than in the Pacific Ocean (West Coast) as seen in the right panel (left panel).

Instead of viewing all the member forecasts on one plot, we can plot the output from each member on a ‘postage stamp’ sized image. This is useful for more complex forecast situations where a small number of contours just won’t suffice.

Take a look at the example below:

ECMWF monthly forecasts. Refer to text below.

Example of a postage stamp plot for the individual members of an ensemble forecasting mean sea level pressure and temperature at 850 hPa

Credit: Malaquias Peña [35]

Each image is the output of an individual member of the ensemble. In this example, we are looking at the mean sea level pressure and temperature at 850 hPa. This type of diagram allows the forecaster to examine all the individual outputs side by side and make comparisons.

Why ensemble?

Why should we use ensembles? Think back to earlier lessons when we discussed probabilistic forecasts. We’d really like the joint PDF of all the weather variables we are interested in. This lets us compute the odds of any combination of outcomes and therefore make decisions based on expected cost. Ensembles can be used to compute those PDFs.

The spread in the PDF can vary with the weather pattern – some patterns are much easier to predict than others. In theory, ensembles can quantify this forecast uncertainty since they know what weather pattern we are in and how it is likely to evolve. One of our big tasks will thus be to determine the spread/skill relationship of our ensemble so we can calibrate it into a good approximation of the true PDF.

Model Error

Prioritize...

Once you have read this section, you should be able to discuss the different sources of model error.

Read...

Every model will have errors. Ensembles tell us the expected value of the error in a statistical sense. By examining an ensemble of model runs that each differs in some respect (differences include initial conditions or small-scale process parametrization), we can assess how the weather, acting through those changes, impacts ensemble spread and thus forecast uncertainty. As with any form of statistical forecasting, this ensemble post-processing requires that we calibrate, in this case calibrating the model spread against observed forecast errors from a large set of training cases.

Sources of Model Error

There are several sources of model error, each of which we would hope to capture with our ensemble. Below is a list of a few.

Nonlinear evolution of the weather pattern – making it sensitive to small errors in the initial weather analysis (3-D multivariate map) from which we start our NWP model run.
Poor approximation of the equations of motion by the algebraic equations to step the NWP model forward. This isn’t much of a problem with modern NWP models, as we’ve gotten pretty good at the finite difference in the math used to turn the equations of motion (a set of partial differential equations) into the model equations (a set of linear algebraic equations).
Poor approximation of all the physics that goes on at scales smaller than the NWP model grid (i.e., poor parametrization). The figure below shows you the smaller scale processes that I’m referring to.

Visual of the small-scale processes that are smaller than NWP model grids

Credit: U.S. DOE. 2010. Atmospheric System Research (ASR) Science and Program Plan [38].

The processes include:
1. short and longwave radiation [39],
2. condensation and freezing of cloud droplets and various types of precipitation (e.g., rain, snow, graupel, hail, etc.),
3. turbulent mixing,
4. cumulus clouds (no operational forecast model has a grid fine enough to resolve all of them),
5. transfer of heat, moisture, and momentum between the atmosphere and earth.
These tasks are hard; thus parametrization of them in NWP models is far from perfect.

The ideal ensemble would try to capture the errors due to uncertainty in the initial conditions and those due to uncertainty in the subgrid-scale physics.

Initial Conditions

We can perturb the initial conditions of a model away from the best guess – this is called initial condition diversity. We cannot perturb the conditions by more than the uncertainty in our analysis, which results from our inability to observe the atmosphere completely. In data-rich areas that may not be much, but in data-poor areas, it could be large.

The big question is how to perturb the initial conditions. If you do it at random, the atmosphere just mixes out the fluctuation in minutes to hours. Instead, you need to perturb the initial conditions on those scales, and in those places, that the atmosphere is most likely to respond to. Below is a visual of the scales of motion in the atmosphere from microscale to climate variation.

Scales of motion in the atmosphere, refer to text below.

Visual of temporal and spatial scales of motion in the atmosphere

Credit: The Comet Program [40].

The x-axis is the spatial scale, while the y-axis is the timescale. Generally speaking, we will focus on the mesoscale to synoptic scale for initial perturbations, as these are the most rapidly growing modes (i.e., weather patterns that are inherently unstable).

There are many ways to perturb the initial conditions, but the details are outside the scope of this course.

Physics

For each of the physics parameterizations (e.g., radiation, clouds/precipitation, turbulence, surface, cumulus, etc.) there are multiple “good” choices. Each ensemble member would have a different combination. Again, the details behind these parameterizations are outside the scope of this course. But, if you would like to see an example of how these aspects of ensemble initiations are implemented, please click here [41].

Turning Ensembles into PDFs

Prioritize...

After you have read this section, you should be able to fit parametric distributions to an ensemble, perform a kernel density estimation to fit a non-parametric PDF, and calibrate the resulting PDFs.

Read...

Since each model run produces only one forecast, we need an ensemble of runs large enough so that the resulting histogram of forecasts is a good approximation of the true PDF. The more variables we’re interested in putting in a joint PDF, the harder this is (i.e., Curse of Dimensionality [42]). For either univariate or joint PDFs, you can fight the curse of dimensionality (a bit) by bringing more information to the table - in particular, knowing what family the PDF belongs to (e.g., normal). Read on to learn about the different methods for fitting distributions to ensembles.

Fitting Parametric Distributions

It takes far fewer (10 to 30) ensemble members to fit a known distribution reasonably well than it does to form a histogram that is a good approximation of the true PDF.

Let’s give this a try. To begin, download this data set of GEFS temperature forecasts [43] valid at 00Z on December 1^st, 2018 for Atlanta, GA. Please note that I sometimes say GEFS and sometimes I say GFS. It's the same model. Use the code below to load in the NetCDF file.

Show me the code...

# load in GEFS model temperature data
library(RNetCDF)
fid <- open.nc("tmp_2m_latlon_all_20181201_20181201_jzr8IGeTt4.nc")
mydata <- read.nc(fid)

Next, average over the lat and lon to create a temperature estimate for Atlanta, GA.

Show me the code...

# average over Atlanta
AtlantaTemp <- apply(mydata$Temperature_height_above_ground,3,mean)
# convert from kelvin to F
AtlantaTemp <- (9/5)*(AtlantaTemp-273.15)+32

We are left with a temperature value in degrees F for each of the 11 members of the ensemble. Let’s plot the probability density for the temperature. As a reminder, the probability density is a conceptual estimate of the probability of a single event (not to be confused with the probability histogram, which plots the likelihood of an event falling in a bin). Run the code below:

Below is a larger version of the figure:

GFS ensemble temperatures for Atlanta, GA 12/01/2018.

Density plot of GFS Ensemble temperature for Atlanta, GA on December 1st, 2018 at 00Z

Credit: J. Roman

Temperature is generally normally distributed. Let’s try and fit a normal distribution to the ensemble. As a reminder, we use the function ‘fitdist’ from the package ‘fitdistrplus’. If you need more of a review, I suggest you look back at the material in Meteo 815 on distribution fits.

Show me the code...

# estimate parameters for normal distribution
library(fitdistrplus)
NormParams <- fitdist(AtlantaTemp,"norm")

# estimate normal distribution 
NormEstimate <- dnorm(quantile(AtlantaTemp,probs=seq(0,1,0.05)),NormParams$estimate[1],NormParams$estimate[2])

The first part of the code estimates the parameters of the normal distribution based on the data. The second part estimates the actual distribution (for plotting purposes). Now we can plot the probability density and the normal density fit. Use the code below:

Show me the code...

# plot distribution
plot.new()
grid()
par(new=TRUE)
plot(hist(AtlantaTemp,probability =TRUE),xlab="Temperature (degrees F)", ylab="Probability Density",
     main="GFS Ensemble Temperatures for Atlanta, GA 12/01/2018 00Z",
     xlim=c(55.5,56.2),ylim=c(0,3))
par(new=TRUE)
plot(quantile(AtlantaTemp,probs=seq(0,1,0.05)),NormEstimate,type='l',col="dodgerblue4",lwd=3,
     xlab="Temperature (degrees F)", ylab="Probability Density",
     main="GFS Ensemble Temperatures for Atlanta, GA 12/01/2018 00Z",
     xlim=c(55.5,56.2),ylim=c(0,3))

You should get the following figure:

GFS ensemble temperatures for Atlanta, GA 12/01/2018 with normal distribution fit.

Density plot of GFS Ensemble temperature for Atlanta, GA on December 1st, 2018 at 00Z with normal distribution fit overlay

Credit: J. Roman

The blue line is the normal distribution fit. There you have it, a parametric distribution fit (normal) for an ensemble consisting of a small set of members (11).

Kernel Density Estimation

Alternatively, we can just spread out the existing data points (i.e., ensemble member forecasts) until we get a smoothed PDF. This is called a Kernel Density Estimation. We convert individual ensemble members into a smoothed PDF.

PDF of ensemble on the left and KDE on the right.

Visual of how a PDF of an ensemble (left) gets fit by a Kernel Density Estimator (right)

Credit: By Drleft at English Wikipedia [44], CC BY-SA 3.0

The figure above shows the PDF of the ensemble on the left and the KDE on the right (individual ensemble members in red dashed line). For those that are interested, you can view the equations for the KDE here [45].

When performing a KDE, there are a couple of parameters you should know about. First is the bandwidth, which describes how far to spread out each data point. The second is the kernel, which is the function used to spread the points.

Let’s give this a try by applying a KDE to our previous GEFS temperature forecast from Atlanta, GA. We will use the function ‘density’. You can set the bandwidth manually by setting the parameter ‘bw’ or you can use the default which is ‘nrdo’ which is considered the ‘rule of thumb’ bandwidth for a Gaussian KDE. You can learn more about the bandwidth types here [46]. You can also set the ‘kernel’ parameter to a character string that tells the function what smoothing kernel should be used. The default is ‘gaussian’. Run the code below to create the KDE in R and plot the results.

You should get the following figure:

GFS ensemble temperatures for Atlanta, GA 12/01/2018. Refer to text below.

Density plot of GFS Ensemble temperature for Atlanta, GA on December 1st, 2018 at 00Z with KDE fit overlay

Credit: J. Roman

The KDE captures the bimodal feature in the distribution that the normal fit from the previous example did not.

KDE has the advantage that you don’t need to know ahead of time what distribution the ensemble members come from. In particular, they don’t have to be normally distributed (even if you used the normal curve for the kernel function, as the example above showed!). This is very handy. But you pay a price for this advantage, which is that it can take more ensemble members to get a good fit to the true PDF using KDE than it does fitting a parametric distribution.

A note on estimating the right bandwidth to use. In the example, we used the default bandwidth. Generally, the more ensemble members you have, the less smoothing you need to do, so the smaller the bandwidth you want to use. To find out more about bandwidth selection, I suggest you review the material here [45]. When in doubt, stick to the default in R which is the rule-of-thumb bandwidth for a Gaussian KDE.

Joint PDFs

Both of these PDF fitting methods can also work for joint PDFs. But the number of ensemble members required to obtain a good estimate of the joint PDF goes up faster than the number of variables in the joint PDF. In other words, it takes more than TWICE as many ensemble members to accurately determine the joint PDF of two variables than it does to determine their two PDFs separately.

We are paying a big price to acquire probabilistic information on the correlation between predictands. This cost is especially high, since NWP runs are one of the most computationally expensive human activities. Below is the joint PDF for a 30-member ensemble of temperature and dew point temperature.

Joint PDF for a 30-member ensemble of temperature and dew point temperature using KDE on each separate variable then computing the joint PDF (left), using a joint KDE function (middle), and an overlay of the density plots (right)

Credit: J. Roman

The left joint density plot is a result of estimating the KDE for each variable separately and then estimating the joint PDF. The middle is the result of estimating the joint PDF directly using a joint KDE function (‘kde2d’ in R). The right panel is an overlay of the two density plots. Notice how different the PDFs are, particularly in the tails of the distribution.

Calibrating PDFs

By comparing the error in the PDF mean (average of ensemble means) to the PDF spread (average over the forecasts) we can calibrate the PDF. Standard deviation is a good metric for ensemble spread – at least for normally distributed ensembles. You can regress the absolute value of the error against the ensemble standard deviation (via linear regression). Then use the equation to determine how much to scale up (or down) the ensemble spread when deriving the forecast PDF from the one you fit to the ensemble members. Please note that most ensembles are under-dispersive (i.e., model spread is not as big as the model error).

Verifying Ensembles

Prioritize...

Once you have completed this section, you should be able to create and use a rank histogram to assess the spread of an ensemble and estimate the continuous ranked probability score.

Read...

Once you have an ensemble, you need to evaluate its skill. Two methods of verifying ensembles will be discussed in this lesson: the rank histogram and the continuous ranked probability score. Both methods evaluate the spread of the ensemble and can be used to assess the skill of the forecast. Read on to learn more.

Rank Histogram Method

A rank histogram is a diagnostic tool to evaluate the spread of an ensemble. For each forecast, bins are selected by ranking the ensemble member forecast (lowest to highest). For N ensemble members, there will be N+1 bins - the lowest and highest bins are open-ended. (source: The Rank Histogram [47]) Then you create a frequency histogram by plotting the number of occurrences in each bin.

A rank histogram is a great plot to eyeball if the ensemble is over or under-dispersed, or to see if the ensemble is biased. In R, you can use the package ‘SpecsVerification’ [48] to create a rank histogram.

Below are several different examples of rank histograms. Match them to the most appropriate description.

Continuous Ranked Probability Score (CRPS)

A Continuous Ranked Probability Score (CRPS) is the PDF equivalent of the MSE. The CRPS rewards forecasts based on the amount of probability at an observed value. The higher the density, the lower the score. CRPS indicates how close the observation is to the weighted average of the PDF. A lower CRPS indicates better model performance (the PDF matches the distribution of forecast errors) – thus, CRPS lets you compare different ensemble models to pick the best. In R, you can use the ‘SpecsVerification’ package to create this plot.

For more information, I suggest you check out this link [48].

Example

Let’s create these two graphs in R. We will start with the rank histogram. Begin by downloading this GEFS forecast output for NYC [49]. The file contains 24-hour forecast temperatures valid at 00Z from January 1st through November 30th, 2018. Start by loading in the file.

Show me the code...

# load in temperature forecast from GFS for NYC
library(RNetCDF)
fid <- open.nc('tmp_2m_latlon_all_20180101_20181130_jzr8iFlxW9.nc')
mydata <- read.nc(fid)

Now, extract out the forecast valid time.

Show me the code...

validTime <- mydata$intValidTime

We want to match up the observations at the time for which the forecast was valid. First, download this data set of hourly temperature observations [50]. Load in the data set and convert the time variable to match that of the valid time forecast variable (YearMonthDayHour) by using the code below:

Show me the code...

# load in daily temperature observations
mydata2 <- read.csv("NCYTemps2018.csv")
obsTime <- as.POSIXct(strptime(mydata2$Date,"%m/%d/%Y %H:%M"))
tempTime <- format(obsTime,"%Y%m%d%H")

Now, loop through each forecast and pull out the corresponding observation that matches the time the forecast is valid.

Show me the code...

# for each forecast valid time find the corresponding observation time
obsTemp <- array(NA,length(validTime))
for(itime in 1:length(validTime)){
  goodIndex <- which(as.character(validTime[itime]) == tempTime)
  obsTemp[itime] <- mydata2$AirTemp[goodIndex]
}

Next, convert the forecast temperature to Fahrenheit (from Kelvin) and transpose the matrix, so the format is (n,k) where n is the forecast and k is the ensemble member. This is the setup needed to use the rank histogram function in R.

Show me the code...

# convert kelvin to F
forecastTemp <- (9/5)*(mydata$Temperature_height_above_ground-273.15)+32
# transpose forecast temp (n,k) where k is the number of members
forecastTemp <- t(forecastTemp)

Now, we can create the rank histogram. We will first estimate the histogram using the ‘Rankhist’ function, and then plot the histogram using the ‘PlotRankhist’ function. Run the code below:

Show me the code...

# load in library
library('SpecsVerification')
# create rank histogram
rh <- Rankhist(forecastTemp,obsTemp,handle.na="use.complete")
PlotRankhist(rh)

You should get the following figure:

Rank histogram for GEFS forecasted temperature in NYC for January 1st through November 30th, 2018 valid at 00Z

Credit: J. Roman

This is an example of an ensemble that has too small of a spread.

Now, let’s make a CRPS plot. Begin by downloading this forecast output of temperature for NYC [51]. Again, the forecast output is from January 1st, 2018 – November 30^th, 2018 but, this time, we have the output for all forecast times from 0 hours out to 192 hours (8 days). We will use the same observational data set as before. Begin by loading in the temperature forecast.

Show me the code...

# load in temperature forecast from GFS for NYC
library(RNetCDF)
fid <- open.nc('tmp_2m_latlon_all_20180101_20181130_jzr8YY8Qhv.nc')
mydata <- read.nc(fid)

Now, extract the time and select out the forecast hours you are interested in testing. I’m going to look at the 1-day forecast, 2-day, 3-day, and so on to the 8-day forecast. Use the code below to set this up and initialize the CRPS variable.

Show me the code...

time <- as.Date(mydata$time/24,origin="1800-01-01 00:00:00")
fhour <- seq(24,192,24)
tempTime <- format(obsTime,"%Y%m%d%H")

# initialize variable
CRPS <- array(NA,length(fhour))

Now, we are ready to loop through each forecast hour and estimate the mean CRPS. Use the code below:

Show me the code...

# loop through each fhour and estimate the mean CRPS
for(ihour in 1:length(fhour)){
  fhourIndex <- which(mydata$fhour == fhour[ihour])
  # extract out valid times
  validTime <- mydata$intValidTime[fhourIndex,]
  # match up observations
  obsTemp <- array(NA,length(time))
  for(itime in 1:length(validTime)){
    goodIndex <- which(as.character(validTime[itime]) == tempTime)
    if(length(goodIndex) == 0){
      obsTemp[itime] <- NA
    }else{
      obsTemp[itime] <- mydata2$AirTemp[goodIndex]
    }
    
  }
  
  # convert kelvin to F
  forecastTemp <- (9/5)*(mydata$Temperature_height_above_ground[fhourIndex,,]-273.15)+32
  # transpose forecast temp (n,k) where k is the number of members
  forecastTemp <- t(forecastTemp)
  # compute the CRPS
  tempCRPS <- EnsCrps(forecastTemp,obsTemp)
  CRPS[ihour] <- mean(tempCRPS,na.rm=TRUE)
}

The code starts by pulling out the valid times for the particular forecast hour of interest. Then we match up the observations for the valid forecast time. Next, we convert the forecast temperature to Fahrenheit and set up the matrix for the forecast values to be in the format (n,k) where n is the forecast and k is the ensemble member. Finally, you compute the CRPS using the function ‘EnsCrps’ and save the mean value to our CRPS variable.

Now, plot the mean CRPS for each forecast hour using the code below:

Show me the code...

# plot
plot.new()
grid()
par(new=TRUE)
plot(fhour/24,CRPS,xlim=c(1,8),ylim=c(3,4.5),type="b",pch=20,lty=1,lwd=2,
     xlab="Forecast Day",ylab="CRPS (degrees)",
     main="GEFS Temperature Forecast NYC")

You should get the following figure:

GEFS temperature forecast NYC. Refer to text below.

CRPS plot for GEFS forecasted temperature in NYC from January 1st through November 30th, 2018 for various forecast days

Credit: J. Roman

As a reminder, a lower CRPS value is better. Thus, the 1-day forecast is best (as expected) and the 8-day forecast is the worst (again as expected). By adding in more ensemble models (such as the ECMWF) we could assess which model is best, based on which one has the lowest CRPS value, for each forecast hour.