Use the links below to link directly to one of the lessons listed.
† indicates lessons that are Open Educational Resources.
Forecasts vary widely in skill: between different systems, different seasons, from day to day, etc. Thus, we need to measure how well competing forecast systems typically do. And, maybe more importantly, how far off they could potentially be. Furthermore, we want to know how well a particular forecast system is meeting our needs. This is key, as needs differ.
For example, one rainfall forecast user, say an emergency manager, may be concerned about a rare extreme event which could lead to flooding, while another rainfall forecast user, say a gardener, might be concerned about the more normal rainfall amounts. They are both concerned about the forecast in rain, but their needs are very different. So, we need to be able to tailor our skill measurements to take into account different user scenarios. Continuing our example, the emergency manager may be concerned only with rainfall amounts over 2 inches in 24 hours and so would need to measure the accuracy of yes/no or a probabilistic forecast that rainfall would exceed this threshold. In contrast, the gardener would want to measure how many inches the forecast was typically off by and whether it was biased wet or dry. Again, same forecast but different need and, thus, different skill assessment. In this lesson, I encourage you to develop the skill of selecting which metric is the most appropriate given the situation. This is not an easy task as there are many fluid parts, but if you carefully consider the goal of the forecast and the intended use, you will succeed.
Once you have finished this section, you should be able to state when in the forecast system development and testing we should measure forecast skill and discuss the general goal of the forecast skill metric for each phase of the development and maintenance cycle.
You might think that we measure forecast skill after the forecast has been made. Although that is correct, we actually measure forecast skill throughout the development and selection process. This ensures that not only do we begin running a suitable model, but we routinely verify that the forecast system is still performing as well as it did during testing. So instead of thinking of measuring forecast skill as a static one-time event, I encourage you to look at the task as being dynamic - a constantly ongoing process. Read on to learn more about the three key phases in which we measure forecast skill.
Every statistical forecast system contains one or more tunable parameters that control the relationship between the predictors and the predictand. Developing a forecast system consists of using ‘training’ data (e.g., developmental data) to find the ‘best’ values for these parameters. For example, take a look at the figure below that shows two linear regression fits (green and red lines) to the data (black dots).

In linear regression, these parameters are the slope coefficients (circled in blue) and the intercept (circled in orange). But what is ‘best’? How were these coefficients selected? ‘Best’ is defined in terms of some measure of forecast skill (like the AIC score used in the example above) - or the reverse, forecast error.
In linear regression, we solve a set of equations once to find the best values of the parameters. Thus, the parameter tuning is done analytically. But we can also tune the parameters iteratively. We use a set of equations repeatedly to gradually improve initial guesses of the parameter’s value. This approach is used in logistic regression or other non-linear models.
Before we move on to the next phase of measuring forecast skill, let’s talk briefly about training vs. testing data. As a reminder, training data is the data used to tune the model parameter while the testing data is independent of the training data and is used to assess how well the model performs in operations. Generally, however, we will assess the skill on both the training and testing data. Why?
We can use the ratio of the error on the testing data to that on the training data to assess if the model is over fit. As a reminder, overfitting occurs when the model is too closely fit to a limited set of data points. (You can review the subject of overfitting in the Meteo 815 lessons on Linear Regression and Data Mining.) If the ratio is much larger than 1, then the model is over fit. Take a look at the figure below:
As the degree of the polynomial fit increases, the ratio becomes larger. An over fit polynomial might look great in MSE on the training data but do much more poorly on independent data, indicating that they’ll do poorly in operations (where the data is also independent of that used to train the model). This will be discussed in more detail in future lessons. The key here is that we can test if a model has been over fit, but the requirement is that we MUST have testing data that is independent of the training data.
Often, we have several competing sources of forecasts available. For example, take a look at the figure below:
The figure shows the mean RMSE for several three-hour forecasts of 500 mb heights in the Northern Hemisphere (20°N through 80°N). You can find information about the models here [14]. In this example, the ECM (European Center for Medium-Range Weather Forecasts) has the lowest average RMSE for the verification period and would thus be a better choice than say the CFSR (Legacy GFS used for Climate Forecast System Reanalysis).
Now, take a look at this figure:
Again, this is the RMSE for several three-day forecasts of 500 mb heights, but in this case, we are looking at the mean over the tropics (20°S to 20° N). In this example, the JMA (Japan Meteorological Agency) has the lowest average RMSE over the whole verification period and thus would be the best model for predicting the tropics. Same models, same variable, same forecast hour, different purpose (tropics vs. Northern Hemisphere). As you can see, comparing models will be imperative because the same model may not be the best model across all user scenarios.
We need to decide which set of forecasts provides the most valuable guidance for the decision of interest. What we need is a quantitative measure of forecast skill so that we can make this comparison objectively. And this measure needs to be designed to fit the customer; that is, who needs the forecast, and what is their purpose (think back to our earlier example of the gardener and the emergency manager)?
In short, our forecast scoring metric needs to measure those aspects of skill that are important for our decision. For example, if we were forecasting frost for citrus growers, we don’t care about the forecast skill at warmer temperatures. We only care about the skill on days cool enough to have a chance of frost. Thus, one size most definitely does not fit all.
The forecast skill metric must be selected or designed to capture that which is of value to the customer! For example, this could include the quality of the decisions that need to be made based on the prediction (often as measured by the financial outcome of a set of such decisions). We will discuss this in more detail later on. The key here is to emphasize that skill metric is greatly dependent on the situation at hand. You need to master the skill of understanding the customer’s needs and turning that into a quantitative, easy-to-use metric capturing the forecast’s value for decision making.
You might think that once the forecast system has been selected, you won’t need to assess the skill. But routine verification is used to make sure the forecast system is still performing as well as it did during testing. Remember, skill assessment is dynamic. For example, the available sources of input data can evolve with time, changing the forecast skill for better or worse. This is an important issue when using either observational data or numerical weather prediction model output as input to your forecast system. In addition, during forecast system use, we can determine if there is a seasonality or location dependence in the forecast system’s skill.
Now that we know when to measure forecast skill, we need to learn how to measure it. Read on to learn more about the different techniques for measuring forecast skill.
After you have read this section, you should be able to assess the skill of a deterministic forecast using both graphical methods and statistical metrics.
So far in this lesson, we have highlighted when we need to assess a forecast. A natural follow-on is how do we assess a forecast? There are many different ways to test the forecast skill, both qualitatively and quantitatively. The choice of technique depends upon the type of predictand and what we're going to use the forecast for. In this section, we will discuss techniques used specifically for deterministic forecasts, which are forecasts in which we state what is going to happen, where it will happen and when (such as tomorrow’s temperature forecast for New York City).
As always, plotting the data is an essential and enlightening first step. In this case, the goal is to compare each forecast to the corresponding observation. There are two main types of plots we will focus on for deterministic forecasts: scatterplots and error histograms.
Scatterplots, which you should remember from Meteo 815 and 820, plot the forecast on the horizontal axis and the corresponding observation on the vertical axis. Scatterplots allow us to ask questions such as:
Let’s work through an example. We are going to create a scatterplot of forecasted daily maximum temperature versus observed daily maximum temperature in Ames, IA. Begin by downloading these predictions [15] of daily MOS (Model Output Statistics) GFS (Global Forecast System) forecasted maximum temperature (degrees F) for Ames, IA. You can read more about the forecast here [16]. In addition, download this set of observations [17] of daily maximum temperature (degrees F) for Ames, IA. You can read more about the observational dataset here [18].
Now, let’s start by loading in the data using the code below:
Your script should look something like this:
### Example of MOS Max Temp (https://mesonet.agron.iastate.edu/mos/fe.phtml)
# load in forecast data
load("GFSMOS_KAMW_MaxTemp.RData")
# load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php)
load("ISUAgClimate_KAMW_MaxTemp.RData")
I have already paired the data (and all data in future examples) so you can begin with the analysis. Let’s overlay a linear model fit on the scatterplot. To do this, we must create a linear model between the observations and the predictions. Use the code below:
# create linear model
linearMod <- lm(obsData$maxTemp~
forecastData$maxTemp,na.action=na.exclude)
Now, create the scatterplot with the forecasted maximum temperatures on the x-axis and the observed on the y-axis by running the code below.
You should get the following figure:
The linear fit is quite good, and you can see we do not need to apply any correction.
The second type of visual used in assessing deterministic forecasts is the error histogram. The formula for error is as follows:
An error histogram, as the name suggests, is a histogram plot of the errors. You can ask the following questions when using an error histogram:
Now, let’s work on an example using the data from above. To start, calculate the error for each forecast and observation.
# calculate error error <- forecastData$maxTemp-obsData$maxTemp
Now, estimate the 1st, 5th, 10th, 90th, 95th, and 99th percentiles which we will include on the error histogram.
# estimate the percentiles (1st, 5th, and 10th)
errorPercentiles <- quantile(error,
probs=c(0.01,0.05,0.1,0.9,0.95,0.99),
na.rm=TRUE)
Now, plot the error histogram using the ‘hist’ function. Run the code below:
You should get the following figure:
The figure above shows the frequency histogram of the errors with confidence intervals at 1% (1st and 99th percentile), 5% (5th and 95th percentile), and 10% (10th and 90th percentile). The errors appear to be normally distributed.
Now that we have visually assessed the forecast, we need to perform some quantitative assessments. As a note, most of the statistical metrics that we will discuss should be a review from Meteo 815. If you need a refresher, I suggest you go back to that course and review the material.
There are four main metrics that we will discuss in this lesson.
# initialize errorRef variable
errorRef <- array(NA,length(obsData$Date))
# create climatology of observations and reference error (anomaly)
for(i in 1:12){
index <- which(as.numeric(format(obsData$Date,"%m"))==i)
# compute mean
monthMean <- mean(obsData$maxTemp[index],na.rm=TRUE)
# estimate anomaly
errorRef[index] <- obsData$maxTemp[index]-monthMean
}
mseRef <- sum((errorRef^2),na.rm=TRUE)/length(errorRef)
After you have read this section, you should be able to assess the skill of a categorical forecast using the appropriate techniques.
You will most likely encounter other forecast types beside deterministic. Categorical forecasts, such as the occurrence of rain, are another common type of forecast in meteorology that relies on a specific set of metrics to assess the skill. Read on to learn about the techniques commonly used to assess categorical forecasts.
For categorical forecasts, scatterplots and error histograms do not make sense. So, the exploratory error analysis is accomplished in a different way. Typically, we compute the joint histogram of the forecast categories and observed categories. The forecast categories are shown as columns, and the observed categories are rows. Each cell formed by the intersection of a row and column contains the count of how many times that column’s forecast was followed by that row’s outcome.
Although the joint histogram is formally called the confusion matrix, it need not be confusing at all. The cells on the diagonal (upper left to lower right) count cases where the forecast was correct. The cell containing event forecasts that failed to materialize captures ‘false alarms’ (bottom left), and the cell containing the count of events that were not forecasted captures ‘missed events’ (top right). Below is an example of the matrix where a, b, c, and d represent the count of each cell.
| Total Count=N=a+b+c+d | Forecast: (Event Occurred) | Forecast: (Event Did Not Occur) |
|---|---|---|
| Observed: (Event Occurred) | TP=a | FN=b |
| Observed: (Event Did Not Occur) | FP=c | TN=d |
Wrongly Forecasted
Correctly Forecasted
Let’s work through each cell starting in the top left. TP stands for true positive - the event was forecasted to occur and it did. FN stands for false negative - the event was not forecasted to occur, but it did (missed events). FP stands for false positive - the event was forecasted to occur, but it did not occur (false alarms). Finally, TN stands for true negative - the event was not forecasted to occur, and it did not occur. We will use these acronyms later on, so I encourage you to make use of this table and go back to it when necessary.
All of the popular forecast metrics for categorical forecasts are computed from this matrix, so it’s important that you can read the matrix (it is not necessary to memorize, although you will probably end up memorizing it unintentionally). The confusion matrix can be generalized to forecasts with more than two categories, but we won’t go into that here.
Let’s work through an example. We are going to look at daily rainfall in Ames, IA from the MOS GFS and compare the predictions to observations (same data source as the previous example). You can download the rainfall forecast here [19] and the observations here [20]. A 1 means rain occurred (was predicted) and a 0 means rain did not occur (was not predicted).
Begin by loading in the data using the code below:
# load in forecast data
load("GFSMOS_KAMW_rain.RData")
# load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php)
load("ISUAgClimate_KAMW_rain.RData")
Now compute the values of a, b, c, and d for the confusion matrix by running the code below.
If we fill in the matrix, we would get the following table:
| Total Count=N=a+b+c+d=3653 | Rain Predicted | No Rain Predicted |
|---|---|---|
| Rain Observed | TP=489 | FN=422 |
| No Rain Observed | FP=326 | TN=2396 |
Wrongly Forecasted
Correctly Forecasted
You should also make sure a+b+c+d adds up to the length of the data set. Run the code below to check.
And there you have it! Your first confusion matrix.
Now that we have the confusion matrix, we want to create some quantifiable metrics that allow us to assess the categorical forecast. The first metric we will talk about is the Percent Correct. Percent Correct is just the ratio of correct forecasts to all forecasts. It answers the question: overall, what faction of the forecasts were correct? The range is 0 to 1 with a perfect score equal to 1. The closer to 1 the better. Mathematically (referring back to the confusion matrix):
It is a great metric if events and non-events are both fairly likely and equally important to the user. But for rare events, you get a great percent correct value by always forecasting non-events. That is not helpful at all. Likewise, if missed events and false alarms have different importance to the user (think tornados) then weighting them equally in the denominator isn’t a good idea.
Therefore, numerous other metrics have been derived that can be computed from the confusion matrix. One example is the Hit Rate. The Hit Rate is the ratio of true positives to the sum of the true positives and false negatives. It answers the question: what fraction of the observed ‘yes’ events were correctly forecasted? The range is 0 to 1 and a perfect score is 1. The closer to 1 the better. Mathematically:
Another metric is the False Alarm Rate, which is the number of times the forecast predicted an event would occur, but the event was not observed. Think of the False Alarm Rate as the forecast system crying wolf. It answers the question: what fraction of the observed ‘no’ events were incorrectly predicted as ‘yes’? The range is 0 to 1 with a perfect score of 0. The closer to 0 the better. Mathematically:
Another metric is the Threat Score. It is the number of times when an event is forecasted and the event occurs divided by the number of times the event is forecasted, and the event occurs plus the number of times the event is forecasted, and the event does not occur plus the number of times the event is not forecasted and the event occurs. It answers the question: How well did the forecasted ‘yes’ events correspond to the observed ‘yes’ events? The range is 0 to 1 with a perfect score of 1. The closer to 1 the better. Mathematically:
The threat score is more appropriate than percent correct for rare events.
Now let’s try computing these ourselves. Continuing on with the rainfall example, run the code below to estimate the four different metrics discussed above.
The percent correct is 0.79 (perfect score=1), the hit rate is 0.53 (perfect score =1), the false alarm rate is 0.12 (perfect score=0) and the threat score is 0.39 (perfect score=1). You can see that the percent correct is quite high, but upon further investigation, we see this might be biased due to the large number of predicted and observed ‘no’ events (TN).Once you have completed this section, you should be able to quantitatively and qualitatively assess a probabilistic forecast.
Now, we are on to the final forecast type, which we will discuss in this lesson. A probabilistic forecast is much like a categorical forecast, except that the forecast system is allowed to express its uncertainty by giving the probability of an event instead of just a yes or a no. While this provides more useful information for the decision maker, it requires different verification procedures, more closely related to those we used for continuous variables. Read on to learn more about how to assess the skill of a probabilistic forecast.
The basic metric for a probabilistic forecast is a variant of the MSE called the Brier Score (BS). It measures the mean squared difference of the predicted probability of the outcomes to the actual outcome. Mathematically:
Where fi is the probability (e.g. 25% chance) that was forecasted of an event occurring and oi is the actual outcome (0 if the event does not occur and 1 if it does occur). The range of BS is 0 to 1 with a perfect score being 0. The closer to 0 the better.
Let’s try estimating the BS for a Probability of Precipitation (PoP) forecast. Again, we will use MOS GFS output and observations from Ames, IA using the same data sources from before. The model predictions can be downloaded here [21] and the observations can be downloaded here [22]. Begin by loading in the data:
# load in forecast data
load("GFSMOS_KAMW_PoP.RData")
# load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php)
load("ISUAgClimate_KAMW_PoP.RData")
Now, turn the predicted probability of precipitation into a decimal value (instead of a percentage).
# turn pop into decimal value f <- forecastData$pop/100
Now, save the observed value (1 if it rained and 0 if it did not).
# save observed value o <- obsData$rainfall
Next, estimate the difference squared for each data point.
# estimate difference squared errorSquared <- (f-o)^2
Finally, sum up and divide by the length to estimate the Brier Score.
In this example, the BS is 0.15, which is relatively close to 0 suggesting a good forecast skill.
BS does not, however, capture the value of a probabilistic forecast to a particular user. The user will be picking some threshold probability above which to assume the event will occur. They’ve got to decide which threshold is best for their particular decision.
Generally, this depends on how expensive false alarms are relative to missed events. For example, consider a hurricane shut-down decision for an oil refinery. If you shut the refinery down when you didn’t have to (false alarm), you will lose several days of production. On the flip side, if you were flooded out while operating (missed events), you might have considerable repairs.
So, we need a way to measure the tradeoff between false alarms and missed events as we change our threshold probability. If you set the threshold near 0%, then you’ll always forecast the event and will have less missed events and more false alarms. If you set the threshold high, you will increase the number of missed events but reduce the number of false alarms.
How do we visualize this? One way is with a Receiver Operating Characteristic (ROC) curve. A ROC curve plots the fraction of events correctly forecasted (True Positive Rate - TPR) on the vertical axis versus the fraction of the non-events which were forecasted to be events (False Positive Rate - FPR) on the horizontal axis. The TPR is the True Positive Rate and is given by the formula below:
The FPR is the False Positive Rate (also called the false alarm rate) and is given by the formula below:
Let’s give this a try with our PoP example. To begin, initialize variables TPR and FPR using the code below:
# initialize variables
truePositiveRate <- {}
falsePositiveRate <- {}
Now, create an index that contains the instances when rain was observed and when it wasn’t.
# create rain index rainIndex <- which(o==1) noRainIndex <- which(o==0)
Now, create a list of threshold probabilities to use for creating the curve.
# select the probabilities probs <- seq(0,1,0.1)
Now, loop through the probabilities and estimate the TP and FP rates.
# loop through probabilities and estimate the True Positive and False Positive rate
for(i in 1:length(probs)){
popIndex <- which(f > probs[i])
noPopIndex <- which(f < probs[i])
TP <- length(intersect(popIndex,rainIndex))
FP <- length(intersect(popIndex,noRainIndex))
FN <- length(intersect(noPopIndex,rainIndex))
TN <- length(intersect(noPopIndex,noRainIndex))
truePositiveRate[i] <- TP/(TP+FN)
falsePositiveRate[i] <- FP/(FP+TN)
}
So for each probability threshold, you found when that probability or higher was forecasted (popIndex). Then, you counted the number of times TP, FP, FN, and TN occurred based on that probability threshold (you essentially created the confusion matrix). Finally, you estimated the rate.
Now, we plot the false positive rate on the x-axis and the true positive rate on the y-axis by running the code below.
I’ve included a one-to-one line in blue. I also displayed the probability threshold as a text. Below is a larger version of the figure:
As we set our threshold probability high, we end up at the lower left of the curve with few cases forecasted as events, so low rates of both true and false positive. In contrast, if we set our threshold probability low, we forecast the event most of the time, so we move to the upper right end of the ROC curve with high rates of true positive (good) and false positive (bad). The trick here is finding the right threshold for our particular decision cost structure.
So, how do we pick the right threshold? Well, by applying a threshold to a probabilistic forecast, we turn the probabilistic forecast into a categorical forecast. This means we can compute the confusion matrix and all the associated error metrics.
The goal is to find a threshold probability that guides us to the most cost-effective decision. To do this, we need the decision’s cost matrix. This has the same columns (forecast category) and rows (observed outcome category) as the confusion matrix. But it contains losses (or profits) for each combination of forecast and observed category.
For example, below is the cost matrix for a hypothetical refinery in a potential hurricane-prone position:
| Profit or loss in millions of $ per day | Fcst Yes | Fcst No |
|---|---|---|
| Obs Yes | -1.5 | -5.0 |
| Obs No | -1.0 | 0.25 |
Let’s say that the corresponding confusion matrix for a threshold value of p is as follows:
| Total Count=N | Fcst Yes | Fcst No |
|---|---|---|
| Obs Yes | a | b |
| Obs No | c | d |
If we multiply each cell in the cost matrix by the corresponding confusion matrix, sum them up, and divide by the total number of forecasts, we get the expected profit or loss using the forecast system with this threshold (p). In this example, the formula would be:
You then try a range of thresholds from 0 to 100% and see which yields the best financial outcome. The optimal threshold probability depends strongly on the cost matrix, so it’s not the same for any two decisions. This is why probabilistic forecasts are more valuable than categorical forecasts; they can be tuned to the decision problem at hand by picking the best threshold probability.
So let’s give it a try. Below is a hypothetical cost matrix for a bike share company when precipitation greater than 0 occurs in Ames, IA. When precipitation occurs, they lose out on people renting bikes. In addition, if precipitation occurs, and they don’t expect it, the bikes can get ruined if not stored correctly. On the flip side, if precipitation is forecasted, and they remove bikes, but then rain does not occur, they miss out on sales. (Please note, this is all hypothetical).
| Profit or Loss $ Per Day | Fcst Yes | Fcst No |
|---|---|---|
| Obs Yes | -200 | -500 |
| Obs No | -150 | 50 |
Now, let’s use our data and apply several different threshold probabilities to estimate the expected profit or loss. To begin, select the thresholds we will test:
# select the probabilities probs <- seq(0,1,0.1)
Now, initialize the variable that will save the expected profit or loss for each threshold.
# initialize variables
expectedProfitLoss <- {}
Now, determine when it rained and when it did not rain.
# create rain index rainIndex <- which(obsData$rainfall==1) noRainIndex <- which(obsData$rainfall==0)
Now, enter the cost matrix values.
# enter cost matrix values costTP <- -200 costFN <- -500 costFP <- -150 costTN <- 50
Now, we can loop through the probabilities and estimate the expected gain/loss.
# loop through probabilities and estimate the expectedProfitLoss
for(i in 1:length(probs)){
popIndex <- which(f > probs[i])
noPopIndex <- which(f < probs[i])
TP <- length(intersect(popIndex,rainIndex))
FP <- length(intersect(popIndex,noRainIndex))
FN <- length(intersect(noPopIndex,rainIndex))
TN <- length(intersect(noPopIndex,noRainIndex))
sumProfitLoss <- (costTP*TP)+(costFN*FN)+(costFP*FP)+(costTN*TN)
expectedProfitLoss[i] <- sumProfitLoss/length(obsData$Date)
}
You could do a number of things to visualize or quantify the best threshold. I will simply make a plot of the threshold probabilities and the expected profit or loss. Run the code below:
Below is a larger version of the figure:
The optimal threshold is 30%, as this minimizes their loss. So, when there is a 30% or larger chance of precipitation, they should store bikes and expect a loss of roughly $60/day.
Once you have completed this section, you should be able to select the appropriate visuals and statistical metrics for assessing forecast skill, depending on the forecast type.
In this lesson, we learned how to assess the skill of a forecast, dependent on the forecast type. The assessment relies heavily on the question at hand and the consequences of a bad forecast versus a good forecast (user-centric). Before you continue on to the assignment, make sure you feel comfortable selecting the appropriate assessment given a forecast type. Below, you will find one final note on strictly proper scores and a flowchart to help guide you in the decision making.
If you chose not to go with cost minimization, there are a plethora of other error and skill metrics to choose from. Many of them, however, reward issuing forecasts that don’t reflect the true probability distribution. For example, MAE applied to a probability forecast will improve if you forecast a 1 for probability > 0.5 and a 0 for probability <= 0.5. Thus, destroying some of the available information about forecast uncertainty actually improves the score. It’s not rewarding the right aspects of skill in this setting.
In contrast, the Brier Score and cost minimization reward issuing the best guess of the event probability as your forecast. Error and skill metrics that reward issuing a correct probability forecast are called ‘strictly proper’.
I will end this lesson by providing a flowchart that should help guide your forecast assessment. There are many more methods for assessing the skill of a forecast, and you are not limited to what was presented here.
Oftentimes with storms, we can get snow to rain, we get snow to sleet, snow/sleet, to freezing rain, but what determines these precipitation types? So, let's talk a little bit about this. First of all, when you think about the troposphere, which is where the game is played, you're looking at about 40,000 feet. So, we're only going to need about 15,000 feet of that atmosphere to age, grow the snow, and produce these different precipitation types. And sometimes you do not even need all of that. But let's just start up here where the snow crystals are developing, you know, we call this the dendritic growth region. Now it's pretty easy to produce snow when you've got the snowflakes that are already being produced up there, and they fall down through the atmosphere all the way down to the ground as snow because the whole column itself is below 32 degrees Fahrenheit or zero Celsius. And so, what do you wind up with? This beautiful Currier and Ives postcard down here, right? That's easy. But what happens when you start introducing a warm layer?
So, on average, let's play with that warm layer somewhere between 5 and 10,000 feet. Maybe not all of it. So, we've got snow, we've always got snow up there in the winter time, usually at 15,000 feet on up. Ok, so here come those snow crystals rolling on the down the layer. All of a sudden, oh! It goes above 32 degrees, starts to melt, either the entire flake or a little bit of it, but once it starts to fall back into the air that's below 32, it becomes a sleet pellet. So, it completely refreezes again. You can hear the sleet. You can hear it on the rooftops. You can hear it on the street. You can hear it on your car if you're just sitting there. And sleet and snow are typically ok to drive in, but this last precipitation type, ew! Alright, so let's play again with the atmosphere. Go up to 15,000, we've got snow down to 10 on average. It melts, uh oh, it's still melting all the way down to about 2,000 feet and then, on under that air mass, that's just cold enough to refreeze whatever is at the surface, but it's very shallow. Now, this can vary anywhere between 200 and 2,000 feet. But the problem is, you don't get a chance to refreeze that droplet. It comes down as a liquid and whatever it hits it turns to ice. And ice is the worst thing to drive on! Especially if you're going too fast!
[CAR SKIDDING ON ICE]
Yeah, stuff like that can happen. So, the rain keeps coming down in the freezing layer. It accumulates on everything. The rooftops, the tree limbs, and you wind up with some big problems in through here. And unfortunately, too, you've got power lines involved in many, many cases where that ice is accumulating on power lines. Now, between the two poles, if you just take on average, the weight of a quarter inch of ice adds about 500 pounds to those lines! And yeah, that means power loss and all sorts of problems.
So, those are precipitation types and basically, how they form. Obviously, you can get all of the same precipitation in one storm, you can get a combination of all three. But that's essentially how the atmosphere works for precipitation types.
The video above discusses different precipitation types and how they form. Precipitation type is an example of a categorical variable, one in which a finite set of outcomes can occur. Accurately predicting the precipitation type may be more important for a user than the magnitude of the precipitation. For example, you might expect less than an inch of rain, but if that rain is actually freezing rain, hazards can occur such as slippery roads, downed trees, etc., that you must prepare for. Or maybe you are concerned with a certain threshold of freezing rain, such as an amount greater than a ½ inch, which could down power lines. The actual magnitude of the precipitation does not matter as much as the type of precipitation that occurs and/or if the amount surpasses a specific threshold.
Categorical variables, like the ones described in the example above, show up quite frequently in weather and climate analytics. They allow us to create thresholds that focus on a key aspect of weather that is important to the user’s question. Instead of predicting all outcomes (precipitation amount and type), we worry about predicting the outcomes that matter (precipitation type and threshold). Categorical variables require special forecasting methods, which will be highlighted in this lesson. Read on to learn more.
Once you have completed this section, you should be able to define a categorical weather-dependent problem.
Categorical variables take on one of a limited (at least two) choice of fixed possible outcomes. In weather, categorical variables include hurricane categories, tornado strength, and precipitation type. You might think that categorical forecasts are not as important as say a probability forecast, but as we saw in the previous lessons, many probabilistic forecasts can become categorical forecasts, particularly when used to make a cost-based decision. For the same reason, many times a real-world scenario will require a categorical answer, not the value of a continuous variable. Read on to learn about what problems require categorical forecasts.
For a categorical forecast, there will be two or more possible outcomes. The goal is to predict the outcome category from the weather data. For example, one question may be whether a river will rise enough to overtop a levee. There are two possible outcomes: the river will overtop the levee, or it won’t. Flood preparation decisions would depend on the odds of the levee being overtopped.
Another example of a categorical weather-dependent forecast is with respect to aviation. Based on certain weather conditions, pilots need to know what set of flight rules apply. The rules are Visual Flight Rules (VFR), Marginal VFR (MVFR), Instrument Flight Rules (IFR), and Low Instrument Flight Rules (LIFR). You can learn more about these rules here [27]. Since there are four different flight rules, we have four possible outcomes. Fuel reserve decisions for an executive jet would depend on the odds of each of these categories occurring.
When you have finished reading this section, you should be able to describe a logistic regression, be able to explain the shortcomings of such a model and perform a logistic regression in R.
The video below is a bit long, but I encourage you to watch the whole thing, as it provides a nice overview of logistic regression.
[SINGING] If you can fit a line, you can fit a squiggle. If you can make me laugh, you can make me giggle. StatQuest.
[JOSH STARMER]: Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to talk about logistic regression. This is a technique that can be used for traditional statistics as well as machine learning. So let's get right to it.
Before we dive into logistic regression, let's take a step back and review linear regression. In another StatQuest, we talked about linear regression. We had some data, weight, and size. Then we fit a line to it and, with that line, we could do a lot of things. First, we could calculate r-squared and determine if weight and size are correlated. Large values imply a large effect. And second, calculate a p-value to determine if the r-squared value is statistically significant. And third, we could use the line to predict size given weight. If a new mouse has this weight, then this is the size that we predict from the weight. Although we didn't mention it at the time, using data to predict something falls under the category of machine learning. So plain old linear regression is a form of machine learning. We also talked a little bit about multiple regression. Now, we are trying to predict size using weight and blood volume. Alternatively, we could say that we are trying to model size using weight and blood volume. Multiple regression did the same things that normal regression did. We calculated r-squared, and we calculated the p-value, and we could predict size using weight and blood volume. And this makes multiple regression a slightly fancier machine learning method. We also talked about how we can use discrete measurements like genotype to predict size. If you're not familiar with the term genotype, don't freak out. It's no big deal. Just know that it refers to different types of mice.
Lastly, we could compare models. So, on the left side, we've got normal regression using weight to predict size. And we can compare those predictions to the ones we get from multiple regression, where we're using weight and blood volume to predict size. Comparing the simple model to the complicated one tells us if we need to measure weight and blood volume to accurately predict size, or if we can get away with just weight. Now that we remember all the cool things we can do with linear regression, let's talk about logistic regression. Logistic regression is similar to linear regression, except logistic regression predicts whether something is true or false instead of predicting something continuous, like size. These mice are obese, and these mice are not. Also, instead of fitting a line to the data, logistic regression fits an S-shaped logistic function. The curve goes from zero to one, and that means that the curve tells you the probability that a mouse is obese based on its weight. If we weighed a very heavy mouse, there is a high probability that the new mouse is obese. If we weighed an intermediate mouse, then there is only a 50% chance that the mouse is obese. Lastly, there's only a small probability that a light mouse is obese. Although logistic regression tells the probability that a mouse is obese or not, it's usually used for classification. For example, if the probability that a mouse is obese is greater than 50%, then we'll classify it as obese. Otherwise, we'll classify it as not obese. Just like with linear regression, we can make simple models. In this case, we can have obesity predicted by weight or more complicated models. In this case, obesity is predicted by weight and genotype. In this case, obesity is predicted by weight and genotype and age. And lastly, obesity is predicted by weight, genotype, age, and astrological sign. In other words, just like linear regression, logistic regression can work with continuous data like weight and age and discrete data like genotype and astrological sign. We can also test to see if each variable is useful for predicting obesity. However, unlike normal regression, we can't easily compare the complicated model to the simple model, and we'll talk more about why in a bit. Instead, we just test to see if a variable's effect on the prediction is significantly different from zero. If not, it means that the variable is not helping the prediction. Psst. We used Wald's test to figure this out. We'll talk about that in another StatQuest.
In this case, the astrological sign is totes useless. That's statistical jargon for not helping. That means, we can save time and space in our study by leaving it out. Logistic regression's ability to provide probabilities and classify new samples using continuous and discrete measurements makes it a popular machine learning method. One big difference between linear regression and logistic regression is how the line is fit to the data. With linear regression, we fit the line using least squares. In other words, we find the line that minimizes the sum of the squares of these residuals. We also use the residuals to calculate R squared and to compare simple models to complicated models. Logistic regression doesn't have the same concept of a residual, so it can't use least squares, and it can't calculate R squared. Instead, it uses something called maximum likelihood. There's a whole StatQuest on maximum likelihood so see that for details, but in a nutshell, you pick a probability scaled by weight of observing an obese mouse just like this curve, and you use that to calculate the likelihood of observing a non-obese mouse that weighs this much. And then, you calculate the likelihood of observing this mouse, and you do that for all of the mice. And lastly, you multiply all of those likelihoods together. That's the likelihood of the data given this line. Then you shift the line and calculate a new likelihood of the data and then shift the line and calculate the likelihood again, and again. Finally, the curve with the maximum value for the likelihood is selected. BAM!
In summary, logistic regression can be used to classify samples, and it can use different types of data like the size and/or genotype to do that classification. And it can also be used to assess what variables are useful for classifying samples, i.e., astrological sign is totes useless.
Hooray! We've made it to the end of another exciting StatQuest. Do you like this StatQuest and want to see more? Please subscribe. If you have suggestions for future StatQuests, well, put them in the comments below. Until next time, Quest on!
Logistic models, as the video discussed, predict whether something is true or false; that is, there are only two outcomes. It is an example of a categorical model. By using several variables of interest, we can create a logistic model and estimate the probability that an outcome will occur.
Logistic regression estimates the parameters of a logistic model, a widely used statistical model that uses a logistic function to model a binary (only two outcomes) dependent variable. The logistic function follows an s-curve, like the image below shows:
Mathematically, the expected probability that Y=1 for a given value of X for logistic regression is:
The expected probability that Y=0 for a given value of X is:
The equation turns linear regression into a probability forecast. Instead of predicting exactly 0 or 1 (the event does not happen, or it does), logistic regression generates a probability between 0 and 1 (the likelihood the event will occur). As the output of the linear regression equation increases, the exponential goes to zero so the probability goes to:
In contrast, as the output of the linear regression equation decreases, the exponential goes to infinity, so the probability goes to:
Logistic regression is advantageous in that the regression generates a probability that the outcome will occur. But there are some disadvantages to this technique. For one, logistic regression allows only linear relations or KNOWN interactions between variables. For example:
But we often don’t know what variable interactions are important, making logistic regression difficult to use. In addition, logistic regression only allows for two possible outcomes. You can only determine the odds of something occurring or not.
Let’s work through an example in R. Begin by downloading this data set of daily values from Burnett, TX [29]. The variables included are: the date (DATE), average daily wind speed (AWND), time of the fastest one-mile wind (FMTM), peak gust time (PGTM), total daily precipitation (PRCP), average daily temperature (TAVG), maximum daily temperature (TMAX), minimum daily temperature (TMIN), wind direction of the fastest 2-minute wind (WDF2), wind direction of the fastest 5-minute wind (WDF5), the fastest 2-minute wind speed (WDF2), and fastest 5-second wind speed (WSF5). To review the variables or their units, please use this reference [30].
For this example, we are going to create a logistic model of rain for Burnett, TX using all of these variables. To start, load in the data set.
# load in daily summary data for Burnett, TX
mydata <- read.csv("Texas_Daily_Summaries.csv")
We only want to retain complete cases, so remove any rows (observational dates) in which any variable has a missing value.
# obtain only complete cases mydata <- na.omit(mydata) # check for missing values badIndex <- which(is.na(mydata))
‘badIndex’ should be empty (this is a way of checking that all NAs were removed). Now that we have a complete data set, we need to create a categorical variable. I will model two rain outcomes: rain occurred (yes =1) and rain did not occur (no=0). Use the code below:
# turn precipitation into categorical variable with 2 outcomes (yes it rain=1 No it did not rain=0) mydata$rain <- array(0, length(mydata$PRCP)) mydata$rain[which(mydata$PRCP > 0)] <- 1
Now, I want to remove the variables Date and PRCP as we will not use them in the model. Use the code below:
# remove $PRCP and $DATE mydata <- subset(mydata, select = -c(1, 5))
We are left with a data frame that only contains the dependent variable (rain) and the independent variables (all other variables). Next, we need to split the data into training and testing. For this example, run the code below to create a 2/3 | 1/3 split.
The function ‘sample.split’ from the package ‘caTools’ randomly selects 2/3rds (0.66) of the observations (set to TRUE). We can then create our training and testing set as follows:#get training and test data train<- subset(mydata,split == TRUE) test <- subset(mydata,split == FALSE)
Now, we can create a logistic model by running the code below:
The function ‘glm’ is a generalized linear model. We are telling the function to model ‘rain’ from the training data using all the variables (~.) in the training set. The ‘family’ parameter tells the function to use logistic regression (‘binomal’).
Some of the coefficients are not statistically significant. What can we do? Well, similar to stepwise linear regression, we can perform a stepwise logistic regression. Run the code below:
The ‘step’ function selects the best model based on minimizing the AIC score. If you summarize the model, you will find that all the coefficients are statistically significant at alpha=0.05. The final variables will be slightly different each time you run, as there is a random aspect with the data split and the step function. But the model will probably include: AWND (average daily wind speed), FMTM (fastest one-mile wind), TMAX (maximum temperature), TAVG (average temperature) or TMIN (minimum temperature), WDF2 (wind direction of the fastest 2-minute wind), WSF2 (fastest 2-minute wind speed), and WSF5 (fastest 5-second wind speed).
We can predict our testing values using the code below:
# predict testing data predictGLM <- predict(bestModel,newdata = test,family="binomial",type="response")
The ‘type’ parameter makes the predictions scale to the response variable instead of the default which scales to the linear predictors. Unlike linear regression where we could plot the observed vs predicted values and assess the fit, we must use alternative assessment techniques. For categorical forecasts, we can display the confusion matrix (refer back to previous lessons for more details). Run the code below:
You should get a similar table to the one below (again, there is a random component, so the numbers will most likely not be identical):
| FALSE | TRUE | |
|---|---|---|
| 0 | 462 | 48 |
| 1 | 80 | 94 |
Try changing the probability threshold to see how the confusion matrix changes. Or we can plot the ROC curve using the code below (again, refer back to previous lessons for more details):
#ROC Curve library(ROCR) ROCRpred <- prediction(predictGLM, as.character(test$rain)) ROCRperf <- performance(ROCRpred, 'tpr','fpr') plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))
The first line standardizes the predictions and provides a label (as.character(test$rain)). The next line (performance) evaluates the True Positive Rate (TPR) and the False Positive Rate (FPR). The final line plots the assessment. The color represents the probability cutoffs. You should get the following figure:
When the probability cutoff is high (near 100%), the false positive rates and true positive rates are low because we have forecasted very few events. When the probability is low, we have high TPR (good) and FPR (bad), since we are forecasting most of the events. The trick here is finding the right threshold for our particular decision cost structure.
Once you have completed this section, you should be able to explain rule-based forecast systems, their use of splits, and how to measure purity.
Rule-based forecasting is a machine learning method that identifies and uses rules to predict an outcome. The key is finding a set of rules that represent the system you are forecasting. There are numerous rule-based AI methods available. For this lesson, we will illustrate the Classification and Regression Tree (CART) method, which is used quite frequently in the field of weather and climate. Before we dive into CART, let’s first go over some nomenclature.
In rule-based forecasting, we use a set of rules that will split the data up and eventually lead to predictions. A rule says: if a specific condition is met, the outcome is Yes; otherwise, the outcome is No. Thus, a rule splits our set of cases into two groups: Yes and No. We’d like these Yes decisions to correspond to the occurrence of one of the categorical outcomes; the same goes for the No decisions.
A perfect rule system would yield splits where all the cases in each group had the same categorical outcome. We call this condition purity. An imperfect rule has a mix of actual outcomes on each side of the split. The better the rule, the closer each side of the split comes to having all cases with the same actual outcome.
We want our rule-based system to tend toward purity. But how do we actually assess purity? There are all sorts of ways of measuring purity, some much better than others. The one we will focus on in this course is the one with the best theoretical foundation: Shannon’s information Entropy H. Mathematically:
Where i is the category, pi is the fraction of the cases in that category, sigma is the sum over all the N categories, and log2(pi) means to take the base-2 logarithm of pi.
H goes to 0 as the sample of the cases becomes pure (e.g., all of one category). For any category that doesn’t occur in the sample, p is zero and the product of p*log2(p) is also zero. For the category that occurs in each case of the pure leaf (end of the decision tree) p=1 so p*log2(p)=1*0=0. Note that because p<=1, log2(p) <=0, so H=0.
You do not need to remember all the math. Just remember that our goal is to build rules that drive H down as close to zero as we can get it. This is similar to minimizing the AIC score when looking at multiple linear regression models.
After you have read this section, you should be able to interpret a tree diagram and understand how rules are built.
As I mentioned earlier, CART, or Classification and Regression Trees, is a rule-based machine learning method. It’s a common AI technique used in the field of weather and climate analytics. We actually used regression trees in Meteo 815 when we discussed data mining, so this should not be the first time you have heard of the method. In this lesson, however, we will dive deeper into the details of CART and explore exactly how the tree forms and how the rules are built that define the tree. Read on to learn more.
In the end, we will have a set of rules that are used to predict the categorical outcome. So, how does the tree of rules work? Each case goes to the first rule and gets sent down one branch of a split (Yes or No). Each of the two branches leads to another rule. The case goes into the next rule and splits again, getting sent down another branch. This process continues to the tips of the branches (leaves).
The goal is to have all the cases that end up at a given branch tip to have the same categorical outcome. Below is an example of a decision tree for flight rules [27].
You start at the top with all of the observations (or whatever other inputs you are using). You follow Rule 1, which will lead you down one of the branches. Whichever branch you follow, you then go to Rule 2 which will lead you down two more branches which end up at the branch tip with the final outcome. The branch tip is formally called the leaf.
Before we can build our tree, we need to build the rules. Each rule takes the form:
If predictor > threshold, then Yes, else No
Thus, we can build different rules using different predictors. And we can create different rules by changing the threshold applied to any one predictor.
We build rules by testing all the predictors to see which one gives us the biggest increase in purity (e.g., the biggest decrease in H). For each predictor, we have to test multiple threshold values to see which works best. Usually, about 20 thresholds spread across the range of a predictor is sufficient. We test each of these predictor threshold combinations on our entire set of training cases (cases where we know both the predictors and the actual categorical outcome).
Below is an example of a purity calculation for different thresholds.
The goal was to predict whether the minimum temperature would be below 32°F in Minneapolis, MN based on the maximum temperature from the day before. I created 12 thresholds starting at 35°F and going up to 60°F. For each threshold, I estimated the probability of observing a Tmin < 32°F given Tmax was between the threshold values. There is an optimal threshold with worse (bigger) values of H on either side (this threshold is 43°F to 45°F). The H value for the optimal threshold is 0.15, which means it does not yield 0 or a perfect purity. This just means that our one forecast rule wasn’t enough to give perfect forecasts on the training data. Hence the need for multiple rules (e.g., CART).
After you have read this section, you should be able to explain how a tree grows in CART and how this growth is terminated.
Now that we know how to build rules, we need to combine the rules to create the tree. In addition, we need to know when to stop growing the tree. Read on to finish learning about CART.
Now that you know how to build a rule, you can go ahead and grow a tree. To begin, build a rule on the full training dataset, using it to split the training cases into two subsets. Now go to one of the resulting subsets and repeat the process. Continue this process with all of the branches until you hit the termination criterion.
How do we know when to terminate? There are multiple ways to determine when to stop building the tree. Some are bad and some are good.
One approach is to drive to purity. This, however, tends to result in so few cases in each leaf that it doesn’t have much statistical robustness. In turn, the tree will not work well on independent data (e.g., in actual operations). Instead, we want to stop growing our tree when the available data no longer supports making additional splits. Another approach is an ad hoc threshold. Instead of solely driving down purity, you set a threshold on the minimum number of cases allowed in a leaf. Although simple, this actually works quite well. What is the optimal threshold? You have to test several values to see which model results in the least overfitting of the data you’ve held out for testing (see previous lessons for how to test for overfitting).
Alternatively, or in addition, we can ‘prune’ a tree back to eliminate splits supported by too few cases. Sometimes this works better, but it takes more computer time since you have to grow the tree out until it is ‘too big’ before stopping. Again, numerous methods have been proposed to control which branches get pruned.
In R, you can use the function ‘rpart’ from the package of the same name to grow a tree. Sound familiar? This is the same function used in Meteo 815. I will only provide a summary here. For more details, I suggest you review the Data Mining lesson from Meteo 815. To grow the tree, we use the following code:
rpart(formula, data=, method=, control=)
Where the formula format is:
Outcome~predictor1+predictor2+predictor3+….
You set data to the data frame and method to ‘class’ (this tells the function to use a classification tree as compared to ‘anova’ for a regression tree of a continuous outcome). The ‘control’ parameter is optional but allows you to set parameters for selecting the best model (such as the minimum number of cases allowed in a leaf). Remember, the function will create multiple models and select the best based on purity and any controls listed.
For a review of how to examine the results, please look at the material from Meteo 815 or use this resource [24]. In addition, we can prune a tree by using the ‘prune’ function, which will allow you to avoid overfitting.
Once you have read this section, you should be able to perform the CART technique on a dataset in R.
Now, it’s time to apply your knowledge and build a tree in R. Earlier, we fit a logistic model to daily precipitation (did it rain or not?). Sometimes we need to know more than whether it rained or not. In some cases, it might be more important to know whether the precipitation passed a certain threshold. This threshold usually corresponds to an impact, such as, more than an inch of rain in a day might cause stormwater drains to back up. In this section, we will translate the precipitation into a categorical variable and create a CART model for prediction. Read on to learn more!
We are going to continue off of our earlier example of rain in Burnett, TX. Again, you can download the data here [29]. To review the variables or their units, please use this reference [30].
Start by loading in the data set and retaining only complete cases.
# load in daily summary data for Burnett, TX
mydata <- read.csv("Texas_Daily_Summaries.csv")
# obtain only complete cases
mydata <- na.omit(mydata)
# check for missing values
badIndex <- which(is.na(mydata))
Instead of looking at yes/no it rained, let’s look at rain levels. I will create 3 rain levels called low (< 0.25 inches), moderate (0.5 and 1 inch), and high (> 1inch). Use the code below:
# create rain levels
rainLevel <-c(0.25, 0.5,1)
# create rain variable
rain <- array("low",length(mydata$PRCP))
# fill in for rain levels
rain[which(mydata$PRCP > rainLevel[1] & mydata$PRCP <= rainLevel[2])] <- "mod"
rain[which(mydata$PRCP > rainLevel[2])] <- "high"
Now save the ‘rain’ variable to the data frame (mydata) and remove $DATE and $PRCP as we will not use them in the model.
# save rain to data frame mydata$rain <- as.vector(rain) # remove $PRCP and $DATE mydata <- subset(mydata, select = -c(1, 5))
Next, we need to split the data into training and testing. Let’s use the 2/3|1/3 approach as we did previously. Use the code below:
#create training and validation data from given data library(caTools) split <- sample.split(mydata$rain, SplitRatio = 0.66) #get training and test data train<- subset(mydata,split == TRUE) test <- subset(mydata,split == FALSE)
Now we can build the tree by running the code below:
In this case, the best model was selected by the default parameters in ‘control’ which include minsplit=20 (number of minimum cases in a leaf) and by tending towards purity. To see all of the parameter defaults, use the help command on ‘rpart.control’. (Again, this was all covered in Meteo 815, so please return to that course if you need a refresher). Please note that each time you run the code, a slightly different tree might grow. That is because there is a random component in the splitting of the data and the growing of the tree. Each time you hit 'run' in the DataCamp code, you re-run the analysis from the beginning, thus creating a new split and a new analysis.
We can look at the final model using the print and summary command. The print command will provide the details of the model, while the summary provides information on how the model was selected. Run the code below:
The results might differ slightly from the tree formed previously, due to the random component of the analysis. To visualize the tree, run the code below:
Again, the tree visualized here may be slightly different from the one formed previously. You can save the figure by running the code below (I find it easier to view the final tree by saving the figure and opening it, then viewing the output in RStudio):
# create attractive postscript plot of tree
post(modelTree, file = "tree.ps",
title = "Classification Tree for Bartlett, TX")
You should get a figure that looks like this (might not be exactly the same):
In the example above, the first rule is whether the fastest 5-minute wind speed is greater than 34. If it’s less than 34, the outcome is labeled as ‘low’. If it is greater than 34, then you test the average wind speed. If the average wind speed is greater than 10.4, then the outcome is low. Otherwise, you test the average temperature. If the average temperature is greater than 79.5, the outcome is low; otherwise, the outcome is high. The dashes in the oval with the numbers correspond to the number of events in that rule for each level (high/low/moderate).
Finally, if you wish to predict the testing values (or any other values), use the code below. Pay particular attention to the parameter ‘type’ as this must be set to ‘class’ to ensure you predict the appropriate variable type (classification instead of continuous).
# predict predictTree <- predict(modelTree,newdata = test,type='class')
You can then use any of the categorical assessment metrics from the earlier lessons to look at the fit of the model (such as joint histograms and confusion matrix metrics).
Below is an updated version of the flowchart that adds in the categorical forecasting methods we just discussed.
DR. ROBERTO BUIZZO: If we go back 25 years ago, we used to have one single forecast issued every day. The problem we had was that we didn't know whether the forecast was going to be accurate or not.
DR. FLORENCE ROBIER: When we take observations of the atmosphere, you know, it's like when you have a thermometer in a room – when if you put another thermometer next to it, you will not necessarily have the same reading. Because nothing is perfect. Which means, there's always uncertainty in what we measure and what we predict.
TOM HAMILL: Ed Lorenz was one of the first to really cast in a mathematical framework with which he was able to demonstrate that if you make a tiny error at the initial time, that tiny error is going to grow into a very substantial error.
DR. FLORENCE ROBIER: We always say that atmospheric weather prediction is an initial value problem. So, we first determine what is the state of the atmosphere now, what is the initial value, the initial conditions, and then we run a model. Traditionally, people were just using one model, typically in one center. They were just doing an analysis of the atmosphere using one model and then getting the forecast for the next few days. And that was very much what we were doing. We were taking the best image of the atmosphere with the observations, the best model we could get, and we had one shot in the future to say what the forecast is.
Ensemble prediction is about running one model several times; and in particular, here at ECMWF, we were trying to change the initial conditions of the forecast.
DR. ROBERTO BUIZZO: For each point for the globe, instead of having just one value of temperature, we had 50 value of temperature at the initial point. The same for wind, the same for pressure.
DR. FLORENCE ROBIER: And this would actually trigger the atmosphere to go in different directions in our forecast.
DR. ROBERTO BUIZZO: Then I can estimate using all of them what's going to be the possible range of forecasts, and this was, for example, extremely helpful in the summer 2003 when Europe experienced extreme hot temperatures. So with our ensemble systems, we're able to predict these extremes two, three weeks ahead.
DR. FLORENCE ROBIER: What we really use a lot is tropical cyclones and the trajectory of tropical cyclones, so there you look at it on a map not on a given point, but you really look at the different trajectories of the tropical cyclones.
TOM HAMILL: You probably have heard of hurricane or superstorm Sandy along the east coast of the United States, and that was a case where the ECMWF prediction system provided the heads-up a day or two in advance of our own US Weather Service prediction system. That was a success story.
DR. FLORENCE ROBIER: So, I think at the beginning, we were trying to really push the forecast where it was going to be sensitive. Then, we started also to incorporate some idea on the uncertainty in the observations. Sort of quantifying actually what comes from the uncertainty in the observations themselves. And, finally, the third step has been to actually quantify what is the uncertainty in the model itself. We know there are some ranges where some parameters could lie, for instance, and so, it's quantifying these as well.
DR. ROBERTO BUIZZO: Now, if I think about the future, we want to make our models better. We know, for example, that coupling the models to the ocean is key, and we have started doing that. We want to improve the initialization of the model to our estimation of state of the atmosphere. Third big area of work is resolution. So, we need more resolution in the system.
DR. FLORENCE ROBIER: We have a very high collaboration goal, that we really think we benefit if we collaborate with other organizations.
TOM HAMILL: Oh, my colleagues and I really think the world of ECMWF. The scientists are really defining the forefront of the operational ensemble predictions, and where the research needs to head to address the remaining deficiencies in ensemble prediction systems.
DR. FLORENCE ROBIER: It's fantastic, extremely motivating to know that you're working in a place where you can really make a difference, and whenever we have a good success in our forecast, and we think we might have saved lives using the science to do that. I can't think of anything better to do.
The video above provides an overview of ensemble forecasting at ECMWF. As technology has advanced, forecasting has improved. NWP models can now create multiple runs for the same forecast time, creating an ensemble. Ensembles provide more confidence in forecasts, at least when runs coincide with one another, and tell the forecaster where and when higher uncertainty exists (where the runs diverge). Thus, ensembles can be used to assess the amount of uncertainty in a model forecast and the range of likely outcomes. In this lesson, you will learn more about ensembles, including how to use them, how to assess the model uncertainty, and how to turn the ensemble into a PDF for application purposes. Read on to learn more!
After you have read this section, you should be able to define ensemble, discuss various visuals for ensembles, and list the strengths and weaknesses of an ensemble.
Over the past 20-30 years, forecasting techniques have advanced from each model making a single forecast, to several runs of the model yielding an ensemble of forecasts. An ensemble, multiple runs for the same forecast time, provides more information for a forecaster. In particular, by examining several different model runs, a forecaster can build confidence in the predictions if they coincide or question the outcome if the model runs are spread out. Read on to learn more.
An ensemble consists of multiple model runs for the same forecast valid time. Each model run is called a “member” of the ensemble. Most often, but not always, all the member model runs start at the same initial time.
A spaghetti diagram plots a mission-critical contour for each member forecast. Take a look at the example below:
The diagram above shows the contours of an ensemble forecast at specific geopotential heights at 500 hPa. Note the two contours selected, one roughly corresponding to the edge of really cold weather and the other the edge of tropical weather.
Spaghetti plots allow us to visualize the amount of uncertainty in the forecast by looking at the spread among ensemble members. There is high confidence in the forecast when the members tend to coincide and low confidence when they are spread apart. This is highlighted in the figure above; the forecast has higher confidence along the Gulf Coast (New York) than in the Pacific Ocean (West Coast) as seen in the right panel (left panel).
Instead of viewing all the member forecasts on one plot, we can plot the output from each member on a ‘postage stamp’ sized image. This is useful for more complex forecast situations where a small number of contours just won’t suffice.
Take a look at the example below:
Each image is the output of an individual member of the ensemble. In this example, we are looking at the mean sea level pressure and temperature at 850 hPa. This type of diagram allows the forecaster to examine all the individual outputs side by side and make comparisons.
Why should we use ensembles? Think back to earlier lessons when we discussed probabilistic forecasts. We’d really like the joint PDF of all the weather variables we are interested in. This lets us compute the odds of any combination of outcomes and therefore make decisions based on expected cost. Ensembles can be used to compute those PDFs.
The spread in the PDF can vary with the weather pattern – some patterns are much easier to predict than others. In theory, ensembles can quantify this forecast uncertainty since they know what weather pattern we are in and how it is likely to evolve. One of our big tasks will thus be to determine the spread/skill relationship of our ensemble so we can calibrate it into a good approximation of the true PDF.
Once you have read this section, you should be able to discuss the different sources of model error.
Every model will have errors. Ensembles tell us the expected value of the error in a statistical sense. By examining an ensemble of model runs that each differs in some respect (differences include initial conditions or small-scale process parametrization), we can assess how the weather, acting through those changes, impacts ensemble spread and thus forecast uncertainty. As with any form of statistical forecasting, this ensemble post-processing requires that we calibrate, in this case calibrating the model spread against observed forecast errors from a large set of training cases.
There are several sources of model error, each of which we would hope to capture with our ensemble. Below is a list of a few.
The ideal ensemble would try to capture the errors due to uncertainty in the initial conditions and those due to uncertainty in the subgrid-scale physics.
We can perturb the initial conditions of a model away from the best guess – this is called initial condition diversity. We cannot perturb the conditions by more than the uncertainty in our analysis, which results from our inability to observe the atmosphere completely. In data-rich areas that may not be much, but in data-poor areas, it could be large.
The big question is how to perturb the initial conditions. If you do it at random, the atmosphere just mixes out the fluctuation in minutes to hours. Instead, you need to perturb the initial conditions on those scales, and in those places, that the atmosphere is most likely to respond to. Below is a visual of the scales of motion in the atmosphere from microscale to climate variation.
The x-axis is the spatial scale, while the y-axis is the timescale. Generally speaking, we will focus on the mesoscale to synoptic scale for initial perturbations, as these are the most rapidly growing modes (i.e., weather patterns that are inherently unstable).
There are many ways to perturb the initial conditions, but the details are outside the scope of this course.
For each of the physics parameterizations (e.g., radiation, clouds/precipitation, turbulence, surface, cumulus, etc.) there are multiple “good” choices. Each ensemble member would have a different combination. Again, the details behind these parameterizations are outside the scope of this course. But, if you would like to see an example of how these aspects of ensemble initiations are implemented, please click here [41].
After you have read this section, you should be able to fit parametric distributions to an ensemble, perform a kernel density estimation to fit a non-parametric PDF, and calibrate the resulting PDFs.
Since each model run produces only one forecast, we need an ensemble of runs large enough so that the resulting histogram of forecasts is a good approximation of the true PDF. The more variables we’re interested in putting in a joint PDF, the harder this is (i.e., Curse of Dimensionality [42]). For either univariate or joint PDFs, you can fight the curse of dimensionality (a bit) by bringing more information to the table - in particular, knowing what family the PDF belongs to (e.g., normal). Read on to learn about the different methods for fitting distributions to ensembles.
It takes far fewer (10 to 30) ensemble members to fit a known distribution reasonably well than it does to form a histogram that is a good approximation of the true PDF.
Let’s give this a try. To begin, download this data set of GEFS temperature forecasts [43] valid at 00Z on December 1st, 2018 for Atlanta, GA. Please note that I sometimes say GEFS and sometimes I say GFS. It's the same model. Use the code below to load in the NetCDF file.
# load in GEFS model temperature data
library(RNetCDF)
fid <- open.nc("tmp_2m_latlon_all_20181201_20181201_jzr8IGeTt4.nc")
mydata <- read.nc(fid)
Next, average over the lat and lon to create a temperature estimate for Atlanta, GA.
# average over Atlanta AtlantaTemp <- apply(mydata$Temperature_height_above_ground,3,mean) # convert from kelvin to F AtlantaTemp <- (9/5)*(AtlantaTemp-273.15)+32
We are left with a temperature value in degrees F for each of the 11 members of the ensemble. Let’s plot the probability density for the temperature. As a reminder, the probability density is a conceptual estimate of the probability of a single event (not to be confused with the probability histogram, which plots the likelihood of an event falling in a bin). Run the code below:
Below is a larger version of the figure:
Temperature is generally normally distributed. Let’s try and fit a normal distribution to the ensemble. As a reminder, we use the function ‘fitdist’ from the package ‘fitdistrplus’. If you need more of a review, I suggest you look back at the material in Meteo 815 on distribution fits.
# estimate parameters for normal distribution library(fitdistrplus) NormParams <- fitdist(AtlantaTemp,"norm") # estimate normal distribution NormEstimate <- dnorm(quantile(AtlantaTemp,probs=seq(0,1,0.05)),NormParams$estimate[1],NormParams$estimate[2])
The first part of the code estimates the parameters of the normal distribution based on the data. The second part estimates the actual distribution (for plotting purposes). Now we can plot the probability density and the normal density fit. Use the code below:
# plot distribution
plot.new()
grid()
par(new=TRUE)
plot(hist(AtlantaTemp,probability =TRUE),xlab="Temperature (degrees F)", ylab="Probability Density",
main="GFS Ensemble Temperatures for Atlanta, GA 12/01/2018 00Z",
xlim=c(55.5,56.2),ylim=c(0,3))
par(new=TRUE)
plot(quantile(AtlantaTemp,probs=seq(0,1,0.05)),NormEstimate,type='l',col="dodgerblue4",lwd=3,
xlab="Temperature (degrees F)", ylab="Probability Density",
main="GFS Ensemble Temperatures for Atlanta, GA 12/01/2018 00Z",
xlim=c(55.5,56.2),ylim=c(0,3))
You should get the following figure:
The blue line is the normal distribution fit. There you have it, a parametric distribution fit (normal) for an ensemble consisting of a small set of members (11).
Alternatively, we can just spread out the existing data points (i.e., ensemble member forecasts) until we get a smoothed PDF. This is called a Kernel Density Estimation. We convert individual ensemble members into a smoothed PDF.
The figure above shows the PDF of the ensemble on the left and the KDE on the right (individual ensemble members in red dashed line). For those that are interested, you can view the equations for the KDE here [45].
When performing a KDE, there are a couple of parameters you should know about. First is the bandwidth, which describes how far to spread out each data point. The second is the kernel, which is the function used to spread the points.
Let’s give this a try by applying a KDE to our previous GEFS temperature forecast from Atlanta, GA. We will use the function ‘density’. You can set the bandwidth manually by setting the parameter ‘bw’ or you can use the default which is ‘nrdo’ which is considered the ‘rule of thumb’ bandwidth for a Gaussian KDE. You can learn more about the bandwidth types here [46]. You can also set the ‘kernel’ parameter to a character string that tells the function what smoothing kernel should be used. The default is ‘gaussian’. Run the code below to create the KDE in R and plot the results.
You should get the following figure:
The KDE captures the bimodal feature in the distribution that the normal fit from the previous example did not.
KDE has the advantage that you don’t need to know ahead of time what distribution the ensemble members come from. In particular, they don’t have to be normally distributed (even if you used the normal curve for the kernel function, as the example above showed!). This is very handy. But you pay a price for this advantage, which is that it can take more ensemble members to get a good fit to the true PDF using KDE than it does fitting a parametric distribution.
A note on estimating the right bandwidth to use. In the example, we used the default bandwidth. Generally, the more ensemble members you have, the less smoothing you need to do, so the smaller the bandwidth you want to use. To find out more about bandwidth selection, I suggest you review the material here [45]. When in doubt, stick to the default in R which is the rule-of-thumb bandwidth for a Gaussian KDE.
Both of these PDF fitting methods can also work for joint PDFs. But the number of ensemble members required to obtain a good estimate of the joint PDF goes up faster than the number of variables in the joint PDF. In other words, it takes more than TWICE as many ensemble members to accurately determine the joint PDF of two variables than it does to determine their two PDFs separately.
We are paying a big price to acquire probabilistic information on the correlation between predictands. This cost is especially high, since NWP runs are one of the most computationally expensive human activities. Below is the joint PDF for a 30-member ensemble of temperature and dew point temperature.
The left joint density plot is a result of estimating the KDE for each variable separately and then estimating the joint PDF. The middle is the result of estimating the joint PDF directly using a joint KDE function (‘kde2d’ in R). The right panel is an overlay of the two density plots. Notice how different the PDFs are, particularly in the tails of the distribution.
By comparing the error in the PDF mean (average of ensemble means) to the PDF spread (average over the forecasts) we can calibrate the PDF. Standard deviation is a good metric for ensemble spread – at least for normally distributed ensembles. You can regress the absolute value of the error against the ensemble standard deviation (via linear regression). Then use the equation to determine how much to scale up (or down) the ensemble spread when deriving the forecast PDF from the one you fit to the ensemble members. Please note that most ensembles are under-dispersive (i.e., model spread is not as big as the model error).
Once you have completed this section, you should be able to create and use a rank histogram to assess the spread of an ensemble and estimate the continuous ranked probability score.
Once you have an ensemble, you need to evaluate its skill. Two methods of verifying ensembles will be discussed in this lesson: the rank histogram and the continuous ranked probability score. Both methods evaluate the spread of the ensemble and can be used to assess the skill of the forecast. Read on to learn more.
A rank histogram is a diagnostic tool to evaluate the spread of an ensemble. For each forecast, bins are selected by ranking the ensemble member forecast (lowest to highest). For N ensemble members, there will be N+1 bins - the lowest and highest bins are open-ended. (source: The Rank Histogram [47]) Then you create a frequency histogram by plotting the number of occurrences in each bin.
A rank histogram is a great plot to eyeball if the ensemble is over or under-dispersed, or to see if the ensemble is biased. In R, you can use the package ‘SpecsVerification’ [48] to create a rank histogram.
Below are several different examples of rank histograms. Match them to the most appropriate description.
A Continuous Ranked Probability Score (CRPS) is the PDF equivalent of the MSE. The CRPS rewards forecasts based on the amount of probability at an observed value. The higher the density, the lower the score. CRPS indicates how close the observation is to the weighted average of the PDF. A lower CRPS indicates better model performance (the PDF matches the distribution of forecast errors) – thus, CRPS lets you compare different ensemble models to pick the best. In R, you can use the ‘SpecsVerification’ package to create this plot.
For more information, I suggest you check out this link [48].
Let’s create these two graphs in R. We will start with the rank histogram. Begin by downloading this GEFS forecast output for NYC [49]. The file contains 24-hour forecast temperatures valid at 00Z from January 1st through November 30th, 2018. Start by loading in the file.
# load in temperature forecast from GFS for NYC
library(RNetCDF)
fid <- open.nc('tmp_2m_latlon_all_20180101_20181130_jzr8iFlxW9.nc')
mydata <- read.nc(fid)
Now, extract out the forecast valid time.
validTime <- mydata$intValidTime
We want to match up the observations at the time for which the forecast was valid. First, download this data set of hourly temperature observations [50]. Load in the data set and convert the time variable to match that of the valid time forecast variable (YearMonthDayHour) by using the code below:
# load in daily temperature observations
mydata2 <- read.csv("NCYTemps2018.csv")
obsTime <- as.POSIXct(strptime(mydata2$Date,"%m/%d/%Y %H:%M"))
tempTime <- format(obsTime,"%Y%m%d%H")
Now, loop through each forecast and pull out the corresponding observation that matches the time the forecast is valid.
# for each forecast valid time find the corresponding observation time
obsTemp <- array(NA,length(validTime))
for(itime in 1:length(validTime)){
goodIndex <- which(as.character(validTime[itime]) == tempTime)
obsTemp[itime] <- mydata2$AirTemp[goodIndex]
}
Next, convert the forecast temperature to Fahrenheit (from Kelvin) and transpose the matrix, so the format is (n,k) where n is the forecast and k is the ensemble member. This is the setup needed to use the rank histogram function in R.
# convert kelvin to F forecastTemp <- (9/5)*(mydata$Temperature_height_above_ground-273.15)+32 # transpose forecast temp (n,k) where k is the number of members forecastTemp <- t(forecastTemp)
Now, we can create the rank histogram. We will first estimate the histogram using the ‘Rankhist’ function, and then plot the histogram using the ‘PlotRankhist’ function. Run the code below:
# load in library
library('SpecsVerification')
# create rank histogram
rh <- Rankhist(forecastTemp,obsTemp,handle.na="use.complete")
PlotRankhist(rh)
You should get the following figure:
This is an example of an ensemble that has too small of a spread.
Now, let’s make a CRPS plot. Begin by downloading this forecast output of temperature for NYC [51]. Again, the forecast output is from January 1st, 2018 – November 30th, 2018 but, this time, we have the output for all forecast times from 0 hours out to 192 hours (8 days). We will use the same observational data set as before. Begin by loading in the temperature forecast.
# load in temperature forecast from GFS for NYC
library(RNetCDF)
fid <- open.nc('tmp_2m_latlon_all_20180101_20181130_jzr8YY8Qhv.nc')
mydata <- read.nc(fid)
Now, extract the time and select out the forecast hours you are interested in testing. I’m going to look at the 1-day forecast, 2-day, 3-day, and so on to the 8-day forecast. Use the code below to set this up and initialize the CRPS variable.
time <- as.Date(mydata$time/24,origin="1800-01-01 00:00:00") fhour <- seq(24,192,24) tempTime <- format(obsTime,"%Y%m%d%H") # initialize variable CRPS <- array(NA,length(fhour))
Now, we are ready to loop through each forecast hour and estimate the mean CRPS. Use the code below:
# loop through each fhour and estimate the mean CRPS
for(ihour in 1:length(fhour)){
fhourIndex <- which(mydata$fhour == fhour[ihour])
# extract out valid times
validTime <- mydata$intValidTime[fhourIndex,]
# match up observations
obsTemp <- array(NA,length(time))
for(itime in 1:length(validTime)){
goodIndex <- which(as.character(validTime[itime]) == tempTime)
if(length(goodIndex) == 0){
obsTemp[itime] <- NA
}else{
obsTemp[itime] <- mydata2$AirTemp[goodIndex]
}
}
# convert kelvin to F
forecastTemp <- (9/5)*(mydata$Temperature_height_above_ground[fhourIndex,,]-273.15)+32
# transpose forecast temp (n,k) where k is the number of members
forecastTemp <- t(forecastTemp)
# compute the CRPS
tempCRPS <- EnsCrps(forecastTemp,obsTemp)
CRPS[ihour] <- mean(tempCRPS,na.rm=TRUE)
}
The code starts by pulling out the valid times for the particular forecast hour of interest. Then we match up the observations for the valid forecast time. Next, we convert the forecast temperature to Fahrenheit and set up the matrix for the forecast values to be in the format (n,k) where n is the forecast and k is the ensemble member. Finally, you compute the CRPS using the function ‘EnsCrps’ and save the mean value to our CRPS variable.
Now, plot the mean CRPS for each forecast hour using the code below:
# plot
plot.new()
grid()
par(new=TRUE)
plot(fhour/24,CRPS,xlim=c(1,8),ylim=c(3,4.5),type="b",pch=20,lty=1,lwd=2,
xlab="Forecast Day",ylab="CRPS (degrees)",
main="GEFS Temperature Forecast NYC")
You should get the following figure:
As a reminder, a lower CRPS value is better. Thus, the 1-day forecast is best (as expected) and the 8-day forecast is the worst (again as expected). By adding in more ensemble models (such as the ECMWF) we could assess which model is best, based on which one has the lowest CRPS value, for each forecast hour.
Links
[1] https://www.e-education.psu.edu/meteo825/node/3
[2] https://www.e-education.psu.edu/meteo825/node/6
[3] https://www.e-education.psu.edu/meteo825/node/517
[4] https://www.e-education.psu.edu/meteo825/node/524
[5] https://www.e-education.psu.edu/meteo825/node/525
[6] https://www.e-education.psu.edu/meteo825/node/526
[7] https://www.e-education.psu.edu/meteo825/node/527
[8] https://www.e-education.psu.edu/meteo825/node/694
[9] http://www.cawcr.gov.au/projects/verification/
[10] https://web.archive.org/web/20210805162759/https://www.weather.gov/mdl/verification_home
[11] https://www.swpc.noaa.gov/sites/default/files/images/u30/Forecast%20Verification%20Glossary.pdf
[12] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EXBlkrpYMUxNv8YLarODitkBnUWLeHUGJjePI8D1xgeYWA?download=1
[13] http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/
[14] https://www.emc.ncep.noaa.gov/users/verification/global/
[15] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EUQLlzJEpPJJv37X9Fz7gr4BdJaTJ78NLH2r80XJ5M7i3A?download=1
[16] https://mesonet.agron.iastate.edu/mos/fe.phtml
[17] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/Eehp7Qx3rn9AhBAreH6QMVMBrmKJ6vzb-FHuhSr6e5UU-g?download=1
[18] https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php
[19] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EcSL6ma8QoRLnlwbiGmyrPUBKuKgbW4Y9nrsd3FDMcx8IA?download=1
[20] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EQbOuwUtTnFAj3-RVkY81owBlu4LufB5nXaQTXZh9FAe2g?download=1
[21] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/ESF0fs7v6ZdHpQnDJEMYNd0BiXYMbUz-9pl2s7X013y4Ig?download=1
[22] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EYbTz0Lj4atMvdbmFNO2lncBCYuOwCzw2EODPRXGpPYukQ?download=1
[23] https://youtu.be/u-_4S5l-Cuw
[24] https://www.statmethods.net/advstats/cart.html
[25] https://cran.r-project.org/web/packages/rpart/rpart.pdf
[26] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/ESW1xP9yjJZJjC-QZV1MyQQBDYjR3ozt86J4OGzQS8tWjg?download=1
[27] https://web.archive.org/web/20231002185827/https://aviationweather.gov/taf/help?page=plot
[28] https://commons.wikimedia.org/w/index.php?curid=4310325
[29] https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/sas405_psu_edu/EY7OSE50zbtCgwns32zdFZMB3eEHKcBjkwhsiOTEym6HJA?download=1
[30] https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf
[31] http://ai-depot.com/Tutorial/RuleBased-Methods.html
[32] http://dtcenter.org/testing-evaluation/verification
[33] http://journals.ametsoc.org/doi/10.1175/1520-0434%282004%29019%3C0552%3AAOTSRU%3E2.0.CO%3B2
[34] http://journals.ametsoc.org/doi/abs/10.1175/1520-0493(1998)126%3C3292%3ATRBESA%3E2.0.CO%3B2
[35] http://www.atmos.umd.edu/~ekalnay/syllabi/AOSC630/ensemble101.pdf
[36] https://journals.ametsoc.org/doi/pdf/10.1175/1520-0434%282000%29015%3C0559%3ADOTCRP%3E2.0.CO%3B2
[37] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EVs-Uor5YERNrx87DQkUamsBl4EMFyVkh29dQTTNBOTINA?download=1
[38] http://asr.science.energy.gov/
[39] https://science.nasa.gov/ems/13_radiationbudget
[40] http://www.comet.ucar.edu/
[41] https://www.ncei.noaa.gov/products/weather-climate-models/global-ensemble-forecast
[42] https://en.wikipedia.org/wiki/Curse_of_dimensionality
[43] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/ES7di-4CYopEmIredWzzcNYB8Wg43gZBIbPUFfawe2jStw?download=1
[44] https://commons.wikimedia.org/w/index.php?curid=57332968
[45] https://en.wikipedia.org/wiki/Kernel_density_estimation
[46] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html
[47] https://www.esrl.noaa.gov/psd/people/tom.hamill/MSMM_vizverif_hamill.pdf
[48] https://cran.r-project.org/web/packages/SpecsVerification/SpecsVerification.pdf
[49] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EXKYPpN_tAVIlrkq-Pff0BsBrfe8cDOgTLKktdJYO-eR5w?download=1
[50] https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/sas405_psu_edu/EXDGcQcvzghLsCKhXKv95-0B8x4RSUz4XjU-2foAmE4D_A?download=1
[51] https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/sas405_psu_edu/EfLu7edGHlhOu0DDsLdi65sBGH1DCGRTMczuhTI_d4TFbA?download=1