Turning Ensembles into PDFs

Prioritize...

After you have read this section, you should be able to fit parametric distributions to an ensemble, perform a kernel density estimation to fit a non-parametric PDF, and calibrate the resulting PDFs.

Read...

Since each model run produces only one forecast, we need an ensemble of runs large enough so that the resulting histogram of forecasts is a good approximation of the true PDF. The more variables we’re interested in putting in a joint PDF, the harder this is (i.e., Curse of Dimensionality). For either univariate or joint PDFs, you can fight the curse of dimensionality (a bit) by bringing more information to the table - in particular, knowing what family the PDF belongs to (e.g., normal). Read on to learn about the different methods for fitting distributions to ensembles.

Fitting Parametric Distributions

It takes far fewer (10 to 30) ensemble members to fit a known distribution reasonably well than it does to form a histogram that is a good approximation of the true PDF.

Let’s give this a try. To begin, download this data set of GEFS temperature forecasts valid at 00Z on December 1^st, 2018 for Atlanta, GA. Please note that I sometimes say GEFS and sometimes I say GFS. It's the same model. Use the code below to load in the NetCDF file.

Show me the code...

# load in GEFS model temperature data
library(RNetCDF)
fid <- open.nc("tmp_2m_latlon_all_20181201_20181201_jzr8IGeTt4.nc")
mydata <- read.nc(fid)

Next, average over the lat and lon to create a temperature estimate for Atlanta, GA.

Show me the code...

# average over Atlanta
AtlantaTemp <- apply(mydata$Temperature_height_above_ground,3,mean)
# convert from kelvin to F
AtlantaTemp <- (9/5)*(AtlantaTemp-273.15)+32

We are left with a temperature value in degrees F for each of the 11 members of the ensemble. Let’s plot the probability density for the temperature. As a reminder, the probability density is a conceptual estimate of the probability of a single event (not to be confused with the probability histogram, which plots the likelihood of an event falling in a bin). Run the code below:

Below is a larger version of the figure:

GFS ensemble temperatures for Atlanta, GA 12/01/2018.

Density plot of GFS Ensemble temperature for Atlanta, GA on December 1st, 2018 at 00Z

Credit: J. Roman

Temperature is generally normally distributed. Let’s try and fit a normal distribution to the ensemble. As a reminder, we use the function ‘fitdist’ from the package ‘fitdistrplus’. If you need more of a review, I suggest you look back at the material in Meteo 815 on distribution fits.

Show me the code...

# estimate parameters for normal distribution
library(fitdistrplus)
NormParams <- fitdist(AtlantaTemp,"norm")

# estimate normal distribution 
NormEstimate <- dnorm(quantile(AtlantaTemp,probs=seq(0,1,0.05)),NormParams$estimate[1],NormParams$estimate[2])

The first part of the code estimates the parameters of the normal distribution based on the data. The second part estimates the actual distribution (for plotting purposes). Now we can plot the probability density and the normal density fit. Use the code below:

Show me the code...

# plot distribution
plot.new()
grid()
par(new=TRUE)
plot(hist(AtlantaTemp,probability =TRUE),xlab="Temperature (degrees F)", ylab="Probability Density",
     main="GFS Ensemble Temperatures for Atlanta, GA 12/01/2018 00Z",
     xlim=c(55.5,56.2),ylim=c(0,3))
par(new=TRUE)
plot(quantile(AtlantaTemp,probs=seq(0,1,0.05)),NormEstimate,type='l',col="dodgerblue4",lwd=3,
     xlab="Temperature (degrees F)", ylab="Probability Density",
     main="GFS Ensemble Temperatures for Atlanta, GA 12/01/2018 00Z",
     xlim=c(55.5,56.2),ylim=c(0,3))

You should get the following figure:

GFS ensemble temperatures for Atlanta, GA 12/01/2018 with normal distribution fit.

Density plot of GFS Ensemble temperature for Atlanta, GA on December 1st, 2018 at 00Z with normal distribution fit overlay

Credit: J. Roman

The blue line is the normal distribution fit. There you have it, a parametric distribution fit (normal) for an ensemble consisting of a small set of members (11).

Kernel Density Estimation

Alternatively, we can just spread out the existing data points (i.e., ensemble member forecasts) until we get a smoothed PDF. This is called a Kernel Density Estimation. We convert individual ensemble members into a smoothed PDF.

PDF of ensemble on the left and KDE on the right.

Visual of how a PDF of an ensemble (left) gets fit by a Kernel Density Estimator (right)

Credit: By Drleft at English Wikipedia, CC BY-SA 3.0

The figure above shows the PDF of the ensemble on the left and the KDE on the right (individual ensemble members in red dashed line). For those that are interested, you can view the equations for the KDE here.

When performing a KDE, there are a couple of parameters you should know about. First is the bandwidth, which describes how far to spread out each data point. The second is the kernel, which is the function used to spread the points.

Let’s give this a try by applying a KDE to our previous GEFS temperature forecast from Atlanta, GA. We will use the function ‘density’. You can set the bandwidth manually by setting the parameter ‘bw’ or you can use the default which is ‘nrdo’ which is considered the ‘rule of thumb’ bandwidth for a Gaussian KDE. You can learn more about the bandwidth types here. You can also set the ‘kernel’ parameter to a character string that tells the function what smoothing kernel should be used. The default is ‘gaussian’. Run the code below to create the KDE in R and plot the results.

You should get the following figure:

GFS ensemble temperatures for Atlanta, GA 12/01/2018. Refer to text below.

Density plot of GFS Ensemble temperature for Atlanta, GA on December 1st, 2018 at 00Z with KDE fit overlay

Credit: J. Roman

The KDE captures the bimodal feature in the distribution that the normal fit from the previous example did not.

KDE has the advantage that you don’t need to know ahead of time what distribution the ensemble members come from. In particular, they don’t have to be normally distributed (even if you used the normal curve for the kernel function, as the example above showed!). This is very handy. But you pay a price for this advantage, which is that it can take more ensemble members to get a good fit to the true PDF using KDE than it does fitting a parametric distribution.

A note on estimating the right bandwidth to use. In the example, we used the default bandwidth. Generally, the more ensemble members you have, the less smoothing you need to do, so the smaller the bandwidth you want to use. To find out more about bandwidth selection, I suggest you review the material here. When in doubt, stick to the default in R which is the rule-of-thumb bandwidth for a Gaussian KDE.

Joint PDFs

Both of these PDF fitting methods can also work for joint PDFs. But the number of ensemble members required to obtain a good estimate of the joint PDF goes up faster than the number of variables in the joint PDF. In other words, it takes more than TWICE as many ensemble members to accurately determine the joint PDF of two variables than it does to determine their two PDFs separately.

We are paying a big price to acquire probabilistic information on the correlation between predictands. This cost is especially high, since NWP runs are one of the most computationally expensive human activities. Below is the joint PDF for a 30-member ensemble of temperature and dew point temperature.

Joint PDF for a 30-member ensemble of temperature and dew point temperature using KDE on each separate variable then computing the joint PDF (left), using a joint KDE function (middle), and an overlay of the density plots (right)

Credit: J. Roman

The left joint density plot is a result of estimating the KDE for each variable separately and then estimating the joint PDF. The middle is the result of estimating the joint PDF directly using a joint KDE function (‘kde2d’ in R). The right panel is an overlay of the two density plots. Notice how different the PDFs are, particularly in the tails of the distribution.

Calibrating PDFs

By comparing the error in the PDF mean (average of ensemble means) to the PDF spread (average over the forecasts) we can calibrate the PDF. Standard deviation is a good metric for ensemble spread – at least for normally distributed ensembles. You can regress the absolute value of the error against the ensemble standard deviation (via linear regression). Then use the equation to determine how much to scale up (or down) the ensemble spread when deriving the forecast PDF from the one you fit to the ensemble members. Please note that most ensembles are under-dispersive (i.e., model spread is not as big as the model error).

Prioritize...

Read...

Fitting Parametric Distributions

Kernel Density Estimation

Joint PDFs

Calibrating PDFs

Navigation

EMS

Programs

Related Links