Missing Values

Prioritize...

By the end of this section, you should be able to apply different methods for replacing missing values and list the pros and cons for each method.

Read...

Remember that in R, missing data, or data that is "not available" is designated by the constant "NA". You can think of NA as a place holder that has no value. Now, let's assume we have a data set where known bad/questionable/missing values have been replaced with NA's (using a line of code like: my_obs$temp[which(my_obs$temp=="-99999")]<-NA). The next question becomes, how do we deal with these non-existent values? There are three main actions we can take: remove rows of data with NA's completely, fill them in (with a proxy value), or leave them alone. What you choose will depend on your analysis and how sensitive that analysis is to having missing data.

Removal

Generally, if we are interested in some bulk characteristic of the data (and depending on the amount of data we have), simply removing missing values may be the best option. There are two ways to remove missing values. The first is listwise deletion. Listwise deletion removes all rows containing a missing value from the data set. For example, if I have a time series of temperature with 100 values, and the values at space 2 and 24 are missing, I would delete those values reducing my time series to 98 values. In R, you can use the function na.omit(...) which will perform a listwise deletion. But what if you are examining more than one variable. Let’s say you want to look at a temperature and relative humidity time series at a station -- again, there are 100 values for each. As with the first example, we have 2 days with missing data, but now in the relative humidity time series, 3 values are missing at spaces 2, 18, and 44. With listwise deletion, all cases with missing values are deleted. So, for the example above, we would delete the 2nd, 18th, 24th, and 44th values, resulting in a matched time series that is only 96 values long. In this case, you can think of listwise deletion as a "complete-case analysis", only obtaining cases where all variables have real data. Let's consider some code to demonstrate the use of na.omit(...)

#generate some random data and place it in a data frame
# the function "rnorm" creates random normal data with the 
# given mean and standard deviation
my_obs<-data.frame("day"=1:100, 
                   "temp"=rnorm(100,mean=50,sd=15), 
                   "rh"=rnorm(100,mean=70,sd=15))

# Create some missing entries
my_obs$temp[which(my_obs$temp<25)]<-NA
my_obs$rh[which(my_obs$rh>100)]<-NA

# Print out number of observations and number of NAs for each variable
print(paste("Num obs:",length(my_obs$day),", Missing T:",sum(is.na(my_obs$temp)),", Missing RH:",sum(is.na(my_obs$rh))))

# Throw out rows with missing data
my_obs<-na.omit(my_obs)

# Print out number of new number of observations and verify no NAs left
print(paste("Num obs:",length(my_obs$day),", Missing T:",sum(is.na(my_obs$temp)),", Missing RH:",sum(is.na(my_obs$rh))))

A second type of omission is called pairwise deletion, which can be described as an ‘available-case analysis’. A pairwise deletion attempts to minimize the loss created by listwise deletion. In pairwise deletion, only the missing values that occur in both data sets are removed. So, in the example above, only 1 case (space 2) would be eliminated. This means that the statistics will be calculated with some missing values included. If you have a substantially long data set with few missing values, or your analysis requires complete samples (containing no missing values), I suggest going with listwise deletion. However, if you can’t take the loss of small sample size, pairwise elimination may be an option.

Imputation

Imputation is a general term used to describe the process of replacing missing data with generated (artificial) values. You should only use imputation if missing values will adversely affect your analysis or visualization of the data.There are many methods of imputation and the choosing the best method will depend on the data itself as well as the analysis you wish to perform. We will only discuss a few simple methods here, while more advanced approaches will be discussed in future courses.

One of the more common and easy methods of imputation is interpolation. Interpolation creates a new value that "fits" within the surrounding context of other observations and works particularly well for temporal or spatial data sets. Interpolation comes in many flavors. Linear interpolation approximates the missing value by first constructing a trend ‘line’ that spans two know observations. Then, a value is selected from that trend line at a point contingent on the distance the missing value is from the two locations. Linear interpolation assumes that a observations vary quasi-linearly between sets of points. This is a decent approximation if the resolution of the observations is higher than the significant natural variability. For example, interpolating a missing hourly temperature will be more accurate than interpolating a daily temperature. Because hourly temperature can, in most cases, be seen as varying linearly while average daily temperature show considerably more variability from day-to-day.

R has several built-in functions for linear interpolation. Check out the use of the approx(...) function which creates a liner approximation for a given data set containing missing values.

# Create some fake temperature data
time<-c(1:48)
temp<-(-30)*sin(time/24*pi+pi/10)+70

# Pick 10 random times and replace those temperatures with NAs
missing_times<-sample(2:47, 10, replace=F)
temp[missing_times]<-NA

# plot the time series using square symbols
plot(time, temp, pch=0, cex=2,
     main="Using approx() to Fill in Missing Values",
     xlab="Observations",
     ylab="Temperature (F)")

# interpolate the missing data 
# data that's not missing will just be the same
complete_temp<-approx(time, temp, xout=time)

# plot the interpolated data using stars
points(complete_temp$x, complete_temp$y, pch=8)

# How good is it? This computes the root mean squared  
# error caused by the interpolation and writes it on the plot.
text(35,50, paste0("RMSE: ",
    mean(sqrt((complete_temp$y[missing_times]-
        ((-30)*sin(missing_times/24*pi+pi/10)+70))^2))))

Here is the graph created by the R-script above. The squares represent the original "observed" data with some random missing values. The stars show the interpolated time series. Note that the root mean squared error is less than 0.5 degree.

Another method is to use a more complex function to model the gap created by the missing data. In R, we can use the function spline(...) to use cubic splines (3rd-order polynomials) instead of lines to find the missing values. Modify the code above by replacing the line containing the approx(...) function with: complete_temp<-spline(time, temp, xout=time, method = "fmm"). Here's the graph of the spline interpolation for our simulated data. Notice how small the RMS error is! Mind you, real data won't behave quite so nicely, but you can see how higher order approximations are better than linear ones.

In the case of spatial data, you can also fill the missing value by considering surrounding points. There are many different methods of varying complexity. If you are looking for some simple bi-linear or bi-cubic interpolation routines, check out the package "akima".

Leave the Missing Values In

The last option, of course, is to leave the missing values in the data set. This is perhaps the most informative choice for the users of the data and may itself provide additional information pertinent to your analysis. For example, let's say you wanted to compute the mean daily temperature at a location for December, January and February. However, on February 2 an ice storm caused a two-week data loss (the sensor was damaged) at your site. Should you simply calculate a mean with the data you have? Or, is two weeks (out of 12 total) too much missing data? We'll tackle such questions in future courses, but suffice to say, you should give serious consideration to at least providing a notation that a significant amount of data was missing from your calculation of the mean.

Leaving missing data in the data set need not sabotage your analysis, however. In R, many functions have parameters which allow you to ignore the NA values. You can look for parameters such as na.rm or na.omit when looking at the help page for a particular function. Setting this value to TRUE will cause the function to ignore the NA values from any calculations that the function performs. Consider the help page for the function mean(...). Modify the code above (remove the listwise deletion) to see the difference na.rm=TRUE makes when computing a mean with NA values.

You will also see in future courses that the number of samples is a key parameter in the calculation of various statistics. If you leave in the missing numbers, the number of samples stay the same but no added information is gained. Deleting the missing values reduces the number of samples which can change the results. The size of the effect varies depending on the specific statistic. You can perform sensitivity analyses to determine this size (compute the statistic with missing values in and with missing values removed).

Prioritize...

Read...

Removal

Imputation

Leave the Missing Values In

Navigation

EMS

Programs

Related Links