Lesson 7: Introduction to Data Mining

Motivate...

As we near the end of this course, I’d like to provide a bigger picture of how we can utilize all of the tools we’ve learned so far. For this lesson, nothing new will be presented. Instead, we will combine all the statistical metrics we have already learned into one process - a simplified yet effective template for data mining. Please note that there are many ways to go about data mining; this lesson will only present one method.

Before we talk about data mining, I want to present a general motivation for the case study we will focus on. In this lesson, we are going to look at corn. As an agricultural product, you might think that weather (and climate) play a big part in the market. But there are two reasons why this is not necessarily true.

The first is that corn has been genetically altered many times to withstand the extremes of weather. In particular, ongoing research has created drought-tolerant corn hybrids. The second part of this problem is that corn, and in particular corn prices, are largely governed by the futures market. Understanding futures markets is not a requirement for this class, but if we want to ask the right questions, we need to learn a little about all the aspects of the problem. Check out this video that describes the futures markets.

Click for transcript of Futures Market Explained.

PRESENTER: A $3 box of corn cereal stays at roughly the same price day to day and week to week. But corn prices can change daily, sometimes by a few cents, sometimes by a lot more. Why does the cost of processed food generally stay quite stable, even though the crops that go into them have prices that fluctuate?

It's partly thanks to the futures market. The futures market allows the people who sell and buy large quantities of corn to insulate you, the consumer, from those changes without going out of business themselves. Let's meet our corn producer, this farmer. Of course, she's always looking to sell her corn at a high price. And on the other side, our corn user. This cereal company is always looking to buy corn at a low price.

Now the farmer has a little bit of a problem because her whole crop gets harvested at once. Lots and lots of farmers will be harvesting at the same time, and the huge supply can send the price falling. And even though that price might be appealing to the company that makes cereal from corn, it doesn't want to purchase all of its corn at once because, among other reasons, it would have to pay to store it.

[CASH REGISTER SOUND EFFECTS]

But it's fortunate that corn can be stored because that means it can be sold and bought throughout the year. And this is where the futures market fits in. Buyers and sellers move bushels around in the market, though actual corn rarely changes hands. Instead of buying and selling corn, the farmer and cereal maker buy and sell contracts. Now, we're getting closer to peace of mind for both sides because a futures contract provides a hedge against a change in the price. This way, neither side is stuck with only whatever the market price is when they want to buy or sell.

These contracts can be made at any time, even before the farmer plants the corn. She'll use the futures market to sell some of her anticipated crop on a certain date in the future. Of course she's not going to sell all of her corn on that contract, just enough corn to reassure her that a low price at harvest won't ruin her business. The contract provides that security. The cereal company uses the same market to buy bushels. Their contract protects against a high price later. Contracts will gain or lose money in the futures market.

If the price goes high, the farmer loses money on that futures contract. Because she's stuck with it. But that's OK. Because now, she can sell the rest of her corn-- what wasn't in that contract-- at the higher price. That offsets her loss in the futures market. If, at harvest time, the price of corn is low, well, that's exactly why she entered the futures market. The low price means her contract makes money. So that profit shields her from the sting of the low price she'll get for the bushels she sells now.

The corn cereal company doesn't like those higher prices, and that's why they have a futures contract. They make money on it and can use that profit to cover the higher price of the corn they now need to buy. The futures market serves as a risk management tool. It doesn't maximize profit. Instead, it focuses on balance. And in this way, it keeps your cereal from breaking your weekly shopping budget.

The futures market is regulated more by contracts than by supply and demand. Weather, however, does play an important role (source: "Investing Seasonally In The Corn Market [1]"). If we can predict the amount of corn that will be produced or the amount of acreage that will be harvestable (not damaged from weather), that information could be used to help farmers decide how many contracts to sell now or whether a company should wait to buy a contract because the supply will be larger (greater harvest) than anticipated.

Our goal for this lesson is to determine how much weather impacts the corn supply (in the form of harvest, yield, price, etc.), and if certain weather events can be used to predict the corn harvest, providing actionable information on whether to buy or sell a futures contract.

Newsstand

Global Crop Losses August 2016, Insurers Reel from Losses [2] (6:02)
Corn Field Day Demonstrates Power Of Weather Over Crops [3] (3:49)
Wessell, Maxwell. (2016, November 3). You Don’t Need Big Data — You Need the Right Data [4]. Retrieved November 18, 2016.
Brown, Meta S. (2016, October 31). 5 Big Data And Analytics Learning Resources That Most People Miss (But Shouldn't) [5]. Retrieved November 18, 2016.
(2016. October 30). How One University Used Big Data To Boost Graduation Rates [6]. Retrieved November 18, 2016.

Lesson Objectives

Identify instances when data mining may be useful.
Describe the general process of data mining.
Explain what can and cannot be achieved through data mining.
Interpret results and use for prediction.

[7]

Download [7] all of this lesson’s code and data.

You will be prompted to save a .zip file to your computer.

Data Mining Overview

Prioritize...

By the end of this section you should be able to describe how data mining works, know when to use it, and why it may be beneficial.

Read...

Big data, data mining, data analytics. These are all popular terms you probably have seen in the news recently. But what exactly is data mining, and how can it be used on weather data? Data mining might sound complicated, but in reality you already know all about it.

What is Data Mining?

A word map with words associated with data mining such as: analysis, patterns, learning, set, customers, etc.

Data Mining Word Bubble

Credit: Word Cloud It [8]

What exactly is data mining? Well, let’s start by stating the goal. The general goal of data mining is to transform data into useful, sometimes actionable, information. It is the general process of looking through the data and finding key patterns that could be used to predict something. Data mining will examine massive amounts of information to determine which variables are pertinent and which are nonessential with respect to predicting the variable of interest.

When to use it?

This might sound amazing - we have an easy and quick way to obtain useful information in a predictive format. But I stress that you should not tread lightly. Data mining can be used for many reasons, but when abused, data mining can produce misleading results and erroneous interpretation. An understanding of what’s behind this data mining ‘machine’ is important. You should be able to answer questions such as: How did we find the patterns? Can we repeat them? What do the results actually mean?

My suggestion is to use data mining for two purposes. The first purpose is to eliminate irrelevant data. Big data is called big data for a reason. There is an enormous amount of data out there, especially for weather and climate. Here is a graph illustrating the amount of data available from NOAA.

A graph of the volume of data from 2004-2020

Volume of NOAA data by year

Credit: NOAA [9]

The second reason to use data mining is to provide an initial pass at predictive analytics. Use it to determine whether the data you collected can actually be used to predict the variable of interest. Then you can take the variables deemed essential and examine the model in more detail, evaluating and changing when necessary.

I want to stress that data mining should not be used as a one and done tool. Although the process can be helpful, letting a computer decide things can be risky. You need to know what’s going on inside the black box. Simply running a data mining scheme is not okay. A thoughtful reasoning behind the results is mandatory (at least for this class).

Process Overview

The process is pretty simple. Supply the data and let R do everything else. But like I stressed above, we will focus on what is actually being performed during the data mining process. Here is an overview of what we will do to set up the analysis. Each step will be discussed in more detail later on. As a reminder, there are many ways to go about data mining. This is just one example that utilizes the tools you have already learned.

Prepare the Data -> This is incredibly important for data mining. You want a clean dataset (no missing values or invalid numbers).
Visualize and Summarize the Data -> Don’t go into this process blindly. Look at the data, inspect it, make sure you caught all ‘odd’ values.
Perform Regression Analysis -> The regression analysis is the pattern detection tool. This is where we look for the patterns. In this lesson, we will be focusing on two types of regression analyses:
1. Multiple Linear Regressions -> Perform several linear regressions.
2. Regression Trees -> Create a "20-questions" style tree to find nonlinear patterns.
Evaluate the Models and Select -> You will create multiple models; you must select which one is best. But how do you decide?
Predict -> Put the model to the test. Predict some independent cases and double check how well it does, further evaluating your model selection.

Pick a Question

Formulating the scope of the question can be difficult. A question that is too wide and expansive can make it difficult to interpret the results, but a question that is too specific doesn’t use data mining to its full advantage. You want the question to optimize the ability of data mining and your ability to interpret the results.

For this lesson, we are going to look at one case study, related to corn, with one question. The goal is to determine if weather data can play a role in the futures market of corn. Specifically, can weather data aid in the decision on whether to buy contracts for corn? To do this, we want to be able to predict the corn harvest. Here is my initial question:

Can the weather during the growing season be used to predict the corn harvest in the United States?

We can’t narrow our question down more until we get an idea of what data is available. It is narrow enough to give us a starting point to look for data, but broad enough that we can select a variety of variables.

Prepare Data

Prioritize...

At the end of this section you should be able to download storm events data and fill in missing values using various methods, and understand when one method may be more appropriate over another.

Read...

Data mining begins with the data. It’s the most important part of the process. You should carefully consider which variables to actually examine and how to create a clean dataset. Although data mining can look at an abundance of variables, we still want to be able to understand the results, and too many variables can be overwhelming to digest.

Let’s begin by selecting the data and the specific variables for this case study. There are two types of variables we will need. The first is the response variable(s) or the variable(s) we will try to predict. The second type are the factor variables or the variables that will be used to predict the response variable(s).

Corn Data

Let’s start by choosing the response variables which are going to be related to corn harvest. We need to rely on what’s freely available. For agriculture purposes in the United States, a good place to start is the Department of Agriculture. They have an Economics, Statistics and Market Information System that archives statistics on different agricultural crops. One such crop is sweet corn [10]. (Note that there are other types of corn, specifically feed corn, but for now I’m going to focus on sweet corn).

Before even looking at the details in the data (search for sweet corn), you will probably notice that some data is organized by states while others represent the entire Continental United States (CONUS). Weather is location-specific, so right away I know I want to look at the individual states instead of CONUS.

You might also notice that some datasets are labeled ‘fresh’ while others are labeled ‘processed’. If you remember from the motivational video at the beginning of the lesson, they specifically discussed processed corn. Corn futures were bought and sold primarily for companies that would use the corn to create a product. Sticking with that route, I will focus on processed sweet corn.

If you open up one of the Excel sheets [11], you will find seven variables: Year, Acreage Planted, Acreage Harvested, Yield per Acre, Production, Season-Average Price, and Value of Production. Most of these variables are self-explanatory. The only one I will mention is the value of production, which is simply the production amount times the average price. All these variables are metrics used to measure how good the harvest was, with the exception of acreage planted. Six response variables to predict is not unreasonable; so we will keep them all. The acreage planted can be used as a known or factor variable to aid in prediction since we will know that information each year before the growing season begins. Let’s refine the question now.

Question:

How does the weather, along with acreage planted, affect the value of production, the average price, the total production, the yield per acre, or the acreage harvested for sweet corn intended for processing in the United States?

You will notice that not all states in CONUS are listed. This could be due to lack of data or simply that those states do not grow corn. Whatever the reason, we will stick with what’s available and let those states ‘represent’ the outcome for the United States. Remember, the futures market is not on a state by state basis so we need to create a model that reflects the United States.

I’ve already combined all of the state data into one file. You can download the file here [12].

Storm Events Summary

We have the response variables, now we need the factor variables, which are the ones that we will use to predict the response variables. There are a number of weather variables we could look at such as daily mean temperature or precipitation. Although these more typical variables are important, I think that high impact events might have a bigger effect on the corn harvest.

For this example, I’m going to look at the storm events database. This Storm Events Database [13] provides a list of several different weather phenomena. Data is available from January 1950 - July 2016, however, only tornados were documented from 1950-1954; tornados, hail, and thunderstorm wind events from 1955-1996; and all 48 event types from 1996 onward. To learn more about each variable you can find the documentation here [14].

What’s great about this database is that it's extremely localized, meaning events are listed at the county level. This is beneficial to us because corn is not grown everywhere in one state. In fact, most states have specific counties where corn production is more prevalent. You can find this list here: Vegetables - Usual Planting and Harvesting Dates [15].

If we know the counties in each state where corn is grown, we can extract out the storm events data for only those counties. This means we are examining only cases where a storm or weather event likely affected the crop. Furthermore, we have typical harvesting dates by county. We can extract out events for each county only between the planting and harvesting dates. This narrows our time and spatial sampling to something more specific and realistic. Let’s rework our question now into the final form.

Question:

How does the frequency of storm events during the growing season, along with acreage planted, affect the value of production, the average price, the total production, the yield per acre, or the acreage harvested for sweet corn intended for processing in the United States?

We have a refined and clear question. Now we need to prepare the data. We want the frequency of each type of storm event only for the counties where corn is typically grown during the growing season. How do we actually extract that?

Here is an Excel sheet that contains the harvest dates [16] and a sheet that contains the principle counties [17] where corn is grown broken down by state. All you need to do is bulk download the storm events data [13]. You want to download only the files that say ‘StormEvents_details’. Download these files (by year) into a folder. Now we can start extracting out the data. The goal is to have one file that contains the response and factor variables temporally and spatially matched up. To begin, load in the growing season dates and key counties by state as well as the sweet corn data.

Show me the code...

Your script should look something like this:

# load in Growing Season dates for all statesObs
growingDates <- read.csv("ProcessedSweetCorn_GrowingDates.csv", header=TRUE)

# load in Key counties for Corn
keyCounties <- read.csv("ProcessedSweetCorn_PrincipleCounties.csv", header=TRUE)

# load in processed sweet corn data
ProcessedSweetCorn <- read.csv("ProcessedCorn.csv", header=TRUE)

# determine which states we have data for
states <- levels(ProcessedSweetCorn$State)

Next, we are going to prep the storm events data.

Show me the code...

Your script should look something like this:

### Storm Event data is in multiple files (1 for each year) 

# path of data 
path <- "StormEvents/” 

# names of files 
file.names <- dir(path, pattern=”.csv”) 

# create an Array for meteorological variables we will include 
tornado <- array(data=NA, dim=length(ProcessedSweetCorn$Year)) 
hail <- array(data=NA, dim=length(ProcessedSweetCorn$Year)) 
tsWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 

# the rest of these variables are only available from 1996 onward 
flashFlood <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
iceStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
heavySnow <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
coldWindChill <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
strongWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
tropicalStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
winterStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
flood <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
winterWeather <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
blizzard <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
excessiveHeat <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
freezingFog <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
heat <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
highWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
sleet <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
drought <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
extremeColdWindChill <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
frostFreeze <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
heavyRain <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
hurricane <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
tropicalDepression <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
wildfire <- array(data=NA,dim=length(ProcessedSweetCorn$Year))

The code above saves the path of your CSV files for storm events (the ones you downloaded in bulk form). It also creates an array with length ‘Years’ for each variable you will include. I selected a sub-selection of what is available. For each year, the variable will contain the frequency of the event. Now we will loop through each year and count the number of times each storm event type occurred.

Show me the code...

Your script should look something like this:

# loop through each year of storm event files 
for(ifile in 1:length(file.names)){
 # open the file
 file <- read.csv(paste0(path,file.names[ifile]),header=TRUE)

For each year, we open the corresponding storm events file. Next, we loop through each state and extract out the key counties for that state.

Show me the code...

Your script should look something like this:

# loop through each state 
for(istate in 1:length(states)){
 # Determine the counties to look for --> change to uppercase
 counties <- factor(toupper(keyCounties$Principal.Producing.Counties[which(keyCounties$State == states[istate])]))

For this given state, we have the counties in which corn is grown. We can use this to subset the storm events data. We want to also subset the storm data for the growing season. Let’s begin by determining the typical planting date for this state.

Show me the code...

Your script should look something like this:

### Determine the growing season dates
# extract out typical planting dates
plantingStartDate <- growingDates$Planting.Begins[which(growingDates$State == states[istate])]
plantingEndDate <- growingDates$Planting.Ends[which(growingDates$State == states[istate])]

# create planting date that is in the middle of usual start/end dates
plantingDate <- as.Date(plantingStartDate)+floor((as.Date(plantingEndDate)-as.Date(plantingStartDate))/2)

# change planting date to julian day for year of data examining
plantingDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH[1],1,4),
                                              format(plantingDate,'%m'),format(plantingDate,'%d'))),'%j')

The data provided a range of when planting typically begins. What I've done is taken an average of that range, which will be used as the planting date. Now we need to determine the harvest date.

Show me the code...

Your script should look something like this:

# extract out typical harvest dates
harvestStartDate <- growingDates$Harvest.Begins[which(growingDates$State == states[istate])]
harvestEndDate <- growingDates$Harvest.Ends[which(growingDates$State == states[istate])]

# create harvest date that is in the middle of usual start/end dates
harvestDate <- as.Date(harvestStartDate)+floor((as.Date(harvestEndDate)-as.Date(harvestStartDate))/2)

# change harvest date to julian day for year of data examining
harvestDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH[1],1,4),
                                             format(harvestDate,'%m'),format(harvestDate,'%d'))),'%j')

We have done the same thing as the planting date: taken the average of the range for the harvest date. We want to extract out the storm events only between these two dates. For this particular file (year), let’s create a spatial index (counties) and a time index (growing season).

Show me the code...

Your script should look something like this:

### Extract out storm events that occurred in the states key counties over the growingseason 
state_index <- which(states[istate] == file$STATE)
county_index <- match(as.character(file$CZ_NAME[state_index]),unlist(strsplit(as.character(counties),’,’)))
spatial_index <- state_index[which(!is.na(county_index))]

# create time variable for meterological data (julian days)
stormDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH,1,4),substr(file$BEGIN_YEARMONTH,5,6),file$BEGIN_DAY)),’%j’) 

# find time stamps that fall in the typical growing season for the state
time_index <- which(as.numeric(stormDateJulian) <= harvestDateJulian & as.numeric(stormDateJulian) >= plantingDateJulian)

Now extract out the intersection of these.

Show me the code...

Your script should look something like this:

# check if there is an intersection with the spatial index -> for the year (ifile) in the state (istate) these are the indices of event that occurred in the counties typical of growing sweet corn during the growing season
index <- intersect(spatial_index,time_index)

This index represents all the cases in which storms occurred in our key counties for this particular state during the growing season. For each corn production observation, we want a corresponding storm events data observation. To determine the correct index, use this code:

Show me the code...

Your script should look something like this:

# determine the index for the corn data
stormYear <- format(format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH,1,4),
                                            substr(file$BEGIN_YEARMONTH,5,6),file$BEGIN_DAY)),'%Y'))
stormYear <- stormYear[1]
corn_index <- intersect(which(ProcessedSweetCorn$Year == stormYear[1]),which(ProcessedSweetCorn$State == states[istate]))

The corn index is just the row in which we will store the storm frequency (refer to the initialization of the variables). Now we can save the frequency of each event.

Show me the code...

Your script should look something like this:

# count the frequency of each meteorological event for this specific corn observation 
tornado[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tornado")) 
hail[corn_index] <- length(which(file$EVENT_TYPE[index] == "Hail")) 
tsWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "Thunderstorm Wind"))

If the year is 1996 or later, we have more variables to look at.

Show me the code...

Your script should look something like this:

# if the year is 1996 or later check for other variables 
if(as.numeric(stormYear) >= 1996){
 flashFlood[corn_index] <- length(which(file$EVENT_TYPE[index] == "Flash Flood"))
 iceStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Ice Storm"))
 heavySnow[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heavy Snow"))
 coldWindChill[corn_index] <- length(which(file$EVENT_TYPE[index] == "Cold/Wind Chill"))
 strongWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "Strong Wind"))
 tropicalStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tropical Storm"))
 winterStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Winter Storm"))
 flood[corn_index] <- length(which(file$EVENT_TYPE[index] == "Flood"))
 winterWeather[corn_index] <- length(which(file$EVENT_TYPE[index] == "Winter Weather"))
 blizzard[corn_index] <- length(which(file$EVENT_TYPE[index] == "Blizzard"))
 excessiveHeat[corn_index] <- length(which(file$EVENT_TYPE[index] == "Excessive Heat"))
 freezingFog[corn_index] <- length(which(file$EVENT_TYPE[index] == "Freezing Fog"))
 heat[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heat"))
 highWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "High Wind"))
 sleet[corn_index] <- length(which(file$EVENT_TYPE[index] == "Sleet"))
 drought[corn_index] <- length(which(file$EVENT_TYPE[index] == "Drought"))
 extremeColdWindChill[corn_index] <- length(which(file$EVENT_TYPE[index] == "Extreme Cold/Wind Chill"))
 frostFreeze[corn_index] <- length(which(file$EVENT_TYPE[index] == "Frost/Freeze"))
 heavyRain[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heavy Rain"))
 hurricane[corn_index] <- length(which(file$EVENT_TYPE[index] == "Hurricane"))
 tropicalDepression[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tropical Depression"))
 wildfire[corn_index] <- length(which(file$EVENT_TYPE[index] == "Wildfire")) 
} # end if statement for additional variables

Now finish the loop of states and storm files (years):

Show me the code...

Your script should look something like this:

 } # end loop of states 
} # end loop of storm files

Finally, assign the storm event variables to the corn data frame. Remember, the goal was to create one CSV file that contains both the response variables and the factor variables.

Show me the code...

Your script should look something like this:

# assign tornado, hail, and tsWind to fresh corn data 
ProcessedSweetCorn$tornado <- tornado 
ProcessedSweetCorn$hail <- hail 
ProcessedSweetCorn$tsWind <- tsWind 
ProcessedSweetCorn$flashFlood <- flashFlood 
ProcessedSweetCorn$iceStorm <- iceStorm 
ProcessedSweetCorn$heavySnow <- heavySnow 
ProcessedSweetCorn$coldWindChill <- coldWindChill 
ProcessedSweetCorn$strongWind <- strongWind 
ProcessedSweetCorn$tropicalStorm <- tropicalStorm 
ProcessedSweetCorn$winterStorm <- winterStorm 
ProcessedSweetCorn$flood <- flood 
ProcessedSweetCorn$winterWeather <- winterWeather 
ProcessedSweetCorn$blizzard <- blizzard 
ProcessedSweetCorn$excessiveHeat <- excessiveHeat 
ProcessedSweetCorn$freezingFog <- freezingFog 
ProcessedSweetCorn$heat <- heat 
ProcessedSweetCorn$highWind <- highWind 
ProcessedSweetCorn$sleet <- sleet 
ProcessedSweetCorn$drought <- drought 
ProcessedSweetCorn$extremeColdWindChill <- extremeColdWindChill 
ProcessedSweetCorn$frostFreeze <- frostFreeze 
ProcessedSweetCorn$heavyRain <- heavyRain 
ProcessedSweetCorn$hurricane <- hurricane 
ProcessedSweetCorn$tropicalDepression <- tropicalDepression 
ProcessedSweetCorn$wildfire <- wildfire

# save the data frame as a new file 
write.csv(ProcessedSweetCorn,file="FinalProcessedSweetCorn.csv")

We are left with one file. For every observation of corn production for each state/year, we have a corresponding frequency of storm events that is spatially (key counties) and temporally (growing season) relevant. For your convenience, here is the R code from above in one file [18].

Missing Values

Now that the variables have been selected and the data has been extracted, we need to talk about missing values. In data mining, unknown values can be problematic. Choosing the best method for dealing with missing values will depend on the question and the data. There’s no right or wrong way. I’m going to discuss three ways we can deal with missing values (some of which may be a review from Meteo 810).

To show you these three methods, I will use our actual data but to make things simpler I will only look at one state. This will minimize the number of observations. Let’s begin by loading in our final data file.

Show me the code...

Your script should look something like this:

# load in the final table of extracted meteorological events and Processed sweet corn data 
mydata <- read.csv("finalProcessedSweetCorn.csv",stringsAsFactors = FALSE)

Let’s extract the data for one state, picking a state that has a lot of observations.

Show me the code...

Your script should look something like this:

# start with one state with the most comprehensive amount of data available 
states <- unique(mydata$State) 
numObs <- {} 
for(istate in 1:length(states)){
 state_index <- which(mydata$State == states[istate])
 numObs[istate] <- length(which(!is.nan(mydata$Production[state_index]))) 
} 

# pick one of the states 
state <- states[which(numObs == max(numObs))[3]] 

# create a data frame for this state 
stateData <- subset(mydata,mydata$State == state)

The state selected should be Minnesota. Before I go into the different methods for filling missing values, let’s start by looking at the data and determining how many missing values we are dealing with. Run the code below to see the number of variables that have missing data.

For each row (each year) of data in the stateData data frame we count the number of variables that had missing data. You will see that the majority of years had 22 missing values, and these are years prior to 1996. If you change the 1 to 2, you will get the total number of missing values for each column – the number of missing values for each variable. If you have a threshold that you want to meet, like I want the variable to have at least 80% of the values (20% missing values), you can use the function ‘manyNAs’.

Show me the code...

Your script should look something like this:

# determine which variables have more than 20% of values with NaNs 
library(DMwR) 
manyNAs(t(stateData),0.2)

By changing the 0.2 you can adjust the threshold. This is a helpful function in pointing out which variables need to be addressed.

So, how do we deal with the missing values? Well, the first method is to fill the missing values with the most representative value. But how do you decide what the most representative value is? Do you use the mode? Median? Mean? Let’s take for example the variable flashflood. If we want to fill in the missing values with the median we could use the following code.

Show me the code...

Your script should look something like this:

# fill in missing values with most frequent value 
stateData$flashFlood[is.na(stateData$flashFlood)] <- median(stateData$flashFlood,na.rm=T)

The median was 1, so now all the observations in that variable that were NAs have been replaced with 1. If we use the function ‘centralImputation’, we can fill in all missing values automatically for all variables in the data frame using this method.

Show me the code...

Your script should look something like this:

# fill in missing values for all variables
stateData <- centralImputation(stateData)

Now rerun the code to determine the number of missing values in each row:

Show me the code...

Your script should look something like this:

# determine the number of missing values in each row 
apply(stateData,1,function(x) sum(is.na(x))) 

# determine the number of missing values in each column 
apply(stateData,2,function(x) sum(is.na(x)))

You should get 0. But is it realistic to say that from 1960-1995 we can replace missing values with the median from 1996-2009? Probably not. So let’s look at another method. If you are following along in R, you need to reset your stateData data frame.

Show me the code...

Your script should look something like this:

# create a data frame for this state 
stateData <- subset(mydata,mydata$State == state)

The next method we will talk about is based on correlations. Many times meteorological variables (or variables in general) can be correlated with other variables. If you have the data observations for variable A and you know that variable B is correlated to variable A, you can estimate Variable B using the correlation. Start by looking at the correlations among the variables. Run the code below to see a table of correlation coefficients and to synthesize the results.

This uses symbols for the level of significance of the correlation. You will see that the highest correlation is 0.629 between extremeColdWindChill and hail. (You will also notice a correlation of 1 between highWind and iceStorms, but I will talk about this in a bit.) So how do we use this correlation? We create a linear model between the two variables. Since hail has fewer missing values, I want to use hail to predict extremeColdWindChill. Use this code to create the linear model and fill in the missing values from extreme cold wind chill with predictions from the model.

The code above fills in the missing values from extremeColdWindChill by finding the corresponding hail values and running the function that uses the linear model to create a new value. What are the problems with this method? Well, the primary issue is that we are calculating correlations between 14 observations (1996-2009). That is not nearly enough to create a valid result. Hence, the correlation coefficient of 1 for highWind and iceStorms. If you look at the data, they have 1 instance of each event over the 14 observations. This correlation is probably spurious rather than real.

The third and more realistic method for this example is to simply remove cases. Make sure you reset your data if you are following along in R.

You can do two things: either set a threshold requiring no more than x% of missing values or completely get rid of all missing values. If we wanted a dataset of only complete cases, we would use the following code:

Show me the code...

Your script should look something like this:

stateData <- na.omit(stateData)

This code removes any row (year) in which any variable has a missing value. This means we would only retain values from 1996 onward. In this case, I think it would be better to just remove variables that have too many unknowns. I’m going to say any variables with more than 20% of the data missing should be omitted. To do this, use the following code (remember to reset your state data frame):

Show me the code...

Your script should look something like this:

clean.stateData <- stateData[,-manyNAs(t(stateData))]

You will be left with the corn variables for all years along with the thunderstorm/wind, tornado, and hail storm event variables. In this case, I decided it was more valuable to have fewer variables with more observations than more variables with limited observations, so as to avoid overfitting when too many variables are used to fit the model. Any of these methods are acceptable; it just depends on the situation and your justification.

Visualize and Summarize

Prioritize...

After finishing this section, you should be able to visualize and summarize data using numerous methods.

Read...

Before diving into the analysis, it is really important that we synthesize the data. Although the purpose of data mining is to find patterns automatically, you still need to understand the data to ensure your results are realistic. Although we have learned several methods of summarizing data (think back to the descriptive statistics lesson), with data mining you will potentially be using massive amounts of data and new techniques may need to be used for efficiency.

In this section, I will present a few new methods for looking at data that will help you synthesize the observations. You don’t always need to apply these methods but having them in your repertoire will be helpful when you come across a result you don't understand.

For simplicity, I will use clean.stateData for the rest of the lesson, showing you the data mining process with this one state’s data that has been ‘cleaned’ (missing values dealt with). Furthermore, since we are only looking at one state, I’m going to only consider the three storm variables that have a longer time period (tornado, hail, and thunderstorm/wind). I want more data points to create a more robust result. At the end, I’ll show you how we could create one process to look at all of the states at once.

Enrich Old Techniques

Run the code below to summarize the data:

This provides the minimum, maximum, median, mean, 1^stquantile and 3^rd quantile values for each variable. The next step is to visualize this summary. We can do this by plotting probability density histograms of the data. Let’s start with the TsWind variable.

Show me the code...

Your script should look something like this:

# plot a probability density histogram of variable TsWind
hist(clean.stateData$tsWind,prob=T)

You would get the following figure:

Histogram of clean.state.Data$tsWind

Density Histogram of tsWind

Credit: J. Roman

Visually, the greatest frequency occurs for annual event counts between 0 and 5, which means the frequency of this event (tsWind) is relatively low, although there are instances of the frequency exceeding 25 events for one year. How do we enrich this diagram? Run the code below to plot the PDF overlaid with the actual data, as well as a Q-Q plot.

Below is an enlarged version of the figure:

Histogram of Thunderstorm/Wind and Normal QQ plot of Thunderstorm

Probability density histogram and QQ plot of tsWind variable

Credit: J. Roman

The left plot is the probability density histogram, but we’ve overlaid an estimate of the Gaussian distribution (line) and marked the data points on the x-axis. The plot on the right is the Q-Q plot which we’ve seen before; it describes how well the data fits the normal distribution.

But what if the data doesn’t appear normal? We can instead visualize the data with no underlying assumption of the fit. We do this by creating a boxplot.

Show me the code...

Your script should look something like this:

# plot boxplot 
boxplot(clean.stateData$tsWind,ylab="Thunderstorm/Wind") 
rug(jitter(clean.stateData$tsWind),side=2) 
abline(h=mean(clean.stateData$tsWind,na.rm=T),lty=2)

Boxplot for the variable tsWind

Credit: J. Roman

We have done boxplots before, so I won’t go into much detail. But let’s talk about the additional functions I used. The rug (jitter) function plots the data as points along the y-axis (side2). The abline functions (horizontal dashed line) plots the mean value, allowing you to compare the mean to the median. The boxplot is also a great tool to look at outliers. Although we have ‘cleaned’ the dataset of missing values, we may have outliers which are represented by the circle points.

Visualize Outliers

If we want to dig into outliers a bit more, we can plot the data and create thresholds of what we would consider ‘extreme’. Run the code below to create a plot with thresholds:

Below is an enlarged version of the figure.

Time series plot of tsWind

Credit: J. Roman

This code creates a horizontal line for the mean (solid line), median (dotted line), and mean+1 standard deviation (dashed line). It creates boundaries of realistic values and highlights those that may be extreme. But remember, you create those thresholds, so you are deciding what is extreme. By adding one more line of code, we can actually click on the outliers to see what their values are:

identify(clean.stateData$tsWind)

This function allows you to interactively click on all possible outliers you want to examine (hit ‘esc’ when finished). The numbers displayed represent the index of the outliers you clicked. If you want to find out more details, then use this code to display the data values for those specific indices:

Show me the code...

Your script should look something like this:

# output outlier values 
clicked.lines <- identify(clean.stateData$tsWind) 
clean.stateData[clicked.lines,]

If, instead, you don’t want to use the identify function but have some threshold in mind, you can look at the indices by using the following code:

Show me the code...

Your script should look something like this:

# look indices with Tswind frequency greater than 20 
clean.stateData[clean.stateData$tsWind > 20,]

Conditional Plots

The last type of figures we will discuss are conditional plots. Conditional plots allow us to visualize possible relationships between variables. Let’s start by looking at acreage planted and how it compares to the yield per acre. First, create levels of the acreage planted (low, high, moderate).

Show me the code...

Your script should look something like this:

# determine levels of acreage planted 
clean.stateData$levelAcreagePlanted <- array(data="Moderate",dim=length(clean.stateData$Year)) 
highBound <- quantile(as.numeric(clean.stateData$Acreage.Planted),0.75,na.rm=TRUE) 
lowBound <- quantile(as.numeric(clean.stateData$Acreage.Planted),0.25,na.rm=TRUE) clean.stateData$levelAcreagePlanted[which(as.numeric(clean.stateData$Acreage.Planted) > highBound)] <- "High" clean.stateData$levelAcreagePlanted[which(as.numeric(clean.stateData$Acreage.Planted) < lowBound)] <- "Low"

Run the code below that uses the function bwplot from lattice to create a conditional boxplot.

Below is an enlarged version of the figure.

Conditional boxplot of acreage planted (y-axis) and corn yield (x-axis)

Credit: J. Roman

For each level of acreage planted (y-axis), the corresponding corn yield data (x-axis) is displayed as a boxplot. As one might expect, if the acreage planted is low the yield is low (boxplot shifted to the left), while if the acreage planted is high we have a higher corn yield (boxplot shifted to the right). This type of plot is a good first step in exploring relationships between response variables and factors.

We can enrich this figure by looking at the density of each boxplot. Run the code below.

Below is an enlarged version of the figure

Density plot of the average planted corn per acre

Alternative conditional boxplot of acreage planted (y-axis) and corn yield (x-axis

Credit: J. Roman

Credit: J. Roman

The vertical lines represent the 1^st quantile, the median, and the 3^rd quantile. The small dashes are the actual values of the data (these are pretty light on this figure and difficult to see).

The final conditional plot we will talk about uses the function stripplot. In this plot, we are going to add the ts_wind variable to our conditional plot about acreage planted and yield. It adds another layer of synthesis. To start, we chunk up the ts_wind data into three groups.

Show me the code...

Your script should look something like this:

ts_wind <- equal.count(na.omit(clean.stateData$tsWind),number=3,overlap=1/5)

The function ‘equal.count’ essentially does what we did for the acreage planted. We created three categories for the frequency of thunderstorm/wind events. Now use the function stripplot to plot the acreage planted, conditioned by yield and ts_wind.

Below is an enlarged version of the plot.

Graph with three panels showing ts_wind. See text below for description.

Stripplot example

Credit: J. Roman

The graph may appear complex, but it really isn't so let’s break it down. There are three panels: one for high ts_wind (upper left), one for low (bottom left), and one for moderate (bottom right); displayed in the orange bar. The x-axis on each is the yield and the y-axis is the acreage planted. For each panel, the yield/acre data corresponding to the level of ts_wind is extracted and then chunked based on the corresponding acreage. This plot can provide information for questions such as: Does a high frequency of thunderstorm wind, along with a high acreage planted result in a higher yield, than say, a low thunderstorm wind frequency? This type of graph may or may not be fruitful. It can also be used as a follow-up figure to an analysis (once you know which factors are important to a response variable).

Pattern Analysis Part 1: Multiple Linear Regressions

Prioritize...

By the end of this section, you should be able to perform multiple linear regressions on a dataset and optimize the resulting model.

Read...

There are many techniques for pattern analysis, but for this course we are going to talk about two that utilize what we already have learned. They are multiple linear regressions and regression trees. In other courses within this program, we may come back to data mining and explore more advanced techniques. But for now, let’s focus on expanding what we already know.

Initial Analysis

We discussed linear regression in a previous lesson, so you are already familiar with this method. A multiple linear regression analysis applies the linear regression method multiple times to hone in on an optimal solution. Before we get into how we decide what’s optimal, we need an initial model to start with. Let’s begin by creating a linear model for each of our response variables (yield/acre, production, season average price, value of production and acreage harvested) using all the factor variables.

Show me the code...

Your script should look something like this:

# create linear model for predictors
data <- clean.stateData[,c(4,6,10:12)]
lm1.YieldAcre <- lm(Yield.Acre~.,lapply(data,as.numeric))
lm1.Production <- lm(Production~.,lapply(clean.stateData[,c(4,7,10:12)],as.numeric))
lm1.SeasonAveragePrice <- lm(Season.Average.Price~.,lapply(clean.stateData[,c(4,8,10:12)],as.numeric))
lm1.ValueofProduction <- lm(Value.of.Production~.,lapply(clean.stateData[,c(4,9:12)],as.numeric))
lm1.AcreageHarvested <- lm(Acreage.Harvested~.,lapply(clean.stateData[,c(4:5,10:12)],as.numeric))

For each response variable, we have a linear equation that takes as inputs our four factor variables: acreage planted, tornado, hail, and thunderstorm wind frequency. Run the code below to summarize each model.

The first thing to start with is the P-value at the bottom, which tells us if we should reject the null hypothesis. In this case, the null hypothesis is that there is no dependence which is that the response variable does not depend on any of the factor variables (tornado, hail, thunderstorm/wind, acreage planted). If this P-value is greater than 0.05, we know that there is not enough evidence to support a linear model with these variables. Which means we should not continue on! If the p-value is less than 0.05 (remember, this is a 95% confidence - you can change this depending on what confidence level you want to test), then we can reject the null hypothesis and continue examining this model. All P-values are less than 0.05 for this example, so we can continue on.

The next step is to look at the R-squared and adjusted R-squared value. Remember, R-squared tells us how much variability is captured using this model - it’s a metric of fit. The closer to 1, the better the fit. If the R-squared value is really low (< 0.5), the linear model is probably not the best method for modelling the data at hand and should no longer be pursued. If it’s greater than 0.5, we can continue on. The adjusted R-squared value is a more robust result (usually it’s lower than the other R-squared value). One thing to note about the adjusted R-squared: if the adjusted R-squared is substantially bigger than the R-squared, you do not have enough training cases for the number of factor variables you are using. This will be important to look at later on. For now, all the R-squares are greater than 0.5; we can move on.

The next step is to look at the individual coefficients for the factor variables. We want to know whether the coefficients are statistically significant. We do this by testing whether the null hypothesis, that the coefficient is equal to 0, can be rejected. The probability is shown in the last column; Pr(>t). If a coefficient has a value that is statistically significant, then it will be marked by a symbol. As long as you have at least one coefficient that is significant, you should keep going. In this example, each model has at least one coefficient that’s statistically significant.

At this point, we have an initial model (for each response variable) that is statistically significant. Now we want to begin optimizing the model. Optimizing a model is a balance between creating a simple and efficient model (few factor variables) and minimizing errors from the fit.

Backwards Elimination

One method for optimization is called backwards elimination. We start by looking at an initial model containing all the factor variables and determine which factor variable coefficients are least significant and then remove the corresponding factor variables. We then update the model using just the remaining factor variables. There are several metrics to decide which factor variables to reject next. You could look at which coefficient doesn’t pass the null hypothesis in the summary function, suggesting that the coefficient is not statistically different to 0. Another option is to look at how well the model predicts the response variable. We can do this by using ANOVA. If you remember from our previous lesson, we used ANOVA as sort of a multiple t or Z-test. It told us whether the response variable is affected by a factor variable with different levels. If we rejected the null hypothesis, then the evidence suggested the factor variable affected the response variable. Since ANOVA uses a GLM to estimate the p-value, we can also use it to test the fit of the model. Run the code below to give this a try:

For ANOVA, the null hypothesis is that each coefficient does not reduce the error. You want to look at the values of Pr(>F) which are again marked by symbols if they are statistically significant. If a coefficient is statistically significant, then that factor variable statistically reduces the error created by the fit. If a coefficient is not marked, it would be optimal to remove that variable from our model.

For this example, I would get rid of ts_Wind for the response variables yield/acre, production, season average price, value of production, and tornado for the response variable acreage harvested. Let’s update our models by removing these variables. Fill in the missing code for the 'AcreageHarvested' linear model. Make sure you remove the variable 'tornado'.

Now we repeat the process, by summarizing our new models.

The P-value should still be less than 0.05 and the adjusted R-squared should have increased. Generally, if you remove a factor variable in regression the R-squared will always decrease but the adjusted R-squared may increase. The goal here is to minimize the factor variables to drive up the adjusted R-squared, even though the loss of each factor variable will cost you a bit of R-squared. If the adjusted R-squared didn't increase, then the update wasn't successful. Let’s compare the two models (before and after). Add in the code for the Acreage Harvested models.

This format of the ANOVA test tells us how significant the update was. If the change was statistically significant, the value Pr(>f) would be less than 0.05 and would have a symbol by it. None of our models statistically changed. However, we still have optimized the model by reducing factor variables and increasing the adjusted R-Square. Now we can repeat this process on the new model and remove another variable, further refining the model.

Automated Backward Elimination

If you have a lot of factor variables, you would not want to do this routine manually. There are functions in R that allow you to set this routine up automatically. One such function is called ‘step’. By default, ‘step’ is a backward elimination process but instead of using the ANOVA fit as a metric, it uses the AIC score. As a reminder, the AIC score is the Akaike Information Criterion, another metric describing the quality of the model relative to other models. AIC focuses on the information lost when a given model is used, whereas R-squared focuses on the prediction power. ‘Step’ will compare the before and after AIC scores to optimize the model. You still need to make the initial linear model, but then you can run ‘step’ to automatically optimize. Fill in the missing code for the Value of Production final linear model.

We can then summarize the final models by running the code below:

You are left with the optimized model. Please note that there is a random component, so the results will not always be the same. For the example I ran, the models with high adjusted R-squared values are the models for variables acreage harvested (0.885) and production (0.7689). This means that 88.5% of the variability of acreage harvested can be explained by the acreage planted and the frequency of hail events. 76.9% of the variability of production can be explained by the combination of acreage planted, hail, and ts_wind events. We will discuss evaluating these models in more detail later on. For now, make sure you understand what the ‘step’ function does and how we perform a data mining process using a multiple linear regression approach.

Pattern Analysis Part 2: Regression Trees

Prioritize...

By the end of this section, you should be able to perform a regression tree analysis on a dataset and be able to optimize the model.

Read...

Sometimes we can overfit our model with too many parameters for the number of training cases we have. An alternative approach to multiple linear regression is to create regression trees. Regression trees are sort of like diagrams of thresholds. Was this observation above or below the threshold? If above, move to this branch; if below, move to the other branch. It's kind of like a roadmap for playing the game "20 questions" with your data to get a numerical answer.

Initial Analysis

Similar to the multiple linear regression analysis, we need to start with an initial tree. To create a regression tree for each response variable, we will use the function 'rpart'. When using this function, there are three ways a tree could stop growing.

The complexity parameter, which describes the amount by which splitting that node improved the relative error (deviance), goes below a certain threshold (parameter cp).
1. If the new branch does not improve by more than 0.01, the new branch is deemed nonessential.
2. By default, cp is set to 0.01. This means that a new branch must result in an improvement greater than 0.01.
The number of samples in the node is less than another threshold (parameter minsplit).
1. You can set a threshold of the number of samples required in each node/breakpoint.
2. By default, the minsplit is 20. This means that after each threshold break, you need to have at least 20 points to consider another break.
The tree depth, the number of splits, exceeds a threshold (parameter maxdepth).
1. You don't want really large trees. Too many splits will not be computationally efficient for prediction.
2. By default, the max depth is set to 30.

Use the code below to create a regression tree for each variable. Note the indexing to make sure we create a tree with respect to other variables in the data set. I included one example (AcreageHarvested) of the summary results of the regression model.

The result is an initial threshold model for the data to follow. Use the code below to visualize the tree:

Show me the code...

Your script should look something like this:

# plot regression trees 
prettyTree(rt1.YieldAcre) 
prettyTree(rt1.Production) 
prettyTree(rt1.SeasonAveragePrice) 
prettyTree(rt1.ValueofProduction) 
prettyTree(rt1.AcreageHarvested)

The boxes represent end points (cutely called the leaves of the tree). The circles represent a split. The tree depth is 5 and the 'n' in each split represents the number of samples. You can see that the 'n' in the final boxes is less than 20, less than the minimum requirement for another split. Here is the tree for the response variable yield/acre.

Regression tree. See text below for description.

Example of Tree plot for the response variable Yield/Acre

Credit: J. Roman

You can think of the tree as a conditional plot. In this example, the first threshold is acreage planted. If acreage planted is greater than 12,730 we then consider hail events. If hail is greater than 4.5 we expect a yield of 6.53 per acre.

Regression trees are easy in that you just follow the map. However, one thing you might notice is that your predictions are set values, and in this case there are 5 possible predictions. You don't capture variability as well relative to linear models.

Pruning

Similar to the multiple linear regression analysis, we want to optimize our model. We want a final tree that minimizes the error of the predictions but also minimizes the size of the tree which tend to generalize poorly for forecasting by fitting the noise in the training data. We do this by 'pruning', deciding which splits provide little value and removing them. Again, there are a variety of metrics that can be used for optimizing.

One option would be to look at the number of samples (n= value in each leaf). If you have a leaf with a small sample size (<5,10 - it depends on the number of samples you actually have), you might decide that there aren't enough values to really confirm whether that split is optimal. Another option is to obtain the subtrees and calculate the error you get for each subtree. Let’s try this out; run the code below to use the command 'printcp'.

What is displayed here? Our whole tree has 4 ‘subtrees’, so the results are broken into 5 parts representing each subtree and the whole tree. For each subtree (and the whole tree), we have several values we can look at. The first relevant variable is the relative error (rel error). This is the error compared to the root node (the starting point of your tree) and is calculated using a ten-fold cross-validation process. This process estimates the error multiple times by breaking the data into chunks. One chunk is the training data, which is used to create the model, and the second chunk is the test data, which is used to predict the values and calculate the error. For each iteration, it breaks the data up into new chunks. For 10-fold cross validation, the testing chunk always contains 1/10 of the cases and the training chunk 9/10. This lets R calculate the error on independent data in just 10 passes while still using almost all of the data to train each time. You can visualize this process below.

Data broken up into chunks. Four iterations of test and training data.

Diagram of k-fold cross validation

Credit: Fabian Flöck, Wikipedia [19]

For more information on cross-validation, you can read the cross-validation wiki page [20]. The xerror is an estimate of the average error from the 10 iterations performed in the cross- validation process.

Let’s look at the results from our yield/acre tree.

(Please note that your actual values may be different because the cross-validation process is randomized). The average error (xerror) is lowest at node 3 or split 2. You would prune your tree at split 2 (remove everything after the second split).

An alternative and more common optimization method is the One Standard Error (1 SE) rule. The 1 SE rule uses the xerror and xstd (average standard error). We first pick the subtree with the smallest xerror. In this case, that is the third one (3). We then calculate the 1 SE rule:

xerror+1*xstd

In this case, it would be 0.93446+0.19904=1.1325. We compare this value to the relative error column (rel error). The rel error represents the fraction of wrong predictions in each split compared to the number of wrong predictions in the root. For this example, the first split results in a rel error of 0.587, meaning the error in this node is 58% of that in the first node (we have improved the error by 1-0.587 or 41%). However, the rel_error is computed using the training data and thus decreases as the tree grows and adjusts to the data. We should not use this as an estimate of the real error, instead we minimize the xerror and xstd which uses the testing data. We still want to minimize the rel_error, so we select the first split that has a rel_error less than 1.1325 (the 1 SE rule). In this case, that is tree 1. To read how xerror, rel_error, and xstd are calculated in this function, please see the documentation [21].

If we want to prune our tree using this metric, we would prune the tree to subtree 1. To do this in R, we will use the 'prune' function and insert the corresponding cp value to that tree, in this case 0.4. Run the code below to test this on the YieldAcre variable.

You will end up with the first tree- visually:

The first tree, see text below for description.

Pruned regression tree for the Yield Acre variable

Credit: J. Roman

This may seem a bit confusing, especially the metric portion. All you should know is that if using the 1 SE rule, we calculate a threshold (xerror+1*xstd) and find the first node with a relative error less than this value. Remember, the key here is we don't want the most extensive tree; we want one that predicts relatively well but doesn't over fit.

Automate

The good thing is that we can automate this process. So if you don't understand the SE metric well, don't fret. Just know that it's one way of optimizing the error and length of the tree. To automate, we will use the function rpartXse.

Show me the code...

Your script should look something like this:

# automate process 
(rtFinal.YieldAcre <- rpartXse(Yield.Acre~.,lapply(stateData[,c(4,6,10:12)],as.numeric))) 
(rtFinal.Production <- rpartXse(Production~.,lapply(stateData[,c(4,7,10:12)],as.numeric))) 
(rtFinal.SeasonAveragePrice <- rpartXse(Season.Average.Price~.,lapply(stateData[,c(4,8,10:12)],as.numeric))) (rtFinal.ValueofProduction <- rpartXse(Value.of.Production~.,lapply(stateData[,c(4,9:12)],as.numeric))) 
(rtFinal.AcreageHarvested <- rpartXse(Acreage.Harvested~.,lapply(stateData[,c(4:5,10:12)],as.numeric)))

Again, you need to remember that due to the ten-fold cross validation, the error estimates will be different if you rerun your code.

Select Model

Prioritize...

At the end of this section you should be able to employ tools to choose the best model for your data.

Read...

By now you should know how to create multiple linear regression models as well as regression tree models for any dataset. The next step is to select the best model. Does a regression tree better predict the response variable than a multiple linear regression model? And which response variable can we best predict? Yield/acre, production, acreage harvested? In this section, we will discuss how to evaluate and select the best model.

Predict

Now that we have our models using the multiple linear regression and regression tree approach, we need to use the models to predict. This is the best way to determine how good they actually are. There are multiple ways to go about this.

The first is to use the same data that you trained your model with (when I say train, I mean the data used to create the model). Another option is to break your data up into chunks and use one part of it for training and the other part for testing. For the best results with this option, use the 10-fold cross-validation so that you exploit most of it for training and all of it for testing. The last option is to have two separate datasets: one for training and one for testing. But this will provide a weaker model and should only be used to save time.

Let’s start by predicting our corn production response variables using our multiple linear regression models that were optimized. To do this, I will use the function ‘predict’. Use the code below to predict response variables.

Show me the code...

Your script should look something like this:

# create predictions for multiple linear models using the test data (only those with actual lm model results)

lmPredictions.YieldAcre <- predict(lmFinal.YieldAcre,lapply(clean.stateData[,4:12],as.numeric))
lmPredictions.YieldAcre[which(lmPredictions.YieldAcre<0)] <- 0
lmPredictions.Production <- predict(lmFinal.Production,lapply(clean.stateData[,4:12],as.numeric))
lmPredictions.Production[which(lmPredictions.Production < 0)] <- 0
lmPredictions.SeasonAveragePrice <- predict(lmFinal.SeasonAveragePrice,lapply(clean.stateData[,4:12],as.numeric))
lmPredictions.SeasonAveragePrice[which(lmPredictions.SeasonAveragePrice < 0)] <- 0
lmPredictions.ValueofProduction <- predict(lmFinal.ValueofProduction,lapply(clean.stateData[,4:12],as.numeric))
lmPredictions.ValueofProduction[which(lmPredictions.ValueofProduction < 0)] <- 0
lmPredictions.AcreageHarvested <- predict(lmFinal.AcreageHarvested,lapply(clean.stateData[,4:12],as.numeric))
lmPredictions.AcreageHarvested[which(lmPredictions.AcreageHarvested < 0)] <- 0

Because we may have predictions less than 0 which are not plausible for this example, I’ve changed those cases to 0 (the nearest plausible limit). This is a caveat with linear regression. The models are lines and so their output can theoretically extend to plus and minus infinity, well beyond the range of physical plausibility. This means you have to make sure your predictions are realistic. Next, let’s predict the corn production response variables using the optimized regression trees.

Show me the code...

Your script should look something like this:

# create predictions for regression tree results
rtPredictions.YieldAcre <- predict(rtFinal.YieldAcre,lapply(clean.stateData[,4:12],as.numeric))
rtPredictions.Production <- predict(rtFinal.Production,lapply(clean.stateData[,4:12],as.numeric))
rtPredictions.SeasonAveragePrice <- predict(rtFinal.SeasonAveragePrice,lapply(clean.stateData[,4:12],as.numeric))
rtPredictions.ValueofProduction <- predict(rtFinal.ValueofProduction,lapply(clean.stateData[,4:12],as.numeric))
rtPredictions.AcreageHarvested <- predict(rtFinal.AcreageHarvested,lapply(clean.stateData[,4:12],as.numeric))

We don’t have to worry about non-plausible results because regression trees are a ‘flow’ chart - not an equation. Now we have predictions for 50 observations and can evaluate the predictions.

Evaluate

To evaluate the models, we are going to compare the predictions to the actual observations. The function regr.eval compares the two and calculates several measures of error that we have used before including the mean absolute error (MAE), mean standard error (MSE), and root mean square error (RMSE). Let’s evaluate the linear models first.

Show me the code...

Your script should look something like this:

# use regression evaluation function to calculate several measures of error 
lmEval.YieldAcre <- regr.eval(clean.stateData[,"Yield.Acre"],lmPredictions.YieldAcre,train.y=clean.stateData[,"Yield.Acre"]) lmEval.Production <- regr.eval(clean.stateData[,"Production"],lmPredictions.Production,train.y=clean.stateData[,"Production"]) lmEval.SeasonAveragePrice <- regr.eval(clean.stateData[,"Season.Average.Price"],lmPredictions.SeasonAveragePrice,train.y=clean.stateData[,"Season.Average.Price"]) 
lmEval.ValueofProduction <- regr.eval(subset(clean.stateData[,"Value.of.Production"],!is.na(clean.stateData[,"Value.of.Production"])),subset(lmPredictions.ValueofProduction,!is.na(clean.stateData[,"Value.of.Production"])),train.y=subset(clean.stateData[,"Value.of.Production"],!is.na(clean.stateData[,"Value.of.Production"]))) 
lmEval.AcreageHarvested <- regr.eval(clean.stateData[,"Acreage.Harvested"],lmPredictions.AcreageHarvested,train.y=clean.stateData[,"Acreage.Harvested"])

There are 6 metrics of which you should already know 3: MAE, MSE, and RMSE. MAPE is the mean absolute percent error – the percentage of prediction accuracy. For the model of yield/acre, the MAPE is 0.11. The predictions were off by 11% on average. MAPE can be biased. NMSE is the normalized mean squared error. It is a unitless error typically ranging between 0 and 1. A value closer to 0 means your model is doing well, while values closer to 1 mean that your model is performing worse than just using the average value of your observations. This is a good metric for selecting which response variable is best predicted. In the yield/acre model, the NMSE is 0.397 while in acreage harvested it is 0.11, suggesting that the acreage harvested is better predicted than yield/acre. The last metric is NMAE, or normalized mean absolute error, which has a value from 0 to 1. For the yield/acre model, the NMAE is 0.59 while in acreage harvested it is 0.32.

Now let’s do the same thing for the regression tree results.

Show me the code...

Your script should look something like this:

rtEval.YieldAcre <- regr.eval(clean.stateData[,"Yield.Acre"],rtPredictions.YieldAcre,train.y=clean.stateData[,"Yield.Acre"]) rtEval.Production <- regr.eval(clean.stateData[,"Production"],rtPredictions.Production,train.y=clean.stateData[,"Production"]) rtEval.SeasonAveragePrice <- regr.eval(clean.stateData[,"Season.Average.Price"],rtPredictions.SeasonAveragePrice,train.y=clean.stateData[,"Season.Average.Price"]) 
rtEval.ValueofProduction <- regr.eval(subset(clean.stateData[,"Value.of.Production"],!is.na(clean.stateData[,"Value.of.Production"])),subset(rtPredictions.ValueofProduction,!is.na(clean.stateData[,"Value.of.Production"])),train.y=subset(clean.stateData[,"Value.of.Production"],!is.na(clean.stateData[,"Value.of.Production"]))) 
rtEval.AcreageHarvested <- regr.eval(clean.stateData[,"Acreage.Harvested"],rtPredictions.AcreageHarvested,train.y=clean.stateData[,"Acreage.Harvested"])

For the regression tree, the smallest NMSE is with acreage harvested (0.13) similar to the linear model.

Visual Inspection

Looking at the numbers can become quite tedious, so visual inspection can be used. The first visual is to plot the observed values agianst predicted values. Here’s an example for the ‘production’ response variable.

Show me the code...

Your script should look something like this:

# visually inspect data 
old.par <- par(mfrow=c(1,2)) 
plot(lmPredictions.Production,clean.stateData[,"Production"],main="Linear Model", 
     xlab="Predictions",ylab="Observations") 
abline(0,1,lty=2) 
plot(rtPredictions.Production,clean.stateData[,"Production"],main="Regression Tree", 
     xlab="Predictions",ylab="Observations") 
abline(0,1,lty=2) 
par(old.par)

You should get a figure similar to this:

On the left there is a linear model and on the right is the regression tree.

Predictions versus Observations for the Linear Model (left) and the Regression Tree (right)

Credit: J. Roman

The x-axes for both panels are the predicted values; the y-axes are the observations. The left panel is using the linear model and the right is using the regression tree. The dashed line is a 1-1 line. The linear model does really well, visually, while the regression tree has ‘stacked’ points. This is due to the nature of the ‘flow’ type modeling. There are only a few possible results. In this particular case, there are 4 values as you can see from the tree chart:

Tree plot for the Acreage Planted model

Credit: J. Roman

For this particular problem, we are not able to capture the variability as well using the tree. But if the relationship had a discontinuity or was nonlinear, the tree would have won.

Let’s say you notice some poorly predicted values that you want to investigate. You can create an interactive diagram that lets you click on poor predictions to pull out the data. This allows you to look at the observations and predictions, possibly identifying the cause of the problem.

Show me the code...

Your script should look something like this:

# interactively click on poor predictions 
plot(lmPredictions.Production,clean.stateData[,"Production"],main="Linear Model", 
     xlab="Predictions",ylab="Observations") 
abline(0,1,lty=2) 
clean.stateData[identify(lmPredictions.Production,clean.stateData[,"Production"]),]

Doing this for every single model, especially when you have several response variables, can be tedious. So how can we compare all the model types efficiently?

Model Comparison

When it comes to data mining, the goal is to be efficient. When trying to choose between types of models as well as which response variable we can best predict, automating the process is imperative. To begin, we are going to set up functions that automate the processes we just discussed: create a model, predict the model, and calculate a metric for comparison. The code below creates two functions: one using the multiple linear regression approach and one with the regression tree approach.

# create function to train test and evaluate models through cross-validation
cvRT <- function(form,train,test,...){
  modelRT<-rpartXse(form,train,...)
  predRT <- predict(modelRT,test)
  predRT <- ifelse(predRT <0,0,predRT)
  mse <- mean((predRT-resp(form,test))^2)
  c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2))
}
cvLM <- function(form,train,test,...){
  modelLM <- lm(form,train,...)
  predLM <- predict(modelLM,test)
  predLM <- ifelse(predLM <0,0,predLM)
  mse <- mean((predLM-resp(form,test))^2)
  c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2))
}

Let’s break the first function (cvRT) down.

You supply the function the form of the model you want, response variables~factor variables. You then supply the training data and the testing data. When this function is called, it uses the same function we used before ‘rpartXse’. Once the optimal regression tree is created, the function predicts the values using the optimized model (modelRT) and the test data. We then replace any nonsensical values (less than 0) with 0. Then we calculate NMSE and save the value. This is the value we will use for model comparison.

The second function (cvLM) is exactly the same, but instead of using the regression tree approach we use the multiple linear regression approach.

Once the optimal models for each response variable are selected (one for each approach), we want to select the approach that does a better job. To do this, we are going to use the function ‘experimentalComparison’. This function implements a cross-validation method. With this function, we need to supply the datasets (form, train, test) and the parameter called ‘system’. These are the types of models you want to run (cvLM and CVRT). Here is where you would supply any more parameters you may want to use. For example, in the function rpartXse we can assign thresholds like the number of splits or the cp. The last parameter is cvSettings. These are the settings for the cross-validation experiment. It is in the form of X, K, Z where X is the number of times to perform the cross-validation process, K is the number of folds (the number of equally sized subsamples), and Z is the seed for the random number generator. Let’s give this a try for the acreage harvested response variable.

Show me the code...

Your script should look something like this:

# perform comparison 
results <- experimentalComparison(  
  c(dataset(Acreage.Harvested~.,lapply(clean.stateData[,c(4,5,10:12)],as.numeric),'Acreage.Harvested')), 
  c(variants('cvLM'), variants('cvRT',se=c(0,0.5,1))),
  cvSettings(3,10,100))

For the regression tree, I added the parameter SE=0, 0.5, and 1. Remember, SE is xerror+xstd. But in reality there is a coefficient in front of xstd, given by the numbers used in this setting (0, 0.5 and 1). This setting will create three random forests with varying SE threshold (xerror, xerror+0.5xstd, xerror+1*xstd). I set ‘cvSettings’ to 3,10,100 which means we are going to run a 3-10 Fold cross-validation. We want 3 iterations chunking the data into 10 equally sized groups. We set the random number generator to 100.

Why do we run iterations? It provides statistics on the performance metric! Let’s take a look by summarizing the results.

Show me the code...

Your script should look something like this:

# summarize the results 
summary(results)

The output begins by telling us the type of cross-validation that was used, in this case 3X10. It also tells us the seed for the random number generator. Next, the data sets are displayed followed by four learners (model types): one version of a linear model and three versions of a regression tree (se=0,0.5, and 1). Next is the summary of the results, which are broken down by the learner (model). We only saved the metric NMSE, so that is the only metric we will consider. You can change this depending on what you are interested in, but with respect to comparing models the NMSE is a pretty good metric. The NMSE for each learner, the average, the std, the min, and the maximum are computed. This can be done because of the cross-validation experiment! Because we ran it 3 times with 10 folds (10 chunks of data) we can create statistics on the error of the model. For this case, it appears the linear model is the best as it has the lowest average NMSE.

If we want to visualize these results, we can run the following code:

Show me the code...

Your script should look something like this:

# plot the results 
plot(results)

You should get a figure similar to this:

Boxplot of the NMSE values for the models created for the variable Acreage Harvested

Credit: J. Roman

This creates a boxplot of the NMSE values estimated for each iteration and fold (30 total points). It’s a good way to visually determine which model type is best for this particular response variable as well as display the spread of the error.

Ensemble Model and Automation

We now want to take the automation to another level. We want to be able to create models for all the response variables, not just one. To do this, we need to create a list that contains the form of each model. Use the code below:

Show me the code...

Your script should look something like this:

# apply function for all variables to predict 
dataForm <- sapply(names(clean.stateData)[c(5:9)],
                   function(x,names.attrs){ 
                                           f <- as.formula(paste(x,"~."))     
                                           dataset(f,lapply(clean.stateData[,c(names.attrs,x)],as.numeric),x) 
                                           }, 
                   names(clean.stateData)[c(4,10:12)])

dataForm is a list variable that contains the datasets needed for the experimental comparison: form, train, and test. Now we can automate the whole process.

Show me the code...

Your script should look something like this:

allResults <- experimentalComparison(dataForm, 
                                    c(variants('cvLM'),variants('cvRT',se=c(0,0.5,1))),
                                    cvSettings(5,10,100))

Notice that this time I’ve decided to perform a 5x10 cross-validation process. This might take a bit longer to run, so be patient.

Now let’s plot the results to get a visual.

Show me the code...

Your script should look something like this:

# plot all results 
plot(allResults)

You should get the following figure:

Boxplot of NMSE for models created for each variable

Credit: J. Roman

Each panel represents one of the response variables we want to predict. Within each panel, we see a boxplot for the four learners (the one linear model and the 3 regression trees). In this case, we have 50 estimates of NMSE going into the boxplot. We will see that some variables, like acreage harvested or yield/acre, might have an obvious model selection, but for others it's unclear. We can use the function ‘bestscores’ to select the model with the smallest average NMSE for each response variable.

Show me the code...

Your script should look something like this:

# determine the best scores 
bestScores(allResults)

For this example, the linear model is actually chosen for all response variables. And acreage harvested has the smallest overall NMSE, suggesting that acreage harvested might be the corn response variable we should focus on.

There is one more type of model that we should discuss. In addition to regression trees and linear regression models we can create a random forest. A random forest is an example of an ensemble model. Ensembles are the combinations of several individual models that at times are better at predicting than their individual parts. In this case, a random forest is an ensemble of regression trees. Let’s begin by creating a function to compute the random forest

# now create a random forest (ensemble model)
library(randomForest)
cvRF <- function(form,train,test,...){
  modelRF <- randomForest(form,train,...)
  predRF <- predict(modelRF,test)
  predRF <- ifelse(predRF <0,0,predRF)
  mse <- mean((predRF-resp(form,test))^2)
  c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2))
}

Similar to the other model methods, we need a dataset with the form, train, and test inputs. We then use the experimental comparison but add in our new learner- cvRF.

Show me the code...

Your script should look something like this:

allResults <- experimentalComparison(dataForm, 
                                     c(variants('cvLM'), variants('cvRT',se=c(0,0.5,1)),   
                                       variants('cvRF',ntree=c(70,90,120))), 
                                     cvSettings(5,10,100))

The ntree parameter tells the random forest how many individual trees to create. You want a large enough number of individual trees to create a meaningful result but small enough to be computationally efficient. Here’s an article, "How Many Trees in a Random Forest? [22]," that says the number should be between 64 and 128. If the function is taking too long to run, you probably have too many trees.

Let’s run the code with the random forest included and determine the best scores.

Show me the code...

Your script should look something like this:

# look at scores 
bestScores(allResults)

After running this, you will notice that now we have some random forests listed as the best models instead of only the linear regression model.

Obtain Model and Predict

Prioritize...

After reading this section you should be able to obtain the best model and use it.

Read...

Now we know which model is best (linear, regression forest, or random forest). But how do we actually obtain the model parameters and use it?

Obtain Model

At this point, we have determined which type of model best represents the response variable but we have yet to actually obtain the model in a usable way. Let's extract out the best model for each response variable. Start by determining the best model names.

# obtain the best model for each predicted variable
NamesBestModels <- sapply(bestScores(allResults),
                          function(x) x['nmse','system'])

Next, classify the model type.

modelTypes <- c(cvRF='randomForest',cvRT='rpartXse',cvLM='lm')

Now create a function to apply each model type.

modelFuncs <- modelTypes[sapply(strsplit(NamesBestModels,split="\\."),
                         function(x) x[1])]

Then obtain the parameters for each model.

modelParSet <- lapply(NamesBestModels,
                   function(x) getVariant(x,allResults)@pars)

Now we can loop through each response variable and save the best model.

Show me the code...

Your script should look something like this:

bestModels <- list()
for(i in 1:length(NamesBestModels)){
  form <-as.formula(paste(names(clean.stateData)[4+i],'~.'))
  bestModels[[i]]<- do.call(modelFuncs[i],
                            c(list(form,lapply(clean.stateData[,c(4,4+i,10:12)],as.numeric)),modelParSet[[i]])) 
}

The code above begins by creating a formula for the model (response variable~factor variable). It then calls the 'modelFuncs' function, which is our learners (lm, randomForest, rpartXse), and supplies the arguments and the parameters. You are left with a list of five models. For each response variable, the list contains the formula of the best model.

Predict

Now that we have the formulas, let's try predicting values that were not used in the training set. Here is a set of corn and storm data from 2010-2015 [23]. We can use this data to test how well the model actually does at predicting values not used in the training dataset. Remember, the reason I didn't use this dataset for testing before was due to the limited sample size but we can use it as a secondary check.

Begin by loading the data and extracting it out for our state of interest (Minnesota).

Show me the code...

Your script should look something like this:

# load in test data 
mydata2 <- read.csv("finalProcessedSweetCorn_TestData.csv",stringsAsFactors = FALSE) 

# create a data frame for this state 
stateData2 <- subset(mydata2,mydata2$State == state)

Now we will predict the values using the best model we extracted previously.

Show me the code...

Your script should look something like this:

# create predictions
modelPreds <- matrix(ncol=length(NamesBestModels),nrow=length(stateData2$Observation.Number))
for(i in 1:nrow(stateData2))
  modelPreds[i,] <- sapply(1:length(NamesBestModels),
                           function(x)
                             predict(bestModels[[x]],lapply(stateData2[i,4:12],as.numeric))
  )
modelPreds[which(modelPreds < 0)] <-0

We loop through each observation and apply the best model for each response variable. We then replace any non-plausible predictions with 0. You are left with a set of predictions for each year for each corn response variable. Although we have only 6 years to work with, we can still calculate the NMSE to see how well the model did with data not included in the training dataset. The hope is that the NMSE is similar to the NMSE calculated in our process for selecting the best model.

Show me the code...

Your script should look something like this:

# calculate NMSE for all models 
avgValue <- apply(data.matrix(clean.stateData[,5:9]),2,mean,na.rm=TRUE) 
apply(((data.matrix(stateData2[,5:9])-modelPreds)^2),
        2,mean,na.rm=TRUE)/ 
        apply((scale(data.matrix(stateData2[,5:9]),avgValue,F)^2),2,mean)

The code starts by calculating an average value of each corn response variable based on the longer dataset. Then we calculate the MSE from our 6-year prediction and normalize it by scaling the test data by the long-term average value. You should get these scores:

Normalized MSE scores from 6-year prediction for each variable
Average.Harvested	Yield.Acre	Production	Season.Average.Price	Value.of.Production
0.1395567	0.7344319	0.4688927	0.6716387	0.7186660

For comparison, these were the best scores during model selection. Remember, the ones listed below may be different from yours due to the random factor in the k-fold cross-validation.

$Acreage.Harvested
       system     score
nmse cv.lm.v1 0.2218191

$Yield.Acre
       system    score
nmse cv.rf.v2 0.624105

$Production
       system     score
nmse cv.lm.v1 0.2736667

$Season.Average.Price
       system     score
nmse cv.rf.v3 0.4404177

$Value.of.Production
       system     score
nmse cv.lm.v1 0.4834414

The scores are relatively similar. Yield/acre has the highest score while the lowest score is from acreage harvested and production. This confirms that our factor variables are better at predicting the acreage harvested than Yield/Acre.

US Example

Let’s go through one more example of the whole process for the entire United States (not just Minnesota). In addition, I'm going to use more factor variables and only look at observations after 1996 (this will include events such as flash floods, wildfires, etc.).

To start, let's extract out the data from 1996 onward for all states.

Show me the code...

Your script should look something like this:

# extract data from 1996 onward to get all variables to include 
allData <- subset(mydata,mydata$Year>= 1996)

Now clean up the data. I've decided that if no events happen, I will remove that event type. Start by summing up the number of events for each event type (over the whole time period) and determining which factor variables to keep.

Show me the code...

Your script should look something like this:

# compute total events 
totalEvents <- apply(data.matrix(allData[,4:34]),2,sum,na.rm=TRUE) 
varName <- names(which(totalEvents>0))

Now subset the data for only event types where events actually occurred and remove any missing cases.

Show me the code...

Your script should look something like this:

# subset data for only events where actual events occurred 
clean.data <- allData[varName] 
clean.data <- clean.data[complete.cases(clean.data),]

Now create the functions to train, test, and evaluate the different model types.

Show me the code...

Your script should look something like this:

# create function to train test and evaluate models through cross-validation
cvRT <- function(form,train,test,...){
     modelRT<-rpartXse(form,train,...)
     predRT <- predict(modelRT,test) 
     predRT <- ifelse(predRT <0,0,predRT) 
     mse <- mean((predRT-resp(form,test))^2) 
     c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2)) 
} 
cvLM <- function(form,train,test,...){ 
     modelLM <- lm(form,train,...) 
     predLM <- predict(modelLM,test) 
     predLM <- ifelse(predLM <0,0,predLM) 
     mse <- mean((predLM-resp(form,test))^2) 
     c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2)) 
} 
library(randomForest) 
cvRF <- function(form,train,test,...){ 
     modelRF <- randomForest(form,train,...) 
     predRF <- predict(modelRF,test) 
     predRF <- ifelse(predRF <0,0,predRF) 
     mse <- mean((predRF-resp(form,test))^2) 
     c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2)) 
}

Now apply these functions to all the response variables.

Show me the code...

Your script should look something like this:

# apply function for all variables to predict 
dataForm <- sapply(names(clean.data)[c(2:6)], 
                   function(x,names.attrs){
                     f <- as.formula(paste(x,"~.")) 
                     dataset(f,lapply(clean.data[,c(names.attrs,x)],as.numeric),x) 
                   }, 
                   names(clean.data)[c(1,7:27)]) 
allResults <- experimentalComparison(dataForm, 
                                     c(variants('cvLM'), 
                                       variants('cvRT',se=c(0,0.5,1)), 
                                       variants('cvRF',ntree=c(70,90,120))), 
                                     cvSettings(5,10,100))

Finally, look at the scores.

Show me the code...

Your script should look something like this:

# look at scores
bestScores(allResults)

Hopefully, by including more data points (in the form of multiple states) as well as more factor variables, we can decrease the NMSE. The best score is still for acreage harvested, with an NMSE of 0.0019, but production and value of production are quite good with NMSEs of 0.039 and 0.11 respectively.

Again the end goal is prediction, so let's obtain the best model for each response variable.

Show me the code...

Your script should look something like this:

# obtain the best model for each predicted variable 
NamesBestModels <- sapply(bestScores(allResults), 
                          function(x) x['nmse','system']) 
modelTypes <- c(cvRF='randomForest',cvRT='rpartXse',cvLM='lm') 
modelFuncs <- modelTypes[sapply(strsplit(NamesBestModels,split="\\."), 
                                function(x) x[1])] 
modelParSet <- lapply(NamesBestModels, 
                      function(x) getVariant(x,allResults)@pars) 
bestModels <- list() 
for(i in 1:length(NamesBestModels)){ 
   form <-as.formula(paste(names(clean.data)[1+i],'~.')) 
  bestModels[[i]]<- do.call(modelFuncs[i],          
                            c(list(form,lapply(clean.data[,c(1,i+1,7:27)],as.numeric)),
                              modelParSet[[i]])) }

And let's test the model again using the separate test dataset. We need to create a clean test dataset first.

Show me the code...

Your script should look something like this:

# create a data frame for test data 
data2 <- subset(mydata2,mydata2$Year >= 2010) 

# subset data for only events where actual events occurred 
clean.test.data <- data2[varName] 
clean.test.data <- subset(clean.test.data,!is.na(as.numeric(clean.test.data$Acreage.Planted)))

Now calculate the predictions.

Show me the code...

Your script should look something like this:

# create predictions 
modelPreds <- matrix(ncol=length(NamesBestModels),nrow=nrow(clean.test.data)) 
for(i in 1:nrow(clean.test.data)){
  modelPreds[i,] <- sapply(1:length(NamesBestModels), 
                          function(x) 
                            predict(bestModels[[x]],lapply(clean.test.data[i,c(1,7:27)],as.numeric)))
} 
modelPreds[which(modelPreds < 0)] <-0

Finally, calculate the NMSE.

Show me the code...

Your script should look something like this:

# calculate NMSE for all models 
avgValue <- apply(data.matrix(clean.data[,2:6]),2,mean,na.rm=TRUE) 
apply(((data.matrix(clean.test.data[,2:6])-modelPreds)^2), 
        2,mean,na.rm=TRUE)/ 
  apply((scale(data.matrix(clean.test.data[,2:6]),avgValue,F)^2),2,mean)

In this case, I get relatively high scores for every corn response variable except acreage harvested. Remember, there are fewer points going into this secondary test, so it may be skewed. But one thing is for sure, acreage harvested seems to be the best route for us to go. If we want to add value to corn futures, we would predict the acreage harvested using storm event frequency and acreage planted. Let's just look at the acreage harvest model in more detail. First, let's visually inspect the predictions and the observations for acreage harvested.

Show me the code...

Your script should look something like this:

# plot test results 
plot(preds[,1],clean.test.data$Acreage.Harvested,main="Linear Model",
     xlab="Predictions",ylab="True values")
abline(0,1,lty=2)

Linear model showing the predicted acreage harvested

Observations versus predictions for Acreage Harvested

Credit: J. Roman

After visually inspecting, we see a really good fit. One of the things to note is that acreage planted is a factor in this model. If there is no difference between acreage planted and acreage harvested, then we aren't really providing more information by looking at storm event frequency. Let's start by looking at the difference between acreage harvested and acreage planted.

Show me the code...

Your script should look something like this:

# difference between planted and harvested 
differenceHarvestedPlanted <- clean.data$Acreage.Planted-clean.data$Acreage.Harvested

If you now apply the summary function to the difference you will see that although there are instances where no change has occurred between the planted and the harvested, the general difference is between 800-1700 acres. We can also perform a t-test. As a reminder, the t-test will tell us whether the mean acreage planted is statistically different than the mean acreage harvested.

Show me the code...

Your script should look something like this:

# perform a t-test 
t.test(clean.data$Acreage.Planted,clean.data$Acreage.Harvested,alternative = "two.sided",paired=TRUE)

The result is a P-value <0.01, which means that the mean acreage planted and the acreage harvested are statistically different.

Let's check which variables have statistically significant coefficients in this model (coefficients different from 0)

Show me the code...

Your script should look something like this:

# follow up analysis 
summary(bestModels[[1]])

We see that the coefficients for the factor variables cold wind chill events, winter storm events, and droughts are statistically significant. The last thing we can do is compare a linear model only using acreage planted (as this is where most of the variability is probably being captured), one which uses acreage planted and the other statistically significant coefficients (cold wind chill events, winter storm events, and droughts), and the one we selected with our automated process.

Show me the code...

Your script should look something like this:

# create three separate linear models for comparison
lm.1 <- lm(clean.data$Acreage.Harvested~clean.data$Acreage.Planted)
lm.2 <-  
lm(clean.data$Acreage.Harvested~as.numeric(clean.data$Acreage.Harvested)+as.numeric(clean.data$coldWindChill)+as.numeric(clean.data$winterStorm)+as.numeric(clean.data$drought))
lm.3 <- lm(clean.data$Acreage.Harvested~.,lapply(clean.data[,c(1,1+1,7:27)],as.numeric))

If you apply the summary function on all the models you will see that the second model is almost a perfect fit and that lm.3 (which is the best model selected through cross-validation) has a slightly higher R-squared value than lm.1. If we perform an ANOVA analysis between the models, we can determine if the models are statistically different.

Show me the code...

Your script should look something like this:

anova(lm.1,lm.2)
anova(lm.1,lm.3)

The results suggest that the change was statistically significant from lm.1 to lm.2. As for the difference between lm.1 and lm.3 (lm.3 is the actual model selected), it is not as strong. This follow-up analysis suggests that it might be best to use linear model 2, which is less complex than linear model 3 but provides a statistically different result from the simple usage of acreage planted. You can see that even though the automation created a good model, by digging in a little more we can further refine it. Data mining is a starting point, not the final solution.

By combining a number of tools you have already learned, we were able to successfully create a model to predict an aspect of the corn harvest effectively. We were able to dig through a relatively large dataset and pull out patterns that could bring added knowledge to investors and farmers, hopefully aiding in better business decisions.

Lesson 7: Introduction to Data Mining

Motivate...

Newsstand

Lesson Objectives

Data Mining Overview

Prioritize...

Read...

What is Data Mining?

When to use it?

Process Overview

Pick a Question

Prepare Data

Prioritize...

Read...

Corn Data

Question:

Storm Events Summary

Question:

Missing Values

Visualize and Summarize

Prioritize...

Read...

Enrich Old Techniques

Visualize Outliers

Conditional Plots

Pattern Analysis Part 1: Multiple Linear Regressions

Prioritize...

Read...

Initial Analysis

Backwards Elimination

Automated Backward Elimination

Pattern Analysis Part 2: Regression Trees

Prioritize...

Read...

Initial Analysis

Pruning

Automate

Select Model

Prioritize...

Read...

Predict

Evaluate

Visual Inspection

Model Comparison

Ensemble Model and Automation

Obtain Model and Predict

Prioritize...

Read...

Obtain Model

Predict

US Example

Self Check

Test your knowledge...