METEO 825
Predictive Analytic Techniques for Meteorological Data

Logistic Regression

Prioritize...

When you have finished reading this section, you should be able to describe a logistic regression, be able to explain the shortcomings of such a model and perform a logistic regression in R.

Read...

The video below is a bit long, but I encourage you to watch the whole thing, as it provides a nice overview of logistic regression.

Click for the transcript of StatQuest: Logistic Regression.

[SINGING] If you can fit a line, you can fit a squiggle. If you can make me laugh, you can make me giggle. StatQuest.

[JOSH STARMER]: Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to talk about logistic regression. This is a technique that can be used for traditional statistics as well as machine learning. So let's get right to it.

Before we dive into logistic regression, let's take a step back and review linear regression. In another StatQuest, we talked about linear regression. We had some data, weight, and size. Then we fit a line to it and, with that line, we could do a lot of things. First, we could calculate r-squared and determine if weight and size are correlated. Large values imply a large effect. And second, calculate a p-value to determine if the r-squared value is statistically significant. And third, we could use the line to predict size given weight. If a new mouse has this weight, then this is the size that we predict from the weight. Although we didn't mention it at the time, using data to predict something falls under the category of machine learning. So plain old linear regression is a form of machine learning. We also talked a little bit about multiple regression. Now, we are trying to predict size using weight and blood volume. Alternatively, we could say that we are trying to model size using weight and blood volume. Multiple regression did the same things that normal regression did. We calculated r-squared, and we calculated the p-value, and we could predict size using weight and blood volume. And this makes multiple regression a slightly fancier machine learning method. We also talked about how we can use discrete measurements like genotype to predict size. If you're not familiar with the term genotype, don't freak out. It's no big deal. Just know that it refers to different types of mice.

Lastly, we could compare models. So, on the left side, we've got normal regression using weight to predict size. And we can compare those predictions to the ones we get from multiple regression, where we're using weight and blood volume to predict size. Comparing the simple model to the complicated one tells us if we need to measure weight and blood volume to accurately predict size, or if we can get away with just weight. Now that we remember all the cool things we can do with linear regression, let's talk about logistic regression. Logistic regression is similar to linear regression, except logistic regression predicts whether something is true or false instead of predicting something continuous, like size. These mice are obese, and these mice are not. Also, instead of fitting a line to the data, logistic regression fits an S-shaped logistic function. The curve goes from zero to one, and that means that the curve tells you the probability that a mouse is obese based on its weight. If we weighed a very heavy mouse, there is a high probability that the new mouse is obese. If we weighed an intermediate mouse, then there is only a 50% chance that the mouse is obese. Lastly, there's only a small probability that a light mouse is obese. Although logistic regression tells the probability that a mouse is obese or not, it's usually used for classification. For example, if the probability that a mouse is obese is greater than 50%, then we'll classify it as obese. Otherwise, we'll classify it as not obese. Just like with linear regression, we can make simple models. In this case, we can have obesity predicted by weight or more complicated models. In this case, obesity is predicted by weight and genotype. In this case, obesity is predicted by weight and genotype and age. And lastly, obesity is predicted by weight, genotype, age, and astrological sign. In other words, just like linear regression, logistic regression can work with continuous data like weight and age and discrete data like genotype and astrological sign. We can also test to see if each variable is useful for predicting obesity. However, unlike normal regression, we can't easily compare the complicated model to the simple model, and we'll talk more about why in a bit. Instead, we just test to see if a variable's effect on the prediction is significantly different from zero. If not, it means that the variable is not helping the prediction. Psst. We used Wald's test to figure this out. We'll talk about that in another StatQuest.

In this case, the astrological sign is totes useless. That's statistical jargon for not helping. That means, we can save time and space in our study by leaving it out. Logistic regression's ability to provide probabilities and classify new samples using continuous and discrete measurements makes it a popular machine learning method. One big difference between linear regression and logistic regression is how the line is fit to the data. With linear regression, we fit the line using least squares. In other words, we find the line that minimizes the sum of the squares of these residuals. We also use the residuals to calculate R squared and to compare simple models to complicated models. Logistic regression doesn't have the same concept of a residual, so it can't use least squares, and it can't calculate R squared. Instead, it uses something called maximum likelihood. There's a whole StatQuest on maximum likelihood so see that for details, but in a nutshell, you pick a probability scaled by weight of observing an obese mouse just like this curve, and you use that to calculate the likelihood of observing a non-obese mouse that weighs this much. And then, you calculate the likelihood of observing this mouse, and you do that for all of the mice. And lastly, you multiply all of those likelihoods together. That's the likelihood of the data given this line. Then you shift the line and calculate a new likelihood of the data and then shift the line and calculate the likelihood again, and again. Finally, the curve with the maximum value for the likelihood is selected. BAM!

In summary, logistic regression can be used to classify samples, and it can use different types of data like the size and/or genotype to do that classification. And it can also be used to assess what variables are useful for classifying samples, i.e., astrological sign is totes useless.

Hooray! We've made it to the end of another exciting StatQuest. Do you like this StatQuest and want to see more? Please subscribe. If you have suggestions for future StatQuests, well, put them in the comments below. Until next time, Quest on!

Logistic models, as the video discussed, predict whether something is true or false; that is, there are only two outcomes. It is an example of a categorical model. By using several variables of interest, we can create a logistic model and estimate the probability that an outcome will occur.

Overview

Logistic regression estimates the parameters of a logistic model, a widely used statistical model that uses a logistic function to model a binary (only two outcomes) dependent variable. The logistic function follows an s-curve, like the image below shows:

Logistic function demonstrating a s-curve
Example of a logistic function
Credit: Qef (talk) - Created from scratch with Gnuplot

Mathematically, the expected probability that Y=1 for a given value of X for logistic regression is:

P( Y=1 )= 1 1+ e ( β 0 + β 1 X)

The expected probability that Y=0 for a given value of X is:

P( Y=0 )=1P( Y=1 )

The equation turns linear regression into a probability forecast. Instead of predicting exactly 0 or 1 (the event does not happen, or it does), logistic regression generates a probability between 0 and 1 (the likelihood the event will occur). As the output of the linear regression equation increases, the exponential goes to zero so the probability goes to:

1 ( 1+0 ) =1 or 100%

In contrast, as the output of the linear regression equation decreases, the exponential goes to infinity, so the probability goes to:

1 ( 1+inf ) = 1 inf =0

Shortcomings

Logistic regression is advantageous in that the regression generates a probability that the outcome will occur. But there are some disadvantages to this technique. For one, logistic regression allows only linear relations or KNOWN interactions between variables. For example:

A*Temp+b*Humidity+c*Temp*Humidity

But we often don’t know what variable interactions are important, making logistic regression difficult to use. In addition, logistic regression only allows for two possible outcomes. You can only determine the odds of something occurring or not.

Example

Let’s work through an example in R. Begin by downloading this data set of daily values from Burnett, TX. The variables included are: the date (DATE), average daily wind speed (AWND), time of the fastest one-mile wind (FMTM), peak gust time (PGTM), total daily precipitation (PRCP), average daily temperature (TAVG), maximum daily temperature (TMAX), minimum daily temperature (TMIN), wind direction of the fastest 2-minute wind (WDF2), wind direction of the fastest 5-minute wind (WDF5), the fastest 2-minute wind speed (WDF2), and fastest 5-second wind speed (WSF5). To review the variables or their units, please use this reference.

For this example, we are going to create a logistic model of rain for Burnett, TX using all of these variables. To start, load in the data set.

Show me the code...
# load in daily summary data for Burnett, TX
mydata <- read.csv("Texas_Daily_Summaries.csv")

We only want to retain complete cases, so remove any rows (observational dates) in which any variable has a missing value.

Show me the code...
# obtain only complete cases
mydata <- na.omit(mydata)

# check for missing values
badIndex <- which(is.na(mydata))

‘badIndex’ should be empty (this is a way of checking that all NAs were removed). Now that we have a complete data set, we need to create a categorical variable. I will model two rain outcomes: rain occurred (yes =1) and rain did not occur (no=0). Use the code below:

Show me the code...
# turn precipitation into categorical variable with 2 outcomes (yes it rain=1 No it did not rain=0)
mydata$rain <- array(0, length(mydata$PRCP))
mydata$rain[which(mydata$PRCP > 0)] <- 1

Now, I want to remove the variables Date and PRCP as we will not use them in the model. Use the code below:

Show me the code...
# remove $PRCP and $DATE
mydata <- subset(mydata, select = -c(1, 5))

We are left with a data frame that only contains the dependent variable (rain) and the independent variables (all other variables). Next, we need to split the data into training and testing. For this example, run the code below to create a 2/3 | 1/3 split.

The function ‘sample.split’ from the package ‘caTools’ randomly selects 2/3rds (0.66) of the observations (set to TRUE). We can then create our training and testing set as follows:
Show me the code...
#get training and test data
train<- subset(mydata,split == TRUE)
test <- subset(mydata,split == FALSE)

Now, we can create a logistic model by running the code below:

The function ‘glm’ is a generalized linear model. We are telling the function to model ‘rain’ from the training data using all the variables (~.) in the training set. The ‘family’ parameter tells the function to use logistic regression (‘binomal’). 

Some of the coefficients are not statistically significant. What can we do? Well, similar to stepwise linear regression, we can perform a stepwise logistic regression. Run the code below:

The ‘step’ function selects the best model based on minimizing the AIC score. If you summarize the model, you will find that all the coefficients are statistically significant at alpha=0.05. The final variables will be slightly different each time you run, as there is a random aspect with the data split and the step function. But the model will probably include: AWND (average daily wind speed), FMTM (fastest one-mile wind), TMAX (maximum temperature), TAVG (average temperature) or TMIN (minimum temperature), WDF2 (wind direction of the fastest 2-minute wind), WSF2 (fastest 2-minute wind speed), and WSF5 (fastest 5-second wind speed).

We can predict our testing values using the code below:

Show me the code...
# predict testing data
predictGLM <- predict(bestModel,newdata = test,family="binomial",type="response")

The ‘type’ parameter makes the predictions scale to the response variable instead of the default which scales to the linear predictors. Unlike linear regression where we could plot the observed vs predicted values and assess the fit, we must use alternative assessment techniques. For categorical forecasts, we can display the confusion matrix (refer back to previous lessons for more details). Run the code below:

You should get a similar table to the one below (again, there is a random component, so the numbers will most likely not be identical):

Confusion Matrix for Rain in Burnett, TX
FALSE TRUE
0 462 48
1 80 94

Try changing the probability threshold to see how the confusion matrix changes. Or we can plot the ROC curve using the code below (again, refer back to previous lessons for more details):

Show me the code...
#ROC Curve
library(ROCR)
ROCRpred <- prediction(predictGLM, as.character(test$rain))
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

The first line standardizes the predictions and provides a label (as.character(test$rain)). The next line (performance) evaluates the True Positive Rate (TPR) and the False Positive Rate (FPR). The final line plots the assessment. The color represents the probability cutoffs. You should get the following figure:

Figure with colored line representing the probability cutoffs.
ROC Curve for precipitation model in Burnett, TX
Credit: J. Roman

When the probability cutoff is high (near 100%), the false positive rates and true positive rates are low because we have forecasted very few events. When the probability is low, we have high TPR (good) and FPR (bad), since we are forecasting most of the events. The trick here is finding the right threshold for our particular decision cost structure.