Project 5: Examine Relationships in Data: Performing and Assessing Regression Analysis

To perform a regression analysis, there are several functions available in RStudio that we will use. To begin, we will create a multivariate regression model using the independent variables for this assignment. Chunk 8 shows the different functions that will be used to run and summarize the regression output, conduct a VIF test, and examine the results of the ANOVA test.

### Chunk 8: Mulitvariate Regression, VIF, and ANOVA
```{r}

# Run a Regression Analysis on the Poverty Data
Poverty_Regress <- lm(Poverty_Data_Cleaned$PctFamsPov~Poverty_Data_Cleaned$PctNoHealth+Poverty_Data_Cleaned$MedHHIncom+Poverty_Data_Cleaned$PCTUnemp)

# Report the Regression Output
summary(Poverty_Regress)

# Carry ouf a VIF Test on the Independent Variables
vif(Poverty_Regress)

# Carry out an ANOVA test
anova(Poverty_Regress)

```

Perform a Regression Analysis

In RStudio, a regression is carried out using the lm function (linear method). Using our poverty data, the dependent variable is the percent of families below the poverty line while the independent variables include the percent of individuals without health insurance, median household income, and percent of unemployed individuals. The tilde character ~ means that the percent of families below the poverty line is being predicted from the independent variables. The additional independent variables are appended to the list through the use of the “+” sign. Note that we are creating a new object called Poverty_Regress that will hold all of the regression output.

Annotated Regression Analysis in RStudio

To summarize the results of this regression, use the summary() function:

> summary(Poverty_Regress)

which returns the following output…

Residuals:
	Min	1Q	Median	3Q	Max 
	-11.8404	-2.9583	0.6864	2.7394	9.2912 

Coefficients:
                                   	  Estimate  	Std. Error	t value	Pr(>|t|)    
(Intercept)                        	 30.9649344 	8.4170726	3.679	0.000429 ***
Poverty_Data_Cleaned$PctNoHealth	  0.8289709 	0.3323119	2.495	0.014725 *  
Poverty_Data_Cleaned$MedHHIncom 	 -0.0005061 	0.0001061	-4.770	8.43e-06 ***
Poverty_Data_Cleaned$PCTUnemp   	  2.1820454 	0.4326694	5.043	2.91e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.797 on 78 degrees of freedom
Multiple R-squared:  0.6984,	Adjusted R-squared:  0.6868 
F-statistic: 60.21 on 3 and 78 DF, p-value: < 2.2e-16

Writing the Regression Equation

Recall that the general regression equation takes the form of $\hat{y} = β_{0} + β_{1} + + β_{2} + + β_{3} + \dots + β_{n}$

where

\begin{array}{l} \hat{y} = the predicted value of y (often called y-hat) \\ β_{0} = the y-intercept (the value on the y-axis where the line of best fit crosses) \\ β_{1, 2, 3, n} = the predictor variables \end{array}

According to the values shown in the RStudio output under the Coefficients heading, the regression equation for percent below poverty predicted from the independent variables is shown in Equation 5.1. The value 30.9649344 is the y-intercept and 0.8289709 represents the slope of the regression line. Thus, we have the following (rounding to two significant digits):

\hat{y} = 30.96 + 0.83 x_{1} - 0.00 x_{2} + 2.18 x_{3}

(5.1)

Equation 5.1 suggests that by inserting any value of x for each independent variable into the equation, a value of y (percent below poverty) will be predicted based on the regression coefficients. Remember that the original variables were cast in percentages and dollars.

Interpreting Equation 5.1, we see that when all of the independent variables’ values are 0 (the value of x = 0), then the baseline percentage of families below the poverty line is 30.96%. In other words, if no one had health insurance, no one made any income, and no one was employed, then there would be a moderate level of poverty. This is a rather unrealistic scenario, but it helps to understand a baseline of what the regression equation is reporting.

Continuing on, if we have one percent of individuals who are unemployed (keeping all other variables constant), then that one percent unemployed will contribute an additional 2.18% to the increase in families below the poverty line. Evidence of this result can also be seen in the reported correlation value of 0.71 between these two variables shown in Table 5.2. Adding an extra dollar of median household income decreases the percent of families below the poverty line by a paltry 0.0005 percent. In other words, small changes in the percent unemployed appear to be more impactful on determining the percent of individuals in poverty compared to increasing or decreasing the median household income.

Assessing the Regression Output

Just how significant is the regression model? The software RStudio provides several metrics by which we can assess the model’s performance.

The t- and p-values

The t-value is given for both the intercept and all independent variables. We are most interested in whether or not an independent variable is statistically significant in predicting the percent below poverty. The t- and p-values can assist us in this determination. Notice that the t-values listed range from -4.770 to 5.043. For example, we can examine the significance of this t-value (5.043). Using an alpha of 0.05 and 80 degrees of freedom (n – 1) and α = 0.01 we find that the table t-value is 2.374. This means that the likelihood of a t-value of 2.374 or greater occurring by chance is less than 1%. So, a t-value of 5.043 is a very rare event. Thus, this variable is statistically significant in the model.

To confirm each independent variable's statistical significance, we can examine their p-values, which are listed under the Pr(>|t|) heading. For example, the p-value for the percent with no health insurance is 0.014, which is statistically significant at the 0.1 and 0.05 levels, but not at the 0.01 level. Depending on our p-value threshold, we can conclude that this variable is statistically significant in the regression model. Both median household income and percent unemployed have very small p-values, indicating that both are highly statistically significant at predicting the percent below the poverty line. This high level of statistical significance is also provided by the “***” code printed to the right of the Pr(>|t|) value, indicating that the significance is very high.

Coefficient of Determination

The r² assesses the degree to which the overall model is able to predict a value of y accurately. The multiple r² value as reported here is 0.6984. If there had been only one independent variable, then the multiple r² value would have been the same as was computed through Pearson’s r. Interpreting the value of 0.6984 can be tricky. Generally speaking, a high value of r² suggests that the independent variables are doing a good job at explaining the variation in the dependent variable while a low r² value suggests that the independent variables are not doing a good job of explaining the variation in the dependent variable. Here, the three independent variables appear to explain almost 70% of the variation in percent below the poverty line. There are likely other variables that can increase the predictive ability of the model.

It is important to note that in general terms, r² is a conservative estimate of the true population coefficient of determination, which is reported by the adjusted r². Note that the value of the adjusted r² will always be less than the value of the multiple r². In this case, the adjusted r² value is 0.6868.

Examining the Independent Variables in the Regression Model

Although seemingly useful, one should not rely solely on the value of r² as a measure of the overall model’s success at making predictions. Instead, we need to confirm the statistical value of the model in predicting y. We can do this through the results of the anova() function using Poverty_Regress, which calculates the F-statistic and its p-value for each independent variable in the equation.

Analysis of Variance Table

Response: Poverty_Data_Cleaned$PctFamsPov
		                            Df	Sum Sq	  Mean Sq	 F value	Pr(>F)    
Poverty_Data_Cleaned$PctNoHealth	1	1588.09	   1588.09	 69.001	    2.393e-12 ***
Poverty_Data_Cleaned$MedHHIncom	    1	1983.49	   1983.49	 86.181	    3.046e-14 ***
Poverty_Data_Cleaned$PCTUnemp	    1	 585.37 	585.37	 25.434 	0.908e-06 ***
Residuals                        	78	1795.20 	 23.02                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Looking at the RStudio output from the anova() function, we see several data values for each variable. The ANOVA table can help us refine our understanding of the contribution that each independent variable makes to the overall regression model. To learn about each independent variable and its contribution to the overall model’s ability to explain the variation in the dependent variable, we need to examine the data values in the sum of squares (Sum Sq) column. The total sum of squares value for the model is 5952.15 (add up the individual sum of squares for each row in the Sum Sq column). Dividing each sum of squares value by this sum of squares total will indicate the percent contribution each independent variable has in explaining the dependent variable. For example, percent with no health insurance has a sum of squares value of 1588.09. Dividing 1588.09 / 5952.15 = 27%. This 27% implies that percent with no health insurance explains 27% of the variation in percent below the poverty line. Median household income and percent unemployed explain 33% and 9% of percent below the poverty line, respectively. Adding 27%, 33%, and 9% produces the same 69% that was reported by in the multiple r² statistic shown in the summary() function output.

Multicollinearity of the Independent Variables

Ideally, we would like to have the independent variable (or in the case of multiple regression, several independent variables) highly correlated with the dependent variable. In such cases, this high correlation would imply that the independent variable or variables would potentially contribute to explaining the dependent variable. However, it is also desirable to limit the degree to which independent variables correlate with each other. For example, assume a researcher is examining a person’s weight as a function of their height. The researcher collects, each person's height and records that height in centimeters and inches. Obviously, the height variables in centimeters and inches would be perfectly correlated with each other. Each measurement would give the same predictive ability for weight and thus are redundant. Here, one of these height variables should be eliminated to avoid what is referred to as multicollinearity.

The important question here is, when should the researcher consider removing one or more independent variables? The VIF (variance inflation factor) test helps answer this question in cases where there are several independent variables. The vif() function examines the independent variables that are used to build the multiple regression model. Specifically, the VIF test measures whether one or more independent variables are highly correlated with one each other. You must have more than one independent variable to use the vif() function. The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 (depending on the reference selected) indicates a problem with multicollinearity (James et al., 2014). Based on the VIF values reported from the vif() function results, none of our independent variables exhibit multicollinearity as all of the values are smaller than 5.

Poverty_Data_Cleaned$PctNoHealth	Poverty_Data_Cleaned$MedHHIncom 
	1.368166	2.018607 

   Poverty_Data_Cleaned$PCTUnemp 
   	1.616680

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.