Establishing Relationships Between Two Variables
Another important application of OLS is the comparison of two different data sets. In this case, we can think of one of the time series as constituting the independent variable x and the other constituting the independent variable y. The methods that we discussed in the previous section for estimating trends in a time series generalize readily, except our predictor is no longer time, but rather, some variable. Note that the correction for autocorrelation is actually somewhat more complicated in this case, and the details are beyond the scope of this course. As a general rule, even if the residuals show substantial autocorrelation, the required correction to the statistical degrees of freedom (N' ), will be small as long as either one of the two time series being compared has low autocorrelation. Nonetheless, any substantial structure in the residuals remains a cause for concern regarding the reliability of the regression results.
We will investigate this sort of application of OLS with an example, where our independent variable is a measure of El Niño — the so-called Niño 3.4 index — and our dependent variable is December average temperatures in State College, PA.
The demonstration is given in three parts below:
Video: Demo - Part 1 (3:22)
PRESENTER: Now we're going to look at a somewhat different situation where our independent variable is no longer time but it's some quantity it could be temperature it could be an index of El niño or the North Atlantic Oscillation. Let's look at an example of that sort. We are going to look at the relationship between El niño and December temperatures in State College Pennsylvania. We can plot out that relationship as a scatterplot. On the y-axis we have December temperature in State College, the x-axis is our independent variable the niño 3.4 index negative values indicate La niña and positive values indicate El niños. The strength of the relationship between the two is going to be determined by the trendline. That describes how December temperatures in State College depend on El niño and by fitting the regression, we obtain a slope of 0.7397. That means for each unit change in El niño in niño 3.4 we get a 0.74 unit change in temperature. So for a moderate El niño event where the niño 3.4 index is in the range of plus one, that would imply that December temperatures in State College, for that year, are 0.74 degrees Fahrenheit 0.74 degrees Fahrenheit warmer than usual. And for modestly strong La niña in weather niño 3.4 indexes on the order of minus one or so the December State College December temperatures would be about zero point seven four degrees colder than normal. You can also see that the y-intercept here the case when the niño 3.4 index is zero we get roughly the climatological value for December temperatures 30.9. Now the correlation coefficient is associated with that linear regression in this case 0.74. Now we have a 107 years our data set as before it goes from 1888 to 1994. If we use our table, and take n equal to 107, and R of 0.74, we find that the one tailed value of P is 0.365 the two tailed value is 0.073. So if I threshold for significance where P of 0.05 the 95 percent significance level, then that relationship a correlation of coefficient of 0.174 with 107 years of information would be significant for one tailed test but it would not past the 0.05, the 95% significance threshold, for a two-tailed test. So we have to ask the question, which is more appropriate here, the one tailed test or the two tailed test. Now if you had a reason to believe that El niño events form the northeastern US, for example, you might motivate a one tailed test since only a positive relationship would be consistent with your expectations. But if we didn't know beforehand whether El niños had a cooling influence or warming influence on the northeastern US you might argue for a two-tailed test. So whether or not the relationship is significant at the P equals 0.05 level is going to depend on which type of hypothesis test were able to use in this case.
Video: Demo - Part 2 (4:10)
PRESENTER: Let's continue with this analysis. Now what I'm going to do here is plot instead the temperature as a function of the year instead of niño 3.4. That's plot number one. That's a State College December temperatures. And now for pot number two. I'm going to plot the niño 3.4 index as a function of year. I use axis B here to put them on the same scale so here we could see the two series. We had the State College December temperatures in blue and the niño 3.4 index in yellow. And you can see that in various years it does seem to be a little bit of a relationship between large positive departures in the niño 3.4 index are associated with warm Decembertemperatures, and large negative departures are associated with cold temperatures. We can visually see that relationship we also saw we plotted the two variables in a two dimensional scatterplot and looked at the slope of the line relating the two datasets. Here now we're looking at the same time looking at the time series of the two data sets and we can see some of that positive covariance, if you will, that there does seem to be a positive relationship although we already know it's a fairly weak relationship.So let's do a formal regression.
So I'm going to take away the niño series here. What we've got here is our State College December temperatures in blue. Now our regression model is going to use the niño 3.4 index as the independent parameter and temperature as our dependent variable We'll run the linear regression. There is a slope 0.74 is the coefficient that describes the relationship between the niño 3.4 index of december temperatures. It's positive, we already saw the slope was positive. There's also a constant term we're not going to worry much about here. What we're really interested in is the slope of the regression line that describes that changes in temperature depends on changes in the niño 3.4 index. And as we've seen, 0.74 implies that for a unit increase in niño 3.4 an anomaly of +1 on the niño 3.4 scale, we'll get a temperature for December that, on average, is 0.74 degrees fahrenheit warmer than average. The r-squared value right here is 0.032 and if we take that number, take the square root of that that's an r-value of 0.1734 and we know that's a positive correlation because the slope is positive. We already looked up the statistical significance of that number and we found that for a one-sided hypothesis test that the relationship is significant at the 0.05 level. But if we were using a two-sided significant criterion hypothesis test, that is to say, if we didn't know a priori whether we had a reason to believe that El niños warm or cool State College December temperatures, then the relationship would not quite be statistically significant. So we've calculated the linear model, so now we can plot it. So now I'm going to plot year and model output on the same scale. You can change the scale up these axes by clicking on these arrows arrows. I'm gonna make them both go from 20 to 40. This one over here. And so now, the yellow curve is showing us the component of variation in the blue curve that can be explained by El niño and we can see it's a fairly small component. It's small compared to the overall level of variability in December state college temperatures which vary by as much as plus or minus 4 degrees or so Fahrenheit.
Video: Demo - Part 3 (3:22)
PRESENTER: So continuing where we left off. The yellow curve is showing us the component of the variation in December State College temperatures that can be explained by El niño. In a particularly strong El niño year, where the niño 3.4 index is say as large as +2. We get a December temperature that's about one and a half degrees Fahrenheit above average. That is to say that 0.74 degrees Fahrenheit that we get for one unit change in niño 3.4. Particularly strong La Niña event, we get a -0.74 degrees effect that we get for the negative niño 3.4 anomaly of negative two or so. Yet, I'm sorry, a -1.5 Fahrenheit cooling effect for negative two or so. So the influence of El niño is small compared to the overall variability of roughly four degrees Fahrenheit in the series. But it is statistically significant. At least if we are able to motivate a one-sided hypothesis test. If we had reason to believe that the El Niño events warm State College temperatures in the winter then the regression gives us a significant result that's significant at the 0.05 level. The standard threshold for statistical significance. Okay, so that may not be that satisfying. We're not explaining a large amount of the variation in the data, but we do appear to be explaining a statistically significant fraction of the variability in the data. Now finally, let's look at the residuals from that regression. So what I'll do is, I will get rid of these other graphs. Let's keep year. Let's change this to model residuals. I'm just going to plot the model residuals as a function of time, and that's what they look like. There isn't a whole lot of obvious structure and in fact if you go back to the regression model tab, and we look at the value of the lag 1 autocorrelation coefficient, we'll see that it's -0.09 that's slightly negative and it's quite small, close to zero. If we look up the statistical significance it's not going to be even remotely significant. So we don't have to worry much about autocorrelation influencing our estimate of statistical significance. We also don't have much evidence here of the sort of low-frequency structure and the residuals that might cause us to worry. So the nominal results of our regression analysis appear to be valid and again if we were to envoce a one-sided hypothesis test, we would have found a statistically significant, albiet a weak, influence of El niño on State College December temperatures.
You can play around with the data set used in this example using this link: Explore Using the File testdata.txt