From Meteorology to Mitigation: Understanding Global Warming

Review of Basic Statistical Analysis Methods for Analyzing Data - Part 3


Establishing Relationships Between Two Variables

Another important application of OLS is the comparison of two different data sets. In this case, we can think of one of the time series as constituting the independent variable x and the other constituting the independent variable y. The methods that we discussed in the previous section for estimating trends in a time series generalize readily, except our predictor is no longer time, but rather, some variable. Note that the correction for autocorrelation is actually somewhat more complicated in this case, and the details are beyond the scope of this course. As a general rule, even if the residuals show substantial autocorrelation, the required correction to the statistical degrees of freedom (N' ), will be small as long as either one of the two time series being compared has low autocorrelation. Nonetheless, any substantial structure in the residuals remains a cause for concern regarding the reliability of the regression results.

We will investigate this sort of application of OLS with an example, where our independent variable is a measure of El Niño  the so-called Niño 3.4 index — and our dependent variable is December average temperatures in State College, PA.

The demonstration is given in three parts below:

Video: Demo - Part 1 (3:22)

Demo part 1
Click here for a transcript

PRESENTER: Now we're going to look at a somewhat different situation where our independent variable is no longer time but it's some quantity it could be temperature it could be an index of El niño or the North Atlantic Oscillation let's look at an example of that sort we are going to look at the relationship between El niño and December temperatures in State College Pennsylvania and we can plot out that relationship as a scatterplot on the y-axis we have December temperature in State College the x-axis is our independent variable the niño 3.4 index negative values indicate low mania and positive values indicate El niños and the strength of the relationship between the two is going to be determined by the trendline that describes how December temperatures in State College depend on El niño and by fitting the progression we obtain a slope of zero point seven three nine seven that means for each unit change in El niño in niño 3.4 we get a zero point seven four unit change in temperature so for a moderate El niño event where the niño 3.4 index is in the range of plus one that would imply that December temperatures in State College for that year are zero point seven four degrees Fahrenheit zero point seven to four degrees Fahrenheit warmer than usual and four modestly strong lending in weather niño 3.4 indexes on the order of minus one or so the December State College December temperatures would be about zero point seven four degrees colder than normal you can also see that the y-intercept here the case when the niño 3.4 index is zero we get roughly the climatological value for December temperatures 30.9 now the correlation coefficient is associated with that linear regression in this case zero point one seven four now we have a hundred and seven years our data set as before it goes from 1888 to 1994 if we use our table and take n equal to 107 an R of zero point one seven four we find that the one tailed value of P is zero point zero three six five the two tailed value is zero point zero seven three so if I threshold for significance where P of 0.05 the 95 percent significance level then that relationship a correlation of coefficient of zero point 174 with 107 years of information would be significant for one tailed test but it would not past the 0.05 the 95% significance threshold for two-tailed test so we have to ask the question which is more appropriate here the one tailed test or the two tailed test now if you had a reason to believe that El niño events form the northeastern US for example you might motivate a one tailed test since only a positive relationship would be consistent with your expectations but if we didn't know beforehand whether El niños had a cooling influence or warming influence on the northeastern US you might argue for a two-tailed test so whether or not the relationship is significant at the P equals 0.05 level is going to depend on which type of hypothesis test were able to use in this case.

Video: Demo - Part 2 (4:10)

Demo part 2
Click here for a transcript

PRESENTER: Let's continue with this analysis now what I'm going to do here is plot instead the temperature as a function of the year instead of me near 3.4 that's plot number one that's a State College December temperatures and now for pot number two I'm going to plot the niño 3.4 index as a function of year I use access B here to put them on the same scale so here we could see the two series we had the State College December temperatures in blue and the niño 3.4 index in yellow and you can see that in various years it does seem to be a little bit of a relationship between large positive departures in the niño 3.4 index are associated with warm December 10 temperatures and large negative departures are associated with cold temperatures we can visually see that relationship we also saw we plotted the two variables in a two dimensional scatterplot and looked at the slope of the line relating the two datasets here now we're looking at the same time looking at the time series of the two data sets and we can see some of that positive covariance if you will that there does seem to be a positive relationship although we already know it's a fairly weak relationship so let's do a formal regression so what I'm going to take away the niño series here when we got here is our State College December temperatures in blue now our regression model is going to use the niño 3.4 index as the independent parameter a temperature as our dependent variable will run the linear regression there is a slope 0.74 is the coefficient that describes the relationship between the niño 3.4 index of december temperatures it's positive we already saw the slope was positive there's also a constant term we're not going to worry much about here what we're really interested in is the slope of the regression line that describes the stages and temperature depends on changes in the neo 3.4 index and as we've seen 0.74 close up for a unit increase in niño 3.4 an anomaly of +1 on the niño 3.4 scale we'll get a temperature for december that on average is 0.74 degrees fahrenheit warmer than average the r-squared value right here is zero point zero three zero two and if we take that number take the square root of that that's an r-value of zero point one seven three four and we know that's a positive correlation because the slope is positive we already looked up the statistical significance of that number and we found that for a one-sided hypothesis test that the relationship is significant at the 0.05 level but if we were using a two-sided significant criterion hypothesis test that is to say if we didn't know a priori whether we had a reason to believe that El niños warm or cool State College December temperatures then the relationship would not quite be statistically significant so we've got elated the linear model so now we can plot it so now I'm going to plot year and model output on the same scale you can change the scale up these axes by clicking on these arrows arrows I'm gonna make them both go from 20 to 40 this one over here and so now the yellow curve is showing us the component of variation in the blue curve that can be explained by El niño and we can see it's a fairly small component it's small compared to the overall level of variability in December state college temperatures which vary by as much as plus or minus 4 degrees or so Fahrenheit.

Video: Demo - Part 3 (3:22)

Demo part 3
Click here for a transcript

PRESENTER: So continuing where we left off the yellow curve is showing us the component of the variation in December state college temperatures that can be explained by El niño in a particularly strong El niño year where the niño 3.4 indexes say as large as +2 we get a December temperature that's about one and a half degrees Fahrenheit above average that is to say that zero point zero zero point seven four degrees Fahrenheit that we get for one unit change in niño 3.4 but particularly strong La Nina event we get a negative zero point seven four degrees effect that we get for the negative niño 3.4 anomaly of negative two or so yet I'm sorry a negative one point five Fahrenheit cooling effect for negative two or so so the influence of El niño is small compared to the overall variability of roughly four degrees Fahrenheit in the series but it is statistically significant at least if we are able to motivate a one-sided hypothesis test if we had reason to believe that nailh niño events warm state college temperatures in the winter then the regression gives us a significant result that's significant at the 0.05 level the standard threshold for statistical significance okay so that may not be that satisfying we're not explaining a large amount of the variation in the data but we do appear to be explaining a statistically significant fraction of the variability in the data now finally let's look at the residuals from that regression so what I'll do is I will get rid of these other graphs let's keep year s chases to model residuals I'm just going to plot the model residuals as a function of time and that's what they look like there isn't a whole lot of obvious structure and in fact if you go back to the regression model Tam U and we look at the value of the lag 1 autocorrelation coefficient we'll see that it's minus 0.09 that's slightly negative and it's quite small close to if we look up the statistical significance not going to be even remotely significant so we don't have to worry much about autocorrelation influence on our estimate of statistical significance we also don't have much evidence here of the sort of low-frequency structure and the residuals that might cause us to worry so the nominal results of our regression enough analysis appear to be valid and again if we were named VOC a one-sided hypothesis test we would have found a statistically significant I'll be in a weak influence of El niño on State College December temperatures.

You can play around with the data set used in this example using this link: Explore Using the File testdata.txt