### Activity: Statistical Analysis of Climate Data

**NOTE: **For this assignment, you will need to record your work on a word processing document. Your work must be submitted in Word (.doc or .docx), or PDF (.pdf) format.

The documents associated with this problem set, including a formatted answer sheet, can be found on CANVAS.

- Worksheet successfully downloaded! Be sure also to download the answer sheet from CANVAS (Files > Problem Sets > PS#1).

Each problem (#2 through #7) is equally weighted for grading and will be graded on a quality scale from 1 to 10 using the general rubric as a guideline. Thus, a raw score as high as 70 is possible, and that raw score will be adjusted to a 50-point scale; the adjusted score will be recorded in the grade book.

The objective of this problem set is for you to work with some of the data analysis/statistics concepts and mechanics covered in Lesson 2, namely the coefficient of variation (You are welcome to use any software or computer programming environment you wish to complete this problem set, but the instructor can only provide support for Excel should you need help. The instructions also will assume you are using Excel.

Now, in CANVAS, in the menu on the left-hand side for this course, there is an option called Files. Navigate to that, and then navigate to the folder called Problem Sets, inside of which is another folder called PS#1. Inside that folder is a data file called PS1.xlsx. Download that data file to your computer, and open the file in Excel. You should see two time series: average annual temperature for State College, PA [home of the main campus of Penn State!], and Oceanic Niño Index [ONI, a measure of ENSO phase]. Both series are for 1950 to 2021.

- An important first step of exploratory data analysis is always to visualize the data. Construct a scatterplot of each time series (i.e., two different plots). If you need pointers on how to make a scatterplot in Excel, this is a good resource. Include only markers in the scatterplots, and remember that a proper graph always has a descriptive title and axis labels, including the units of the quantities being displayed.
**Place the scatterplots on the answer sheet.**

- An important second step of exploratory data analysis is to find some basic summary statistics.
**For both time series, find the mean, median, and standard deviation, and record the values of these statistics on the answer sheet.**

- The next step is to explore whether there is a linear trend in State College annual average temperature over time and the nature of that trend. You will explore the calculation of the linear trend line in a couple of different ways.

The first way is using the equations derived in Lesson 2. As a recap, the equation for a linear trend line is:

$${y}_{i}=b{x}_{i}+a\text{,}$$

where $i$ ranges from 1 to $N$ , $a$ is the intercept of the linear relationship between $y$ and $x$, and $b$ is the slope of that relationship. In this context, $N$ is given by the number of years (and also values) in the time series, $y$ is given by State College annual average temperature, and $x$ is given by time. Accordingly, $a$ and $b$ are calculated from the data as follows:

$$b=\frac{N\xb7{\displaystyle \sum {y}_{i}{x}_{i}-{\displaystyle \sum {y}_{i}\xb7{\displaystyle \sum {x}_{i}}}}}{N\xb7{\displaystyle \sum \left({x}_{i}^{2}\right)-{\left({\displaystyle \sum {x}_{i}}\right)}^{2}}}$$

and

$$a=\frac{{\displaystyle \sum {y}_{i}-b{\displaystyle \sum {x}_{i}}}}{N}$$

Note that $b$ must be calculated before $a$ can be calculated. To calculate the value of $b$ ,

**first calculate each value that is in the equation for it, recording your results on the answer sheet:**$$\begin{array}{l}N=\\ {\displaystyle \sum {y}_{i}}=\\ {\displaystyle \sum {x}_{i}}=\\ {\displaystyle \sum {y}_{i}{x}_{i}}=\\ {\displaystyle \sum \left({x}_{i}^{2}\right)}=\\ {\left({\displaystyle \sum {x}_{i}}\right)}^{2}=\end{array}$$

As a hint, the second value that you are to calculate is a summation (which is what the sigma notation indicates), and it is suggested that you use Excel to find the sum of all the $y$ values (i.e., State College annual average temperature values); if you need a resource on sum functions in Excel, look here. For the fourth value, you should first calculate the “$\mathrm{xy}$” products (again, Excel would be helpful), and then add those products together. For the fifth value, you are finding the square of each $x$ value and then summing those squares. For the sixth value, you are finding the square of the sum of all the $x$ values; in other words, you are squaring a value that you previously found.

Now, you can calculate $b$ . Substitute the values you calculated above into the equation for $b$ that is given above (and also given in Lesson 2).

**On the answer sheet, report the value you calculate for $b$ , and also show the equation for $b$ with the values substituted into the equation.**If you need a refresher on Microsoft Word Equation Editor, look here. Be cautious with order of operations!The calculation of the value of $a$ is more straightforward, as you might note that you have already calculated all the values that appear in the equation for $a$ .

**On the answer sheet, report the value you calculate for $a$ , and also show the equation for $a$ with the values substituted into the equation.****You now can write the equation for the linear trend line. Report this equation on the answer sheet.**You may have found this calculation to be tedious. It is a useful exercise to work through the mechanics of how a linear trend line equation comes together, but you would not want to go through all this work every time you want to perform a regression analysis. Excel has already helped quite a lot, but it can help even more.

The second way of exploring whether there is a linear trend in State College annual average temperature over time is to use Excel to add a linear trend line to the scatterplot you already made for the State College annual average temperature time series dataset and to display the equation for that trend line.

**Use Excel to do this, and place the scatterplot with trend line and trend line equation on the answer sheet.**If you need help with how to do this in Excel, look here. You will find that the trend line equation given by Excel is likely different than the one that you found, especially in terms of the y-intercept. This discrepancy is due to rounding: you likely rounded more than Excel does. Do not worry about this.Now, you will consider whether the linear trend is statistically significant. (It is likely that you can already see a trend!)

**Calculate the correlation coefficient $r$ , and report that value on the answer sheet.**It is suggested that you use Excel’s CORREL function to do this, but, whatever tool or software you use, make sure you are calculating Pearson’s correlation coefficient value, as there are many different flavors of correlation value.Next, use the online calculator that is introduced in Lesson 2 to test whether the correlation coefficient is statistically significant. Note that we are using the correlation coefficient as a means of evaluating the statistical significance of the linear trend.

**Report the P-value on the answer sheet, and state, in a complete sentence, whether the correlation coefficient is statistically significant based on that P-value.**For the last part of the exploration, you will consider whether the time series exhibits autocorrelation by calculating the lag-1 autocorrelation coefficient $\rho $ . This calculation can be done using the CORREL function in Excel with one modification that is illustrated below:

The third column shown is actually just the second column shifted forward one year (a lag of one year, hence lag-1 autocorrelation). This leaves $N-1$ pairs (i.e., the degrees of freedom) on which you can implement the CORREL function to calculate the lag-1 autocorrelation coefficient.

**Report the value of ρ that you find on the answer sheet.**Then, use the online calculator that is introduced in Lesson 2 to determine whether the value of $\rho $ is statistically significant.**Report the P-value that you find on the answer sheet and, in a complete sentence, whether the autocorrelation in the time series is statistically significant, and also report the value of $N$ that you inputted into the calculator.** - Perform a similar linear trend analysis as you performed in #4 for the ONI time series. That is, calculate the linear trend line (you can use Excel and need not calculate the $a$ and $b$ values using the equations), calculate the correlation coefficient, calculate the lag-1 autocorrelation coefficient, and discuss whether the linear trend and whether any autocorrelation are statistically significant.
**Report all results and your discussion on the answer sheet.**

- Is there a linear relationship between ONI and State College annual average temperature? If so, what is the strength of that relationship, and is the relationship statistically significant? Is there any concern about autocorrelation in this context?
**Report your results and discussion on the answer sheet.**

- Suppose, hypothetically, that you split the State College annual average temperature time series into two sub-time series – one covering the earlier 1950-1984 period and another covering the latter 1985-2021 period. You calculate the slopes of the linear trend lines corresponding to these sub-time series and the standard errors of those slopes. By adding the corresponding standard error to each slope value and also subtracting it from each slope value, you find the upper and lower bounds, respectively, of the 95% confidence interval for each slope. You find that the two confidence intervals overlap. Given this finding, explain whether or not the slopes are statistically different.
**Give your discussion on the answer sheet.**