Activity: Statistical Analysis of Atlantic Tropical Cyclones and Underlying Climate Influences
NOTE: For this assignment, you will need to record your work on a word processing document. Your work must be submitted in Word (.doc or .docx) or PDF (.pdf) formats. A formatted answer sheet is available on CANVAS as a convenience for students enrolled in the course.
- Be sure also to download the answer sheet from CANVAS (Files > Problem Sets > PS#2).
Each problem (#2 through #6) is equally weighted for grading and will be graded on a quality scale from 1 to 10 using the general rubric as a guideline. Thus, a score as high as 50 is possible, and that score will be recorded in the grade book.
The objective of this problem set is for you to work with some of the data analysis/statistics concepts and mechanics covered in Lesson 2, namely the coefficient of variation and multi-variate regression. You are welcome to use any software or computer programming environment you wish to complete this problem set, but the instructor can only provide support for Excel and an online tool that is introduced in this problem set should you need help. The instructions also will assume you are using Excel and that online tool.
Now, in CANVAS, in the menu on the left-hand side for this course, there is an option called Files. Navigate to that, and then navigate to the folder called Problem Sets, inside of which is another folder called PS#2. Download the data file PS2.xlsx in that folder to your computer, and open the file in Excel. You should see five time series: HURDAT (“unadjusted”) tropical-cyclone (TC) count for the Atlantic basin, Vecchi-Knutson (2008) (“adjusted”) TC count, August-October Main Development Region (MDR) sea-surface temperature (SST), December-March North Atlantic Oscillation (NAO) index, and December-February Niño3.4 index (which measures ENSO phase). All five series cover 1878 to 2019.
- Construct a scatterplot and calculate the mean, median, and standard deviation for each of the five time series in the data file. Recall that you constructed scatterplots and calculated these summary statistics in PS#1. Place the five scatterplots and report the summary statistics for each time series on the answer sheet
- The two TC count time series should be thought of as dependent variables; you will be building models that should give meaningful predictions of those two dependent variables. One TC count series is “unadjusted” and comes from the HURDAT data-set. However, a weakness of this data-set is that it does not take into account the effect of satellite observations on TC counts. Satellite observations began to be used to observe TCs around 1966, providing comprehensive counts of annual TC activity, whereas, prior to 1966, some TCs were missed due to sparse observation networks, which grow more sparse going back farther and farther in time. Many scientists have tried to estimate the undercount of TCs in the more distant past, and Vecchi and Knutson made such an estimate in a 2008 paper, yielding the “adjusted” TC count series that you will explore in this data-set.
On the other hand, the MDR SST, NAO, and Niño3.4 time series should be thought of as independent variables, or predictor variables. The justification for the choice of these predictor variables is that warmth of the surface of the seawater in the Main Development Region (an area of the Atlantic Ocean roughly east of the Caribbean Sea), pressure patterns in the North Atlantic region, and pressure patterns in the equatorial Pacific region, respectively corresponding to MDR SST, NAO, and Niño3.4, are thought to be related to TC activity in the Atlantic basin.
Considering the “unadjusted” TC count as the predictand, calculate the linear trend line equation (i.e., single-variable regression equation) for each of the three predictor variables. Also calculate the correlation coefficient r and the coefficient of variation R2 for each of the three regression equations. Moreover, for each of the three regressions, state how much of the variation in the predictand is explained by the predictor. Report your results on the answer sheet.
- Repeat the calculations and analysis that you performed in #3, but, this time, consider the “adjusted” TC count as the predictand. Comment on any differences compared to the “unadjusted” regressions. Report your results on the answer sheet.
- Multi-variate regression methods, which you learned about in Lesson 3, allow for the simultaneous use of multiple variables to predict a response. Instead of constructing a separate regression for each predictor-predictand pair, like you did in #4, one regression can be constructed that uses all available predictors to predict a response.
A sidebar: Excel could be used to construct a multi-variate regression, but it is not the best tool for this task. In the real world, you might leverage your computer programming skills to accomplish this task, but knowledge of programming is not a pre-requisite for this course. More likely, you might have available a statistical software package for the task, but such packages generally are proprietary and cost money and are not a required technical capability for this course. In past offerings of this course, for this problem set, we have asked students to use the regression tool that you saw used in video demonstrations in Lesson 3, but it presents technical problems depending on the Web browser being used. Therefore, we suggest use of this online multi-variate regression calculator, but you may use any tool you wish.
Using this online multi-variate regression calculator (or any tool you wish), calculate two multi-variate regressions using all three predictor variables available to you: one to predict the “unadjusted” TC count and one to predict the “adjusted” TC count. The equation should be given in the form
where y is the predictand, is a constant, and , , and are coefficients for predictor variables , , and respectively. If you are using the online calculator, display output to four decimal places, and include no interactions. Find the coefficient of variation for each regression, and interpret it, comparing it to the coefficients of variation for the single-variable regressions you calculated in #3 and #4. Note that is an output of the online calculator. Report your results and discussion on the answer sheet.
- In #5, you found two multi-variate regressions, one that predicts (or models) “unadjusted” TC counts and one that predicts “adjusted” TC counts. You now will see how well these regressions model 2020 TC count. Given that, in 2020, August to October MDR SST was 28.3531 degrees Celsius, December to March NAO index was -0.135, and Niño3.4 index was -0.987, use both regressions to model 2020 TC count in the Atlantic basin. Research (citing your source) the number of TCs that actually occurred in the Atlantic basin in 2020, showing predictor values substituted into the multi-variate regression equation, and compare your modelled counts to the actual count. Report your results and discussion on the answer sheet.