Automating Downloads

Prioritize...

In this section, we'll look at functions in R that can automatically download files from FTP or HTTP sites. This allows for automation techniques that can harvest data across many different files.

Read...

It might have occurred to you that if you wanted any kind of data analysis that spanned many years or a suite of different variables, you might have to spend a great deal of time downloading and processing the various required files. However, what if you wanted to automate the process? We've already seen that having an R-script can assist in the parsing, analysis, and display of large data sets. Can the file download process be included in the script as well?

It turns out the answer is "yes". R provides you access to your local file system as well as tools for retrieving files directly from FTP and HTTP sites.

Let's consider the problem that we want to retrieve the Christmas Day mean surface temperatures for University Park, PA for the years 1990 through 2010. How might we do that in an automated manner? Let's look at some code... (Don't forget to set your working directory!)

# if for some reason we need to reinstall the
# package, then do so.

if (!require("ncdf4")) {
  install.packages("ncdf4") }

library(ncdf4)

# specify the file name and location URL
filename<-"air.sfc.1990.nc"
saved_filename<-"air.sfc.1990mean.nc"
url <- "ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/"

# get the file... the mode MUST be "wb" (write binary)
download.file(paste0(url,filename), saved_filename, quiet = FALSE, mode="wb")

# open the file
nc <- nc_open(saved_filename)

# print some info to show a good retrieval
print(head(nc$dim$time$vals))

# close the file
nc_close(nc)

# delete the file (this is optional)
file.remove(saved_filename)

Notice in this code we send the filename and its location to a function: download.file(...) which fetches the file and places it in your working directory (as a default). There are numerous parameters that can be set but notice that you can control the saved file name as well as the appearance of the progress bar via the quiet parameter. In fact, we can get quite fancy here if we want. For example, I can create a sub-directory (manually) named "data" within the working directory where I can store the data files that I download (by adding "data/" to the saved file name). I can also check to see if the file I want to download already exists in my directory by using the file.exists(...) function. See the modified code snippet below that only downloads the file if it doesn't exist in the "data/" subdirectory. Note that I have commented out the file.remove(saved_filename) command at the end of the script. In this context (and for the sake of speed) I want to keep the files that I download, but when I run the script for real, I will delete them to save space (it will take much longer to run the script though).

filename<-"air.sfc.1990.nc"
saved_filename<-"data/air.sfc.1990.nc"
url <- "ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/"

# get the file... the mode MUST be "wb" (write binary)
if(!file.exists(saved_filename)) {
    download.file(paste0(url,filename), saved_filename, quiet = FALSE, mode="wb")
}

Are you getting a feel for where this is going? If we could create a list of filenames to retrieve then we could loop over those names and retrieve the information we need for each year. This is easy and can be accomplished by these three lines located right after the library call...

dates<-seq(1990,2010,1)
in_file_list<-paste0("air.sfc.",dates,".nc")
out_file_list<-paste0("air.sfc.",dates,"mean.nc")

Run these three lines of code and look at the variables in_file_list and out_file_list in the environment inspector. Easy, right? Now, we need the automation part... do you remember how to do it? We put everything that we want to automate and place it in a function. Then we run the command: mapply(...) with the function name and the variables we want to send the function. Here are the basics...

# ALL THE HEADER STUFF GOES HERE

# get our list of dates (keep the list short for testing)
dates<-seq(1990,1992,1)
in_file_list<-paste0("air.sfc.",dates,".nc")
out_file_list<-paste0("air.sfc.",dates,"mean.nc")
url <- "ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/"

# make a placeholder for the specific point data we want
narr.timeseries <- data.frame()

# make the retrieve function
retrieve_data <- function(filename, saved_filename) {
  #
  # PUT ALL THE AUTOMATION STUFF IN HERE
  # 
}

# call the automation function with the in and out file lists
mapply (retrieve_data, in_file_list, out_file_list)

# ALL THE PLOT STUFF GOES HERE

All that's left is for you to build the data retrieval code in the automation section and add a plot function after the mapply(...). Can you do it? You've done all the pieces before, and there's a small issue or two to sort out... but give it a try. My version of the Christmas daily mean temperatures histogram is below. If you get stuck, I have put my complete copy of the code in the Quiz Yourself section.

A histogram of Christmas Day mean temperatures for University Park, PA during the years 1990-2010. The data were taken from the North American Regional Reanalysis.

Credit: D. Babb

Quiz Yourself...

Were you able to figure everything out? If not, click on the bar below to reveal my complete code. Remember (yes, I forgot) that inside of the automation function all of the variables disappear every time the function restarts. If you want to access a variable after the function finished (called a global variable), you must use the "<<-" assignment operator, not just "<-".

Click for the answer...

# if for some reason we need to reinstall the
# package, then do so.

if (!require("ncdf4")) {
  install.packages("ncdf4") }

library(ncdf4)

# Set up my date range and file info
dates<-seq(1990,2010,1)
in_file_list<-paste0("air.sfc.",dates,".nc")
out_file_list<-paste0("data/air.sfc.",dates,"mean.nc")
url <- "ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/"

# choose a day and location
target_datetime<-"12-25"
location<-c(40.85, -77.85)

# This will be our placeholder for the specific point data we want
narr.timeseries <- data.frame()

# we are going to only compute the cloest_point once,
# so let's delete any old versions.
if (exists("closest_point")) { rm(closest_point) }

retrieve_data <- function(filename, saved_filename) {

# get the file
if(!file.exists(saved_filename)) {
  print(paste0("Retrieving <", filename, "> ... Saving <",saved_filename, ">"))
    download.file(paste0(url,filename), saved_filename, quiet = TRUE, mode="wb")
}

  # open the file
  nc <- nc_open(saved_filename)
  data_dims<-nc$var$air$varsize
  
  #Get the properly formatted date strings/target time index
  ncdates<-nc$dim$time$vals  #hours since 1800-1-1
  origin<-as.POSIXct('1800-01-01 00:00:00', tz = 'UTC')
  ncdates<-origin + ncdates*(60*60)
 
  # time_index may change due to leap days so we have to calulate it every time
  time_index<-which(strftime(ncdates,format="%m-%d")==target_datetime)
  
  # lat and lon index will not change, let's find it only once
  if (!exists("closest_point")) {
    lats<-ncvar_get(nc, varid = 'lat', 
                     start = c(1,1),
                     count = c(data_dims[1],data_dims[2]))
    
    lons<-ncvar_get(nc, varid = 'lon', 
                     start = c(1,1),
                     count = c(data_dims[1],data_dims[2]))
    lons<-ifelse(lons>0,lons-360,lons)
    
    distance<-(lats-location[1])**2+(lons-location[2])**2

  # I want closest_point defined outside of the function (global) so I used <<-
    closest_point<<-which(distance==min(distance), arr.ind = TRUE)
  }
  
  temp_out<-array(data = NA, dim = c(1,1,1))
  
  temp_out[,,]<-ncvar_get(nc, varid = 'air', 
                            start = c(closest_point[1],closest_point[2],time_index),
                            count = c(1,1,1))
  temp_out[,,]<-(temp_out[,,]-273.15)*9/5+32
  
  # add the row to the data frame
  row <- data.frame(Year=strftime(ncdates[time_index], format="%Y"), Temperature=temp_out[1,1,1])
  
  # use the global variable
  narr.timeseries <<- rbind(narr.timeseries,row)

  # close the file
  nc_close(nc)
  
  # delete the file (optional)
  file.remove(saved_filename)

}

# The main loop
mapply (retrieve_data, in_file_list, out_file_list)

# Write the data out to a file (just so we don't have to calculate it again
write.csv(narr.timeseries, "data/christmas_temps.csv")

# converts $Year from a factor to a number
# need this if you want a line/scatter plot 
narr.timeseries$Year<-as.numeric(levels(narr.timeseries$Year))[narr.timeseries$Year]

# I think a histogram makes more sense here than a line/scatter plot
hist(narr.timeseries$Temperature,
     ylab = "Occurances", xlab = "Temperature (F)",
     main = paste0("NARR Mean Daily Surface Temperature\n University Park, PA\n",
            "December 25,", head(narr.timeseries$Year,1),"-",tail(narr.timeseries$Year,1)),
     las=1,
     breaks=10,
     col="red",
     xlim=c(10,50)
)

Prioritize...

Read...

Quiz Yourself...

Navigation

EMS

Programs

Related Links