Accessing and Manipulating Data

Prioritize...

In this section, you will learn how to import comma-delimited datafiles into R. You'll also learn some ways to begin exploring the data.

Read...

So, now that we have some data loaded, let's look at how we might learn some things about our data. To help me walk you through some basic R, I have set up a DataCamp session below. DataCamp will allow you to run R right from this webpage (pretty cool, don't you think?). Start by clicking the "Run" button to source the code. When you do so, DataCamp executes the R-code and switches to the Console panel. Now, let's look at the data...

First, you can view any variable simply by typing its name in the Console (try typing "y" at the Console prompt and hitting "Return"). You should see the value of the variable "y" displayed. You can look at a specific value(s) of "y" by adding square brackets containing an expression. Try the following commands: y[3], y[3:4], y[seq(1,9, by=2)]. We can also look at values of a variable that meet certain criteria by using the which() command. For example, try this command (don't type the command prompt ">"):

> rand_nums[which(rand_nums>0)]

You'll find this type of data access to be very powerful. The which() command produces a vector of indices that satisfy the condition (in this case, all the indices of rand_nums whose value is greater than zero. The output indices that satisfy the condition, which if used inside the "[...]" retrieve the actual values of rand_nums that are greater than zero. By the way, you can make the condition statement as complex as you like, allowing you to filter data in numerous ways. And remember, you can use these commands not only in the console, but also in a script, storing the values in new variables as well.

Solve It!

In the DataCamp window above, can you come up with a command to count the number of random numbers between 0.25 and 0.5? There are lots of ways to do it. Hint: You might want to look up the length() and sum() commands.

Click for answer...

Here are some possible answers:

> length(rand_nums[which(rand_nums>0.25 & rand_nums<0.5)])

> length(which(rand_nums>0.25 & rand_nums<0.5))

# Why does this work? ...Look at the output passed to sum()
> sum((rand_nums>0.25 & rand_nums<0.5))

Working with Dataframes

Dataframes can be accessed in ways that are similar to simple variables. In the DataCamp window below, I have loaded the .CSV file from the last section into a dataframe called mydata. Switch over to the Console tab and start with the command: str(mydata). This command displays the structure of the dataframe in much the same format as the environment tab in R-studio. You can see a listing of the variable within the dataframe, the variable types along with some sample data.

Remember that dataframes are basically tables of data (if you don't remember, you might want to revisit the Swirl Tutorial: Lesson 7). In some cases, we might want to look at a sampling of the dataframe across all of its columns. Below are some commands for you to try in the console window (remember to omit the ">").

# list the variable names (header) of mydata
> names(mydata)

# list the first 5 lines of mydata
> head(mydata, n=5)

# list the last 5 lines of mydata
> tail(mydata, n=5)

We can also access data within a dataframe a few other ways. First, if we want just one variable (a single column), remember that we use the "$" character. For example, mydata$date. This command selects only the column named "date" (if you type that in the Console, R will print out the values in that column. Try it!). You can also use the bracket [...] notation that we learned above. The reference mydata[1] is the same as mydata$date because "date" is the first column in the dataframe. To access individual data values, we can use a combination of the $ and [...] notations. For example, the third date value can be found with mydata$date[3], or mydata[3,1]. (note that the 2-D index is in the format [row, column]).

The dataframe mydata contains the daily minimum and maximum dew point temperatures for each day in 2014 at the University Park Airport (State College, PA). In case you are unfamiliar with dew point, you can read this article in Wikipedia. Meteorologists use dew point as a proxy for the amount of water vapor in the air because, unlike relative humidity, its value is not affected by air temperature. Let's look at some other commands we can use to query this dataframe. The first thing that you might want to do is look at some bulk characteristics of the data. For example, try out this command in the console tab of the DataCamp window.

> max(mydata$max_dwpf)

This command asks for the maximum value of the column (“max_dwpf”). Notice that the “$” is again used to select a variable from a dataframe. I get a value of 75.2 F. Note that we also get the same result with the commands: max(mydata[2]) and max(mydata[, 2]). This is because "max_dwpf" is the second column in the dataframe. I should comment that, for clarity's sake, using the actual name of the column is preferable to just its number. Using the column's name makes it immediately clear what we are calculating when looking at the code.

How about the date on which this maximum value occurred? Now try:

> mydata$date[which(mydata$max_dwpf==max(mydata$max_dwpf))]

Can you see what this command does? First, we find the position in the data where the dew point temperature equals the maximum value. The statement: which(mydata$max_dwpf==max(mydata$max_dwpf)) returns a value of 182 (the max occurs on the 182nd row). Now we can retrieve the specific date by placing the which() statement inside the mydata$date[…] selector. You could also use the following command:

> mydata$date[which.max(mydata$max_dwpf)]

Note, however, the function which.max() always returns the first instance of the maximum value. If the data had two maximum values, only the first command above will tell you, not the second. I point this out simply to emphasize that there are many ways to accomplish similar things in R. However, you have to make sure that the command does exactly what you think it does.

Let’s perform a more complex calculation. For example, let's count the number of days where the daily maximum dew point was between 40 F and 50 F:

> sum(mydata$max_dwpf <= 50 & mydata$max_dwpf >= 40)

or, the number of days where dew point temperature changed by more than 30 degrees during the day:

> sum((mydata$max_dwpf - mydata$min_dwpf) > 30)

I hope that you are beginning to see how powerful R can be. Let’s look at one final example. What if we wanted to know when such large changes in dew points occur (as organized by month)? Here’s the code to find out:

> table(strftime(mydata$date[(mydata$max_dwpf - mydata$min_dwpf) > 30], "%b"))

This command looks for large differences in the maximum and minimum daily dew point (greater than 30F), gets the dates associated with those differences, extracts and formats the months (learn more about strftime) of the occurrences, and finally places the results in a frequency table. Remember that you can always look up each one of these commands in R-studio (or on the web for that matter) to find out what it does. You can find lots of information and examples by Googling... "R strftime", for example.

Here’s the output…

Apr Dec Feb Jan Mar

 3   1   1   5   5

Pretty cool, huh? Can you think of a meteorological reason why the dew point temperature (the moisture content of the air) would change drastically during the day?

We'll look at many other ways to explore large data sets in future lessons. Now, however, let's resume our brief survey of R with a look at other ways of managing data. Read on.

Prioritize...

Read...

Solve It!

Working with Dataframes

Navigation

EMS

Programs

Related Links