Managing Data

Prioritize...

In this section, you will learn how to manage variables in R by changing data classes, formatting times, and adding and deleting columns/rows from dataframes.

Read...

We've looked at how to query variables and dataframes to retrieve values but we haven't discussed ways that variables and dataframes can be modified. Let's return to the DataCamp window where we load the dew point data file into the mydata dataframe (click "Run" to start the session).

Let's look at the structure of the dataframe using the following command (entered in the R Console window):

> str(mydata)

This command gives you the same information as the Environment tab in R-studio. Note that the $max_dwpf and $min_dwpf have a data class of "numeric". (You can also verify this fact with the command class(mydata$max_dwpf)). The "$date" variable however has a strange class "Factor". What is a "factor"? You might want to read about them in this documentation page. Basically, a factor variable is a list of indices that point to a lookup table (called the "levels" table in R). You can think of a factor variable as a finite list of categories rather than actual numbers (called a "categorical variable" in statistics). The problem is that factors are not sequential variables and thus don't always play nice with other R functions (like plots, for example). In many cases, we are going to want to change these factor-type variable to something more manageable. So, how can we transform one class of variable into another?

The process is rather easy, but does come with some caveats. R has a set of functions having the form: as.numeric(...), as.integer(...), as.character(...), etc., that can transform one class of variable into another (where possible). For example, try the command:

# Convert max_dwpf to characters and display the first 5 values

> head(as.character(mydata$max_dwpf))

Notice that all the values for $max_dwpf are now surrounded by quotes. This means that they are no longer numbers, but strings of characters instead (We can no longer perform any mathematics on them. For example, try: max(as.character(mydata$max_dwpf)). Conversions from numbers to characters are pretty straightforward (everything converts, no problem). The opposite however, is not always the case. Let's consider the following scenario (enter just the commands in the console window):

# Create a variable of character data
# This is what you might get after reading a data file
#   with mixed numbers and non-numbers.
> temperature <- c("25", "-2", "10.5", "M", "16*")

# Try doing some math... Nope, can't do it
> temperature/10

# Now convert the temperature to numbers
> temperature <- as.numeric(temperature) 

# Look at the values... Notice that some values 
#   cannot be converted and are assigned 'NA'
> temperature 

# Try doing some math now...
> temperature/10

Notice that you must be careful when changing data from one type/class to another, particularly when you have mixed types assigned to one variable. Factors can be particularly troublesome because the values stored in a "factors" variable are numbers pointing to a look-up table, not actual values. To see what I mean, run the DataCamp console below. This console loads a data file called daily_obs_may2016.csv (I linked to the file in case you want to download and play with it in R-Studio).

First, after running the console, use str(daily_obs) to look at the loaded dataframe. Note the $PRCP variable which was loaded as a factor. By using the levels(daily_obs$PRCP) we can see all the possible values of the variable (and why the variable is listed as a factor... the "T" stands for "trace of precip" and is represents values less than 0.01 inches). So, how do we convert this variable into something we can use (find the maximum value, for example)? First, let me say that you can't just convert using as.numeric(daily_obs$PRCP). Try it... what do you get? You do, in fact, get numbers, but those numbers represent the position of the values in the look-up table. Now try: as.character(daily_obs$PRCP). This gets you closer, doesn't it? Now the output is a list of the proper values pulled from the look-up table, but they are still character values, not numbers. Let's add the as.numeric(...) conversion to what we have and take a look at the output. If you want to see how some other data scientists solved the "Trace" problem, you can check out their paper, published in the Journal of Service Climatology (2013),

> as.numeric(as.character(daily_obs$PRCP))

Now we've got what we want, but we've lost something in the process (the "T" observations have been transformed into "NA"s). NA is a bit misleading because the observation is not missing. In fact, a "Trace" is perfectly valid observation. So what to do? Well that's up to you. If it suites your purpose, you might leave those values as NA (if you are tabulating a "Total Precip", for example). Or, you might assign the T's a numerical value before converting (0.00 or 0.001 perhaps). Or, you may wish to create a "flag" column in the dataframe which stores a letter designating the type of observation (Number, Trace, or Missing) for each value. As the data scientist, you must decide how you are going to treat data that is not so nicely behaved. Just be cautious! My mantra is to always perform operations that preserve the original data and that only alter that data in a manner consistent with the type of result I need to compute.

Now that we've dealt with the factors, let's talk about dates. If you noticed, the dataframe mydata contained dates in the form of factors, while daily_obs dates were read in as integers. Both of these formats are not acceptable -- dates should be dates! For dates that are already in character format (and in some sort of recognizable date format), you can use the as.Date(...) function. For example, you might want to recast the mydata$date variable like this (remember it starts as a factor):

> mydata$date <- as.Date(as.character(mydata$date))

The function as.Date(...) is quick and dirty, but not all that powerful. In the daily_obs case, the dates data are listed as integers (namely because they don't have any non-digit characters). This format is troublesome for as.Date(...) because R thinks that these numbers represent a time (in seconds) from some point of t=0 (called the origin). So, to convert these times (and in a general sense, ALL dates/times), we are going to use the even more powerful function strptime(...). Here's the command to try:

> daily_obs$DATE <- strptime(daily_obs$DATE, "%Y%m%d", tz="EST")

Notice that strptime(...) can take almost any input form (number, character, even factor), it can handle dates and times in a format of your choosing, and can even embed timezone information. It creates a data class called POSIXlt which is a one-stop-shop for storing date/time information. You have already encounter its sister function, strftime(...), which is used to format output of POSIXlt objects in much the same way. Take my advice and format all of your time/dates using strptime(...).

Manipulating Dataframes

You've probably figured out that dataframes are nice, compact structures for storing data. Ideally, we are going to want to keep all similar data together rather than having to keep track of several independent variables. Adding columns to an existing dataframe is easy (Run the DataCamp session below before continuing).

Let's say that we wanted to convert the dew point temperature in mydata to degrees Celsius instead of Fahrenheit. We could issue the following command (don't run it):

# Don't run me!....
> mydata$max_dwpf<-(mydata$max_dwpf-32)*5/9

However, this will overwrite the original data. Instead, let's create a new column in mydata called max_dwpC.

# This is better... it creates a new column 
#   without overwriting the original data.
> mydata$max_dwpC<-(mydata$max_dwpf-32)*5/9

You can remove columns from a dataframe as well by simply assigning the NULL values to an existing column like such:

> mydata$max_dwpC<-NULL

Note that NULL is not the same as NA. Assigning a variable to the NULL state will remove it from R's memory space. The value NA is treated as an actual value even though it's usually used to signify an undefined or missing value.

What about adding and deleting rows? You guessed it... that's pretty easy too. To add rows, we simply use the command rbind(...) to bind together rows from one dataframe with rows from another. Look at the following series of commands:

# start with the original 'mydata' (not the one with added columns)

# create a new dataframe with two rows of data
> mydata2 <- data.frame(date=c("2015-01-01","2015-01-02"),
           max_dwpf=c(9.1, 12.6), min_dwpf=c(2.7, 1.6))

# bind the new rows to the original dataframe
> mydata <- rbind(mydata, mydata2)

# look at the result (the last 10 lines)
> tail(mydata, n=10)

Note that the two dataframes must be the same size (same number of columns). Therefore, if you added a Celsius column to mydata, you must also add that variable to mydata2. Otherwise, you will get an error.

Deleting rows, on the other hand, is a bit different. Instead of directly deleting rows, we need to tell R what rows we want to keep, and then reassign that selection to the original dataframe. For example, let's say that we want to delete the rows added in the commands above.

# Select only the first 365 rows and delete the rest
> mydata <- mydata[1:365,]

This method of parsing a data set is quite powerful when combined with the selection functions we've previously discussed. For example, the command:

> mydata <- mydata[which(mydata$max_dwpf > 20), ]

deletes all rows in the dataframe where the max dew point is not greater than 20F. We might also want to use the function is.na(...) to delete rows that contain missing or corrupt data. We'll discuss what to do with data sets that contain missing values at the end of this lesson (it's more involved than you might think).

For now, let's move on to how we might display our data. Read on.

Prioritize...

Read...

Manipulating Dataframes

Navigation

EMS

Programs

Related Links