3.8 Pandas and the manipulation of tabular data

The pandas package provides high-performance data structures and analysis tools, in particular for working with tabular data based on a fast and efficient implementation of a data frame class. It also allows for reading and writing data from/to various formats including CSV and Microsoft Excel. In the following, we show you some examples illustrating how to perform the most important data frame related operations with pandas. Again, we can only scratch the surface of the functionality provided by pandas here. Resources provided at the end will allow you to dive deeper if you wish to do so. We recommend that you start a new Jupyter Notebook and use it to try out the examples from this section for yourself. Use a new code cell for each of block of code you will encounter on the following pages.

3.8.1 Creating a new data frame

In our examples, we will be using pandas in combination with the numpy package, the package that provides many fundamental scientific computing functionalities for Python and that many other packages are built on. So we start by importing both packages:

import pandas as pd
import numpy as np

A data frame consists of cells that are organized into rows and columns. The rows and columns have names that serve as indices to access the data in the cells. Let us start by creating a data frame with some random numbers that simulate a time series of different measurements (columns) taken on consecutive days (rows) from January 1, 2017 to January 6, 2017. The first step is to create a pandas series of dates that will serve as the row names for our data frame. For this, we use the pandas function date_range(…):

dates = pd.date_range('20170101' , periods=6, freq='D')
dates

The first parameter given to date_range is the starting date. The ‘periods’ parameter tells the function how many dates we want to generate, while we use ‘freq’ to tell it that we want a date for every consecutive day. If you look at the output from the second line we included, you will see that the object returned by the function is a DatetimeIndex object which is a special class defined in pandas.

Next, we generate random numbers that should make up the content of our date frame with the help of the numpy function randn(…) for creating a set of random numbers that follow a standard normal distribution:

numbers = np.random.randn(6,4)
numbers

The output is a two-dimensional numpy array of random numbers normally distributed around 0 with 4 columns and 6 rows. We create a pandas data frame object from it with the following code:

df = pd.DataFrame(numbers, index=dates, columns=['m1', 'm2', 'm3', 'm4'])
df

Note that we provide our array of random numbers as the first parameter, followed by the DatetimeIndex object we created earlier for the row index. For the columns, we simply provide a list of the names with ‘m1’ standing for measurement 1, ‘m2’ standing for measurement 2, and so on. Please also note how the resulting data frame is displayed as a nicely formatted table in your Jupyter Notebook because it makes use of IPython widgets. Please keep in mind that because we are using random numbers for the content of the cells, the output produced by commands used in the following examples will look different in your notebook because the numbers are different.

3.8.2 Subsetting and changing cell values

Now that we have a data frame object available, let’s quickly go through some of the basic operations that one can perform on a data frame to access or modify the data.

The info() method can be used to display some basic information about a data frame such as the number of rows and columns and data types of the columns:

df.info()

The output tells us that we have four columns, all for storing floating point numbers, and each column has 6 rows with values that are not null. If you ever need the number of rows and columns, you can get them by applying the len(…) operation to the data frame and to the columns property of the data frame:

print(len(df))          # gives you the number of rows
print(len(df.columns))  # gives you the number of columns

We can use the head(…) and tail(…) methods to get only the first or last n rows of a data frame:

firstRows = df.head(2)
print(firstRows)
lastRows = df.tail(2)
print(lastRows)

We can also just get a subset of consecutive rows by applying slicing to the data frame similar to how this can be done with lists or strings:

someRows = df[3:5]    # gives you the 4th and 5th row
print(someRows)

This operation gives us rows 4 and 5 (those with index 3 and 4) from the original data frame because the second number used is the index of the first row that will not be included anymore.

If we are just interested in a single column, we can get it based on its name:

thirdColumn = df.m3
print(thirdColumn)

The same can be achieved by using the notation df['m3'] instead of df.m3 in the first line of code. Moreover, instead of using a single column name, you can use a list of column names to get a data frame with just these columns and in the specified order:

columnsM3andM2 = df[ ['m3', 'm2'] ]
columnsM3andM2
Table 3.1 Data frame with swapped columns
m3 m2
2017-01-01 0.510162 0.163613
2017-01-02 0.025050 0.056027
2017-01-03 -0.422343 -0.840010
2017-01-04 -0.966351 -0.721431
2017-01-05 -1.339799 0.655267
2017-01-06 -1.160902 0.192804
 

The column subsetting and row slicing operations shown above can be concatenated into a single expression. For instance, the following command gives us columns ‘m3’ and ‘m2’ and only the rows with index 3 and 4:

someSubset = df[['m3', 'm2']][3:5]
someSubset

The order here doesn’t matter. We could also have written df[3:5][['m3', 'm2']] .

The most flexible methods for creating subsets of data frame are via the loc and .iloc index properties of a data frame. .iloc[…] is based on the numerical indices of the columns and rows. Here is an example:

someSubset = df.iloc[2:4,1:3] 
print(someSubset)

The part before the comma determines the rows (rows with indices 2 and 3 in this case) and the part after the comma, the columns (columns with indices 1 and 2 in this case). So we get a data frame with the 3rd and 4th rows and 2nd and 3rd columns of the original data frame. Instead of slices we can also use lists of row and column indices to create completely arbitrary subsets. For instance, using iloc in the following example

someSubset = df.iloc[ [0,2,4], [1,3] ]
print(someSubset)

...gives us a data frame with the 1st, 3rd, and 5th row and 2nd and 4th column of the original dataframe. Both the part before the comma and after the comma can just be a colon symbol (:) in which case all rows/columns will be included. For instance,

allRowsSomeColumns = df.iloc[ : , [1,3] ]
print(allRowsSomeColumns)

...will give you all rows but only the 2nd of 4th column.

In contrast to iloc, loc doesn’t use the row and column numbers but instead is based on their labels, while otherwise working in the same way as iloc. The following command produces the same subset of the 1st, 3rd, and 5th rows and 2nd and 4th columns as the iloc code from two examples above:

someSubset = df.loc[ [pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-03'), pd.Timestamp('2017-01-05')] , ['m2', 'm4'] ]
print(someSubset)

Please note that, in this example, the list for the column names at the very end is simply a list of strings but the list of dates for the row names has to consist of pandas Timestamp objects. That is because we used a DatetimeIndex for the rows when we created the original data frame. When a data frame is displayed, the row names show up as simple strings but they are actually Timestamp objects. However, a DatetimeIndex for the rows has many advantages; for instance, we can use it to get all rows for dates that are from a particular year, e.g. with

df.loc['2017' , ['m2', 'm4'] ]

...to get all dates from 2017 which, of course, in this case, are all rows. Without going into further detail here, we can also get all dates from a specified time period, etc.

The methods explained above for accessing subsets of a data frame can also be used as part of an assignment to change the values in one or several cells. In the simplest case, we can change the value in a single cell, for instance with

df.iloc[0,0] = 0.17 

...or

df.loc['2017-01-01', 'm1'] = 0.17 

...to change the value of the top left cell to 0.17. Please note that this operation will change the original data frame, not create a modified copy. So if you now print out the data frame with

df 

you will see the modified value for 'm1' of January 1, 2017. Even more importantly, if you have used the operations explained above for creating a subset, the data frame with the subset will still refer to the original data, so changing a cell value will change your original data. If you ever want to make changes but keep the original data frame unchanged, you need to explicitly make a copy of the data frame by calling the copy() method as in the following example:

dfCopy = df.copy()
dfCopy.iloc[0,0] = 0
print(df)
print(dfCopy)

Check out the output and compare the top left value for both data frames. The data frame in df still has the old value of 0.17, while the value will be changed to 0 in dfCopy. Without using the copy() method in the first line, both variables would still refer to the same underlying data and both would show the 0 value. Here is another slightly more complicated example where we change the values for the first column of the 1st and 5th rows to 1.2 (intentionally modifying the original data):

df.iloc[ [0,4] , [0] ] = 1.2
print(df)

If you ever need to explicitly go through the rows in a data frame, you can do this with a for-loop that uses the itertuples(…) method of the data frame. itertuples(…) gives you an object that can be used to iterate through the rows as named tuples, meaning each element in the tuple is labeled with the respective column name. By providing the parameter index=False to the method, we are saying that we don’t want the row name to be part of the tuple, just the cell values for the different columns. You can access the elements of the tuple either via their index or via the column name:

for row in df.itertuples(index=False):
    print(row)     # print entire row tuple
    print(row[0])  # print value from column with index 0
    print(row.m2)  # print value from column with name m2
    print('----------')

Try out this example and have a look at the named tuple and the output produced by the other two print statements.

3.8.3 Sorting

Pandas also provides operations for sorting the rows in a data frame. The following command can be used to sort our data frame by the values in the ‘m2’ column in decreasing order:

dfSorted = df.sort_values(by='m2', ascending=False)
dfSorted
Table 3.2 Data frame with rows sorted by descending values in the m2 column
m1 m2 m3 m4 m5
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-06 0.192804 0.192804 -1.160902 0.525051 -0.412310
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-04 -0.721431 -0.721431 -0.966351 -0.380911 0.001231
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232

The ‘by’ argument specifies the column that the sorting should be based on and, by setting the ‘ascending’ argument to False, we are saying that we want the rows to be sorted in descending rather than ascending order. It is also possible to provide a list of column names for the ‘by’ argument, to sort by multiple columns. The sort_values(...) method will create a new copy of the data frame, so modifying any cells of dfSorted in this example will not have any impact on the data frame in variable df.

3.8.4 Adding / removing columns and rows

Adding a new column to a data frame is very simple when you have the values for that column ready in a list. For instance, in the following example, we want to add a new column ‘m5’ with additional measurements and we already have the numbers stored in a list m5values that is defined in the first line of the example code. To add the column, we then simply make an assignment to df['m5'] in the second line. If a column ‘m5’ would already exist, its values would now be overwritten by the values from m5values. But since this is not the case, a new column gets added under the name ‘m5’ with the values from m5values.

m5values = [0.432523, -0.123223, -0.231232, 0.001231, -0.23698, -0.41231]
df['m5'] = m5values
df
Table 3.3
m1 m2 m3 m4 m5
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232
2017-01-04 -0.721431 -0.721431 -0.966351 -0.380911 0.001231
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-06 0.192804 0.192804 -1.160902 0.525051 -0.412310

For adding new rows, we can simply make assignments to the rows selected via the loc operation, e.g. we could add a new row for January 7, 2017 by writing

df.loc[pd.Timestamp('2017-01-07'),:] = [ ... ]

where the part after the equal sign is a list of five numbers, one for each of the columns. Again, this would replace the values in the case that there already is a row for January 7. The following example uses this idea to create new rows for January 7 to 9 using a for loop:

for i in range(7,10):
    df.loc[ pd.Timestamp('2017-01-0'+str(i)),:] = [ np.random.rand() for j in range(5) ]
df
Table 3.4
m1 m2 m3 m4 m5
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232
2017-01-04 -0.721431 -0.721431 -0.966351 -0.380911 0.001231
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-06 0.192804 0.192804 -1.160902 0.525051 -0.412310
2017-01-07 0.768633 0.559968 0.591466 0.210762 0.610931
2017-01-08 0.483585 0.652091 0.183052 0.278018 0.858656
2017-01-09 0.909180 0.917903 0.226194 0.978862 0.751596

In the body of the for loop, the part on the left of the equal sign uses loc(...) to refer to a row for the new date based on loop variable i, while the part on the right side simply uses the numpy rand() method inside a list comprehension to create a list of five random numbers that will be assigned to the cells of the new row.

If you ever want to remove columns or rows from a data frame, you can do so by using df.drop(...). The first parameter given to drop(...) is a single column or row name or, alternatively, a list of names that should be dropped. By default, drop(...) will consider these as row names. To indicate these are column names that should be removed, you have to specify the additional keyword argument axis=1 . We will see an example of this in a moment.

3.8.5 Joining data frames

The following short example demonstrates how pandas can be used to merge two data frames based on a common key, so to perform a join operation in database terms. Let’s say we have two tables, one listing capitals of states in the U.S. and one listing populations for each state. For simplicity we just define data frames for these with just entries for two states, Washington and Oregon:

df1 = pd.DataFrame( {'state': ['Washington', 'Oregon'], 'capital': ['Olympia', 'Salem']} )
print(df1)
df2 = pd.DataFrame( {'name': ['Washington', 'Oregon'], 'population':[7288000, 4093000]} )
print(df2)

The two data frames produced by this code look like this:

Table 3.5 Data frame 1 (df1) listing states and state capitals
capital state
0 Olympia Washington
1 Salem Oregon
Table 3.6 Data frame 2 (df2) listing states and population numbers
name population
0 Washington 7288000
1 Oregon 4093000

We here use a new way of creating a data frame, namely from a dictionary that has entries for each of the columns where the keys are the column names (‘state’ and ‘capital’ in the case of df1, and ‘name’ and ‘population’ in case of df2) and the values stored are lists of the values for that column. We now invoke the merge(...) method on df1 and provide df2 as the first parameter meaning that a new data frame will be produced by merging df1 and df2. We further have to specify which columns should be used as keys for the join operation. Since the two columns containing the state names are called differently, we have to provide the name for df1 through the ‘left_on’ argument and the name for df2 through the ‘right_on’ argument.

merged = df1.merge(df2, left_on='state', right_on='name')
merged

The joined data frame produced by the code will look like this:

Table 3.7 Joined data frame
capital state name population
0 Olympia Washington Washington 7288000
1 Salem Oregon Oregon 4093000

As you see, the data frames have been merged correctly. However, we do not want two columns with the state names, so, as a last step, we remove the column called ‘name’ with the previously mentioned drop(...) method. As explained, we have to use the keyword argument axis=1 to indicate that we want to drop a column, not a row.

newMerged = merged.drop('name', axis=1)
newMerged

Result:

Table 3.8 Joined data frame after dropping the 'name' column
capital state population
0 Olympia Washington 7288000
1 Salem Oregon 4093000

If you print out variable merged, you will see that it still contains the 'name' column. That is because drop(...) doesn't change the original data frame but rather produces a copy with the column/row removed.

3.8.6 Advanced data frame manipulation: Filtering via Boolean indexing

When working with tabular data, it is very common that one wants to do something with particular data entries that satisfy a certain condition. For instance, we may want to restrict our analysis to rows that have a value larger than a given threshold for one of the columns. Pandas provides some powerful methods for this kind of filtering, and we are going to show one of these to you in this section, namely filtering with Boolean expressions.

The first important thing to know is that we can use data frames in comparison expressions, like df > 0, df.m1 * 2 < 0.2, and so on. The output will be a data frame that only contains Boolean values (True or False) indicating whether the corresponding cell values satisfy the comparison expression or not. Let’s try out these two examples:

df > 0

The result is a data frame with the same rows and columns as the original data frame in df with all cells that had a value larger than 0 set to True, while all other cells are set to False:

Table 3.9 Boolean data frame produced by the expression df > 0
m1 m2 m3 m4 m5
2017-01-01 True True True True True
2017-01-02 True True True True False
2017-01-03 False False False False False
2017-01-04 False False False False True
2017-01-05 True True False True False
2017-01-06 True True False True False
2017-01-07 True True True True True
2017-01-08 True True True True True
2017-01-09 True True True True True

df.m1 * 2 < 0.2

Here we are doing pretty much the same thing but only for a single column (‘m1’) and the comparison expression is slightly more complex involving multiplication of the cell values with 2 before the result is compared to 0.2. The result is a one-dimensional vector of True and False values corresponding to the cells of the ‘m1’ column in the original data frame:

Table 3.10 Boolean data frame for the expression df.m1 * 2 < 0.2
2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Freq: D, Name: m1, dtype: bool

Just to introduce another useful pandas method, we can apply the value_counts() method to get a summary of the values in a data frame telling how often each value occurs:

(df.m1 * 2 < 0.2).value_counts()

The expression in the parentheses will give us a boolean column vector as we have seen above, and invoking its value_counts() method tells us how often True and False occur in this vector. (The actual numbers will depend on the random numbers in your original data frame).

The second important component of Boolean indexing is that we can use Boolean operators to combine Boolean data frames as illustrated in the next example:

v1 = df.m1 * 2 < 0.2
print(v1)
v2 = df.m2 > 0
print(v2)
print(~v1)
print(v1 & v2)
print(v1 | v2)

This will produce the following output data frames:

Table 3.11 Data frame for v1
2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Frew: D, Name: m1, dtype: bool
Table 3.12 Data frame for v2
2017-01-01 True
2017-01-02 False
2017-01-03 False
2017-01-04 True
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 True
2017-01-09 True
Frew: D, Name: m2, dtype: bool
Table 3.13 Data frame for ~v1
2017-01-01 True
2017-01-02 False
2017-01-03 False
2017-01-04 False
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 False
2017-01-09 False
Frew: D, Name: m1, dtype: bool
Table 3.14 Data frame for v1 & v2
2017-01-01 False
2017-01-02 True
2017-01-03 False
2017-01-04 False
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Frew: D, dtype: bool
Table 3.15 Data frame for v1 | v2 
2017-01-01 True
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 True
2017-01-09 True
Frew: D, dtype: bool

What is happening here? We first create two different Boolean vectors using two different comparison expressions for columns ‘m1’ and ‘m2’, respectively, and store the results in variables v1 and v2. Then we use the Boolean operators ~ (not), & (and), and | (or) to create new Boolean vectors from the original ones, first by negating the Boolean values from v1, then by taking the logical AND of the corresponding values in v1 and v2 (meaning only cells that have True for both v1 and v2 will be set to True in the resulting vector), and finally by doing the same but with the logical OR (meaning only cells that have False for both v1 and v2 will be set to False in the result). We can construct arbitrarily complex Boolean expressions over the values in one or multiple data frames in this way.

The final important component is that we can use Boolean vectors or lists to select rows from a data frame. For instance,

df[ [True, False, True, False, True, False, True, False, True] ]

... will give us a subset of the data frame with only every second row:

Table 3.13 Data frame resulting from Boolean indexing operation
m1 m2 m3 m4 m5
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-07 0.399069 0.029156 0.937808 0.476401 0.766952
2017-01-09 0.041115 0.984202 0.912212 0.740345 0.148835

Taken these three things together means we can use arbitrarily logical expressions over the values in a data frame to select a subset of rows that we want to work with. To continue the examples from above, let’s say that we want only those rows that satisfy both the criteria df.m1 * 2 < 0.2 and df.m2 > 0, so only those rows for which the value of column ‘m1’ times 2 is smaller than 0.2 and the value of column ‘m2’ is larger than 0. We can use the following expression for this:

df[v1 & v2 ]

Or even without first having to define v1 and v2:

df[ (df.m1 * 2 < 0.2)  & (df.m2 > 0)  ]

Here is the resulting data frame:

Table 3.14 Data frame produced by the expression df[ (df.m1 * 2 < 0.2) & (df.m2 > 0) ]
m1 m2 m3 m4 m5
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-08 0.043226 0.904844 0.181999 0.253381 0.165105
2017-01-09 0.041115 0.984202 0.912212 0.740345 0.148835

Hopefully you are beginning to see how powerful this approach is and how it allows for writing very elegant and compact code for working with tabular data. You will get to see more examples of this and of using pandas in general in the lesson’s walkthrough. There we will also be using GeoPandas, an extension built on top of pandas that allows for working with data frames that contain geometry data, e.g. entire attribute tables of an ESRI shapefile.