GEOG 489
Advanced Python Programming for GIS

3.8.2 Subsetting and changing cell values

PrintPrint

Now that we have a data frame object available, let’s quickly go through some of the basic operations that one can perform on a data frame to access or modify the data.

The info() method can be used to display some basic information about a data frame such as the number of rows and columns and data types of the columns:

df.info()

The output tells us that we have four columns, all for storing floating point numbers, and each column has 6 rows with values that are not null. If you ever need the number of rows and columns, you can get them by applying the len(…) operation to the data frame and to the columns property of the data frame:

print(len(df))          # gives you the number of rows
print(len(df.columns))  # gives you the number of columns

We can use the head(…) and tail(…) methods to get only the first or last n rows of a data frame:

firstRows = df.head(2)
print(firstRows)
lastRows = df.tail(2)
print(lastRows)

We can also just get a subset of consecutive rows by applying slicing to the data frame similar to how this can be done with lists or strings:

someRows = df[3:5]    # gives you the 4th and 5th row
print(someRows)

This operation gives us rows 4 and 5 (those with index 3 and 4) from the original data frame because the second number used is the index of the first row that will not be included anymore.

If we are just interested in a single column, we can get it based on its name:

thirdColumn = df.m3
print(thirdColumn)

The same can be achieved by using the notation df['m3'] instead of df.m3 in the first line of code. Moreover, instead of using a single column name, you can use a list of column names to get a data frame with just these columns and in the specified order:

columnsM3andM2 = df[ ['m3', 'm2'] ]
columnsM3andM2
Table 3.1 Data frame with swapped columns
m3 m2
2017-01-01 0.510162 0.163613
2017-01-02 0.025050 0.056027
2017-01-03 -0.422343 -0.840010
2017-01-04 -0.966351 -0.721431
2017-01-05 -1.339799 0.655267
2017-01-06 -1.160902 0.192804
 

The column subsetting and row slicing operations shown above can be concatenated into a single expression. For instance, the following command gives us columns ‘m3’ and ‘m2’ and only the rows with index 3 and 4:

someSubset = df[['m3', 'm2']][3:5]
someSubset

The order here doesn’t matter. We could also have written df[3:5][['m3', 'm2']] .

The most flexible methods for creating subsets of data frame are via the loc and .iloc index properties of a data frame. .iloc[…] is based on the numerical indices of the columns and rows. Here is an example:

someSubset = df.iloc[2:4,1:3] 
print(someSubset)

The part before the comma determines the rows (rows with indices 2 and 3 in this case) and the part after the comma, the columns (columns with indices 1 and 2 in this case). So we get a data frame with the 3rd and 4th rows and 2nd and 3rd columns of the original data frame. Instead of slices we can also use lists of row and column indices to create completely arbitrary subsets. For instance, using iloc in the following example

someSubset = df.iloc[ [0,2,4], [1,3] ]
print(someSubset)

...gives us a data frame with the 1st, 3rd, and 5th row and 2nd and 4th column of the original dataframe. Both the part before the comma and after the comma can just be a colon symbol (:) in which case all rows/columns will be included. For instance,

allRowsSomeColumns = df.iloc[ : , [1,3] ]
print(allRowsSomeColumns)

...will give you all rows but only the 2nd of 4th column.

In contrast to iloc, loc doesn’t use the row and column numbers but instead is based on their labels, while otherwise working in the same way as iloc. The following command produces the same subset of the 1st, 3rd, and 5th rows and 2nd and 4th columns as the iloc code from two examples above:

someSubset = df.loc[ [pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-03'), pd.Timestamp('2017-01-05')] , ['m2', 'm4'] ]
print(someSubset)

Please note that, in this example, the list for the column names at the very end is simply a list of strings but the list of dates for the row names has to consist of pandas Timestamp objects. That is because we used a DatetimeIndex for the rows when we created the original data frame. When a data frame is displayed, the row names show up as simple strings but they are actually Timestamp objects. However, a DatetimeIndex for the rows has many advantages; for instance, we can use it to get all rows for dates that are from a particular year, e.g. with

df.loc['2017' , ['m2', 'm4'] ]

...to get all dates from 2017 which, of course, in this case, are all rows. Without going into further detail here, we can also get all dates from a specified time period, etc.

The methods explained above for accessing subsets of a data frame can also be used as part of an assignment to change the values in one or several cells. In the simplest case, we can change the value in a single cell, for instance with

df.iloc[0,0] = 0.17 

...or

df.loc['2017-01-01', 'm1'] = 0.17 

...to change the value of the top left cell to 0.17. Please note that this operation will change the original data frame, not create a modified copy. So if you now print out the data frame with

df 

you will see the modified value for 'm1' of January 1, 2017. Even more importantly, if you have used the operations explained above for creating a subset, the data frame with the subset will still refer to the original data, so changing a cell value will change your original data. If you ever want to make changes but keep the original data frame unchanged, you need to explicitly make a copy of the data frame by calling the copy() method as in the following example:

dfCopy = df.copy()
dfCopy.iloc[0,0] = 0
print(df)
print(dfCopy)

Check out the output and compare the top left value for both data frames. The data frame in df still has the old value of 0.17, while the value will be changed to 0 in dfCopy. Without using the copy() method in the first line, both variables would still refer to the same underlying data and both would show the 0 value. Here is another slightly more complicated example where we change the values for the first column of the 1st and 5th rows to 1.2 (intentionally modifying the original data):

df.iloc[ [0,4] , [0] ] = 1.2
print(df)

If you ever need to explicitly go through the rows in a data frame, you can do this with a for-loop that uses the itertuples(…) method of the data frame. itertuples(…) gives you an object that can be used to iterate through the rows as named tuples, meaning each element in the tuple is labeled with the respective column name. By providing the parameter index=False to the method, we are saying that we don’t want the row name to be part of the tuple, just the cell values for the different columns. You can access the elements of the tuple either via their index or via the column name:

for row in df.itertuples(index=False):
    print(row)     # print entire row tuple
    print(row[0])  # print value from column with index 0
    print(row.m2)  # print value from column with name m2
    print('----------')

Try out this example and have a look at the named tuple and the output produced by the other two print statements.