GEOG 489
Advanced Python Programming for GIS

3.8.6 Advanced data frame manipulation: Filtering via Boolean indexing

PrintPrint

When working with tabular data, it is very common that one wants to do something with particular data entries that satisfy a certain condition. For instance, we may want to restrict our analysis to rows that have a value larger than a given threshold for one of the columns. Pandas provides some powerful methods for this kind of filtering, and we are going to show one of these to you in this section, namely filtering with Boolean expressions.

The first important thing to know is that we can use data frames in comparison expressions, like df > 0, df.m1 * 2 < 0.2, and so on. The output will be a data frame that only contains Boolean values (True or False) indicating whether the corresponding cell values satisfy the comparison expression or not. Let’s try out these two examples:

df > 0

The result is a data frame with the same rows and columns as the original data frame in df with all cells that had a value larger than 0 set to True, while all other cells are set to False:

Table 3.9 Boolean data frame produced by the expression df > 0
m1 m2 m3 m4 m5
2017-01-01 True True True True True
2017-01-02 True True True True False
2017-01-03 False False False False False
2017-01-04 False False False False True
2017-01-05 True True False True False
2017-01-06 True True False True False
2017-01-07 True True True True True
2017-01-08 True True True True True
2017-01-09 True True True True True

df.m1 * 2 < 0.2

Here we are doing pretty much the same thing but only for a single column (‘m1’) and the comparison expression is slightly more complex involving multiplication of the cell values with 2 before the result is compared to 0.2. The result is a one-dimensional vector of True and False values corresponding to the cells of the ‘m1’ column in the original data frame:

Table 3.10 Boolean data frame for the expression df.m1 * 2 < 0.2
2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Freq: D, Name: m1, dtype: bool

Just to introduce another useful pandas method, we can apply the value_counts() method to get a summary of the values in a data frame telling how often each value occurs:

(df.m1 * 2 < 0.2).value_counts()

The expression in the parentheses will give us a boolean column vector as we have seen above, and invoking its value_counts() method tells us how often True and False occur in this vector. (The actual numbers will depend on the random numbers in your original data frame).

The second important component of Boolean indexing is that we can use Boolean operators to combine Boolean data frames as illustrated in the next example:

v1 = df.m1 * 2 < 0.2
print(v1)
v2 = df.m2 > 0
print(v2)
print(~v1)
print(v1 & v2)
print(v1 | v2)

This will produce the following output data frames:

Table 3.11 Data frame for v1
2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Frew: D, Name: m1, dtype: bool
Table 3.12 Data frame for v2
2017-01-01 True
2017-01-02 False
2017-01-03 False
2017-01-04 True
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 True
2017-01-09 True
Frew: D, Name: m2, dtype: bool
Table 3.13 Data frame for ~v1
2017-01-01 True
2017-01-02 False
2017-01-03 False
2017-01-04 False
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 False
2017-01-09 False
Frew: D, Name: m1, dtype: bool
Table 3.14 Data frame for v1 & v2
2017-01-01 False
2017-01-02 True
2017-01-03 False
2017-01-04 False
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Frew: D, dtype: bool
Table 3.15 Data frame for v1 | v2 
2017-01-01 True
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 True
2017-01-09 True
Frew: D, dtype: bool

What is happening here? We first create two different Boolean vectors using two different comparison expressions for columns ‘m1’ and ‘m2’, respectively, and store the results in variables v1 and v2. Then we use the Boolean operators ~ (not), & (and), and | (or) to create new Boolean vectors from the original ones, first by negating the Boolean values from v1, then by taking the logical AND of the corresponding values in v1 and v2 (meaning only cells that have True for both v1 and v2 will be set to True in the resulting vector), and finally by doing the same but with the logical OR (meaning only cells that have False for both v1 and v2 will be set to False in the result). We can construct arbitrarily complex Boolean expressions over the values in one or multiple data frames in this way.

The final important component is that we can use Boolean vectors or lists to select rows from a data frame. For instance,

df[ [True, False, True, False, True, False, True, False, True] ]

... will give us a subset of the data frame with only every second row:

Table 3.13 Data frame resulting from Boolean indexing operation
m1 m2 m3 m4 m5
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-07 0.399069 0.029156 0.937808 0.476401 0.766952
2017-01-09 0.041115 0.984202 0.912212 0.740345 0.148835

Taken these three things together means we can use arbitrarily logical expressions over the values in a data frame to select a subset of rows that we want to work with. To continue the examples from above, let’s say that we want only those rows that satisfy both the criteria df.m1 * 2 < 0.2 and df.m2 > 0, so only those rows for which the value of column ‘m1’ times 2 is smaller than 0.2 and the value of column ‘m2’ is larger than 0. We can use the following expression for this:

df[v1 & v2 ]

Or even without first having to define v1 and v2:

df[ (df.m1 * 2 < 0.2)  & (df.m2 > 0)  ]

Here is the resulting data frame:

Table 3.14 Data frame produced by the expression df[ (df.m1 * 2 < 0.2) & (df.m2 > 0) ]
m1 m2 m3 m4 m5
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-08 0.043226 0.904844 0.181999 0.253381 0.165105
2017-01-09 0.041115 0.984202 0.912212 0.740345 0.148835

Hopefully you are beginning to see how powerful this approach is and how it allows for writing very elegant and compact code for working with tabular data. You will get to see more examples of this and of using pandas in general in the lesson’s walkthrough. There we will also be using GeoPandas, an extension built on top of pandas that allows for working with data frames that contain geometry data, e.g. entire attribute tables of an ESRI shapefile.