Lesson 3: Advanced Geoprocessing

We have made it past the halfway mark in the class and I hope you are starting to see where and how you can employ Python to make your daily tasks easier. Lesson 3 builds on the previous lesson and we will be investigating advanced geoprocessing in Python. We will start with looking at List Comprehension, Regular Expressions (regex), and then transition to exploring two of the popular python package managers, pip and conda.

Since most scientific packages for Python contain many dependencies, installation through conda is the preferred to take advantage of condas’ built dependency resolver during package installation. Conda also is an environment manager and exporting environments for distribution, creating, restoring, and cloning can be done via the conda command prompt. Conda environments can be troublesome to get working correctly and for the brevity of this course, we are touching on the topics that can use esri’s conda installation. If you are interested trying to create an external environment using Anaconda, feel free to view the steps at our full 10-week GEOG 489 course content [1].

Lastly, after reviewing some scientific packages, we will dive into some Pandas operations and then finish it up with an assignment that involves iterating through a folder, reading csv data, and concatenates metrics to produce a summary file.

Overview and Checklist

This lesson builds on the previous and will be investigating advanced geoprocessing in Python. We will start with looking at Lists comprehensions and Regular Expressions (regex) and then transition to discussing two of the popular python package managers pip and conda. We will introduce some well known data science packages at a high level and finally work through some data manipulation in pandas.

Learning Outcomes

By the end of this lesson, you should be able to:

Operations on data lists
Use regex to perform data extraction.
Identify the main python packages used for spatial data science.
Perform data manipulation using pandas.

Lesson Roadmap

To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.

Steps for Completing Lesson 3
Step	Activity	Access/Directions
1	Engage with Lesson 3 Content	Begin with List Comprehension
2	Programming Assignment and Reflection	Submit your code for the programming assignment and 400 words write-up with reflections
3	Quiz 3	Complete the Lesson 3 Quiz
4	Questions/Comments	Remember to visit Canvas to post/answer any questions or comments pertaining to Lesson 3

Downloads

The following is a list of datasets that you will be prompted to download through the course of the lesson. They are divided into two sections: Datasets that you will need for the assignments and Datasets used for the content and examples in the lesson.

Required:

Assignment_3_files.zip [2]

Suggested:

None for this lesson; data for this lesson's examples have been downloaded during lesson 1.

Assignment

Please review the assignment at the of the lesson for full details

In this homework assignment, we want you to practice working with pandas and the other Python packages introduced in this lesson some more, and you are supposed to submit your solution as a nice-looking script.

The situation is the following: You have been hired by a company active in the northeast of the United States to analyze and produce different forms of summaries for the traveling activities of their traveling salespersons. Unfortunately, the way the company has been keeping track of the related information leaves a lot to be desired. The information is spread out over numerous .csv files. Please download the .zip [3] file containing all (imaginary) data you need for this assignment and extract it to a new folder. Then open the files in a text editor and read the explanations below.

Write-up

Produce a 400-word write-up on how the assignment went for you; reflect on and briefly discuss the issues and challenges you encountered and what you learned from the assignment.

Deliverable

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

Your script file
Your 400-word write-up

Questions?

If you have any questions, please send a message through Canvas. We will check daily to respond. If your question is one that is relevant to the entire class, we may respond to the entire class rather than individually.

List Comprehension

Like the first lesson, we are going to start Lesson 3 with a bit of Python theory. From mathematics, you probably are familiar with the elegant way of defining sets based on other sets using a compact notation as in the example below:

M = { 1, 5, 9, 27, 31}

N = {x2, x ∈ M ∧ x > 11}

What is being said here is that the set N should contain the squares of all numbers in set M that are larger than 11. The notation uses { … } to indicate that we are defining a set, then an expression that describes the elements of the set based on some variable (x2) followed by a set of criteria specifying the values that this variable (x) can take (x ∈ M and x > 11).

This kind of compact notation has been adopted by Python for defining lists and it is called list comprehension. Recall that we used list comprehension twice already in our multiprocessing script in lesson one to create the list of objectid’s and a list of jobs. A list comprehension has the general form

[< new value expression using variable> for <variable> in <list> if <condition for variable>]

The fixed parts are written in bold here, while the parts that need to be replaced by some expressions using some variable are put into angular brackets <..> . The if and following condition are optional. To give a first example, here is how this notation can be used to create a list containing the squares of the numbers from 1 to 10:

squares = [ x**2 for x in range(1,11) ]

print(squares)

Output:
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In case you haven’t seen this before, ** is the Python operator for a to the power of b.

What happens when Python evaluates this list comprehension is that it goes through the numbers in the list produced by range(1,11), so the numbers from 1 to 10, and then evaluates the expression x**2 with each of these numbers assigned to variable x. The results are collected to form the new list produced by the entire list comprehension. We can easily extend this example to only include the squares of numbers that are even:

evenNumbersSquared = [ x**2 for x in range(1,11) if x % 2 == 0 ]
print(evenNumbersSquared)

Output:
[4, 16, 36, 64, 100]

This example makes use of the optional if condition to make sure that the new value expression is only evaluated for certain elements from the original list, namely those for which the remainder of the division by 2 with the Python modulo operator % is zero. To show that this not only works with numbers, here is an example in which we use list comprehension to simply reduce a list of names to those names that start with the letter ‘M’ or the letter ‘N’:

names = [ 'Monica', 'John', 'Anne', 'Mike', 'Nancy', 'Peter', 'Frank', 'Mary' ]
namesFiltered = [ n for n in names if n.startswith('M') or n.startswith('N') ]
print(namesFiltered)

Output:
['Monica', 'Mike', 'Nancy', 'Mary']

This time, the original list is defined before the actual list comprehension rather than inside it as in the previous examples. We are also using a different variable name here (n) so that you can see that you can choose any name here but, of course, you need to use that variable name consistently directly after the for and in the condition following the if. The new value expression is simply n because we want to keep those elements from the original list that satisfy the condition unchanged. In the if condition, we use the string method startswith(…) twice connected by the logical or operator to check whether the respective name starts with letter ‘M’ or the letter ‘N’.

Surely, you are getting the general idea and how list comprehension provides a compact and elegant way to produce new lists from other lists by (a) applying the same operation to the elements of the original list and (b) optionally using a condition to filter the elements from the original list before this happens. The new value expression can be arbitrarily complex involving multiple operators as well as function calls. It is also possible to use several variables, either with each variable having its own list to iterate through corresponding to nested for-loops, or with a list of tuples as in the following example:

pairs = [ (21,23), (12,3), (3,11) ]
sums = [ x + y for x,y in pairs ]
print(sums)

Output:
[44, 15, 14]

With “for x,y in pairs” we here go through the list of pairs and for each pair, x will be assigned the first element of that pair and y the second element. Then these two variables will be added together based on the expression x + y and the result will become part of the new list. Often we find this form of a list comprehension used together with the zip(…) function from the Python standard library that takes two lists as parameters and turns them into a list of pairs. Let’s say we want to create a list that consists of the pairwise sums of corresponding elements from two input lists. We can do that as follows:

list1 = [ 1, 4, 32, 11 ]
list2 = [ 3, 2, 1, 99 ]

sums = [ x + y for x,y in zip(list1,list2) ]

print(sums)

Output:
[4, 6, 33, 110]

The expression zip(list1,list2) will produce the list of pairs [ (1,3), (4,2), (32,1), (11,99) ] from the two input lists and then the rest works in the same way as in the previous example.

Most of the examples of list comprehensions that you will encounter in the rest of this course will be rather simple and similar to the examples you saw in this section. We will also practice writing list comprehensions a bit in the practice exercises of this lesson. If you'd like to read more about them and see further examples, there are a lot of good tutorials and blogs out there on the web if you search for Python + list comprehension, like this List Comprehensions in Python page(link is external) for example. As a last comment, we focussed on list comprehension in this section but the same technique can also be applied to other Python containers such as dictionaries, for example. If you want to see some examples, check out the section on "Dictionary Comprehension" in this article here(link is external).

As provided in lesson 1 and 2, you can use list comprehension with arcpy methods and cursors as an elegant way to reduce for loop code.

# Create the idList by list comprehension and SearchCursor
idList = [row[0] for row in arcpy.da.SearchCursor(clipping_fc, [field])]

jobs = [(clipping_fc, tobeclipped, field, id) for id in idList]

fields = [f.name for f in arcpy.ListFields(fc)]

Comprehension can also be used to create dictionaries as demonstrated in lesson 2 when we showed you how you can convert a featureclass into a dictionaries. The code below, the construct is creating a dictionary of all fields in the row and assigning it to the objectid key in the avalache_dict dictionary.

with arcpy.da.SearchCursor(fc, fields) as sCur:
    for row in sCur:
        avalache_dict[row[fields.index('OBJECTID')]] = {k: v for k, v in zip(sCur.fields, row)}

Portions of lesson content developed by Jan Wallgrun and James O’Brien

Generators and Yield

Python Generators retrieve data from a source when needed instead of retrieving all data at once. Generators are useful when you need to manage the amount of memory you are using and behave like an iterator. Further definitions and explaination can be found in the Python documentation [4]. Let's look at an example:

def first_n(n):
	'''Build and return a list'''
	num, nums = 0, []
	while num < n:
	    nums.append(num)
	    num += 1
	return nums

sum_of_first_n = sum(first_n(1000000))

The function returns the whole list and holds it in memory. This can take valuable resources away from other geoprocessing tasks, such as dissolve. The benefit of Generators is that it will only return the item called before retrieving the next value. Creating an iteration as a Generator is simple as adding the keyword yield instead of return.

# a generator that yields items instead of returning a list
def firstn(n):
	num = 0
	while num < n:
	    yield num
	num += 1

sum_of_first_n = sum(firstn(1000000))

This may be useful when delaing with a lot of data where a process is performed on a result before doing the same to the next item. It is very beneficial if the pc you are using contains a small amount of RAM or there are multiple people competeing for resources in a networked environment.

Regular Expressions

Often you need to find a string or all strings that match a particular pattern among a given set of strings.

For instance, you may have a list of names of persons and need all names from that list whose last name starts with the letter ‘J’. Or, you want do something with all files in a folder whose names contain the sequence of numbers “154” and that have the file extension “.shp”. Or, you want to find all occurrences where the word “red” is followed by the word “green” with at most two words in between in a longer text.

Support for these kinds of matching tasks is available in most programming languages based on an approach for denoting string patterns that is called regular expressions.

A regular expression is a string in which certain characters like '.', '*', '(', ')', etc. and certain combinations of characters are given special meanings to represent other characters and sequences of other characters. Surely you have already seen the expression “*.txt” to stand for all files with arbitrary names but ending in “.txt”.

To give you another example before we approach this topic more systematically, the following regular expression “a.*b” in Python stands for all strings that start with the character ‘a’ followed by an arbitrary sequence of characters, followed by a ‘b’. The dot here represents all characters and the star stands for an arbitrary number of repetitions. Therefore, this pattern would, for instance, match the strings 'acb', 'acdb', 'acdbb', etc.

Regular expressions like these can be used in functions provided by the programming language that, for instance, compare the expression to another string and then determine whether that string matches the pattern from the regular expression or not. Using such a function and applying it to, for example, a list of person names or file names allows us to perform some task only with those items from the list that match the given pattern.

In Python, the package from the standard library that provides support for regular expressions together with the functions for working with regular expressions is simply called “re”. The function for comparing a regular expression to another string and telling us whether the string matches the expression is called match(...).

There are two websites whose purpose is to help developers write and test regex. These two are invaluable and I would recommend bookmarking them. https://regex101.com/ [5] is a great resource that provides a tester in multiple languages. Be sure to switch to Python if you use this one. The other is https://pythex.org/ [6], which is much more focused on Python and not as feature rich, but still helpful in creating the regex string.

Let’s create a small example to learn how to write regular expressions. In this example, we have a list of names in a variable called personList, and we loop through this list comparing each name to a regular expression given in variable pattern and print out the name if it matches the pattern.

import re
personList = [ 'Julia Smith', 'Francis Drake', 'Michael Mason',
'Jennifer Johnson', 'John Williams', 'Susanne Walker',
'Kermit the Frog', 'Dr. Melissa Franklin', 'Papa John',
'Walter John Miller', 'Frank Michael Robertson', 'Richard Robertson',
'Erik D. White', 'Vincent van Gogh', 'Dr. Dr. Matthew Malone',
'Rebecca Clark' ]

pattern = "John"

for person in personList:
	if re.match(pattern, person):
	    print(person)

Output:
  John Williams

Before we try out different regular expressions with the code above, we want to mention that the part of the code following the name list is better written in the following way:

pattern = "John"

compiledRE = re.compile(pattern)

for person in personList:
	if compiledRE.match(person):
	    print(person)

Whenever we call a function from the “re” module like match(…) and provide the regular expression as a parameter to that function, the function will do some preprocessing of the regular expression and compile it into some data structure that allows for matching strings to that pattern efficiently. If we want to match several strings to the same pattern, as we are doing with the for-loop here, it is more time efficient to explicitly perform this preprocessing and store the compiled pattern in a variable, and then invoke the match(…) method of that compiled pattern. In addition, explicitly compiling the pattern allows for providing additional parameters, e.g. when you want the matching to be done in a case-insensitive manner. In the code above, compiling the pattern happens in line 3 with the call of the re.compile(…) function and the compiled pattern is stored in variable compiledRE. Instead of the match(…) function, we now invoke the method match(…) of the compiled pattern object in variable person (line 6) that only needs one parameter, the string that should be matched to the pattern. Using this approach, the compilation of the pattern only happens once instead of once for each name from the list as in the first version.

One important thing to know about match(…) is that it always tries to match the pattern to the beginning of the given string but it allows for the string to contain additional characters after the entire pattern has been matched. That is the reason why when running the code above, the simple regular expression “John” matches “John Williams” but neither “Jennifer Johnson”, “Papa John”, nor “Walter John Miller”. You may wonder how you would then ever write a pattern that only matches strings that end in a certain sequence of characters. The answer is that Python's regular expressions use the special characters ^ and $ to represent the beginning or the end of a string and this allows us to deal with such situations as we will see a bit further below.

Now let’s have a look at the different special characters and some examples using them in combination with the name list code from above. Here is a brief overview of the characters and their purpose:

Special Characters and Their Purpose
Character	Purpose
.	stands for a single arbitrary character
[ ]	are used to define classes of characters and match any character of that class
( )	are used to define groups consisting of multiple characters in a sequence
+	stands for arbitrarily many repetitions of the previous character or group but at least one occurrence
*	stands for arbitrarily many repetitions of the previous character or group including no occurrence
?	stands for zero or one occurrence of the previous character or group, so basically says that the character or group is optional
{m,n}	stands for at least m and at most n repetitions of the previous group where m and n are integer numbers
^	stands for the beginning of the string
$	stands for the end of the string
\|	stands between two characters or groups and matches either only the left or only the right character/group, so it is used to define alternatives
\	is used in combination with the next character to define special classes of characters

Since the dot stands for any character, the regular expression “.u” can be used to get all names that have the letter ‘u’ as the second character. Give this a try by using “.u” for the regular expression in line 1 of the code from the previous example.

pattern = ".u"

The output will be:
  Julia Smith
  Susanne Walker

Similarly, we can use “..cha” to get all names that start with two arbitrary characters followed by the character sequence resulting in “Michael Mason” and “Richard Robertson” being the only matches. By the way, it is strongly recommended that you experiment a bit in this section by modifying the patterns used in the examples. If in some case you don’t understand the results you are getting, feel free to post this as a question on the course forums.

Maybe you are wondering how one would use the different special characters in the verbatim sense, e.g. to find all names that contain a dot. This is done by putting a backslash in front of them, so \. for the dot, \? for the question mark, and so on. If you want to match a single backslash in a regular expression, this needs to be represented by a double backslash in the regular expression. However, one has to be careful here when writing this regular expression as a string literal in the Python code because of the string escaping mechanism, a sequence of two backslashes will only produce a single backslash in the string character sequence. Therefore, you actually have to use four backslashes, "xyz\\\\xyz" to produce the correct regular expression involving a single backslash. Or you use a raw string in which escaping is disabled, so r"xyz\\xyz". Here is one example that uses \. to search for names with a dot as the third character returning “Dr. Melissa Franklin” and “Dr. Dr. Matthew Malone” as the only results:

pattern = "..\."

Next, let us combine the dot (.) with the star (*) symbol that stands for the repetition of the previous character. The pattern “.*John” can be used to find all names that contain the character sequence “John”. The .* at the beginning can match any sequence of characters of arbitrary length from the . class (so any available character). For Instance, for the name “Jennifer Johnson”, the .* matches the sequence “Jennifer “ produced from nine characters from the . class and since this is followed by the character sequence “John”, the entire name matches the regular expression.

pattern = ".*John"

Output:

  Jennifer Johnson
  John Williams
  Papa John
  Walter John Miller

Please note that the name “John Williams” is a valid match because the * also includes zero occurrences of the preceding character, so “.*John” will also match “John” at the beginning of a string.

The dot used in the previous examples is a special character for representing an entire class of characters, namely any character. It is also possible to define your own class of characters within a regular expression with the help of the squared brackets. For instance, [abco] stands for the class consisting of only the characters ‘a’, ‘b’,‘c’ and ‘o’. When it is used in a regular expression, it matches any of these four characters. So the pattern “.[abco]” can, for instance, be used to get all names that have either ‘a’, ‘b’, ‘c’ or ‘o’ as the second character. This means using ...

pattern = ".[abco]"

... we get the output:

John Williams
  Papa John
  Walter John Miller

When defining classes, we can make use of ranges of characters denoted by a hyphen. For instance, the range m-o stands for the lower-case characters ‘m’, ‘n’, ‘o’ . The class [m-oM-O.] would then consist of the characters ‘m’, ‘n’, ‘o’, ‘M’, ‘N’, ‘O’, and ‘.’ . Please note that when a special character appears within the squared brackets of a class definition (like the dot in this example), it is used in its verbatim sense. Try out this idea of using ranges with the following example:

pattern = "......[m-oM-O.]"

The output will be...

Papa John
  Frank Michael Robertson
  Erik D. White
  Dr. Dr. Matthew Malone

… because these are the only names that have a character from the class [m-oM-O.] as the seventh character.

In addition to the dot, there are more predefined classes of characters available in Python for cases that commonly appear in regular expressions. For instance, these can be used to match any digit or any non-digit. Predefined classes are denoted by a backslash followed by a particular character, like \d for a single decimal digit, so the characters 0 to 9. The following table lists the most important predefined classes:

Predefined Character Classes
Predefined class	Description
\d	stands for any decimal digit 0…9
\D	stands for any character that is not a digit
\s	stands for any whitespace character (whitespace characters include the space, tab, and newline character)
\S	stands for any non-whitespace character
\w	stands for any alphanumeric character (alphanumeric characters are all Latin letters a-z and A-Z, Arabic digits 0…9, and the underscore character)
\W	stands for any non-alphanumeric character

To give one example, the following pattern can be used to get all names in which “John” appears not as a single word but as part of a longer name (either first or last name). This means it is followed by at least one character that is not a whitespace which is represented by the \S in the regular expression used. The only name that matches this pattern is “Jennifer Johnson”.

pattern = ".*John\S"

In addition to the *, there are more special characters for denoting certain cases of repetitions of a character or a group. + stands for arbitrarily many occurrences but, in contrast to *, the character or group needs to occur at least once. ? stands for zero or one occurrence of the character or group. That means it is used when a character or sequence of characters is optional in a pattern. Finally, the most general form {m,n} says that the previous character or group needs to occur at least m times and at most n times.

If we use “.+John” instead of “.*John” in an earlier example, we will only get the names that contain “John” but preceded by one or more other characters.

pattern = ".+John"

Output:

  Jennifer Johnson
  Papa John
  Walter John Miller

By writing ...

pattern = ".{11,11}[A-Z]"

... we get all names that have an upper-case character as the 12th character. The result will be “Kermit the Frog”. This is a bit easier and less error-prone than writing “………..[A-Z]”.

Lastly, the pattern “.*li?a” can be used to get all names that contain the character sequences ‘la’ or ‘lia’.

pattern = ".*li?a"

Output:

  Julia Smith
  John Williams
  Rebecca Clark

So far we have only used the different repetition matching operators *, +, {m,n}, and ? for occurrences of a single specific character. When used after a class, these operators stand for a certain number of occurrences of characters from that class. For instance, the following pattern can be used to search for names that contain a word that only consists of lower-case letters (a-z) like “Kermit the Frog” and “Vincent van Gogh”. We use \s to represent the required whitespaces before and after the word and then [a-z]+ for an arbitrarily long sequence of lower-case letters but consisting of at least one letter.

pattern = ".*\s[a-z]+\s"

Sequences of characters can be grouped together with the help of parentheses (…) and then be followed by a repetition operator to represent a certain number of occurrences of that sequence of characters. For instance, the following pattern can be used to get all names where the first name starts with the letter ‘M’ taking into account that names may have a ‘Dr. ’ as prefix. In the pattern, we use the group (Dr.\s) followed by the ? operator to say that the name can start with that group but doesn’t have to. Then we have the upper-case M followed by .*\s to make sure there is a white space character later in the string so that we can be reasonably sure this is the first name.

pattern = "(Dr.\s)?M.*\s"

Output:

  Michael Mason
  Dr. Melissa Franklin

You may have noticed that there is a person with two doctor titles in the list whose first name also starts with an ‘M’ and that it is currently not captured by the pattern because the ? operator will match at most one occurrence of the group. By changing the ? to a * , we can match an arbitrary number of doctor titles.

pattern = "(Dr.\s)*M.*\s"

Output:

  Michael Mason
  Dr. Melissa Franklin
  Dr. Dr. Matthew Malone

Similarly to how we have the if-else statement to realize case distinctions in addition to loop based repetitions in normal Python, regular expression can make use of the | character to define alternatives. For instance, (nn|ss) can be used to get all names that contain either the sequence “nn” or the sequence “ss” (or both):

pattern = ".*(nn|ss)"

Output:

  Jennifer Johnson
  Susanne Walker
  Dr. Melissa Franklin

As we already mentioned, ^ and $ represent the beginning and end of a string, respectively. Let’s say we want to get all names from the list that end in “John”. This can be done using the following regular expression:

pattern = ".*John$"

Output:
    Papa John

Here is a more complicated example. We want all names that contain “John” as a single word independent of whether “John” appears at the beginning, somewhere in the middle, or at the end of the name. However, we want to exclude cases in which “John” appears as part of longer word (like “Johnson”). A first idea could be to use ".*\sJohn\s" to achieve this making sure that there are whitespace characters before and after “John”. However, this will match neither “John Williams” nor “Papa John” because the beginning and end of the string are not whitespace characters. What we can do is use the pattern "(^|.*\s)John" to say that John needs to be preceded either by the beginning of the string or an arbitrary sequence of characters followed by a whitespace. Similarly, "John(\s|$)" requires that John be succeeded either by a whitespace or by the end of the string. Taken together we get the following regular expressions:

pattern = "(^|.*\s)John(\s|$)"

Output:

  John Williams
  Papa John
  Walter John Miller

An alternative would be to use the regular expression "(.*\s)?John(\s.*)?$" That uses the optional operator ? rather than | . There are often several ways to express the same thing in a regular expression. Also, as you start to see here, the different special matching operators can be combined and nested to form arbitrarily complex regular expressions. You will practice writing regular expressions like this a bit more in the practice exercises and in the homework assignment.

In addition to the main special characters we explained in this section, there are certain extension operators available denoted as (?x...) where the x can be one of several special characters determining the meaning of the operator. Here we just briefly want to mention the operator (?!...) for negative lookahead assertion because we will use it later in the lesson's walkthrough to filter files in a folder. A negative lookahead extension means that what comes before the (?!...) can only be matched if it isn't followed by the expression given for the ... . For instance, if we want to find all names that contain John but are not followed by "son" as in "Johnson", we could use the following expression:

pattern = ".*John(?!son)"

Output:

  John Williams
  Papa John
  Walter John Miller

If match(…) does not find a match, it will return the special value None. That’s why we can use it with an if-statement as we have been doing in all the previous examples. However, if a match is found it will not simply return True but a match object that can be used to get further information, for instance about which part of the string matched the pattern. The match object provides the methods group() for getting the matched part as a string, start() for getting the character index of the starting position of the match, end() for getting the character index of the end position of the match, and span() to get both start and end indices as a tuple. The example below shows how one would use the returned matching object to get further information and the output produced by its four methods for the pattern “John” matching the string “John Williams”:

pattern = "John"
compiledRE = re.compile(pattern)

for person in personList:
    match = compiledRE.match(person)
    if match:
        print(match.group())
        print(match.start())
        print(match.end())
        print(match.span())

Output:

  John <- output of group()
  0 <- output of start()
  4 <- output of end()
  (0,4) <- output of span()

In addition to match(…), there are three more matching functions defined in the re module. Like match(…), these all exist as standalone functions taking a regular expression and a string as parameters, and as methods to be invoked for a compiled pattern. Here is a brief overview:

search(…) - In contrast to match(…), search(…) tries to find matching locations anywhere within the string not just matches starting at the beginning. That means “^John” used with search(…) corresponds to “John” used with match(…), and “.*John” used with match(…) corresponds to “John” used with search(…). However, “corresponds” here only means that a match will be found in exactly the same cases but the output by the different methods of the returned matching object will still vary.
findall(…) - In contrast to match(…) and search(…), findall(…) will identify all substrings in the given string that match the regular expression and return these matches as a list.
finditer(…) – finditer(…) works like findall(…) but returns the matches found not as a list but as a so-called iterator object.

By now you should have enough understanding of regular expressions to cover maybe ~80 to 90% of the cases that you encounter in typical programming. However, there are quite a few additional aspects and details that we did not cover here that you potentially need when dealing with rather sophisticated cases of regular-expression-based matching. The full documentation of the “re” package [7] is always a good source for looking up details when needed. In addition, this HOWTO [8] provides a good overview.

We also want to mention that regular expressions are very common in programming and matching with them is very efficient, but they do have certain limitations in their expressivity. For instance, it is impossible to write a regular expression for names with the first and last name starting with the same character. Or, you cannot define a regular pattern for all strings that are palindromes, so words that read the same forward and backward. For these kinds of patterns, certain extensions to the concept of a regular expression are needed. One generalization of regular expressions are what are called recursive regular expressions. The regex [9] Python package currently under development, backward compatible to re, and planned to replace re at some point, has this capability, so feel free to check it out if you are interested in this topic.

Lesson content developed by Jan Wallgrun and James O’Brien

Python Package Management

There are many Python packages available for use, and there are a couple of different ways to effectively manage (install, uninstall, update) packages. The two package managers that are commonly used are pip and conda. In the following sections, we will discuss each of them in more detail. At the end of the section, we will discuss the merits of the two tools and make recommendations for their use.

We will be doing some more complicated technical "stuff" here so the steps might not work as planned because everyone’s PC is configured a little differently. For this section, it is probbaly best to just review it and know that if you needed to perform these steps, you can refer to the process to get you started. Anaconda environments can take days to install and get working. Installation of Anaconda is not required, but included for review and if you want to try it, I suggest referring to it at a later date.

pip

As already mentioned, pip is a Python package manager. It allows for an easier install, uninstall and update of packages. Pip comes installed with Python, and if you have multiple versions of Python you will have a different version of pip for each. To make sure we are using the version of pip that comes installed with ArcGIS Pro, we will go to the directory where pip is installed. Go to the Windows Start Menu and open the Python Command Prompt as before.

In the command window that now opens, you will again be located in the default Python environment folder of your ArcGIS Pro installation. For newer versions of Pro this will be C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\. Pip is installed in the Scripts subfolder of that location, so type in:

cd Scripts

Now you can run a command to check that pip is in the directory – type in:

dir pip.*

The resulting output will show you all occurrences of files that start with pip. in the current folder, in this case, there is only one file found – pip.exe.

Figure 2.30 Files that Start with "pip"

Next, let’s run our first pip command, type in:

pip --version

The output shows you the current version of pip. Pip allows you to see what packages have been installed. To look at the list type in:

pip list

The output will show (Figure 31) the list of packages and their respective versions.

Figure 2.31 Package Versions

To install a package, you run the pip command with the install option and provide the name of the package, for example, try:

pip install numpy

Pip will run for a few seconds and show you a progress bar as it is searching for the numpy package online and installing it. When you run pip install, the packages are loaded from an online repository named PyPI, short for Python Package Index. You can browse available packages at Python's Package Index page(link is external). If the installation has been successful you will see a message stating the same, which you can confirm by running pip list again.

In order to find out if any packages are outdated you can run the pip list with the outdated option:

pip list –-outdated

If you find that there are packages you want to update, you run the install with the upgrade option, for example:

pip install numpy –-upgrade

This last command will either install a newer version of numpy or inform you that you already have the latest version installed.

If you wanted to uninstall a package you would run pip with the uninstall option, for example:

pip uninstall numpy

You will be asked to confirm that you want the package uninstalled, and, if you do (better not to do this or you will have to install the package again!), the package will be removed.

The packages installed with pip are placed in the Lib\site-packages folder of the Python environment you are using. You will recall that that was one of the search locations Python uses in order to find the packages you import.

Lesson content developed by Jan Wallgrun and James O’Brien

Conda and Anaconda

Another option for packaging and distributing your Python programs is to use conda. Just like pip, it is a package manager. In addition, it is also an environment manager. What that means is that you can use conda to create virtual environments for Python, while specifying the packages you want to have available in that environment. Conda comes installed with ArcGIS Pro. We can double check that it is by opening the Python Command Prompt and then typing in:

cd Scripts

followed by:

conda –-version

The output should show the conda version.

In order to find out what packages are installed type in:

conda list

Your output should look something like Figure 2.34:

Figure 2.34 Conda Package List

The first column shows the package name, the second the version of the package. The third column provides clues on how the package was installed. You will see that for some of the packages installed, Esri is listed, showing they are related to the Esri installation. The list option of conda is useful, not only to find out if the package you need is already installed but also to confirm that you have the appropriate version.

Conda has the functionality to create different environments. Think of an environment as a sandbox – you can set up the environment with a specific Python version and different packages. That allows you to work in environments with different packages and Python versions without affecting other applications. The default environment used by conda is called base environment. We do not need to create a new environment, but, should you need to, the process is simple – here is an example:

conda create -n gisenv python=3.6 arcpy numpy

the –n flag is followed by the name of the environment (in this case gisenv), then you would choose the Python version which matches the one you already have installed (3.5, 3.6 etc.) and follow that up with a list of packages you want to add to it. If you later find out you need other packages to be added, you could use the install option of conda, for example:

conda install –n gisenv matplotlib

To activate an environment, you would run:

activate gisenv

And to deactivate an environment, simply:

deactivate

Full reference to the command line arguments and flags [10] are helpful.

To see the commands usage, you access the documentation through the command: conda env [-h] command ... as

conda env –h

In the conda prompt.

Output:
positional arguments:
{create,export,list,remove,update,config}
create Create an environment based on an environment definition file. If using an environment.yml file (the default), you can name the environment in the first line of the file with 'name: envname' or you can specify the environment name in the CLI command using the -n/--name argument. The name specified in the CLI will override the name specified in the environment.yml file. Unless you are in the directory containing the environment definition file, use -f to specify the file path of the environment definition file you want to use.
export Export a given environment list List the Conda environments remove Remove an environment update Update the current environment based on environment file config Configure a conda environment

optional arguments:
-h, --help Show this help message and exit.

Once you know the command you want to use, you can add the parameter and view its help.
conda env export –h

Export a given environment

Options:
optional arguments:
-h, --help Show this help message and exit.
-c CHANNEL, --channel CHANNEL

Additional channel to include in the export
--override-channels Do not include .condarc channels
-f FILE, --file FILE
--no-builds Remove build specification from dependencies
--ignore-channels Do not include channel names with package names.
--from-history Build environment spec from explicit specs in history

Target Environment Specification:
-n ENVIRONMENT, --name ENVIRONMENT

Name of environment.
-p PATH, --prefix PATH

Full path to environment location (i.e. prefix).

Output, Prompt, and Flow Control Options:
--json Report all output as json. Suitable for using conda programmatically.
-v, --verbose Use once for info, twice for debug, three times for trace.
-q, --quiet Do not display progress bar.

examples:
conda env export
conda env export --file SOME_FILE

Once we have made sure that everything is working correctly in this new environment and we built our export command, we can export a YAML file using the command:

conda env export > “C:\Users\...\Documents\ArcGIS\AC37.yml”

This creates a file that contains the name, channel and version of each package in the environment and can be used to create or restore an environment to this state. It also includes packages that were installed via pip, and will attempt to install them through conda first when the file is used to create the environment.

It is important to note that using pip can break a conda environment which results in conda not being able to resolve dependencies. There are times when packages are only found in pip and for those instances, you can create the conda environment first, and then use pip. This is not guaranteed and could take several tries to find the right package version numbers that work nicely together.

There are other options you can use with environments – a great resource for the different options is Conda's Managing Environments page

Lesson content developed by Jan Wallgrun and James O’Brien

Python Packages for (spatial) Data Science

It would be impossible to introduce or even just list all the packages available for conducting spatial data analysis projects in Python here, so the following is just a small selection of those that we consider most important.

numpy

numpy Python numpy page(link is external) [11], Wikipedia numpy page(link is external) [12] stands for “Numerical Python” and is a library that adds support for efficiently dealing with large and multi-dimensional arrays and matrices to Python together with a large number of mathematical operations to apply to these arrays, including many matrix and linear algebra operations. Many other Python packages are built on top of the functionality provided by numpy.

matplotlib

matplotlib Python matplotlib page(link is external) [13], Wikipedia matplot page(link is external) [14] is an example of a Python library that builds on numpy. Its main focus is on producing plots and embedding them into Python applications. Take a quick look at its Wikipedia page to see a few examples of plots that can be generated with matplotlib. We will be using matplotlib a few times in this lesson’s walkthrough to quickly create simple map plots of spatial data.

SciPy

SciPy Python SciPy page(link is external) [15], Wikipedia SciPy page(link is external) [16] is a large Python library for application in mathematics, science, and engineering. It is built on top of both numpy and matplotlib, providing methods for optimization, integration, interpolation, signal processing and image processing. Together numpy, matplotlib, and SciPy roughly provide a similar functionality as the well known software Matlab. While we won’t be using SciPy in this lesson, it is definitely worth checking out if you're interested in advanced mathematical methods.

pandas

pandas Python pandas page(link is external) [17], Wikipedia pandas software page(link is external) [18] provides functionality to efficiently work with tabular data, so-called data frames, in a similar way as this is possible in R. Reading and writing tabular data, e.g. to and from .csv files, manipulating and subsetting data frames, merging and joining multiple data frames, and time series support are key functionalities provided by the library. A more detailed overview on pandas will be given in the upcoming section.

Shapely

Shapely Python Shapely page(link is external) [19], Shapely User Manual(link is external) [20] adds the functionality to work with planar geometric features in Python, including the creation and manipulation of geometries such as points, polylines, and polygons, as well as set-theoretic analysis capabilities (intersection, union, …). It is based on the widely used GEOS(link is external) library, the geometry engine that is used in PostGIS(link is external), which in turn is based on the Java Topology Suite(link is external) (JTS) and largely follows the OGC’s Simple Features Access Specification(link is external).

geopandas

geopandas Python geopandas page(link is external) [21], GeoPandas page(link is external) [22] combines pandas and Shapely to facilitate working with geospatial vector data sets in Python. While we will mainly use it to create a shapefile from Python, the provided functionality goes significantly beyond that and includes geoprocessing operations, spatial join, projections, and map visualizations.

GDAL/OGR

GDAL/OGR Python GDAL page(link is external) [23], GDAL/OGR Python(link is external) [24] is a powerful library for working with GIS data in many different formats widely used from different programming languages. Originally, it consisted of two separated libraries, GDAL (‘Geospatial Data Abstraction Library‘) for working with raster data and OGR (used to stand for ‘OpenGIS Simple Features Reference Implementation’) for working with vector data, but these have now been merged. The gdal Python package provides an interface to the GDAL/OGR library written in C++.

ArcGIS API for Python

As we already mentioned in the last lesson, Esri provides its own Python API ArcGIS for Python page(link is external) [25] for working with maps and GIS data via their ArcGIS Online and Portal for ArcGIS web platforms. The API allows for conducting administrative tasks, performing vector and raster analyses, running geocoding tasks, creating map visualizations, and more. While some services can be used autonomously, many are tightly coupled to Esri’s web platforms and you will at least need a free ArcGIS Online account. The Esri API for Python will be further discussed in Lesson 4.

Lesson content developed by Jan Wallgrun and James O’Brien

Pandas and the Manipulation of Tabular Data

The pandas package provides high-performance data structures and analysis tools, in particular for working with tabular data based on a fast and efficient implementation of a data frame class. It also allows for reading and writing data from/to various formats including CSV and Microsoft Excel. In this section, we show you some examples illustrating how to perform the most important data frame related operations with pandas. Again, we can only scratch the surface of the functionality provided by pandas here. Resources provided at the end will allow you to dive deeper if you wish to do so.

Creating a New Dataframe

In our examples, we will be using pandas in combination with the numpy package, the package that provides many fundamental scientific computing functionalities for Python and that many other packages are built on. So we start by importing both packages:

import pandas as pd
import numpy as np

A data frame consists of cells that are organized into rows and columns. The rows and columns have names that serve as indices to access the data in the cells. Let us start by creating a data frame with some random numbers that simulate a time series of different measurements (columns) taken on consecutive days (rows) from January 1, 2017 to January 6, 2017. The first step is to create a pandas series of dates that will serve as the row names for our data frame. For this, we use the pandas function date_range(…):

dates = pd.date_range('20170101' , periods=6, freq='D')
dates

The first parameter given to date_range is the starting date. The ‘periods’ parameter tells the function how many dates we want to generate, while we use ‘freq’ to tell it that we want a date for every consecutive day. If you look at the output from the second line we included, you will see that the object returned by the function is a DatetimeIndex object which is a special class defined in pandas.

Next, we generate random numbers that should make up the content of our date frame with the help of the numpy function randn(…) for creating a set of random numbers that follow a standard normal distribution:

numbers = np.random.randn(6,4)
numbers

The output is a two-dimensional numpy array of random numbers normally distributed around 0 with 4 columns and 6 rows. We create a pandas data frame object from it with the following code:

df = pd.DataFrame(numbers, index=dates, columns=['m1', 'm2', 'm3', 'm4'])
df

Note that we provide our array of random numbers as the first parameter, followed by the DatetimeIndex object we created earlier for the row index. For the columns, we simply provide a list of the names with ‘m1’ standing for measurement 1, ‘m2’ standing for measurement 2, and so on. Please also note how the resulting data frame is displayed as a nicely formatted table in your Jupyter Notebook because it makes use of IPython widgets. Please keep in mind that because we are using random numbers for the content of the cells, the output produced by commands used in the following examples will look different in your notebook because the numbers are different.

Lesson content developed by Jan Wallgrun and James O’Brien

Subsetting and changing cell values

Now that we have a data frame object available, let’s quickly go through some of the basic operations that one can perform on a data frame to access or modify the data.

The info() method can be used to display some basic information about a data frame such as the number of rows and columns and data types of the columns:

df.info()

The output tells us that we have four columns, all for storing floating point numbers, and each column has 6 rows with values that are not null. If you ever need the number of rows and columns, you can get them by applying the len(…) operation to the data frame and to the columns property of the data frame:

print(len(df)) # gives you the number of rows
print(len(df.columns)) # gives you the number of columns

We can use the head(…) and tail(…) methods to get only the first or last n rows of a data frame:

firstRows = df.head(2)
print(firstRows)
lastRows = df.tail(2)
print(lastRows)

We can also just get a subset of consecutive rows by applying slicing to the data frame similar to how this can be done with lists or strings:

someRows = df[3:5]    # gives you the 4th and 5th row
print(someRows)

This operation gives us rows 4 and 5 (those with index 3 and 4) from the original data frame because the second number used is the index of the first row that will not be included anymore.

If we are just interested in a single column, we can get it based on its name:

thirdColumn = df.m3
print(thirdColumn)

The same can be achieved by using the notation df['m3'] instead of df.m3 in the first line of code. Moreover, instead of using a single column name, you can use a list of column names to get a data frame with just these columns and in the specified order:

columnsM3andM2 = df[ ['m3', 'm2'] ]
columnsM3andM2

Table 3.1 Data frame with swapped columns

m3 m2
2017-01-01 0.510162 0.163613
2017-01-02 0.025050 0.056027
2017-01-03 -0.422343 -0.840010
2017-01-04 -0.966351 -0.721431
2017-01-05 -1.339799 0.655267
2017-01-06 -1.160902 0.192804

The column subsetting and row slicing operations shown above can be concatenated into a single expression. For instance, the following command gives us columns ‘m3’ and ‘m2’ and only the rows with index 3 and 4:

someSubset = df[['m3', 'm2']][3:5]
someSubset

The order here doesn’t matter. We could also have written df[3:5][['m3', 'm2']] . The most flexible methods for creating subsets of data frame are via the loc and .iloc index properties of a data frame. .iloc[…] is based on the numerical indices of the columns and rows. Here is an example:

someSubset = df.iloc[2:4,1:3]
print(someSubset)

The part before the comma determines the rows (rows with indices 2 and 3 in this case) and the part after the comma, the columns (columns with indices 1 and 2 in this case). So we get a data frame with the 3rd and 4th rows and 2nd and 3rd columns of the original data frame. Instead of slices we can also use lists of row and column indices to create completely arbitrary subsets. For instance, using iloc in the following example

someSubset = df.iloc[ [0,2,4], [1,3] ]
print(someSubset)

...gives us a data frame with the 1st, 3rd, and 5th row and 2nd and 4th column of the original dataframe. Both the part before the comma and after the comma can just be a colon symbol (:) in which case all rows/columns will be included. For instance,

allRowsSomeColumns = df.iloc[ : , [1,3] ]
print(allRowsSomeColumns)

...will give you all rows but only the 2nd of 4th column.

In contrast to iloc, loc doesn’t use the row and column numbers but instead is based on their labels, while otherwise working in the same way as iloc. The following command produces the same subset of the 1st, 3rd, and 5th rows and 2nd and 4th columns as the iloc code from two examples above:

someSubset = df.loc[ [pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-03'), pd.Timestamp('2017-01-05')] , ['m2','m4'] ]
print(someSubset)

Please note that, in this example, the list for the column names at the very end is simply a list of strings but the list of dates for the row names has to consist of pandas Timestamp objects. That is because we used a DatetimeIndex for the rows when we created the original data frame. When a data frame is displayed, the row names show up as simple strings but they are actually Timestamp objects. However, a DatetimeIndex for the rows has many advantages; for instance, we can use it to get all rows for dates that are from a particular year, e.g. with df.loc['2017' , ['m2', 'm4'] ]

...to get all dates from 2017 which, of course, in this case, are all rows. Without going into further detail here, we can also get all dates from a specified time period, etc. The methods explained above for accessing subsets of a data frame can also be used as part of an assignment to change the values in one or several cells. In the simplest case, we can change the value in a single cell, for instance with

df.iloc[0,0] = 0.17

...or

df.loc['2017-01-01', 'm1'] = 0.17

...to change the value of the top left cell to 0.17. Please note that this operation will change the original data frame, not create a modified copy. So if you now print out the data frame with:

print(df)

you will see the modified value for 'm1' of January 1, 2017. Even more importantly, if you have used the operations explained above for creating a subset, the data frame with the subset will still refer to the original data, so changing a cell value will change your original data. If you ever want to make changes but keep the original data frame unchanged, you need to explicitly make a copy of the data frame by calling the copy() method as in the following example:

dfCopy = df.copy()
dfCopy.iloc[0,0] = 0
print(df)
print(dfCopy)

Check out the output and compare the top left value for both data frames. The data frame in df still has the old value of 0.17, while the value will be changed to 0 in dfCopy. Without using the copy() method in the first line, both variables would still refer to the same underlying data and both would show the 0 value. Here is another slightly more complicated example where we change the values for the first column of the 1st and 5th rows to 1.2 (intentionally modifying the original data):

df.iloc[ [0,4] , [0] ] = 1.2
print(df)

If you ever need to explicitly go through the rows in a data frame, you can do this with a for-loop that uses the itertuples(…) method of the data frame. itertuples(…) gives you an object that can be used to iterate through the rows as named tuples, meaning each element in the tuple is labeled with the respective column name. By providing the parameter index=False to the method, we are saying that we don’t want the row name to be part of the tuple, just the cell values for the different columns. You can access the elements of the tuple either via their index or via the column name:

for row in df.itertuples(index=False):
  print(row)# print entire row tuple
  print(row[0]) # print value from column with index 0
  print(row.m2) # print value from column with name m2
  print('----------')

Try out this example and have a look at the named tuple and the output produced by the other two print statements.

Lesson content developed by Jan Wallgrun and James O’Brien

Sorting

Pandas also provides operations for sorting the rows in a data frame. The following command can be used to sort our data frame by the values in the ‘m2’ column in decreasing order:

  dfSorted = df.sort_values(by='m2', ascending=False)
  dfSorted

m1 m2 m3 m4 m5
  2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
  2017-01-06 0.192804 0.192804 -1.160902 0.525051 -0.412310
  2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
  2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
  2017-01-04 -0.721431 -0.721431 -0.966351 -0.380911 0.001231
  2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232

The ‘by’ argument specifies the column that the sorting should be based on and, by setting the ‘ascending’ argument to False, we are saying that we want the rows to be sorted in descending rather than ascending order. It is also possible to provide a list of column names for the ‘by’ argument, to sort by multiple columns. The sort_values(...) method will create a new copy of the data frame, so modifying any cells of dfSorted in this example will not have any impact on the data frame in variable df.

Lesson content developed by Jan Wallgrun and James O’Brien

Adding / Removing Columns and Rows

Adding a new column to a data frame is very simple when you have the values for that column ready in a list. For instance, in the following example, we want to add a new column ‘m5’ with additional measurements and we already have the numbers stored in a list m5values that is defined in the first line of the example code. To add the column, we then simply make an assignment to df['m5'] in the second line. If a column ‘m5’ would already exist, its values would now be overwritten by the values from m5values. But since this is not the case, a new column gets added under the name ‘m5’ with the values from m5values.

m5values = [0.432523, -0.123223, -0.231232, 0.001231, -0.23698, -0.41231]
df['m5'] = m5values
df

m1 m2 m3 m4 m5 
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523 
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223 
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232 
2017-01-04 -0.721431 -0.721431 -0.966351 -0.380911 0.001231 
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980 
2017-01-06 0.192804 0.192804 -1.160902 0.525051 -0.412310

For adding new rows, we can simply make assignments to the rows selected via the loc operation, e.g. we could add a new row for January 7, 2017 by writing

df.loc[pd.Timestamp('2017-01-07'),:] = [ ... ]

where the part after the equal sign is a list of five numbers, one for each of the columns. Again, this would replace the values in the case that there already is a row for January 7. The following example uses this idea to create new rows for January 7 to 9 using a for loop:

for i in range(7,10):
  df.loc[ pd.Timestamp('2017-01-0'+str(i)),:] = [ np.random.rand() for j in range(5) ]
  df

m1 m2 m3 m4 m5
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232
2017-01-04 -0.721431 -0.721431 -0.966351 -0.380911 0.001231
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-06 0.192804 0.192804 -1.160902 0.525051 -0.412310
2017-01-07 0.768633 0.559968 0.591466 0.210762 0.610931
2017-01-08 0.483585 0.652091 0.183052 0.278018 0.858656
2017-01-09 0.909180 0.917903 0.226194 0.978862 0.751596

In the body of the for loop, the part on the left of the equal sign uses loc(...) to refer to a row for the new date based on loop variable i, while the part on the right side simply uses the numpy rand() method inside a list comprehension to create a list of five random numbers that will be assigned to the cells of the new row.

If you ever want to remove columns or rows from a data frame, you can do so by using df.drop(...). The first parameter given to drop(...) is a single column or row name or, alternatively, a list of names that should be dropped. By default, drop(...) will consider these as row names. To indicate these are column names that should be removed, you have to specify the additional keyword argument axis=1 . We will see an example of this in a moment.

Lesson content developed by Jan Wallgrun and James O’Brien

Joining Dataframes

The following short example demonstrates how pandas can be used to merge two data frames based on a common key, so to perform a join operation in database terms. Let’s say we have two tables, one listing capitals of states in the U.S. and one listing populations for each state. For simplicity we just define data frames for these with just entries for two states, Washington and Oregon:

df1 = pd.DataFrame( {'state': ['Washington', 'Oregon'], 'capital': ['Olympia', 'Salem']} )
print(df1)
df2 = pd.DataFrame( {'name': ['Washington', 'Oregon'], 'population':[7288000, 4093000]} )
print(df2)

The two data frames produced by this code look like this: Table 3.5 Data frame 1 (df1) listing states and state capitals

capital state
0 Olympia Washington
1 Salem Oregon

name population
0 Washington 7288000
1 Oregon 4093000

We here use a new way of creating a data frame, namely from a dictionary that has entries for each of the columns where the keys are the column names (‘state’ and ‘capital’ in the case of df1, and ‘name’ and ‘population’ in case of df2) and the values stored are lists of the values for that column. We now invoke the merge(...) method on df1 and provide df2 as the first parameter meaning that a new data frame will be produced by merging df1 and df2. We further have to specify which columns should be used as keys for the join operation. Since the two columns containing the state names are called differently, we have to provide the name for df1 through the ‘left_on’ argument and the name for df2 through the ‘right_on’ argument.

merged = df1.merge(df2, left_on='state', right_on='name')
merged

The joined data frame produced by the code will look like this: Table 3.7 Joined data frame

capital state name population
0 Olympia Washington Washington 7288000
1 Salem Oregon Oregon 4093000

As you see, the data frames have been merged correctly. However, we do not want two columns with the state names, so, as a last step, we remove the column called ‘name’ with the previously mentioned drop(...) method. As explained, we have to use the keyword argument axis=1 to indicate that we want to drop a column, not a row.

newMerged = merged.drop('name', axis=1)
newMerged

Result: Joined data frame after dropping the 'name' column

capital state population
0 Olympia Washington 7288000
1 Salem Oregon 4093000

If you print out variable merged, you will see that it still contains the 'name' column. That is because drop(...) doesn't change the original data frame but rather produces a copy with the column/row removed.

Lesson content developed by Jan Wallgrun and James O’Brien

Advanced Dataframe Manipulation: Filtering via Boolean indexing

When working with tabular data, it is very common that one wants to do something with particular data entries that satisfy a certain condition. For instance, we may want to restrict our analysis to rows that have a value larger than a given threshold for one of the columns. Pandas provides some powerful methods for this kind of filtering, and we are going to show one of these to you in this section, namely filtering with Boolean expressions. The first important thing to know is that we can use data frames in comparison expressions, like df > 0, df.m1 * 2 < 0.2, and so on. The output will be a data frame that only contains Boolean values (True or False) indicating whether the corresponding cell values satisfy the comparison expression or not. Let’s try out these two examples:

df > 0

The result is a data frame with the same rows and columns as the original data frame in df with all cells that had a value larger than 0 set to True, while all other cells are set to False: Table 3.9 Boolean data frame produced by the expression df > 0

m1 m2 m3 m4 m5
2017-01-01 True True True True True
2017-01-02 True True True True False
2017-01-03 False False False False False
2017-01-04 False False False False True
2017-01-05 True True False True False
2017-01-06 True True False True False
2017-01-07 True True True True True
2017-01-08 True True True True True
2017-01-09 True True True True True

df.m1 * 2 < 0.2

Here we are doing pretty much the same thing but only for a single column (‘m1’) and the comparison expression is slightly more complex involving multiplication of the cell values with 2 before the result is compared to 0.2. The result is a one-dimensional vector of True and False values corresponding to the cells of the ‘m1’ column in the original data frame:

df.m1 * 2 < 0.2

2017-01-01 False 2017-01-02 True 2017-01-03 True 2017-01-04 True 2017-01-05 False 2017-01-06 False 2017-01-07 False 2017-01-08 True 2017-01-09 True Freq: D, Name: m1, dtype: bool

Just to introduce another useful pandas method, we can apply the value_counts() method to get a summary of the values in a data frame telling how often each value occurs:

(df.m1 * 2 < 0.2).value_counts()

The expression in the parentheses will give us a boolean column vector as we have seen above, and invoking its value_counts() method tells us how often True and False occur in this vector. (The actual numbers will depend on the random numbers in your original data frame). The second important component of Boolean indexing is that we can use Boolean operators to combine Boolean data frames as illustrated in the next example:

v1 = df.m1 * 2 < 0.2
print(v1)
v2 = df.m2 > 0
print(v2)
print(~v1)
print(v1 & v2)
print(v1 | v2)

This will produce the following output data frames:

2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Frew: D, Name: m1, dtype: bool

Table 3.12 Data frame for v2
2017-01-01 True
2017-01-02 False
2017-01-03 False
2017-01-04 True
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 True
2017-01-09 True
Frew: D, Name: m2, dtype: bool

Table 3.13 Data frame for ~v1
2017-01-01 True
2017-01-02 False
2017-01-03 False
2017-01-04 False
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 False
2017-01-09 False
Frew: D, Name: m1, dtype: bool

Table 3.14 Data frame for v1 & v2
2017-01-01 False
2017-01-02 True
2017-01-03 False
2017-01-04 False
2017-01-05 False
2017-01-06 False
2017-01-07 False
2017-01-08 True
2017-01-09 True
Frew: D, dtype: bool

Table 3.15 Data frame for v1 | v2 
2017-01-01 True
2017-01-02 True
2017-01-03 True
2017-01-04 True
2017-01-05 True
2017-01-06 True
2017-01-07 True
2017-01-08 True
2017-01-09 True
Frew: D, dtype: bool

What is happening here? We first create two different Boolean vectors using two different comparison expressions for columns ‘m1’ and ‘m2’, respectively, and store the results in variables v1 and v2. Then we use the Boolean operators ~ (not), & (and), and | (or) to create new Boolean vectors from the original ones, first by negating the Boolean values from v1, then by taking the logical AND of the corresponding values in v1 and v2 (meaning only cells that have True for both v1 and v2 will be set to True in the resulting vector), and finally by doing the same but with the logical OR (meaning only cells that have False for both v1 and v2 will be set to False in the result). We can construct arbitrarily complex Boolean expressions over the values in one or multiple data frames in this way. The final important component is that we can use Boolean vectors or lists to select rows from a data frame. For instance,

df[ [True, False, True, False, True, False, True, False, True] ]

... will give us a subset of the data frame with only every second row:

m1 m2 m3 m4 m5
2017-01-01 1.200000 0.163613 0.510162 0.628612 0.432523
2017-01-03 -0.840010 -0.840010 -0.422343 1.022622 -0.231232
2017-01-05 1.200000 0.655267 -1.339799 1.075069 -0.236980
2017-01-07 0.399069 0.029156 0.937808 0.476401 0.766952
2017-01-09 0.041115 0.984202 0.912212 0.740345 0.148835

Taken these three things together means we can use arbitrarily logical expressions over the values in a data frame to select a subset of rows that we want to work with. To continue the examples from above, let’s say that we want only those rows that satisfy both the criteria df.m1 * 2 < 0.2 and df.m2 > 0, so only those rows for which the value of column ‘m1’ times 2 is smaller than 0.2 and the value of column ‘m2’ is larger than 0. We can use the following expression for this:

df[v1 & v2 ]

Or even without first having to define v1 and v2:

df[ (df.m1 * 2 < 0.2)  & (df.m2 > 0)  ]

Here is the resulting data frame:

m1 m2 m3 m4 m5
2017-01-02 0.056027 0.056027 0.025050 0.283586 -0.123223
2017-01-08 0.043226 0.904844 0.181999 0.253381 0.165105
2017-01-09 0.041115 0.984202 0.912212 0.740345 0.148835

Hopefully you are beginning to see how powerful this approach is and how it allows for writing very elegant and compact code for working with tabular data. You will get to see more examples of this and of using pandas in general in the lesson’s walkthrough. There we will also be using GeoPandas, an extension built on top of pandas that allows for working with data frames that contain geometry data, e.g. entire attribute tables of an ESRI shapefile.

Lesson content developed by Jan Wallgrun and James O’Brien

Practice Exercises

After this excursion into the realm of package management and package managers, let's come back to the previous topics covered in this lesson (list comprehension, web access, GUI development) and wrap up the lesson with a few practice exercises. These are meant to give you the opportunity to test how well you have understood the main concepts from this lesson and as a preparation for the homework assignment in which you are supposed to develop a small standalone and GUI-based program for a relatively simple GIS workflow using arcpy. Even if you don't manage to implement perfect solutions for all three exercises yourself, thinking about them and then carefully studying the provided solutions will be helpful, in particular since reading and understanding other people's code is an important skill and one of the main ways to become a better programmer. The solutions to the three practice exercises can be found in the following subsections.

Practice Exercise 1: List Comprehension

You have a list that contains dictionaries describing spatial features, e.g. obtained from some web service. Each dictionary stores the id, latitude, and longitude of the feature, all as strings, under the respective keys "id", "lat", and "lon":

features = [ { "id": "A", "lat": "23.32", "lon": "-54.22" }, 
             { "id": "B", "lat": "24.39", "lon": "53.11" }, 
             { "id": "C", "lat": "27.98", "lon": "-54.01" } ]

We want to convert this list into a list of 3-tuples instead using a list comprehension (Section 2.2). The first element of each tuple should be the id but with the fixed string "Feature " as prefix (e.g. "Feature A"). The other two elements should be the lat and lon coordinates but as floats not as strings. Here is an example of the kind of tuple we are looking for, namely the one for the first feature from the list above: ('Feature A', 23.32, -54.22). Moreover, only features with a longitude coordinate < 0 should appear in the new list. How would you achieve this task with a single list comprehension?

Practice Exercise Solution

features = [ { "id": "A", "lat": "23.32", "lon": "-54.22" }, { "id": "B", "lat": "24.39", "lon": "53.11" }, { "id": "C", "lat": "27.98", "lon": "-54.01" } ]
featuresAsTuples = [ ("Feature " + feat['id'], float(feat['lat']), float(feat['lon']) ) for feat in features if float(feat['lon']) < 0 ]

print(featuresAsTuples)

Let's look at the components of the list comprehension starting with the middle part:

for feat in features

This means we will be going through the list features using a variable feat that will be assigned one of the dictionaries from the features list. This also means that both the if-condition on the right and the expression for the 3-tuples on the left need to be based on this variable feat.

if float(feat['lon']) < 0

Here we implement the condition that we only want 3-tuples in the new list for dictionaries that contain a lon value that is < 0.

("Feature " + feat['id'], float(feat['lat']), float(feat['lon']) )

Finally, this is the part where we construct the 3-tuples to be placed in the new list based on the dictionaries contained in variable feat. It should be clear that this is an expression for a 3-tuple with different expressions using the values stored in the dictionary in variable feat to derive the three elements of the tuple. The output produced by this code will be:

Output:

[('Feature A', 23.32, -54.22), ('Feature C', 27.98, -54.01)]

Lesson 3 Assignment

The situation is the following: You have been hired by a company active in the northeast of the United States to analyze and produce different forms of summaries for the traveling activities of their traveling salespersons. Unfortunately, the way the company has been keeping track of the related information leaves a lot to be desired. The information is spread out over numerous .csv files. Please download the .zip file [3] containing all (imaginary) data you need for this assignment and extract it to a new folder. Then open the files in a text editor and read the explanations below.

Explanation of the files:

File employees.csv: Most data files of the company do not use the names of their salespersons directly but instead refer to them through an employee ID. This file maps employee name to employee ID number. It has two columns, the first contains the full names in the format first name, last name and the second contains the ID number. The double quotes around the names are needed in the csv file to signal that this is the content of a single cell containing a comma rather than two cells separated by a comma.

"Smith, Richard",1234421
"Moore, Lisa",1231233
"Jones, Frank",2132222
"Brown, Justine",2132225
"Samulson, Roger",3981232
"Madison, Margaret",1876541

Files travel_???.csv: each of these files describes a single trip by a salesperson. The number in the file name is not the employee ID but a trip number. There are 75 such files with numbers from 1001 to 1075. Each file contains just a single row; here is the content of one of the files, the one named travel_1001.csv:

2132222,2016-01-07 16:00:00,2016-01-26 12:00:00,Cleveland;Bangor;Erie;Philadelphia;New York;Albany;Cleveland;Syracuse

The four columns (separated by comma) have the following content:

the ID of the employee who did this trip (here: 2132222),
the start date and time of the trip (here: 2016-01-07 16:00:00),
the end date and time of the trip (here: 2016-01-26 12:00:00),
and the route consisting of the names of the cities visited on the trip as a string separated by semi-colons (here: Cleveland;Bangor;Erie;Philadelphia;New York;Albany;Cleveland;Syracuse). Please note that the entire list of cities visited is just a single column in the csv file!

There are a few more files in the folder. They are actually empty but you are not allowed to delete these from the folder. This is to make sure that you have to be as specific as possible when using regular expressions for file names in your solution.

Your Task

The Python code you are supposed to write should take a list of employee names (e.g., ['Jones, Frank', 'Brown, Justine', 'Samulson, Roger'] )

It should then produce a new .csv file that lists the trips made by employees from the given employee name list with all information from the respective travel_???.csv files as well as the name of the employee. The rows should be ordered by employee name. The figure below shows the exemplary content of this output file.

Example CSV Output File

Steps in Detail

Your script should roughly follow the steps below; in particular you should use the APIs mentioned for performing each of the steps:

The input variables defined at the beginning of your code should include
1. the list of employee names to include in the output
2. the folder that contains the input files
3. the name of the output csv file
Use pandas to read the data from employees.csv into a data frame (see Hint 1).
Use pandas to create a single data frame with the content from all 75 travel_???.csv files. The content from each file should form a row in the new data frame (see Hints 1 and 2). Use regular expression and the functions from the re package for this step to only include files that start with "travel_", followed by a number, and ending in ".csv".
Use pandas operations to join the two data frames from steps (2) and (3) using the employee ID as key. Derive from this combined data frame a new data frame with only those rows that contain trips of employees from the input name list
Write the data frame produced in the previous step to a new csv file using the specified output file name from (1) (see Hint 1 and image of example csv output file above).

Hint 1:

Pandas provides functions for reading and writing csv files (and quite a few other file formats). They are called read_csv(...) and to_csv(...). See this Pandas documentation site for more information(link is external)

import pandas as pd
import datetime
df = pd.read_csv(r'C:\489\test.csv', sep=",", header=None)

Hint 2:

The pandas concat(…)(link is external) function can be used to combine several data frames with the same columns stored in a list to form a single data frame. This can be a good approach for this step. Let's say you have the individual data frames stored in a list variable called dataframes. You'd then simply call concat like this:

combinedDataFrame = pd.concat(dataframes)

This means your main task will be to create the list of pandas data frames, one for each travel_???.csv file before calling concat(...). For this, you will first need to use a regular expression to filter the list of all files in the input folder you get from calling os.listdir(inputFolder) to only the travel_???.csv files and then use read_csv(...) as described under Hint 1 to create a pandas DataFrame object from the csv file and add this to the data frame list.

Write-up

Produce a 400-word write-up on how the assignment went for you; reflect on and briefly discuss the issues and challenges you encountered and what you learned from the assignment.

Deliverable

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

Your script file
Your 400-word write-up