Lesson 1: Advanced Python and Multiprocessing

Lesson 1 is one week in length and serves as a refresher and preparation for the rest of the course. You may already know many of the topics covered in the modules and want to skip over them, but it may be worth skimming though the topic. The assignment will walk you through implementing multiprocessing to clip a set of feature classes in parallel. Don’t stress out too much about this, much of the necessary coding and process will be provided to you. 

The Integrated Development Environment (IDE) we are going to start with in this class is called PyScripter, but feel free to use another IDE that you are comfortable with. We will start the lesson with discussing the debugger. Learning how to use this tool will save you time and save you from many headaches. I encourage you to spend some time learning how to use it since it is cleaner, faster, and provides more infomration than print statements. It will also help you identify errors and execution misteps throughout your code.

Next, we will look at Python types and structures, logic constructs, methods/functions, and finish the lesson with multiprocessing.

Overview and Checklist

The lesson contains a lot of material, but it reads quickly. If there is a portion that is not clear or if you have questions, please feel free to use the canvas discussion boards and post your questions to the class. If you can answer a question, please feel free to answer it. 

Pace yourself and leave plenty of time for working on the assignment. If you have completed the assignment and want to try your hand at implementing a different means of multi-processing or try it on another process that may benefit your work, I encourage you try it and submit the code. There are many solutions and implementations so feel free to explore them and find one that works for your process.

Learning Outcomes

Utilize the debugger and interpret its output.
Identify and properly use Python types and functions.
Execute a task using multiprocessing.

Lesson Roadmap

Steps for Completing Lesson 1
Step	Activity	Access/Directions
1	Engage with Lesson 1 Content	Begin with Debugging.
2	Programming Assignment and Reflection	Submit your code for the programming assignment and 400 words write-up with reflections
3	Quiz 1	Complete the Lesson 1 Quiz
4	Questions/Comments	Remember to visit Canvas to post/answer any questions or comments pertaining to Lesson 3

Downloads

The following is a list of datasets that you will be prompted to download through the course of the lesson. They are divided into two sections: Datasets that you will need for the assignments and Datasets used for the content and examples in the lesson.

Required:

Lesson1_Assignment_initial_code.zip [1]

Suggested:

USA.gdb.zip [2]
In the Multiprocessing with raster data section you will also use some DEM raster data that you can download here [3] if you want to follow along with the content. You can also wait with obtaining that data until you reach that section in the lesson material.

Assignments

We are going to use the arcpy vector data processing code from Multiprocessing with vector data (download Lesson1_Assignment_initial_code.zip [1]) as the basis for our Lesson 1 programming project. The code is already in multiprocessing mode, so you will not have to write multiprocessing code on your own from scratch but you still will need a good understanding of how the script works. If you are unclear about anything the script does, please ask on the course forums. This part of the assignment will be for getting back into the rhythm of writing arcpy based Python code and practice creating a multiprocessing script. Your task is to extend our vector data clipping script by doing the following:

Expand the code so that it can handle multiple input feature classes to be clipped (still using a single polygon clipping feature class). The input variable data_to_be_clipped should now take a list of feature class names rather than a single name. The worker function should, as before, perform the operation of clipping a single input file (not all of them!) to one of the features from the clipper feature class. The main change you will have to make will be in the main code where the jobs are created. The names of the output files produced should have the format

clip_<oid>_<name of the input feature class>.shp

For instance clip_0_Roads.shp for clipping the Roads feature class from USA.gdb to the state featureclass with oid 0. You can change the OID to the state name if you want to expand the code.

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

Your modified code files. Please organize the files cleanly.
A 400-word write-up of what you have learned during this exercise.

Questions?

If you have any questions, please send a message through Canvas. We will check daily to respond. If your question is one that is relevant to the entire class, we may respond to the entire class rather than individually.

Debugging

Debugging is a very important part of writing code. The simplest method of debugging is to embed print statements in your code to either determine how far your code is running through a loop or to print out the contents of a variable. This works to some degree, but it does not tell you why your code failed where it did. A more detailed method involves using the tools or features of your IDE to create watches for checking the contents of variables and breakpoints for stepping through your code.

We will provide a generic overview of the techniques here of setting breakpoints, watches, and stepping through code. Don’t focus on the specifics of the interface as we do this. Instead, it is more important to understand the purpose of each of the different methods of debugging. The debugger will be your greatest tool for writing code and working through complex processes since it will allow to inspect your variables, data, and step through your application's or script's execution flow.

We will start off by looking at PyScripter debugging functions. There's more details of the different debugging Engines in the PyScripter Debugger [4] wiki. For this class, the standard debugger will work.

The best way to explain the aspects of debugging is to work through an example. This time, we'll look at some code that tries to calculate the factorial of an integer (the integer is hard-coded to 5 in this case). In mathematics, a factorial is the product of an integer and all positive integers below it. Thus, 5! (or "5 factorial") should be

5 * 4 * 3 * 2 * 1 = 120

The code below attempts to calculate a factorial through a loop that increments the multiplier by 1 until it reaches the original integer. This is a valid approach since 1 * 2 * 3 * 4 * 5 would also yield 120.

# This script calculates the factorial of a given
#  integer, which is the product of the integer and
#  all positive integers below it.

number = 5
multiplier = 1

while multiplier < number:
    number *= multiplier
    multiplier += 1

print (number)

Even if you can spot the error, follow along with the steps below to get a feel for the debugging process and the PyScripter Debug toolbar.

Open PyScripter and copy the above code into a new script.

Save your script as debugger_walkthrough.py. You can optionally run the script, but you won't get a result.
Click View > Toolbars and ensure Debug is checked. You should see a toolbar like this: Many IDEs have debugging toolbars like this, and the tools they contain are pretty standard: a way to run the code, a way to set breakpoints, a way to step through the code line by line, and a way to watch the value of variables while stepping through the code. We'll cover each of these in the steps below.
Move your cursor to the left of line 5 (number = 5) and click. If you are in the right area, you will see a red dot next to the line number, indicating the addition of a breakpoint. A breakpoint is a place where you want your code to stop running, so you can examine it line by line using the debugger. Often you'll set a breakpoint deep in the middle of your script, so you don't have to examine every single line of code. In this example, the script is very short, so we're putting the breakpoint right at the beginning. The breakpoint is represented by a circle next to the line of code, and this is common in other debuggers too. Note that F5 is the shortcut key for this command.
Press the Debug file button . This runs your script up to the breakpoint. In the Python Interpreter console, note that the debugfile() function is run on your script rather than the normal runfile() function. Also, instead of the normal >>> prompt, you should now see a [Dbg]>>> prompt. The cursor will be on that same line in PyScripter's Editor pane, which causes that line to be highlighted.
Click the Step over next function call button or the Step into subroutine button. This executes the line of your code, in this case the number = 5 line. Both buttons execute the highlighted statement, but it is important to note here that they will behave differently when the statement includes a call to a function. The Step over next function call button will execute all the code within the function, return to your script, and then pause at the script's next line. You'd use this button when you're not interested in debugging the function code, just the code of your main script. The Step into subroutine button, on the other hand, is used when you do want to step through the function code one line at a time. The two buttons will produce the same behavior for this simple script. You'll want to experiment with them later in the course when we discuss writing our own functions and modules.
Before going further, click the Variable window tab in PyScripter's lower pane. Here, you can track what happens to your variables as you execute the code line by line. The variables will be added automatically as they are encountered. At this point, you should see a globals {}, which contain variables from the python packages and locals {} that will contain variables created by your script. We will be looking at the locals dictionary so you can disregard the globals. Expanding the locals dictionary, you should see some are built in variables (__<name>__) and we can ignore those for now. The "number" variable should be listed, with a type of int. Expanding on the +, will expose more of the variables properties.
Click the Step button again. You should now see the "multiplier" variable has been added in the Variable window, since you just executed the line that initializes that variable, as called out in the image.
Click the Step button a few more times to cycle through the loop. Go slowly, and use the Variable window to understand the effect that each line has on the two variables. (Note that the keyboard shortcut for the Step button is F8, which you may find easier to use than clicking on the GUI.) Setting a watch on a variable is done by placing the cursor on the variable and then pressing Alt+W or right clicking in the Watches pane and selecting Add Watch At Cursor. This isolates the variable to the Watch window and you can watch the value change as it changes in the code execution.
Step through the loop until "multiplier" reaches a value of 10. It should be obvious at this point that the loop has not exited at the desired point. Our intent was for it to quit when "number" reached 120.

Can you spot the error now? The fact that the loop has failed to exit should draw your attention to the loop condition. The loop will only exit when "multiplier" is greater than or equal to "number." That is obviously never going to happen as "number" keeps getting bigger and bigger as it is multiplied each time through the loop.

In this example, the code contained a logical error. It re-used the variable for which we wanted to find the factorial (5) as a variable in the loop condition, without considering that the number would be repeatedly increased within the loop. Changing the loop condition to the following would cause the script to work:
```
while multiplier < 5:
```
Even better than hard-coding the value 5 in this line would be to initialize a variable early and set it equal to the number whose factorial we want to find. The number could then get multiplied independent of the loop condition variable.
Click the Stop button in the Debug toolbar to end the debugging session. We're now going to step through a corrected version of the factorial script, but you may notice that the Variable window still displays a list of the variables and their values from the point at which you stopped executing. That's not necessarily a problem, but it is good to keep in mind.

Open a new script, paste in the code below, and save the script as debugger_walkthrough2.py

# This script calculates the factorial of a given
# integer, which is the product of the integer and
# all positive integers below it.
number = 5
loopStop = number
multiplier = 1
while multiplier < loopStop: 
    number *= multiplier
    multiplier += 1

print (number)

Step through the loop a few times as you did above. Watch the values of the "number" and "multiplier" variables, but also the new "loopStop" variable. This variable allows the loop condition to remain constant while "number" is multiplied. Indeed, you should see "loopStop" remain fixed at 5 while "number" increases to 120.
Keep stepping until you've finished the entire script. Note that the usual >>> prompt returns to indicate you've left debugging mode.

In the above example, you used the Debug toolbar to find a logical error that had caused an endless loop in your code. Debugging tools are often your best resource for hunting down subtle errors in your code.

You can and should practice using the Debug toolbar in the script-writing assignments that you receive in this course. Spending a little time to master the few simple steps of the debugger will save you tremendous amounts of time troubleshooting.

Import and Some Syntactic Sugar

Now that we are familiar with how to step into the code and look at our variables, lets warm up a bit and briefly revisit a few Python features that you are already familiar with but for which there exist some forms or details that you may not yet know. We will start with the Python “import” command and introduce a few Python constructs that you may be new to you on the way. 

It is highly recommended that you try these examples yourself, experiment with them to get a better understanding, and use the debugger to step into the process to explore and see the variables values.

Import

What happens here is that the module (either a module from the standard library, a module that is part of another package you installed, or simply another .py file in your project directory) is loaded, unless it has already been loaded before, and the name of the module becomes part of the namespace of the script that contains the import command. As a result, you can now access all variables, functions, or classes defined in the imported module, by writing

<module name>.<variable or function name>

e.g.,

arcpy.Describe(…)

You can also use the import command like this instead:

import arcpy as ap

This form introduces a new alias for the module name, typically to save some typing when the module name is rather long, and instead of writing

arcpy.Describe(…)

you would now use

ap.Describe(…)

in your code. Note that the use of ‘…’ is to indicate parameters, code between two important lines of code that would otherwise be distracting or is not necessary to show, or a continuation of list items.

Another approach of using “import” is to directly add content of a module (again either variables, functions, or classes) to the namespace of the importing Python script. This is done by using the form "from … import …" as in the following example:

from arcpy import Describe, Point , …  

... 

Describe(…)

The difference is that now you can use the imported names directly in our code without having to use the module name (or an alias) as a prefix as it is done in line 5 of the example code. However, be aware that if you are importing multiple modules, this can easily lead to name conflicts if, for instance, two modules contain functions with the same name. It can also make your code a little more difficult to read since

  arcpy.Describe(...)

helps you or another programmer recognize that you’re using something defined in arcpy and not in another library or the main code of your script.

You can also use

from arcpy import *

to import all variable, function and class names from a module into the namespace of your script if you don’t want to list all those you actually need. However, this can increase the likelihood of a name conflict.

Keep in mind that readability matters and using the <module name>.<variable or function name> helps the reader know where the function is coming from. Modules may have similar function names and this convention helps describe which package you are using. Your IDE might flag the function as ambiguous if it finds any same named function declarations that could be from multiple modules.

Lesson content developed by Jan Wallgrun and James O’Brien

Python data types

Python contains many data types built in. These include text, numeric, sequence, mapping, sets, Boolean, binary and None. Coding at its core is largely working with data that is in one of these types and it is important to know them and how they work.

As a developer, you need to know types for operations and for parameters and this helps with debugging, performing conditionals, calculations, dictionary operations, and data transformations. For example, the string “1” does not equal the int 1. Trying to add “2” + 5 will result in a TypeError

res = "2" + 5
print (res)

Output 
Traceback (most recent call last): TypeError: can only concatenate str (not "int") to str

Changing the 5 to “5” results in concatenation of the “2” and “5”:

res = "2" + "5"
print (res)

Output 
25

For this course, we will be looking at more advanced structures in the mapping type. W3 provides a great overview of the data types [5] and I encourage you to review them.

Variables: local vs. global, mutable vs. immutable

When making the transition from a beginner to an intermediate or advanced Python programmer, it also gets important to understand the intricacies of variables used within functions and of passing parameters to functions in detail. First of all, we can distinguish between global and local variables within a Python script. Global variables are defined outside of any function or loop construct, or with the keyword global. They can be accessed from anywhere in the script after they are instantiated. They exist and keep their values as long as the script is loaded, which typically means as long as the Python interpreter into which they are loaded is running.

In contrast, local variables are defined inside a function or loop construct and can only be accessed in the body (scope) of that function or loop. Furthermore, when the body of the function has been executed, its local variables will be discarded and cannot be used anymore to access their current values. A local variable is either a parameter of that function, in which case it is assigned a value immediately when the function is called, or it is introduced in the function body by making an assignment to the name for the first time.

Here are a few examples to illustrate the concepts of global and local variables and how to use them in Python.

def doSomething(x):  # parameter x is a local variable of the function 
    count = 1000 * x  # local variable count is introduced 
    return count 
 
y = 10  # global variable y is introduced  
print(doSomething(y)) 
print(count)  # this will result in an error  
print(x)  # this will also result in an error

This example introduces one global variable, y, and two local variables, x and count, both part of the function doSomething(…). x is a parameter of the function, while count is introduced in the body of the function in line 3. When this function is called in line 11, the local variable x is created and assigned the value that is currently stored in global variable y, so the integer number 10. Then the body of the function is executed. In line 3, an assignment is made to variable count. Since this variable hasn’t been introduced in the function body before, a new local variable will now be created and assigned the value 10000. After executing the return statement in line 5, both x and count will be discarded. Hence, the two print statements at the end of the code would lead to errors because they try to access variables that do not exist anymore.

Think of the function as a house with one-way mirrored windows. You can see out, but cannot see in. Variables, can ‘see’ to the outside, but a variable on the outside cannot see 'into' the function. This is also referred to as variable scope. The scope of the variable is where the variable is created and follows the code indents, unless it is decorated with global. Simply, inner scopes can access the outer scopes, but the outer scopes cannot reach into the inner scopes.

Now let’s change the example to the following:

def doSomething():  
    count = 1000 * y    # global variable y is accessed here 
    return count  
y = 10          
print(doSomething())

This example shows that global variable y can also be directly accessed from within the function doSomething(): When Python encounters a variable name that is neither the name of a parameter of that function nor has been introduced via an assignment previously in the body of that function, it will look for that variable among the global (outer scope) variables. However, the first version using a parameter instead is usually preferable because then the code in the function doesn’t depend on how you name and use variables outside of it. That makes it much easier to, for instance, re-use the same function in different projects.

So maybe you are wondering whether it is also possible to change the value of a global variable from within a function, not just read its value? One attempt to achieve this could be the following:

def doSomething():  
    count = 1000  
    y = 5 
    return count * y  
y = 10 
print(doSomething())  
print(y)  # output will still be 10 here

However, if you run the code, you will see that last line still produces the output 10, so the global variable y hasn't been changed by the assignment in line 5. That is because the rule is if the variable is not passed to the function or the variable is not marked as global within the function, it will be considered a local variable to that function. Since this is the first time an assignment to y is made in the body of the function, a new local variable with that name is created at that point that will overshadow the global variable with the same name until the end of the function has been reached. Instead, you explicitly must tell Python that a variable name should be interpreted as the name of a global variable by using the keyword ‘global’, like this:

def doSomething(): 
    count = 1000 
    global y  # tells Python to treat y as the name of global variable 
    y = 5  # as a result, global variable y is assigned a new value here 
    return count * y 
 
 
y = 10 
print(doSomething()) 
print(y)  # output will now be 5 here

In line 5, we are telling Python that y in this function should refer to the global variable y. As a result, the assignment in line 7 changes the value of the global variable called y and the output of the last line will be 5. While it's good to know how these things work in Python, we again want to emphasize that accessing global variables from within functions should be avoided as much as possible. Passing values via parameters and returning values is usually preferable because it keeps different parts of the code as independent of each other as possible.

So after talking about global vs. local variables, what is the issue with mutable vs. immutable mentioned in the heading? There is an important difference in passing values to a function depending on whether the value is from a mutable or immutable data type. All values of primitive data types like numbers and boolean values in Python are immutable, meaning you cannot change any part of them. On the other hand, we have mutable data types like lists and dictionaries for which it is possible to change their parts: You can, for instance, change one of the elements in a list or what is stored under a particular key in a given dictionary without creating a completely new object.

What about strings and tuples? You may think these are mutable objects, but they are actually immutable. While you can access a single character from a string or element from a tuple, you will get an error message if you try to change it by using it on the left side of the equal sign in an assignment. Moreover, when you use a string method like replace(…) to replace all occurrences of a character by another one, the method cannot change the string object in memory for which it was called but has to construct a new string object and return that to the caller.

Why is that important to know in the context of writing functions? Because mutable and immutable data types are treated differently when provided as a parameter to functions as shown in the following two examples:

def changeIt(x): 
    x = 5  # this does not change the value assigned to y because x is considered local 
 
 
y = 3 
changeIt(y) 
print(y)  # will print out 3

As we already discussed above, the parameter x is treated as a local variable in the function body. We can think of it as being assigned a copy of the value that variable y contains when the function is called. As a result, the value of the global variable y doesn’t change and the output produced by the last line is 3. But it only works like this for immutable objects, like numbers in this case! Let’s do the same thing for a list:

def changeIt(x): 
    x[0] = 5  # this will change the list y refers to 
 
y = [3, 5, 7] 
changeIt(y) 
print(y)  # output will be [5, 5, 7]

The output [5, 5, 7] produced by the print statement in the last line shows that the assignment in line 3 changed the list object that is stored in global variable y. How is that possible? Well, for values of mutable data types like lists, assigning the value to function parameter x cannot be conceived as creating a copy of that value and, as a result, having the value appear twice in memory. Instead, x is set up to refer to the same list object in memory as y. Therefore, any change made with the help of either variable x or y will change the same list object in memory. When variable x is discarded when the function body has been executed, variable y will still refer to that modified list object. Maybe you have already heard the terms “call-by-value” and “call-by-reference” in the context of assigning values to function parameters in other programming languages. What happens for immutable data types in Python works like “call-by-value,” while what happens to mutable data types works like “call-by-reference.” If you feel like learning more about the details of these concepts, check out this article on Parameter Passing [6].

While the reasons behind these different mechanisms are very technical and related to efficiency, this means it is actually possible to write functions that take parameters of mutable type as input and modify their content. This is common practice (in particular for class objects which are also mutable) and not generally considered bad style because it is based on function parameters and the code in the function body does not have to know anything about what happens outside of the function. Nevertheless, often returning a new object as the return value of the function rather than changing a mutable parameter is preferable. This brings us to the last part of this section.

Lesson content developed by Jan Wallgrun and James O’Brien

Python Dictionaries

In programming, we often want to store larger amounts of data that somehow belongs together inside a single variable. You probably already know about lists, which provide one option to do so. As long as available memory permits, you can store as many elements in a list as you wish and the append(...) method allows you to add more elements to an existing list.

Dictionaries are another data structure that allows for storing complex information in a single variable. While lists store elements in a simple sequence and the elements are then accessed based on their index in the sequence, the elements stored in a dictionary consist of key-value pairs and one always uses the key to retrieve the corresponding values from the dictionary. It works like in a real dictionary where you look up information (the stored value) under a particular keyword (the key). Similar to real dictionaries, there can only be one unique keyword (key), but can have multiple values attached to it. Values can be simple data structures such as strings, ints, lists, dictionaries, or they can be more complex objects such as featureclasses, classes, and even functions.

Dictionaries can be useful to realize a mapping, for instance from English words to the corresponding words in Spanish. Here is how you can create such a dictionary for just the numbers from one to four:

englishToSpanishDic = {"one": "uno", "two": "dos", "three": "tres", "four": "cuatro"}

The curly brackets { } delimit the dictionary similarly to how squared brackets [ ] do for lists. Inside the dictionary, we have four key-value pairs separated by commas. The key and value for each pair are separated by a colon. The key appears on the left of the colon, while the value stored under the key appears on the right side of the colon.

We can now use the dictionary stored in variable englishToSpanishDic to look up the Spanish word for an English number, e.g.

print(englishToSpanishDic["two"])

Output 
dos

To retrieve some value stored in the dictionary, we here use the name of the variable followed by squared brackets containing the key under which the value is stored in the dictionary. There are also built-in methods that you can use to avoid some of the Exceptions raised (such as KeyError if the key does not exist in the dictionary). This method (.get()) has been used in the previous lesson examples.

We can add a new key-value pair to an existing dictionary by using the dictionanary[key] notation but on the left side of an assignment operator (=):

englishToSpanishDic["five"] = "cinco" 
print(englishToSpanishDic)

Output 
{'four': 'cuatro', 'three': 'tres', 'five': 'cinco', 'two': 'dos', 'one': 'uno'}

Here we added the value "cinco" appearing on the right side of the equal sign under the key "five" to the dictionary. If something would have already been stored under the key "five" in the dictionary, the stored value would have been overwritten. You may have noticed that the order of the elements of the dictionary in the output has changed but that doesn’t matter since we always access the elements in a dictionary via their key. If our dictionary would contain many more word pairs, we could use it to realize a very primitive translator that would go through an English text word-by-word and replace each word by the corresponding Spanish word retrieved from the dictionary.

Now let’s use Python dictionaries to do something a bit more complex. Let’s simulate the process of creating a book index that lists the page numbers on which certain keywords occur. We want to start with an empty dictionary and then go through the book page-by-page. Whenever we encounter a word that we think is important enough to be listed in the index, we add it and the page number to the dictionary.

To create an empty dictionary in a variable called bookIndex, we use the notation with the curly brackets but nothing in between:

bookIndex = {} 
print(bookIndex)

Output 
{}

Now let’s say the first keyword we encounter in the imaginary programming book we are going through is the word "function" on page 2. We now want to store the page number 2 (value) under the keyword "function" (key) in the dictionary. But since keywords can appear on many pages, what we want to store as values in the dictionary are not individual numbers but lists of page numbers. Therefore, what we put into our dictionary is a list with the number 2 as its only element:

bookIndex["function"] =  [2] 
print (bookIndex)

Output 
{'function': [2]}

Next, we encounter the keyword "module" on page 3. So we add it to the dictionary in the same way:

bookIndex["module"] =  [3] 
print (bookIndex)

Output 
{'function': [2], 'module': [3]}

So now our dictionary contains two key-value pairs and for each key it stores a list with just a single page number. Let’s say we next encounter the keyword “function” a second time, this time on page 5. Our code to add the additional page number to the list stored under the key “function” now needs to look a bit differently because we already have something stored for it in the dictionary and we do not want to overwrite that information. Instead, we retrieve the currently stored list of page numbers and add the new number to it with append(…):

pages = bookIndex["function"] 
pages.append(5) 
print(bookIndex) 
>>> {'function': [2, 5], 'module': [3]} 
print(bookIndex["function"]) 
>>> [2, 5]

Please note that we didn’t have to put the list of page numbers stored in variable pages back into the dictionary after adding the new page number. Both, variable pages and the dictionary refer to the same list such that appending the number changes both. Our dictionary now contains a list of two page numbers for the key “function” and still a list with just one page number for the key “module”. Surely you can imagine how we would build up a large dictionary for the entire book by continuing this process. Dictionaries can be used in concert with a for loop to go through the keys of the elements in the dictionary. This can be used to print out the content of an entire dictionary:

for k in bookIndex.keys():  # loop through keys of the dictionary 
    print("keyword: " + k)  # print the key 
    print("pages: " + str(bookIndex[k]))  # print the value

Output 
keyword: function 
pages: [2, 5] 
keyword: module 
pages: [3]

When adding the second page number for “function”, we ourselves decided that this needs to be handled differently than when adding the first page number. But how could this be realized in code? We can check whether something is already stored under a key in a dictionary using an if-statement together with the “in” operator:

keyword = "function" 
if keyword in bookIndex.keys():
    ("entry exists") 
else:
    print ("entry does not exist")

Output 
entry exists

So assuming we have the current keyword stored in variable word and the corresponding page number stored in variable pageNo, the following piece of code would decide by itself how to add the new page number to the dictionary:

if word in bookIndex:
    # entry for word already exists, so we just add page
    pages = bookIndex[word]
    pages.append(pageNo)
else:
    # no entry for word exists, so we add new entry
    bookIndex[word] = [pageNo]

This can also be written as a ternary operation as discussed in the previous lesson, however the order of the operation needs to be careful that it doesn’t trigger a KeyError by checking if the value is in the keys first:

bookIndex[word] = [pageNo] if word not in bookIndex.keys() else bookIndex[word].append(word)

A more sophisticated version of this code would also check whether the list of page numbers retrieved in the if-block already contains the new page number to deal with the case that a keyword occurs more than once on the same page. Feel free to think about how this could be included.

Lesson content developed by Jan Wallgrun and James O’Brien

JSON

JSON, which is an acronym for Javascript Serializion Object Notation, is a data structure mostly associated with the web. It provides a dictionary like structure that is easily created, transmitted, read, and parsed. Many coding languages include built in methods for these processes and it is largely used for transferring data and information between languages. For example, a web API on a server may be written in C# and the front-end application written in Javascript. The API request from Javascript may be serialized to JSON and then deserialized by the C# API. C# continues to process the request and responds with return data in JSON form. Python includes a JSON package aptly named json that makes the working with JSON and disctionaries easy.

An example of JSON is:

{ 
  "objectIdFieldName" : "OBJECTID",  
  "uniqueIdField" :  
  { 
    "name" : "OBJECTID",  
    "isSystemMaintained" : true 
  },  
  "globalIdFieldName" : "GlobalID",  
  "geometryType" : "esriGeometryPoint",  
  "spatialReference" : { 
    "wkid" : 102100,  
    "latestWkid" : 3857 
  },  
  "fields" : [ 
    { 
      "name" : "OBJECTID",  
      "type" : "esriFieldTypeOID",  
      "alias" : "OBJECTID",  
      "sqlType" : "sqlTypeOther",  
      "domain" : null,  
      "defaultValue" : null 
    },  
    { 
      "name" : "Acres",  
      "type" : "esriFieldTypeDouble",  
      "alias" : "Acres",  
      "sqlType" : "sqlTypeOther",  
      "domain" : null,  
      "defaultValue" : null 
    } 
  ],  
  "features" : [ 
    { 
      "attributes" : { 
        "OBJECTID" : 6,  
        "GlobalID" : "c4c4bcfd-ce86-4bc4-b9a2-ac7b75027e12",  
        "Contained" : "Yes",  
        "FireName" : "66",  
        "Responsibility" : "Local",  
        "DPA" : "LRA",  
        "StartDate" : 1684177800000,  
        "Status" : "Out",  
        "PercentContainment" : 50,  
        "Acres" : 127,  
        "InciWeb" : "https://www.fire.ca.gov/incidents/2023/5/15/66-fire",  
        "SocialMediaHyperlink" : "https://twitter.com/hashtag/66fire?src=hashtag_click",  
        "StartTime" : "1210",  
        "ImpactBLM" : "No",  
        "FireNotes" : "Threats exist to structures, critical infrastructure, and agricultural ands. High temperatures, and low humidity. Forward spread stopped. Resources continue to strengthen ontainment lines.",  
        "CameraLink" : null 
      },  
      "geometry" :  
      { 
        "x" : -12923377.992696606,  
        "y" : 3971158.6933410829 
      } 
    },  
    { 
      "attributes" : { 
        "OBJECTID" : 7,  
        "GlobalID" : "04772c32-acab-4f12-8875-7f456e21eda7",  
        "Contained" : "No",  
        "FireName" : "Ramona",  
        "Responsibility" : "LRA",  
        "DPA" : "Local",  
        "StartDate" : 1684789860000,  
        "Status" : "Out",  
        "PercentContainment" : 80,  
        "Acres" : 348,  
        "InciWeb" : "https://www.fire.ca.gov/incidents/2023/5/22/ramona-fire/",  
        "SocialMediaHyperlink" : "https://twitter.com/hashtag/ramonafire?src=hashtag_click",  
        "StartTime" : null,  
        "ImpactBLM" : "Possible",  
        "FireNotes" : "Minimal fire behavior observed. Resources continue to strengthen control ines and mop-up.",  
        "CameraLink" : "https://alertca.live/cam-console/2755" 
      },  
      "geometry" :  
      { 
        "x" : -13029408.882657332,  
        "y" : 4003241.0902095754 
      } 
    },  
    { 
      "attributes" : { 
        "OBJECTID" : 8,  
        "GlobalID" : "737be4e4-a127-486a-8481-a0ca62a631d7",  
        "Contained" : "Yes",  
        "FireName" : "Range",  
        "Responsibility" : "State",  
        "DPA" : "State",  
        "StartDate" : 1685925480000,  
        "Status" : "Out",  
        "PercentContainment" : 100,  
        "Acres" : 72,  
        "InciWeb" : "https://www.fire.ca.gov/incidents/2023/6/4/range-fire",  
        "SocialMediaHyperlink" : "https://twitter.com/hashtag/RangeFire?src=hashtag_click",  
        "StartTime" : null,  
        "ImpactBLM" : "No",  
        "FireNotes" : null,  
        "CameraLink" : "https://alertca.live/cam-console/2731" 
      },  
      "geometry" :  
      { 
        "x" : -13506333.475177869,  
        "y" : 4366169.4120039716 
      } 
    } 
] 
}

Where the left side of the : is the property, and the right side is the value, much like the dictionary. It is important to note that while it looks like a python dictionary, JSON needs to be converted to a dictionary for it to be recognized as a dictionary and vice versa to JSON. One main difference between dictionaries and JSON is that JSON is that the properties (keys in Python) need to be strings whereas Python dictionary keys can be ints, floats, strings, Booleans or other immutable types.

Many API’s will transmit the requested data in JSON form and conversion is simple as using JSON.loads() to convert to JSON to a python dictionary and JSON.dumps() to convert it to a JSON object. We will be covering more details of this process in Lesson 2.

Classes

Let’s recapitulate a bit: the underlying perspective of object-oriented programming is that the domain modeled in a program consists of objects belonging to different classes. If your software models some part of the real world, you may have classes for things like buildings, vehicles, trees, etc. and then the objects (also called instances) created from these classes during run-time represent concrete individual buildings, vehicles, or trees with their specific properties. The classes in your software can also describe non real-world and often very abstract things like a feature layer or a random number generator.

Class definitions specify general properties that all objects of that class have in common, together with the things that one can do with these objects. Therefore, they can be considered blueprints for the objects. Each object at any moment during run-time is in a particular state that consists of the concrete values it has for the properties defined in its class. So, for instance, the definition of a very basic class Car may specify that all cars have the properties owner, color, currentSpeed, and lightsOn. During run-time we might then create an object for “Tom’s car” in variable carOfTom with the following values making up its state:

carOfTom.owner = "Tom" 
carOfTom.color = "blue" 
carOfTom.currentSpeed = 48   (mph) 
carOfTom.lightsOn = False

While all objects of the same class have the same properties (also called attributes or fields), their values for these properties may vary and, hence, they can be in different states. The actions that one can perform with a car or things that can happen to a car are described in the form of methods in the class definition. For instance, the class Car may specify that the current speed of cars can be changed to a new value and that lights can be turned on and off. The respective methods may be called changeCurrentSpeed(…), turnLightsOn(), and turnLightsOff(). Methods are like functions but they are explicitly invoked on an object of the class they are defined in. In Python this is done by using the name of the variable that contains the object, followed by a dot, followed by the method name:

carOfTom.changeCurrentSpeed(34) # change state of Tom’s car to current speed being 34mph 

carOfTom.turnLightsOn()# change state of Tom’s car to lights being turned on

The purpose of methods can be to update the state of the object by changing one or several of its properties as in the previous two examples. It can also be to get information about the state of the car, e.g. are the lights turned on? But it can also be something more complicated, e.g. performing a certain driving maneuver or fuel calculation.

In object-oriented programming, a program is perceived as a collection of objects that interact by calling each other’s methods. Object-oriented programming adheres to three main design principles:

Encapsulation: Definitions related to the properties and methods of any class appear in a specification that is encapsulated independently from the rest of the software code and properties are only accessible via a well-defined interface, e.g. via the defined methods.
Inheritance: Classes can be organized hierarchically with new classes being derived from previously defined classes inheriting all the characteristics of the parent class but potentially adding specialized properties or specialized behavior. For instance, our class Car could be derived from a more general class Vehicle adding properties and methods that are specific for cars.
Polymorphism: Inherited classes can change the behavior of methods by overwriting them and the code executed when such a method is invoked for an object then depends on the class of that object.

We will talk more about inheritance and polymorphism soon. All three principles aim at improving reusability and maintainability of software code. These days, most software is created by mainly combining parts that already exist because that saves time and costs and increases reliability when the re-used components have already been thoroughly tested. The idea of classes as encapsulated units within a program increases reusability because these units are then not dependent on other code and can be moved over to a different project much more easily.

For now, let’s look at how our simple class Car can be defined in Python.

 class Car(): 

     def __init__(self): 
          self.owner = 'UNKNOWN' 
          self.color = 'UNKNOWN' 
          self.currentSpeed = 0 
          self.lightsOn = False 

     def changeCurrentSpeed(self,newSpeed): 
          self.currentSpeed = newSpeed 

     def turnLightsOn(self): 
          self.lightsOn = True 

     def turnLightsOff(self): 
          self.lightsOn = False 

     def printInfo(self): 
          print('Car with owner = {0}, color = {1}, currentSpeed = {2}, lightsOn = {3}'.format(self.owner, self.color, self.currentSpeed, self.lightsOn))

Here is an explanation of the different parts of this class definition: each class definition in Python starts with the keyword ‘class’ followed by the name of the class (‘Car’) followed by parentheses that may contain names of classes that this class inherits from, but that’s something we will only see later on. The rest of the class definition is indented to the right relative to this line.

The rest of the class definition consists of definitions of the methods of the class which all look like function definitions but have the keyword ‘self’ as the first parameter, which is an indication that this is a method. The method __init__(…) is a special method called the constructor of the class. It will be called when we create a new object of that class like this:

carOfTom = Car()    # uses the __init__() method of Car to create a new Car object

In the body of the constructor, we create the properties of the class Car. Each line starting with “self.<name of property> = ...“ creates a so-called instance variable for this car object and assigns it an initial value, e.g. zero for the speed. The instance variables describing the state of an object are another type of variable in addition to global and local variables that you already know. They are part of the object and exist as long as that object exists. They can be accessed from within the class definition as “self.<name of the instance variable>” which happens later in the definitions of the other methods, namely in lines 10, 13, 16 and 19. If you want to access an instance variable from outside the class definition, you have to use <name of variable containing the object>.<name of the instance variable>, so, for instance:

print(carOfTom.lightsOn)    # will produce the output False because right now this instance variable still has its default value

The rest of the class definition consists of the methods for performing certain actions with a Car object. You can see that the already mentioned methods for changing the state of the Car object are very simple. They just assign a new value to the respective instance variable, a new speed value that is provided as a parameter in the case of changeCurrentSpeed(…) and a fixed Boolean value in the cases of turnLightsOn() and turnLightsOff(). In addition, we added a method printInfo() that prints out a string with the values of all instance variables to provide us with all information about a car’s current state. Let us now create a new instance of our Car class and then use some of its methods:

carOfSue = Car() 
carOfSue.owner = 'Sue' 
carOfSue.color = 'white' 
carOfSue.changeCurrentSpeed(41) 
carOfSue.turnLightsOn() 
carOfSue.printInfo()

Output

Car with owner = Sue, color = white, currentSpeed = 41, lightsOn = True

Since we did not define any methods to change the owner or color of the car, we are directly accessing these instance variables and assigning new values to them in lines 2 and 3. While this is okay in simple examples like this, it is recommended that you provide so-called getter and setter methods (also called accessor and mutator methods) for all instance variables that you want the user of the class to be able to read (“get”) or change (“set”). The methods allow the class to perform certain checks to make sure that the object always remains in an allowed state. How about you go ahead and for practice create a second car object for your own car (or any car you can think of) in a new variable and then print out its information?

A method can call any other method defined in the same class by using the notation “self.<name of the method>(...)”. For example, we can add the following method randomSpeed() to the definition of class Car:

def setRandomSpeed(self): 
    self.changeCurrentSpeed(random.randint(0,76))

The new method requires the “random” module to be imported at the beginning of the script. The method generates a random number and then uses the previously defined method changeCurrentSpeed(…) to actually change the corresponding instance variable. In this simple example, one could have simply changed the instance variable directly but in more complex cases changes to the state can require more code so that this approach here actually avoids having to repeat that code. Give it a try and add some lines to call this new method for one of the car objects and then print out the info again.

Lesson content developed by Jan Wallgrun and James O’Brien

Inheritance, class hierarchies, and polymorphism

We already mentioned building class hierarchies via inheritance and polymorphism as two main principles of object-oriented programming in addition to encapsulation. To introduce you to these concepts, let us start with another exercise in object-oriented modeling and writing classes in Python. Imagine that you are supposed to write a very basic GIS or vector drawing program that only deals with geometric features of three types: circles, and axis-aligned rectangles and squares. You need the ability to store and manage an arbitrary number of objects of these three kinds and be able to perform simple operations with these objects like computing their area and perimeter and moving the objects to a different position. How would you write the classes for these three kinds of geometric objects?

Let us start with the class Circle: a circle in a two-dimensional coordinate system is typically defined by three values, the x and y coordinates of the center of the circle and its radius. So these should become the properties (= instance variables) of our Circle class and for computing the area and perimeter, we will provide two methods that return the respective values. The method for moving the circle will take the values by how much the circle should be moved along the x and y axes as parameters but not return anything.

import math  

class Circle():  
    def __init__(self, x = 0.0, y = 0.0, radius = 1.0):  
        self.x = x  
        self.y = y  
        self.radius = radius  

    def computeArea(self):  
        return math.pi * self.radius ** 2 

    def computePerimeter (self):  
        return 2 * math.pi * self.radius  

    def move(self, deltaX, deltaY):  
        self.x += deltaX  
        self.y += deltaY 

    def __str__(self):  
        return 'Circle with coordinates {0}, {1} and radius {2}'.format(self.x, self.y, self.radius)

In the constructor, we have keyword arguments with default values for the three properties of a circle and we assign the values provided via these three parameters to the corresponding instance variables of our class. We import the math module of the Python standard library so that we can use the constant math.pi for the computations of the area and perimeter of a circle object based on the instance variables. Finally, we add the __str__() method to produce a string that describes a circle object with its properties. It should by now be clear how to create objects of this class and, for instance, apply the computeArea() and move(…) methods.

circle1 = Circle(10,4,3) 
print(circle1) 
print(circle1.computeArea()) 
circle1.move(3,-1) 
print(circle1)

Output
Circle with coordinates 10, 4 and radius 3 
28.274333882308138 
Circle with coordinates 13, 3 and radius 3

How about a similar class for axis-aligned rectangles? Such rectangles can be described by the x and y coordinates of one of their corners together with width and height values, so four instance variables taking numeric values in total. Here is the resulting class and a brief example of how to use it:

class Rectangle(): 
	def __init__(self, x = 0.0, y = 0.0, width = 1.0, height = 1.0): 
		self.x = x 
		self.y = y 
		self.width = width 
		self.height = height 

    def computeArea(self): 
		return self.width * self.height 

    def computePerimeter (self): 
		return 2 * (self.width + self.height) 

    def move(self, deltaX, deltaY): 
		self.x += deltaX 
		self.y += deltaY 

	def __str__(self): 
		return 'Rectangle with coordinates {0}, {1}, width {2} and height {3}'.format(self.x, self.y, self.width, self.height ) 

rectangle1 = Rectangle(10,10,3,2) 
print(rectangle1) 
print(rectangle1.computeArea()) 
rectangle1.move(2,2) 
print(rectangle1)

Output
Rectangle with coordinates 10, 10, width 3 and height 2 
6 
Rectangle with coordinates 12, 12, width 3 and height 2

There are a few things that can be observed when comparing the two classes Circle and Rectangle we just created: the constructors obviously vary because circles and rectangles need different properties to describe them and, as a result, the calls when creating new objects for the two classes also look different. All the other methods have exactly the same signature, meaning the same parameters and the same kind of return value; just the way they are implemented differs. That means the different calls for performing certain actions with the objects (computing the area, moving the object, printing information about the object) also look exactly the same; it doesn’t matter whether the variable contains an object of class Circle or of class Rectangle. If you compare the two versions of the move(…) method, you will see that these even do not differ in their implementation, they are exactly the same!

This all is a clear indication that we are dealing with two classes of objects that could be seen as different specializations of a more general class for geometric objects. Wouldn’t it be great if we could now write the rest of our toy GIS program managing a set of geometric objects without caring whether an object is a Circle or a Rectangle in the rest of our code? And, moreover, be able to easily add classes for other geometric primitives without making any changes to all the other code, and in their class definitions only describe the things in which they differ from the already defined geometry classes? This is indeed possible by arranging our geometry classes in a class hierarchy starting with an abstract class for geometric objects at the top and deriving child classes for Circle and Rectangle from this class with both adding their specialized properties and behavior. Let’s call the top-level class Geometry. The resulting very simple class hierarchy is shown in the figure below.

Flowchart with "Geometry" leading to "Circle" and "Rectangle."

Figure 4.17 Simple class hierarchy with three classes. Classes Circle and Rectangle are both derived from parent class Geometry.

Inheritance allows the programmer to define a class with general properties and behavior and derive one or more specialized subclasses from it that inherit these properties and behavior but also can modify them to add more specialized properties and realize more specialized behavior. We use the terms derived class and base class to refer to the two classes involved when one class is derived from another.

Lesson content developed by Jan Wallgrun and James O’Brien

Implementing the class hierarchy

Let’s change our example so that both Circle and Rectangle are derived from such a general class called Geometry. This class will be an abstract class in the sense that it is not intended to be used for creating objects from. Its purpose is to introduce properties and templates for methods that all geometric classes in our project have in common.

class Geometry():  

    def __init__(self, x = 0.0, y = 0.0):  
        self.x = x  
        self.y = y  

    def computeArea(self):  
        pass 

    def computePerimeter(self):  
        pass 

    def move(self, deltaX, deltaY):  
        self.x += deltaX  
        self.y += deltaY  

    def __str__(self):  
        return 'Abstract class Geometry should not be instantiated and derived classes should override this method!'

The constructor of class Geometry looks pretty normal, it just initializes the instance variables that all our geometry objects have in common, namely x and y coordinates to describe their location in our 2D coordinate system. This is followed by the definitions of the methods computeArea(), computePerimeter(), move(…), and __str__() that all geometry objects should support. For move(…), we can already provide an implementation because it is entirely based on the x and y instance variables and works in the same way for all geometry objects. That means the derived classes for Circle and Rectangle will not need to provide their own implementation. In contrast, you cannot compute an area or perimeter in a meaningful way just from the position of the object. Therefore, we used the keyword pass to indicate that we are leaving the body of the computeArea() and computePerimeter() methods intentionally empty. These methods will have to be overridden in the definitions of the derived classes with implementations of their specialized behavior. We could have done the same for __str__() but instead we return a warning message that this class should not have been instantiated.

It is worth mentioning that, in many object-oriented programming languages, the concepts of an abstract class (= a class that cannot be instantiated) and an abstract method (= a method that must be overridden in every subclass that can be instantiated) are built into the language. That means there exist special keywords to declare a class or method to be abstract and then it is impossible to create an object of that class or a subclass of it that does not provide an implementation for the abstract methods. In Python, this has been added on top of the language via a module in the standard library called abc [7] (for abstract base classes). Although we won’t be using it in this course, it is a good idea to check it out and use it if you get involved in larger Python projects. This Abstract Classes page [8] is a good source for learning more.

Here is our new definition for class Circle that is now derived from class Geometry. We also use a few commands at the end to create and use a new Circle object of this class to make sure everything is indeed working as before:

import math  

class Circle(Geometry): 

	def __init__(self, x = 0.0, y = 0.0, radius = 1.0): 
		super(Circle,self).__init__(x,y) 
		self.radius = radius 

	def computeArea(self): 
		return math.pi * self.radius ** 2 

	def computePerimeter (self): 
		return 2 * math.pi * self.radius 

	def __str__(self): 
		return 'Circle with coordinates {0}, {1} and radius {2}'.format(self.x, self.y, self.radius) 

circle1 = Circle(10, 10, 10) 
print(circle1.computeArea()) 
print(circle1.computePerimeter()) 
circle1.move(2,2) 
print(circle1)

Here are the things we needed to do in the code:

In line 3, we had to change the header of the class definition to include the name of the base class we are deriving Circle from (‘Geometry’) within the parentheses.
The constructor of Circle takes the same three parameters as before. However, it only initializes the new instance variable radius in line 7. For initializing the other two variables it calls the constructor of its base class, so the class Geometry, in line 6 with the command “super(Circle,self).__init__(x,y)”. This is saying “call the constructor of the base class of class Circle and pass the values of x and y as parameters to it”. It is typically a good idea to call the constructor of the base class as the first command in the constructor of the derived class so that all general initializations are taken care off.
Then we provide definitions of computeArea() and computePerimeter() that are specific for circles. These definitions override the “empty” definitions of the Geometry base class. This means whenever we invoke computeArea() or computePerimeter() for an object of class Circle, the code from these specialized definitions will be executed.
Note that we do not provide any definition for method move(…) in this class definition. That means when move(…) will be invoked for a Circle object, the code from the corresponding definition in its base class Geometry will be executed.
We do override the __str__() method to produce the same kind of string with information about all instance variables that we had in the previous definition. Note that this function accesses both the instance variables defined in the parent class Geometry as well as the additional one added in the definition of Circle.

The new definition of class Rectangle, now derived from Geometry, looks very much the same as that of Circle if you replace “Circle” with “Rectangle”. Only the implementations of the overridden methods look different, using the versions specific for rectangles.

class Rectangle(Geometry): 

	def __init__(self, x = 0.0, y = 0.0, width = 1.0, height = 1.0): 
		super(Rectangle, self).__init__(x,y) 
		self.width = width 
        self.height = height 

	def computeArea(self): 
		return self.width * self.height 

	def computePerimeter (self): 
		return 2 * (self.width + self.height) 

	def __str__(self): 
		return 'Rectangle with coordinates {0}, {1}, width {2} and height {3}'.format(self.x, self.y, self.width, self.height ) 

rectangle1 = Rectangle(15,20,4,5) 
print(rectangle1.computeArea()) 
print(rectangle1.computePerimeter()) 
rectangle1.move(2,2) 
print(rectangle1)

Lesson content developed by Jan Wallgrun and James O’Brien

Class attributes and static class functions

In this section we are going to look at two additional concepts that can be part of a class definition, namely class variables/attributes and static class functions. We will start with class attributes even though it is the less important one of these two concepts and won't play a role in the rest of this lesson. Static class functions, on the other hand, will be used in the walkthrough code of this lesson and also will be part of the homework assignment.

We learned in this lesson that for each instance variable defined in a class, each object of that class possesses its own copy so that different objects can have different values for a particular attribute. However, sometimes it can also be useful to have attributes that are defined only once for the class and not for each individual object of the class. For instance, if we want to count how many instances of a class (and its subclasses) have been created while the program is being executed, it would not make sense to use an instance variable with a copy in each object of the class for this. A variable existing at the class level is much better suited for implementing this counter and such variables are called class variables or class attributes. Of course, we could use a global variable for counting the instances but the approach using a class attribute is more elegant as we will see in a moment.

The best way to implement this instance counter idea is to have the code for incrementing the counter variable in the constructor of the class because that means we don’t have to add any other code and it’s guaranteed that the counter will be increased whenever the constructor is invoked to create a new instance. The definition of a class attribute in Python looks like a normal variable assignment but appears inside a class definition outside of any method, typically before the definition of the constructor. Here is what the definition of a class attribute counter for our Geometry class could look like. We are adding the attribute to the root class of our hierarchy so that we can use it to count how many geometric objects have been created in total.

class Geometry(): 
   counter = 0 

   def __init__(self, x = 0.0, y = 0.0): 
      self.x = x 
      self.y = y 
      Geometry.counter += 1 
…

The class attribute is defined in line 2 and the initial value of zero is assigned to it when the class is loaded so before the first object of this class is created. We already included a modified version of the constructor that increases the value of counter by one. Since each constructor defined in our class hierarchy calls the constructor of its base class, the counter class attribute will be increased for every geometry object created. Please note that the main difference between class attributes and instance variables in the class definition is that class attributes don’t use the prefix “self.” but the name of the class instead, so Geometry.counter in this case. Go ahead and modify your class Geometry in this way, while keeping all the rest of the code unchanged.

While instance variables can only be accessed for an object, e.g. using <variable containing the object>.<name of the instance variable>, we can access class attributes by using the name of the class, i.e. <name of the class>.<name of the class attribute>. That means you can run the code and use the statement

print(Geometry.counter)

… to get the value currently stored in this new class attribute. Since we have not created any geometry objects since making this change, the output should be 0.

Let’s now create two geometry objects of different types, for instance, a circle and a square:

Circle(10,10,10) 
Square(5,5,8)

Now run the previous print statement again and you will see that the value of the class variable is now 2. Class variables like this are suitable for storing all information related to the class, so essentially everything that does not describe the state of individual objects of the class.

Class definitions can also contain definitions of functions that are not methods, meaning they are not invoked for a specific object of that class and they do not access the state of a particular object. We will refer to such functions as static class functions. Like class attributes they will be referred to from code by using the name of the class as prefix. Class functions allow for implementing some functionality that is in some way related to the class but not the state of a particular object. They are also useful for providing auxiliary functions for the methods of the class. It is important to note that since static class functions are associated with the class but not an individual object of the class, you cannot directly refer to the instance variables in the body of a static class function like you can in the definitions of methods. However, you can refer to class attributes as you will see in a moment.

A static class function definition can be distinguished from the definition of a method by the lack of the “self” as the first parameter of the function; so it looks like a normal function definition but is located inside a class definition. To give a very simple example of a static class function, let’s add a function called printClassInfo() to class Geometry that simply produces a nice output message for our counter class attribute:

class Geometry(): 
    … 

    def printClassInfo(): 
        print( "So far, {0} geometric objects have been created".format(Geometry.counter) )

We have included the header of the class definition to illustrate how the definition of the function is embedded into the class definition. You can place the function definition at the end of the class definition, but it doesn’t really matter where you place it, you just have to make sure not to paste the code into the definition of one of the methods. To call the function you simply write:

Geometry.printClassInfo()

The exact output depends on how many objects have been created but it will be the current value of the counter class variable inserted into the text string from the function body.

Go ahead and save your completed geometry script since we'll be using it later in this lesson.

In the program that we will develop in the walkthroughs of this lesson, we will use static class functions that work somewhat similarly to the constructor in that they can create and return new objects of the class but only if certain conditions are met. We will use this idea to create event objects for certain events detected in bus GPS track data. The static functions defined in the different bus event classes (called detect()) will be called with the GPS data and only return an object of the respective event class if the conditions for this kind of bus event are fulfilled. Here is a sketch of a class definition that illustrates this idea:

class SomeEvent(): 
    ...

    # static class function that creates and returns an object of this class only if certain conditions are satisfied
    def detect(data): 
        ... # perform some tests with data provided as parameter
        if ...: # if conditions are satisfied, use constructor of SomeEvent to create an object and return that object
              return SomeEvent(...)
        else:   # else the function returns None
              return None

# calling the static class function from outside the class definition,
# the returned SomeEvent object will be stored in variable event
event = SomeEvent.detect(...)
if event: # test whether an object has been returned
    ... # do something with the new SomeEvent object

Lesson content developed by Jan Wallgrun and James O’Brien

Loops, if/else, ternary operators

Loops, continue, break

Next, let’s quickly revisit loops in Python. There are two kinds of loops in Python, the for-loop and the while-loop. You should know that the for-loop is typically used when the goal is to go through a given set or list of items or do something a certain number of times. In the first case, the for-loop typically looks like this:

for item in list:
	# do something with item

while in the second case, the for-loop is often used together with the range(…) function to determine how often the loop body should be executed:

for item in list: 
    # do something with item 
    for i in range(50):

In contrast, the while-loop has a condition that is checked before each iteration and if the condition becomes False, the loop is terminated and the code execution continues after the loop body. With this knowledge, it should be pretty clear what the following code example does:

import random 
r = random.randrange(100)  # produce random number between 0 and 99  
 
attempts = 1 
 
while r != 11: 
    attempts += 1 
    r = random.randrange(100) 
 
print('This took ' + str(attempts) + ' attempts')

Break and Continue

What you may not yet know is that there are two additional commands, break and continue, that can be used in combination with either a for or a while-loop. The break command will automatically terminate the execution of the current loop and continue with the code after it. If the loop is part of a nested loop only the inner loop will be terminated. This means we can rewrite the program from above using a for-loop rather than a while-loop like this:

import random  
attempts = 0 
for i in range(1000):   
    r = random.randrange(100)  
    attempts += 1 
    if r == 11:  
        break  # terminate loop and continue after it  
print('This took ' + str(attempts) + ' attempts')

When the random number produced in the loop body is 11, the body of the if-statement, so the break command, will be executed and the program execution immediately leaves the loop and continues with the print statement after it. Obviously, this version is not completely identical to the while based version from above because the loop will be executed at most 1000 times here.

If you have experience with programming languages other than Python, you may know that some languages have a "do … while" loop construct where the condition is only tested after each time the loop body has been executed so that the loop body is always executed at least once. Since we first need to create a random number before the condition can be tested, this example would actually be a little bit shorter and clearer using a do-while loop. Python does not have a do-while loop but it can be simulated using a combination of while and break:

import random 
attempts = 0  
while True:  
    r = random.randrange(100)  
    attempts += 1 
    if r == 11:  
        break 
print('This took ' + str(attempts) + ' attempts')

A while loop with the condition True will in principle run forever. However, since we have the if-statement with the break, the execution will be terminated as soon as the random number generator rolls an 11. While this code is not shorter than the previous while-based version, we are only creating random numbers in one place, so it can be considered a little bit more clear.

When a continue command is encountered within the body of a loop, the current execution of the loop body is also immediately stopped, but in contrast to the break command, the execution then continues with the next iteration of the loop body. Of course, the next iteration is only started if, in the case of a while-loop, the condition is still true, or in the case of a for-loop, there are still remaining items in the list that we are looping through. The following code goes through a list of numbers and prints out only those numbers that are divisible by 3 (without remainder).

l = [3,7,99,54,3,11,123,444]  
for n in l:  
    if n % 3 != 0:   # test whether n is not divisible by 3 without remainder  
        continue 
    print(n)

This code uses the modulo operator % to get the remainder of the division of n and 3 in line 5. If this remainder is not 0, the continue command is executed and, as a result, the program execution directly jumps back to the beginning of the loop and continues with the next number. If the condition is False (meaning the number is divisible by 3), the execution continues as normal after the if-statement and prints out the number. Hopefully, it is immediately clear that the same could have been achieved by changing the condition from != to == and having an if-block with just the print statement, so this is really just a toy example illustrating how continue works.

As you saw in these few examples, there are often multiple ways in which for, while, break, continue, and if-else can be combined to achieve the same thing. While break and continue can be useful commands, they can also make code more difficult to read and understand. Therefore, they should only be used sparingly and when their usage leads to a simpler and more comprehensible code structure than a combination of for /while and if-else would do.

Lesson content developed by Jan Wallgrun and James O’Brien

Expressions, if/else & ternary operator, and match

Expressions

You are already familiar with Python binary operators that can be used to define arbitrarily complex expressions. For instance, you can use arithmetic expressions that evaluate to a number, or boolean expressions that evaluate to either True or False. Here is an example of an arithmetic expression using the arithmetic operators – and *:

x = 25 – 2 * 3

Each binary operator takes two operand values of a particular type (all numbers in this example) and replaces them by a new value calculated from the operands. All Python operators are organized into different precedence classes, determining in which order the operators are applied when the expression is evaluated unless parentheses are used to explicitly change the order of evaluation. This operator precedence table [9] shows the classes from lowest to highest precedence. The operator * for multiplication has a higher precedence than the – operator for subtraction, so the multiplication will be performed first and the result of the overall expression assigned to variable x is 19.

Here is an example for a boolean expression:

x = y > 12 and z == 3

The boolean expression on the right side of the assignment operator contains three binary operators: two comparison operators, > and ==, that take two numbers and return a boolean value, and the logical ‘and’ operator that takes two boolean values and returns a new boolean (True only if both input values are True, False otherwise). The precedence of ‘and’ is lower than that of the two comparison operators, so the ‘and’ will be evaluated last. So if y has the value 6 and z the value 3, the value assigned to variable x by this expression will be False because the comparison on the left side of the ‘and’ evaluates to False.

if/else & ternary operator

In addition to all these binary operators, Python has a ternary operator, so an operator that takes three operands as input. This operator has the format

x if c else y

x, y, and c here are the three operands while ‘if’ and ‘else’ are the keywords making up the operator and demarcating the operands. While x and y can be values or expressions of arbitrary type, the condition c needs to be a boolean value or expression. What the operator does is it looks at the condition c and if c is True it evaluates to x, else it evaluates to y. So for example, in the following line of code

p = 1 if x > 12 else 0

variable p will be assigned the value 1 if x is larger than 12, else p will be assigned the value 0. Obviously what the ternary if-else operator does is very similar to what we can do with an if or if-else statement. For instance, we could have written the previous code as

p = 1 
if x > 12:  
    p = 0

The “x if c else y” operator is an example of a language construct that does not add anything principally new to the language but enables writing things more compactly or more elegantly. That’s why such constructs are often called syntactic sugar. The nice thing about “x if c else y” is that in contrast to the if-else statement, it is an operator that evaluates to a value and, hence, can be embedded directly within more complex expressions as in the following example that uses the operator twice:

newValue = 25 + (10 if oldValue < 20 else 44) / 100 + (5 if useOffset else 0)

Using an if-else statement for this expression would have required at least five lines of code. If you have more than two possibilities, you will need to utilize the if/elif/else structure or you can implement what is called a ‘object literal’ or ‘switch case’.

Match

Other coding languages include a switch/case construct that executes or assigns values based on a condition. Python introduced this as ‘match’ in 3.10 but it can also done with a dictionary and the built in dict.get() method. This construct replaces multiple elifs in the if/elif/else structure and provides an explicit means of setting values.

For example, what if we wanted to set a value based on another value? The long way would be to create an if, elif, else like so:

p = 0 
 
for x in [1, 13, 12, 6]: 
    if x == 1: 
        p = ‘One’ 
    elif x == 13: 
        p = ‘Two’ 
    elif x == 12: 
        p = ‘Three’ 
 
    print(p)

Output
One
Two 
Three

The elifs can get long depending on the number of possibilities and difficult to read. Using match, you can control the flow of the program by explicitly setting cases and the desired code that should be executed if that case matches the condition.

An example is provided below:

command = 'Hello, Geog 485!' 
 
match command: 
    case ‘Hello, Geog 485!’: 
        print('Hello to you too!') 
    case 'Goodbye, World!': 
        print('See you later') 
    case other: 
        print('No match found')

Output 
Hello to you too!

‘Hello, Geog 485’ is a string assigned to the variable command. The interpreter will compare the incoming variable against the cases. When there is a True result, a ‘match’ between the incoming object and one of the cases, the code within the case scope will execute. In the example, the first case equaled the command, resulting in the Hello to you too! printing.

With the dict.get(…) dictionary lookup mentioned earlier, you can also include a default value if one of the values does not match any of the keys in a much more concise way:

possible_values_dict = {1: 'One', 13: 'Two', 12: 'Three'} 
for x in [1, 13, 12, 6]: 
    print(possible_values_dict.get(x, 'No match found'))

Output 
One 
Two 
Three 
No match found

In the example above, 1, 13, and 12 are keys in the dictionary and their values were returned for the print statement. Since 6 is not present in the dictionary, the result is the set default value of ‘No match found’. This default value return is helpful when compared to the dict[‘key’] retrieval method by not throwing a KeyError Exception and stopping the script or requiring that added code is written to handle the KeyError as shown below.

possible_values_dict = {1: 'One', 13: 'Two', 12: 'Three'} 
for x in [1, 13, 12, 6]: 
    print(possible_values_dict[x])

As mentioned earlier in the lesson, dictionaries are a very powerful data structure in Python and can even be used to execute functions as values using the .get(…) construct above. For example, let’s say we have different tasks that we want to run depending on a string value. This construct will look like the code below:

task = ‘monthly’ 
getTask = {'daily': lambda: get_daily_tasks(), 
            'monthly': lambda: get_monthly_tasks(), 
           'taskSet': lambda: get_one_off()} 
 
getTask.get(task)()

The .get will return the lambda (introduced in the next module) for the matching key passed in . The empty () after the .get(task) then executes the function that was returned in the .get(task) call. .get() takes a second parameter that is a default return if there is no key match. You can set it to be a function, or a value.

getTask.get(task, get_hourly_tasks)()

Portions of this content developed by Jan Wallgrun and James O’Brien

Functions to include lambdas

From previous experience, you should be familiar with defining simple functions that take a set of input parameters and potentially return some value. When calling such a function from somewhere in your Python code, you must provide values (or expressions that evaluate to some value) for each of these parameters, and these values are then accessible under the names of the respective parameters in the code that makes up the body of the function.

However, from working with different tool functions provided by arcpy and different functions from the Python standard library, you also already know that functions can also have optional parameters (often denoted by italics or within {} or 'opt' in documentation and are generally indexed after all of the required parameters), and you can use the names of such parameters to explicitly provide a value for them when calling the function. In this section, we will show you how to write functions with such keyword arguments and functions that take an arbitrary number of parameters, and we will discuss some more details about passing different kinds of values as parameters to a function.

Functions with keyword arguments

The parameters we have been using so far, for which we only specify a name in the function definition, are called positional parameters or positional arguments because the value that will be assigned to them when the function is called depends on their position in the parameter list: The first positional parameter will be assigned the first value given within the parentheses (…) when the function is called, and so on. Here is a simple function with two positional parameters, one for providing the last name of a person and one for providing a form of address. The function returns a string to greet the person with.

def greet(lastName, formOfAddress): 
      return 'Hello {0} {1}!'.format(formOfAddress, lastName) 

print(greet('Smith', 'Mrs.'))

Output 

Hello Mrs. Smith!

Note how the first value used in the function call (“Smith”) in line 6 is assigned to the first positional parameter (lastName) and the second value (“Mrs.”) to the second positional parameter (formOfAddress). Nothing new here so far.

The parameter list of a function definition can also contain one or more so-called keyword arguments. A keyword argument appears in the parameter list as

A keyword argument can be provided in the function by again using the notation

It can also be left out, in which case the default value specified in the function definition is used. This means keyword arguments are optional. Here is a new version of our greet function that now supports English and Spanish, but with English being the default language:

def greet(lastName, formOfAddress, language = 'English'): 
      greetings = { 'English': 'Hello', 'Spanish': 'Hola' }

      return '{0} {1} {2}!'.format(greetings[language], formOfAddress, lastName) 

 
print(greet('Smith', 'Mrs.')) 

print(greet('Rodriguez', 'Sr.', language = 'Spanish'))

Output

Hello Mrs. Smith! 
Hola Sr. Rodriguez!

Compare the two different ways in which the function is called in lines 8 and 10. In line 8, we do not provide a value for the ‘language’ parameter, so the default value ‘English’ is used when looking up the proper greeting in the dictionary stored in variable greetings. In the second version in line 10, the value ‘Spanish’ is provided for the keyword argument ‘language,’ so this is used instead of the default value and the person is greeted with “Hola” instead of "Hello." Keyword arguments can be used like positional arguments meaning the second call could also have been

print(greet('Rodriguez', 'Sr.', 'Spanish'))

without the “language =” before the value.

Things get more interesting when there are several keyword arguments, so let’s add another one for the time of day:

def greet(lastName, formOfAddress, language = 'English', timeOfDay = 'morning'): 
    greetings = { 'English': { 'morning': 'Good morning', 'afternoon': 'Good afternoon' }, 
                  'Spanish': { 'morning': 'Buenos dias', 'afternoon': 'Buenas tardes' } } 
    return '{0}, {1} {2}!'.format(greetings[language][timeOfDay], formOfAddress, lastName) 

 
print(greet('Smith', 'Mrs.')) 
print(greet('Rodriguez', 'Sr.', language = 'Spanish', timeOfDay = 'afternoon'))

Output

Good morning, Mrs. Smith! 
Buenas tardes, Sr. Rodriguez!

Since we now have four different forms of greetings depending on two parameters (language and time of day), we now store these in a dictionary in variable greetings that for each key (= language) contains another dictionary for the different times of day. For simplicity reasons, we left it at two times of day, namely “morning” and “afternoon.” In line 7, we then first use the variable language as the key to get the inner dictionary based on the given language and then directly follow up with using variable timeOfDay as the key for the inner dictionary.

The two ways we are calling the function in this example are the two extreme cases of (a) providing none of the keyword arguments, in which case default values will be used for both of them (line 10), and (b) providing values for both of them (line 12). However, we could now also just provide a value for the time of day if we want to greet an English person in the afternoon:

print(greet('Rogers', 'Mrs.', timeOfDay = 'afternoon'))

Output 

Good afternoon, Mrs. Rogers!

This is an example in which we have to use the prefix “timeOfDay =” because if we leave it out, it will be treated like a positional parameter and used for the parameter ‘language’ instead which will result in an error when looking up the value in the dictionary of languages. For similar reasons, keyword arguments must always come after the positional arguments in the definition of a function and in the call. However, when calling the function, the order of the keyword arguments doesn’t matter, so we can switch the order of ‘language’ and ‘timeOfDay’ in this example:

print(greet('Rodriguez', 'Sr.', timeOfDay = 'afternoon', language = 'Spanish'))

Of course, it is also possible to have function definitions that only use optional keyword arguments in Python.

Let us continue with the “greet” example, but let’s modify it to be a bit simpler again with a single parameter for picking the language, and instead of using last name and form of address we just go with first names. However, we now want to be able to not only greet a single person but arbitrarily many persons, like this:

greet('English', 'Jim', 'Michelle')

Output: 

Hello Jim! 
Hello Michelle!

greet('Spanish', 'Jim', 'Michelle', 'Sam')

Output: 

Hola Jim! 
Hola Michelle! 
Hola Sam!

To achieve this, the parameter list of the function needs to end with a special parameter that has a * symbol in front of its name. If you look at the code below, you will see that this parameter is treated like a list in the body of the function:

def greet(language, *names): 
     greetings = { 'English': 'Hello', 'Spanish': 'Hola' } 
     for n in names: 
          print('{0} {1}!'.format(greetings[language], n))

What happens is that all values given to that function from the one corresponding to the parameter with the * on will be placed in a list and assigned to that parameter. This way you can provide as many parameters as you want with the call and the function code can iterate through them in a loop. Please note that for this example we changed things so that the function directly prints out the greetings rather than returning a string.

We also changed language to a positional parameter because if you want to use keyword arguments in combination with an arbitrary number of parameters, you need to write the function in a different way. You then need to provide another special parameter starting with two stars ** and that parameter will be assigned a dictionary with all the keyword arguments provided when the function is called. Here is how this would look if we make language a keyword parameter again:

def greet(*names, **kwargs): 
 
     greetings = { 'English': 'Hello', 'Spanish': 'Hola' } 
 
     language = kwargs['language'] if 'language' in kwargs else 'English'
 
     for n in names: 
 
          print('{0} {1}!'.format(greetings[language], n))

If we call this function as

greet('Jim', 'Michelle')

the output will be:

Hello Jim! 
Hello Michelle!

And if we use

greet('Jim', 'Michelle', 'Sam', language = 'Spanish')

we get:

Hola Jim! 
Hola Michelle! 
Hola Sam!

Yes, this is getting quite complicated, and it’s possible that you will never have to write functions with both * and ** parameters, still here is a little explanation: All non-keyword parameters are again collected in a list and assigned to variable names. All keyword parameters are placed in a dictionary using the name appearing before the equal sign as the key, and the dictionary is assigned to variable kwargs. To really make the ‘language’ keyword argument optional, we have added line 5 in which we check if something is stored under the key ‘language’ in the dictionary (this is an example of using the ternary "... if ... else ..." operator). If yes, we use the stored value and assign it to variable language, else we instead use ‘English’ as the default value. In line 9, language is then used to get the correct greeting from the dictionary in variable greetings while looping through the name list in variable names.

Lesson content developed by Jan Wallgrun and James O’Brien

Higher order functions

In this section, we are going to introduce a new and very powerful concept of Python (and other programming languages), namely the idea that functions can be given as parameters to other functions similar to how we have been doing so far with other types of values like numbers, strings, or lists. You see examples of this near the end of the lesson with the pool.starmap(...) function. A function that takes other functions as arguments is often called a higher order function.

Let us immediately start with an example: Let’s say you often need to apply certain string functions to each string in a list of strings. Sometimes you want to convert the strings from the list to be all in upper-case characters, sometimes to be all in lower-case characters, sometimes you need to turn them into all lower-case characters but have the first character capitalized, or apply some completely different conversion. The following example shows how one can write a single function for all these cases and then pass the function to apply to each list element as a parameter to this new function:

def applyToEachString(stringFunction, stringList):
	myList = []
	for item in stringList:
		myList.append(stringFunction(item))
	return myList

allUpperCase = applyToEachString(str.upper, ['Building', 'ROAD', 'tree'] )
print(allUpperCase)

As you can see, the function definition specifies two parameters; the first one is for passing a function that takes a string and returns either a new string from it or some other value. The second parameter is for passing along a list of strings. In line 7, we call our function with using str.upper for the first parameter and a list with three words for the second parameter. The word list intentionally uses different forms of capitalization. upper() is a string method that turns the string it is called for into all upper-case characters. Since this a method and not a function, we have to use the name of the class (str) as a prefix, so “str.upper”. It is important that there are no parentheses () after upper because that would mean that the function will be called immediately and only its return value would be passed to applyToEachString(…).

In the function body, we simply create an empty list in variable myList, go through the elements of the list that is passed in parameter stringList, and then in line 4 call the function that is passed in parameter stringFunction to an element from the list. The result is appended to list myList and, at the end of the function, we return that list with the modified strings. The output you will get is the following:

['BUILDING', 'ROAD', 'TREE']

If we now want to use the same function to turn everything into all lower-case characters, we just have to pass the name of the lower() function instead, like this:

allLowerCase = applyToEachString(str.lower, ['Building', 'ROAD', 'tree'] )
print(allLowerCase)

Output 
['building', 'road', 'tree']

You may at this point say that this is more complicated than using a simple list comprehension that does the same, like:

[ s.upper() for s in ['Building', 'ROAD', 'tree'] ]

That is true in this case but we are just creating some simple examples that are easy to understand here. For now, trust us that there are more complicated cases of higher-order functions that cannot be formulated via list comprehension.

For converting all strings into strings that only have the first character capitalized, we first write our own function that does this for a single string. There actually is a string method called capitalize() that could be used for this, but let’s pretend it doesn’t exist to show how to use applyToEachString(…) with a self-defined function.

def capitalizeFirstCharacter(s):
	return s[:1].upper() + s[1:].lower()

allCapitalized = applyToEachString(capitalizeFirstCharacter, ['Building', 'ROAD', 'tree'] )
print(allCapitalized)

Output
['Building', 'Road', 'Tree']

The code for capitalizeFirstCharacter(…) is rather simple. It just takes the first character of the given string s and turns it into upper-case, then takes the rest of the string and turns it into lower-case, and finally puts the two pieces together again. Please note that since we are passing a function as parameter not a method of a class, there is no prefix added to capitalizeFirstCharacter in line 4.

Lesson content developed by Jan Wallgrun and James O’Brien

Lamda function

In a case like this where the function you want to use as a parameter is very simple like just a single expression and you only need this function in this one place in your code, you can skip the function definition completely and instead use a so-called lambda expression. A lambda expression basically defines a function without giving it a name using the format (there's a good first principles discussion on Lambda functions here [10] at RealPython).

lambda <parameters>: <expression for the return value>

For capitalizeFirstCharacter(…), the corresponding lamba expression would be this:

lambda s: s[:1].upper() + s[1:].lower()

Note that the part after the colon does not contain a return statement; it is always just a single expression and the result from evaluating that expression automatically becomes the return value of the anonymous lambda function. That means that functions that require if-else or loops to compute the return value cannot be turned into lambda expression. When we integrate the lambda expression into our call of applyToEachString(…), the code looks like this:

allCapitalized = applyToEachString(lambda s: s[:1].upper() +  s[1:].lower(), ['Building', 'ROAD', 'tree'] )

Lambda expressions can be used everywhere where the name of a function can appear, so, for instance, also within a list comprehension:

[(lambda s: s[:1].upper() + s[1:].lower())(s) for s in ['Building', 'ROAD', 'tree'] ]

We here had to put the lambda expression into parenthesis and follow up with “(s)” to tell Python that the function defined in the expression should be called with the list comprehension variable s as parameter.

So far, we have only used applyToEachString(…) to create a new list of strings, so the functions we used as parameters always were functions that take a string as input and return a new string. However, this is not required. We can just as well use a function that returns, for instance, numbers like the number of characters in a string as provided by the Python function len(…). Before looking at the code below, think about how you would write a call of applyToEachString(…) that does that!

Here is the solution.

wordLengths = applyToEachString(len, ['Building', 'ROAD', 'tree'] )
print(wordLengths)

len(…) is a function so we can simply put in its name as the first parameter. The output produced is the following list of numbers:

Output
[8, 4, 4]

With what you have seen so far in this lesson the following code example should be easy to understand:

def applyToEachNumber(numberFunction, numberList):
	l = []
	for item in numberList:
		l.append(numberFunction(item))
	return l

roundedNumbers = applyToEachNumber(round, [12.3, 42.8] )
print(roundedNumbers)

Right, we just moved from a higher-order function that applies some other function to each element in a list of strings to one that does the same but for a list of numbers. We call this function with the round(...) function for rounding a floating point number. The output will be:

Output
[12.0, 43.0]

If you compare the definition of the two functions applyToEachString(…) and applyToEachNumber(…), it is pretty obvious that they are exactly the same, we just slightly changed the names of the input parameters! The idea of these two functions can be generalized and then be formulated as “apply a function to each element in a list and build a list from the results of this operation” without making any assumptions about what type of values are stored in the input list. This kind of general higher-order function is already available in the Python standard library. It is called map(…) and it is one of several commonly used higher-order functions defined there. In the following, we will go through the three most important list-related functions defined there, called map(…), reduce(…), and filter(…).

Lesson content developed by Jan Wallgrun and James O’Brien

Map

Like our more specialized versions, map(…) takes a function (or method) as the first input parameter and a list as the second parameter. It is the responsibility of the programmer using map(…) to make sure that the function provided as a parameter is able to work with whatever is stored in the provided list. In Python 3, a change to map(…) has been made so that it now returns a special map object rather than a simple list. However, whenever we need the result as a normal list, we can simply apply the list(…) function to the result like this:

l = list(map(…, …))

The three examples below show how we could have performed the conversion to upper-case and first character capitalization, and the rounding task with map(...) instead of using our own higher-order functions:

map(str.upper, ['Building', 'Road', 'Tree'])

map(lambda s: s[:1].upper() + s[1:].lower(), ['Building', 'ROAD', 'tree']) # uses lambda expression for only first character as upper-case

map(round, [12.3, 42.8])

Map is actually more powerful than our own functions from above in that it can take multiple lists as input together with a function that has the same number of input parameters as there are lists. It then applies that function to the first elements from all the lists, then to all second elements, and so on. We can use that to, for instance, create a new list with the sums of corresponding elements from two lists as in the following example. The example code also demonstrates how we can use the different Python operators, like the + for addition, with higher-order functions: The operator module [11] from the standard Python library contains function versions of all the different operators that can be used for this purpose. The one for + is available as operator.add(...).

import operator
map(operator.add, [1,3,4], [4,5,6])

Output
[5, 8, 10]

As a last map example, let’s say you instead want to add a fixed number to each number in a single input list. The easiest way would then again be to use a lambda expression:

number = 11
map(lambda n: n + number, [1,3,4,7])

Output
[12, 14, 15, 18]

Lesson content developed by Jan Wallgrun and James O’Brien

Filter

The goal of the filter(…) higher-order function is to create a new list with only certain items from the original list that all satisfy some criterion by applying a boolean function to each element (a function that returns either True or False) and only keeping an element if that function returns True for that element.

Below we provide two examples for this, one for a list of strings and one for a list of numbers. The first example uses a lambda expression that uses the string method startswith(…) to check whether or not a given string starts with the character ‘R’. Here is the code:

newList = filter(lambda s: s.startswith('R'), ['Building', 'ROAD', 'tree'])
print(newList)

Output
['ROAD']

In the second example, we use is_integer() from the float class to take only those elements from a list of floating point numbers that are integer numbers. Since this is a method, we again need to use the class name as a prefix (“float.”).

newList = filter(float.is_integer, [12.4, 11.0, 17.43, 13.0])
print(newList)

Output
[11.0, 13.0]

Lesson content developed by Jan Wallgrun and James O’Brien

Reduce

The last higher-order function we are going to discuss here is reduce(…). In Python 3, it needs to be imported from the module functools. Its purpose is to combine (or “reduce”) all elements from a list into a single value by using an aggregation function taking two parameters that is used to combine the first and the second element, then the result with the third element, and so on until all elements from the list have been incorporated. The standard example for this is to sum up all values from a list of numbers. reduce(…) takes three parameters: (1) the aggregation function, (2) the list, and (3) an accumulator parameter. To understand this third parameter, think about how you would solve the task of summing up the numbers in a list with a for-loop. You would use a temporary variable initialized to zero and then add each number from that list to that variable which in the end would contain the final result. If you instead would want to compute the product of all numbers, you would do the same but initialize that variable to 1 and use multiplication instead of addition. The third parameter of reduce(…) is the value used to initialize this temporary variable. That should make it easy to understand the arguments used in the following two examples:

import operator
from functools import reduce

result = reduce(operator.add, [234,3,3], 0) # sum
print(result)

Output
240

import operator
from functools import reduce

result = reduce(operator.mul, [234,3,3], 1) # product
print(result)

Output
2106

Other things reduce(…) can be used for are computing the minimum or maximum value of a list of numbers or testing whether or not any or all values from a list of booleans are True. We will see some of these use cases in the practice exercises of this lesson. Examples of the higher-order functions discussed in this section will occasionally appear in the examples and walkthrough code of the remaining lessons.

Lesson content developed by Jan Wallgrun and James O’Brien

Multiprocessing

Python includes several different syntaxial ways of executing processes in parallel. Each way comes with a list of pros and cons. The major hinderance with Python’s multithreading is the serialization and deserialization of data to and from the threads. This is called ‘pickling’ and will be discussed more in detail later in the section. It is important to note that custom classes, such as those found in arcpy (geoprocessing results, Featureclasses or layers) will need a custom serializer and de-serializer for geoprocessing results to be returned from threads. This takes significant work and coding to create and I have yet to see one. Trying to return a object outside of the built in types will result in an Exception that the object cannot be pickled. The method of multiprocessing that we will be using utilizes the map method that we covered earlier in the lesson as a starmap(), or you can think of it as ‘start map’. The method starts a new thread for each item in the list and holds the results from each process in a list.

What if you wanted to run different scripts at the same time? The starmap() method is great for a single process done i number of times but you can also be more explicit by using the pool.apply_async() method. Instead of using the map construct, you assign each process to a variable and then call .get() for the results. Note here that the parameters need to be passed as a tuple. Single params need to be passed as (arg,), but if you have more than one parameter to pass, the tuple is (arg1, arg2, arg3).

For example:

with mp.Pool(processes=5) as pool: 
    p1 = pool.apply_async(scriptA, (1param,)) 
    p2 = pool.apply_async(scriptB, (1param, 2param)) 
    p3 = pool.apply_async(scriptC, (1param,)) 
 
    res = [p1.get(), p2.get(), p3.get(), …]

First steps with multiprocessing

You might have realized that there are generally two broad types of tasks – those that are input/output (I/O) heavy which require a lot of data to be read, written or otherwise moved around; and those that are CPU (or processor) heavy that require a lot of calculations to be done. Because getting data is the slowest part of our operation, I/O heavy tasks do not demonstrate the same improvement in performance from multiprocessing as CPU heavy tasks. The more work there is to do for the CPU the greater the benefit in splitting that workload among a range of processors so that they can share the load.

The other thing that can slow us down is outputting to the screen – although this isn’t really an issue in multiprocessing because printing to our output window can get messy. Think about two print statements executing at exactly the same time – you’re likely to get the content of both intermingled, leading to a very difficult to understand message. Even so, updating the screen with print statements is a slow task.

Don’t believe me? Try this sample piece of code that sums the numbers from 0-100.

import time 
 
start_time = time.time() 
 
sum = 0 
for i in range(0, 100): 
    sum += i 
    print(sum) 
 
# Output how long the process took.  
print("--- %s seconds ---" % (time.time() - start_time))

If I run it with the print function in the loop the code takes 0.049 seconds to run on my PC. If I comment that print function out, the code runs in 0.0009 seconds.

4278
4371
4465
4560
4656
4753
4851
4950
--- 0.04900026321411133 seconds ---

runfile('C:/Users/jao160/Documents/Teaching_PSU/Geog489_SU_21/Lesson 1/untitled1.py', wdir='C:/Users/jao160/Documents/Teaching_PSU/Geog489_SU_21/Lesson 1')
--- 0.0009996891021728516 seconds ---

You might remember a similar situation in GEOG 485 with the Hi-ho Cherry-O example [12] where we simulated 10,000 runs of this children's game to determine the average number of turns it takes. If we printed out the results, the code took a minute or more to run. If we skipped all but the final print statement the code ran in less than a second.

We’ll revisit that Cherry-O example as we experiment with moving code from the single processor paradigm to multiprocessor. We’ll start with it as a simple, non arcpy example and then move on to two arcpy examples – one raster (our raster calculation example from before) and one vector.

Here’s our original Cherry-O code. (If you did not take GEOG485 and don't know the game, you may want to have a quick look at the description from GEOG485 [12]).

# Simulates 10K game of Hi Ho! Cherry-O  
# Setup _very_ simple timing.  
import time 
 
start_time = time.time() 
import random 
 
spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 
turns = 0 
totalTurns = 0 
cherriesOnTree = 10 
games = 0 
 
while games < 10001: 
    # Take a turn as long as you have more than 0 cherries  
    cherriesOnTree = 10 
    turns = 0 
    while cherriesOnTree > 0: 
        # Spin the spinner  
        spinIndex = random.randrange(0, 7) 
        spinResult = spinnerChoices[spinIndex] 
        # Print the spin result      
        # print ("You spun " + str(spinResult) + ".")     
        # Add or remove cherries based on the result  
        cherriesOnTree += spinResult 
 
        # Make sure the number of cherries is between 0 and 10     
        if cherriesOnTree > 10: 
            cherriesOnTree = 10 
        elif cherriesOnTree < 0: 
            cherriesOnTree = 0 
            # Print the number of cherries on the tree         
        # print ("You have " + str(cherriesOnTree) + " cherries on your tree.")      
        turns += 1 
    # Print the number of turns it took to win the game  
    # print ("It took you " + str(turns) + " turns to win the game.")  
    games += 1 
    totalTurns += turns 
print("totalTurns " + str(float(totalTurns) / games)) 
# lastline = raw_input(">")  
# Output how long the process took.  
print("--- %s seconds ---" % (time.time() - start_time))

We've added in our very simple timing from earlier and this example runs for me in about 1/3 of a second (without the intermediate print functions). That is reasonably fast and you might think we won't see a significant improvement from modifying the code to use multiprocessor mode but let's experiment.

The Cherry-O task is a good example of a CPU bound task; we’re limited only by the calculation speed of our random numbers, as there is no I/O being performed. It is also an embarrassingly parallel task as none of the 10,000 runs of the game are dependent on each other. All we need to know is the average number of turns; there is no need to share any other information. Our logic here could be to have a function (Cherry-O) which plays the game and returns to our calling function the number of turns. We can add that value returned to a variable in the calling function and when we’re done divide by the number of games (e.g. 10,000) and we’ll have our average.

Lesson content developed by Jan Wallgrun and James O’Brien

Converting from sequential to multiprocessing

So with that in mind, let us examine how we can convert a simple program like a programmatic version of the game Hi Ho Cherry-O from sequential to multiprocessing.

You can download the Hi Ho Cherry-O script [13].

There are a couple of basic steps we need to add to our code in order to support multiprocessing. The first is that our code needs to import multiprocessing which is a Python library which as you will have guessed from the name enables multiprocessing support. We’ll add that as the first line of our code.

The second thing our code needs to have is a __main__ method defined. We’ll add that into our code at the very bottom with:

if __name__ == '__main__': 
    play_a_game()

With this, we make sure that the code in the body of the if-statement is only executed for the main process we start by running our script file in Python, not the subprocesses we will create when using multiprocessing, which also are loading this file. Otherwise, this would result in an infinite creation of subprocesses, subsubprocesses, and so on. Next, we need to have that play_a_game() function we are calling defined. This is the function that will set up our pool of processors and also assign (map) each of our tasks onto a worker (usually a processor) in that pool.

Our play_a_game() function is very simple. It has two main lines of code based on the multiprocessing module:

The first instantiates a pool with a number of workers (usually our number of processors or a number slightly less than our number of processors). There’s a function to determine how many processors we have, multiprocessing.cpu_count(), so that our code can take full advantage of whichever machine it is running on. That first line is:

with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool:
   ... # code for setting up the pool of jobs

You have probably already seen this notation from working with arcpy cursors. This with ... as ... statement creates an object of the Pool class defined in the multiprocessing module and assigns it to variable myPool. The parameter given to it is the number of processors on my machine (which is the value that multiprocessing.cpu_count() is returning), so here we are making sure that all processor cores will be used. All code that uses the variable myPool (e.g., for setting up the pool of multiprocessing jobs) now needs to be indented relative to the "with" and the construct makes sure that everything is cleaned up afterwards. The same could be achieved with the following lines of code:

myPool = multiprocessing.Pool(multiprocessing.cpu_count())
... # code for setting up the pool of jobs
myPool.close()
myPool.join()

Here the Pool variable is created without the with ... as ... statement. As a result, the statements in the last two lines are needed for telling Python that we are done adding jobs to the pool and for cleaning up all sub-processes when we are done to free up resources. We prefer to use the version using the with ... as ... construct in this course.

The next line that we need in our code after the with ... as ... line is for adding tasks (also called jobs) to that pool:

    res = myPool.map(hi_ho_cherry_o, range(10000))

What we have here is the name of another function, hi_ho_cherry_o(), which is going to be doing the work of running a single game and returning the number of turns as the result. The second parameter given to map() contains the parameters that should be given to the calls of thecherryO() function as a simple list. So this is how we are passing data to process to the worker function in a multiprocessing application. In this case, the worker function hi_ho_cherry_o() does not really need any input data to work with. What we are providing is simply the number of the game this call of the function is for, so we use the range from 0-9,999 for this. That means we will have to introduce a parameter into the definiton of the hi_ho_cherry_o() function for playing a single game. While the function will not make any use of this parameter, the number of elements in the list (10000 in this case) will determine how many times hi_ho_cherry_o() will be run in our multiprocessing pool and, hence, how many games will be played to determine the average number of turns. In the final version, we will replace the hard-coded number by an argument called numGames. Later in this part of the lesson, we will show you how you can use a different function called starmap(...) instead of map(...) that works for worker functions that do take more than one argument so that we can pass different parameters to it.

Python will now run the pool of calls of the hi_ho_cherry_o() worker function by distributing them over the number of cores that we provided when creating the Pool object. The returned results, so the number of turns for each game played, will be collected in a single list and we store this list in variable res. We’ll average those turns per game to get an average using the Python library statistics and the function mean().

To prepare for the multiprocessing version, we’ll take our Cherry-O code from before and make a couple of small changes. We’ll define function hi_ho_cherry_o() around this code (taking the game number as parameter as explained above) and we’ll remove the while loop that currently executes the code 10,000 times (our map range above will take care of that) and we’ll therefore need to “dedent“ the code.

Here’s what our revised function will look like :

def hi_ho_cherry_o(game): 
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 
    turns = 0 
    cherriesOnTree = 10 
 
    # Take a turn as long as you have more than 0 cherries  
    while cherriesOnTree > 0: 
        # Spin the spinner  
        spinIndex = random.randrange(0, 7) 
        spinResult = spinnerChoices[spinIndex] 
        # Print the spin result      
        # print ("You spun " + str(spinResult) + ".")  
        # Add or remove cherries based on the result  
        cherriesOnTree += spinResult 
        # Make sure the number of cherries is between 0 and 10     
        if cherriesOnTree > 10: 
            cherriesOnTree = 10 
        elif cherriesOnTree < 0: 
            cherriesOnTree = 0 
            # Print the number of cherries on the tree         
        # print ("You have " + str(cherriesOnTree) + " cherries on your tree.")  
        turns += 1 
    # return the number of turns it took to win the game  
    return turns

Now lets put it all together. We’ve made a couple of other changes to our code to define a variable at the very top called numGames = 10000 to define the size of our range.

import random
import multiprocessing
import statistics
import time

def hi_ho_cherry_o(game):
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10]
    turns = 0
    cherriesOnTree = 10

    # Take a turn as long as you have more than 0 cherries
    while cherriesOnTree > 0:
        # Spin the spinner
        spinIndex = random.randrange(0, 7)
        spinResult = spinnerChoices[spinIndex]
        # Print the spin result
        # print ("You spun " + str(spinResult) + ".")
        # Add or remove cherries based on the result
        cherriesOnTree += spinResult
        # Make sure the number of cherries is between 0 and 10
        if cherriesOnTree > 10:
            cherriesOnTree = 10
        elif cherriesOnTree < 0:
            cherriesOnTree = 0
            # Print the number of cherries on the tree
        # print ("You have " + str(cherriesOnTree) + " cherries on your tree.")
        turns += 1
        # return the number of turns it took to win the game
    return turns


def play_a_game(numGames):
    with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool:
        ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list.
        turns = myPool.map(hi_ho_cherry_o, range(numGames))
        # Uncomment this line to print out the list of total turns (but note this will slow down your code's execution)
        # print(turns)
    # Use the statistics library function mean() to calculate the mean of turns
    print(f'Average turns for {len(turns)} games is {statistics.mean(turns)}')


if __name__ == '__main__':
    start_time = time.time()
    play_a_game(10000)
    # Output how long the process took.
    print(f" Process took {time.time() - start_time} seconds")

You will also see that we have the list of results returned on the left side of the = before our map function (~line 35). We’re taking all of the returned results and putting them into a list called turns (feel free to add a print or type statement here to check that it's a list). Once all of the workers have finished playing the games, we will use the Python library statistics function mean, which we imported at the very top of our code (right after multiprocessing) to calculate the mean of our list in variable turns. The call to mean() will act as our reduce as it takes our list and returns the single value that we're really interested in.

When you have finished writing the code in PyScripter, you can run it.

Lesson content developed by Jan Wallgrun and James O’Brien

Arcpy multiprocessing examples

Now that we have completed a non-ArcGIS parallel processing exercise, let's look at a couple of examples using ArcGIS functions. There are several caveats or gotchas to using multiprocessing with ArcGIS and it is important to cover them up-front because they affect the ways in which we can write our code.

Esri describes several best practices for multiprocessing with arcpy. These include:

Use “memory“ workspaces to store temporary results because as noted earlier memory is faster than disk.
Avoid writing to file geodatabase (FGDB) data types and GRID raster data types. These data formats can often cause schema locking or synchronization issues. That is because file geodatabases and GRID raster types do not support concurrent writing – that is, only one process can write to them at a time. You might have seen a version of this problem in arcpy previously if you tried to modify a feature class in Python that was open in ArcGIS. That problem is magnified if you have an FGDB and you’re trying to write many feature classes to it at once. Even if all of the feature classes are independent you can only write them to the FGDB one at a time.
Use 64-bit. This isn’t an issue if we are writing code in ArcGIS Pro (although Esri does recommend using a version of Pro greater than 1.4) because we are already using 64-bit, but if you were planning on using Desktop as well, then you would need to use ArcGIS Server 10.5 (or greater) or ArcGIS Desktop with Background Geoprocessing (64-bit). The reason for this is that as we previously noted 64-bit processes can access significantly more memory and using 64-bit might help resolve any large data issues that don’t fit within the 32-bit memory limits of 4GB.

So bearing the top two points in mind we should make use of memory workspaces wherever possible and we should avoid writing to FGDBs (in our worker functions at least – but we could use them in our master function to merge a number of shapefiles or even individual FGDBs back into a single source).

Since we work with other packages such as arcpy, it is important to note that Classes within arcpy such as the Featureclass, Layer, Table, Raster, etc.,. cannot be returned from the worker threads without creating custom serializers to serialize and deserialize the objects between threads. This is known as Pickling and is the process of converting the object to JSON and back to an object. This method is beyond the scope of this course but built in classes and objects within python can be returned. For our example, we will return a dictionary containing information of the process.

Lesson content developed by Jan Wallgrun and James O’Brien

Multiprocessing with raster data

There are two types of operations with rasters that can easily (and productively) be implemented in parallel: operations that are independent components in a workflow, and raster operations which are local, focal or zonal – that is they work on a small portion of a raster such as a pixel or a group of pixels.

Esri’s Clinton Dow and Neeraj Rajasekar presented way back at the 2017 User Conference demonstrating multiprocessing with arcpy and they had a number of useful graphics in their slides which demonstrate these two categories of raster operations which we have reproduced here as they're still appropriate and relevant.

An example of an independent workflow would be if we calculate the slope, aspect and some other operations on a raster and then produce a weighted sum or other statistics. Each of the operations is independently performed on our raster up until the final operation which relies on each of them (see the first image below). Therefore, the independent operations can be parallelized and sent to a worker and the final task (which could also be done by a worker) aggregates or summarises the result. Which is what we can see in the second image as each of the tasks is assigned to a worker (even though two of the workers are using a common dataset) and then Worker 4 completes the task. You can probably imagine a more complex version of this task where it is scaled up to process many elevation and land-use rasters to perform many slope, aspect and reclassification calculations with the results being combined at the end.

parallel problem slide see text description below

Figure 1.17 Slide 15 from Parallel Python

Click for a text description

Slide titled: Pleasingly parallel problems. Slide shows serialized execution of a model workflow as worker 1 does 3 steps sequentially which feed into the final worker 1 which does weighted sum leading to an output suitability raster. The 3 original processes completed by work one are: First, elevation raster to slope, then, elevation raster to aspect, and finally, Land use raster to reclassify. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

Figure 1.18 Slide 16 from Parallel Python

Click for a text description

Slide titled: Pleasingly parallel problems. Slide shows parallelized execution of a model workflow as three different workers simultaneously feed into a fourth worker which does weighted sum leading to an output suitability raster. Worker 1 processes elevation raster to slope, worker 2 processes elevation raster to aspect, and worker 3 processes land use raster to reclassify. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

An example of the second type of raster operation is a case where we want to make a mathematical calculation on every pixel in a raster such as squaring or taking the square root. Each pixel in a raster is independent of its neighbors in this operation so we could have multiple workers processing multiple tiles in the raster and the result is written to a new raster. In this example, instead of having a single core serially performing a square root calculation across a raster (the first image below) we can segment our raster into a number of tiles, assign each tile to a worker and then perform the square root operation for each pixel in the tile outputting the result to a single raster which is shown in the second image below.

Figure 1.19 Slide 19 from Parallel Python

Click for a text description

Slide titled: Pleasingly parallel problems. Slide shows tool executed serially on a large input dataset. Starts with large elevation raster leading to worker 1 leading to square root math tool and finally output square root raster. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

Figure 1.20 Slide 20 from Parallel Python

Click for a text description

Slide titled: Pleasingly parallel problems. Slide shows tool executed parallelly on a large input dataset. Starts with large elevation raster leading to four different workers identically using the square root math tool and all leading to the same output square root raster. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

Bearing in mind the caveats about parallel programming from above and the process that we undertook to convert the Hi Ho Cherry-O program, let's begin.

The DEM that we will be using can be downloaded [14] and the sample code is below that we want to conver it is below.

# This script uses map algebra to find values in an 
#  elevation raster greater than 3500 (meters). 
 
import arcpy 
from arcpy.sa import *
 
# Specify the input raster 
inRaster = arcpy.GetParameterAsText(0)
cutoffElevation = arcpy.GetParameter(1)
outPath = arcpy.env.workspace
 
# Check out the Spatial Analyst extension 
arcpy.CheckOutExtension("Spatial") 
 
# Make a map algebra expression and save the resulting raster 
outRaster = Raster(inRaster) > cutoffElevation 
outRaster.save(outPath+"/foxlake_hi_10")
 
# Check in the Spatial Analyst extension now that you're done 
arcpy.CheckInExtension("Spatial")

Our first task is to identify the parts of our problem that can work in parallel and the parts which we need to run sequentially.

The best place to start with this can be with the pseudocode of the original task. If we have documented our sequential code well, this could be as simple as copying/pasting each line of documentation into a new file and working through the process. We can start with the text description of the problem and build our sequential pseudocode from there and then create the multiprocessing pseudocode. It is very important to correctly and carefully design our multiprocessing solutions to ensure that they are as efficient as possible and that the worker functions have the bare minimum of data that they need to complete the tasks, use memory workspaces, and write as little data back to disk as possible.

Our original task was :

Get a list of raster tiles  
For every tile in the list: 
     Fill the DEM 
     Create a slope raster  
     Calculate a flow direction raster 
     Calculate a flow accumulation raster  
     Convert those stream rasters to polygon or polyline feature classes.

You will notice that I’ve formatted the pseudocode just like Python code with indentations showing which instructions are within the loop.

As this is a simple example we can place all of the functionality within the loop into our worker function as it will be called for every raster. The list of rasters will need to be determined sequentially and we’ll then pass that to our multiprocessing function and let the map element of multiprocessing map each raster onto a worker to perform the tasks. We won’t explicitly be using the reduce part of multiprocessing here as the output will be a featureclass but reduce will probably tidy up after us by deleting temporary files that we don’t need.

Our new pseudocode then will look like :

Get a list of raster tiles  
For every tile in the list: 
    Launch a worker function with the name of a raster

Worker:

Fill the DEM 
     Create a slope raster  
     Calculate a flow direction raster 
     Calculate a flow accumulation raster  
     Convert those stream rasters to polygon or polyline feature classes.

Bear in mind that not all multiprocessing conversions are this simple. We need to remember that user output can be complicated because multiple workers might be attempting to write messages to our screen at once and that can cause those messages to get garbled and confused. A workaround for this problem is to use Python’s logging library which is much better at handling messages than us manually using print statements. We haven't implemented logging in this sample solution for this script but feel free to briefly investigate it to supplement the print and arcpy.AddMessage functions with calls to the logging function. The Python Logging Cookbook [15] has some helpful examples.

As an exercise, attempt to implement the conversion from sequential to multiprocessing. You will probably not get everything right since there are a few details that need to be taken into account such as setting up an individual scratch workspace for each call of the worker function. In addition, to be able to run as a script tool the script needs to be separated into two files with the worker function in its own file. But don't worry about these things, just try to set up the overall structure in the same way as in the Hi Ho Cherry-O multiprocessing version and then place the code from the sequential version of the raster example either in the main function or worker function depending on where you think it needs to go. Then check out the solution linked below.

Click here for one way of implementing the solution [16]

When you run this code, do you notice any performance differences between the sequential and multiprocessor versions?

The sequential version took 96 seconds on the same 4-processor PC we were using in the Cherry-O example, while the multiprocessor version completed in 58 seconds. Again not 4 times faster as we might expect but nearly twice as fast with multiprocessing is a good improvement. For reference, the 32-processor PC from the Cherry-O example processed the sequential code in 110 seconds and the multiprocessing version in 40 seconds. We will look in more detail at the individual lines of code and their performance when we examine code profiling but you might also find it useful to watch the CPU usage tab in Task Manager to see how hard (or not) your PC is working.

Lesson content developed by Jan Wallgrun and James O’Brien

Multiprocessing with vector data

The best practices of multiprocessing that we introduced earlier are even more important when we are working with vector data than they are with raster data. The geodatabase locking issue is likely to become much more of a factor as typically we use more vector data than raster and often geodatabases are used more with feature classes.

The example we’re going to use here involves clipping a feature layer by polygons in another feature layer. A sample use case of this might be if you need to segment one or several infrastructure layers by state or county (or even a smaller subdivision). If I want to provide each state or county with a version of the roads, sewer, water or electricity layers (for example) this would be a helpful script. To test out the code in this section (and also the first homework assignment), you can again use the data from the USA.gdb geodatabase (Section 1.5) we provided. The application then is to clip the data from the roads, cities, or hydrology data sets to the individual state polygons from the States data set in the geodatabase.

To achieve this task, one could run the Clip tool manually in ArcGIS Pro but if there are a lot of polygons in the clip data set, it will be more effective to write a script that performs the task. As each state/county is unrelated to the others, this is an example of an operation that can be run in parallel.

The code example below has been adapted from a code example written by Duncan Hornby at the University of Southampton in the United Kingdom that has been used to demonstrate multiprocessing and also how to create a script tool that supports multiprocessing. We will take advantage of Mr. Hornby’s efforts and make use of his code (with attribution of course) but we have also reorganized and simplified it quite a bit and added some enhancements.

Let us examine the code’s logic and then we’ll dig into the syntax.

The code has two Python files [1]. This is important because when we want to be able to run it as a script tool in ArcGIS, it is required that the worker function for running the individual tasks be defined in its own module file, not in the main script file for the script tool with the multiprocessing code that calls the worker function. The first file called scripttool.py imports arcpy, multiprocessing, and the worker code contained in the second python file called multicode.py and it contains the definition of the main function mp_handler() responsible for managing the multiprocessing operations similar to the hi_ho_cherry_o multiprocessing version. It uses two script tool parameters, the file containing the polygons to use for clipping (variable clipper) and the file to be clipped (variable tobeclipped). The main function mp_handler() calls the worker(...) function located in the multicode file, passing it the files to be used and other information needed to perform the clipping operation. This will be further explained below . The code for the first file including the main function is shown below.

import arcpy
import multiprocessing as mp
from WorkerScript import clipper

# Input parameters
clipping_fc = arcpy.GetParameterAsText(0) if arcpy.GetParameterAsText(0) else r"C:\489\USA.gdb\States"
data_to_be_clipped = arcpy.GetParameterAsText(1) if arcpy.GetParameterAsText(1) else r"C:\489\USA.gdb\Roads"

def mp_handler():
    try:
        # Create a list of object IDs for clipper polygons
        arcpy.AddMessage("Creating Polygon OID list...")
        clipperDescObj = arcpy.Describe(clipping_fc)
        field = clipperDescObj.OIDFieldName

        # Create the idList by list comprehension and SearchCursor
        idList = [row[0] for row in arcpy.da.SearchCursor(clipping_fc, [field])]

        arcpy.AddMessage(f"There are {len(idList)} object IDs (polygons) to process.")

        # Create a task list with parameter tuples for each call of the worker function. Tuples consist of the clippper, tobeclipped, field, and oid values.
        # adds tuples of the parameters that need to be given to the worker function to the jobs list
        # using list comprehension
        jobs = [(clipping_fc, data_to_be_clipped, field, id) for id in idList]

        arcpy.AddMessage(f"Job list has {len(jobs)} elements.\n Sending to pool")

        cpuNum = mp.cpu_count()  # determine number of cores to use
        arcpy.AddMessage(f"There are: {cpuNum} cpu cores on this machine")

        # Create and run multiprocessing pool.
        with mp.Pool(processes=cpuNum) as pool:  # Create the pool object
            # run jobs in job list; res is a list with return dictionary values from the worker function
            res = pool.starmap(clipper, jobs)

        # After the threads are complete, iterate over the results and check for errors.
        for r in res:
            if r['errorMsg'] != None:
                arcpy.AddError(f'Task {r["name"]} Failed with: {r["errorMsg"]}')

        arcpy.AddMessage("Finished multiprocessing!")

    except Exception as ex:
        arcpy.AddError(ex)


if __name__ == '__main__':
    mp_handler()

Let's now have a close look at the logic of the two main functions which will do the work. The first one is the mp_handler() function shown in the code section above. It takes the input variables and has the job of processing the polygons in the clipping file to get a list of their unique IDs, building a job list of parameter tuples that will be given to the individual calls of the worker function, setting up the multiprocessing pool and running it, and taking care of error handling.

The second function is the worker function called by the pool (named worker in this example) located in the WorkerScript.py file (code shown below). This function takes the name of the clipping feature layer, the name of the layer to be clipped, the name of the field that contains the unique IDs of the polygons in the clipping feature layer, and the feature ID identifying the particular polygon to use for the clipping as parameters. This function will be called from the pool constructed in mp_handler().

The worker function will then make a selection from the clipping layer. This has to happen in the worker function because all parameters given to that function in a multiprocessing scenario need to be of a simple type that can be "pickled." Pickling data [17] means converting it to a byte-stream which in the simplest terms means that the data is converted to a sequence of simple Python types (string, number etc.). As feature classes are much more complicated than that containing spatial and non-spatial data, they cannot be readily converted to a simple type. That means feature classes cannot be "pickled" and any selections that might have been made in the calling function are not shared with the worker functions. Therefore, we need to think about creative ways of getting our data shared with our sub-processes. In this case, that means we’re not going to do the selection in the master module and pass the polygon to the worker module. Instead, we’re going to create a list of feature IDs that we want to process and we’ll pass an ID from that list as a parameter with each call of the worker function that can then do the selection with that ID on its own before performing the clipping operation. For this, the worker function selects the polygon matching the OID field parameter when creating a layer with MakeFeatureLayer_management() and uses this selection to clip the feature layer to be clipped. The results are saved in a shapefile including the OID in the file's name to ensure that the names are unique.

def clipper(clipper, tobeclipped, field, oid):
    """
       This is the function that gets called and does the work of clipping the input feature class to one of the
       polygons from the clipper feature class. Note that this function does not try to write to arcpy.AddMessage() as
       nothing is ever displayed.
       param: clipper
       param: tobeclipped
       param: field
       param: oid
    """
    # Create result dictionary that will be exclusive to the thread if ran in parallel.
    result_dict = {'name': None, 'errorMsg': None}
    try:
        # Set the name of the clipped to the result dictionary name
        result_dict['name'] = tobeclipped

        # Create a layer with only the polygon with ID oid. Each clipper layer needs a unique name, so we include oid in the layer name.
        query = f"{field} = {oid}"
        tmp_flayer = arcpy.MakeFeatureLayer_management(clipper, f"clipper_{oid}", query)

        # Do the clip. We include the oid in the name of the output feature class.
        outFC = fr"C:\NGA\Lesson 1 Data\output\clip_{oid}.shp"
        arcpy.Clip_analysis(tobeclipped, tmp_flayer, outFC)

        print(f"finished clipping: {oid}")
        return result_dict  # everything went well so we return the dictionary

    except Exception as ex:
        result_dict['errorMsg'] = ex
        # Some error occurred so return the exception thrown.
        print(f"error condition: {ex}")
        return result_dict

Having covered the logic of the code, let's review the specific syntax used to make it all work. While you’re reading this, try visualizing how this code might run sequentially first – that is one polygon being used to clip the to-be-clipped feature class, then another polygon being used to clip the to-be-clipped feature class and so on (maybe through 4 or 5 iterations). Then once you have an understanding of how the code is running sequentially try to visualize how it might run in parallel with the worker function being called 4 times simultaneously and each worker performing its task independently of the other workers.

We’ll start with exploring the syntax within the mp_handler(...) function.

The mp_handler(...) function begins by determining the name of the field that contains the unique IDs of the clipper feature class using the arcpy.Describe(...) function (line 13). The code then uses a Search Cursor to get a list of all of the object (feature) IDs from within the clipper polygon feature class (line 17). This gives us a list of IDs that we can pass to our worker function along with the other parameters. As a check, the length of that list is printed out (line 26).

Next, we create the job list with one entry for each call of the clipper() function we want to make (line 24). Each element in this list is a tuple of the parameters that should be given to that particular call of clipper(). This list will be required when we set up the pool by calling pool.starmap(...). To construct the list, we simply loop through the ID list and append a parameter tuple to the list in variable jobs. The first three parameters will always be the same for all tuples in the job list; only the polygon ID will be different. In the homework assignment for this lesson, you will adapt this code to work with multiple input files to be clipped. As a result, the parameter tuples will vary in both the values for the oid parameter and for the tobeclipped parameter.

To prepare the multiprocessing pool, we start it using the with statement. The code then sets up the size of the pool using the maximum number of processors in line 28 (as we have done in previous examples) and then, using the starmap() method of Pool, calls the worker function clipper(...) once for each parameter tuple in the jobs list (line 34).

Any outputs from the worker function will be stored in a list of results dictionaries. These are values returned by the clipper() function. The results of each process is iterated over and checked for the Exceptions by checking if the r[‘errorMsg’] key holds a value other than None.

Let's now look at the code in our worker function clipper(...). As we noted in the logic section above, it receives four parameters: the full paths of the clipping and to-be-clipped feature classes, the name of the field that contains the unique IDs in the clipper feature class, and the OID of the polygon it is to use for the clipping.

We create a results dictionary to help with returning valuable information back to the main processes. The 'name' key will be set to the tobeclipped parameter in line 15. The errorMsg is set to None to indicate that everything went ok as default (line 12 of the clipper function) and would be set to an Exception message to indicate that the operation failed (line 29). In the main function, the results are iterated over to print out any error messages that were encountered during the clipping process.

Notice that the MakeFeatureLayer_management(...) function in line 19 is used to create an memory layer which is a copy of the original clipper layer. This use of the memory layer is important in three ways: The first is performance – memory layers are faster; second, the use of an memory layer can help prevent any chance of file locking (although not if we were writing back to the file); third, selection only works on layers so even if we wanted to, we couldn’t get away without creating this layer.

The call of MakeFeatureLayer_management(...) also includes an SQL query string defined one line earlier in line 11 to create the layer with just the polygon that matches the oid that was passed as a parameter. The name of the layer we are producing here should be unique; this is why we’re adding str(oid) to the name in the first parameter.

Now with our selection held in our memory, uniquely named feature layer, we perform the clip against our to-be-clipped layer (line 16) and store the result in outFC which we define in line 15 to be a hardcoded folder with a unique name starting with "clip_" followed by the oid. To run the code, you will most likely have to adapt the path used in variable outFC.

The process then returns from the worker function and will be supplied with another oid. This will repeat until a call has been made for each polygon in the clipping feature class.

We are going to use this code as the basis for our Lesson 1 homework project. Have a look at the Assignment Page for full details.

You can test this code out by running it in a number of ways. If you run it from ArcGIS Pro as a script tool, you will have to swap the hashmarks for the clipper and tobeclipped input variables so that GetParameterAsText() is called instead of using hardcoded paths and file names. Be sure to set your parameter type for both parameters to Feature Class. If you make changes to the code and have problems with the changes to the code not being reflected in Pro, delete your Script tool from the Toolbox, restart Pro and re-add the script tool.

You can run your code from Pyscripter as a stand alone script. Make sure you're running the scripttool.py in PyScripter (not the multicode.py). You can also run your code from the Command Prompt which is the fastest way with the smallest resource overhead.

The final thing to remember about this code is that it has a hardcoded output path defined in variable outFC in the worker() function - which you will want to change, create and/or parameterize etc. so that you have some output to investigate. If you do none of these things then no output will be created.

When the code runs it will create a shapefile for every unique object identifier in the "clipper" shapefile (there are 51 in the States data set from the sample data) named using the OID (that is clip_1.shp - clip_59.shp).

Lesson content developed by Jan Wallgrun and James O’Brien

Lesson 1 Assignment

We are going to use the arcpy vector data processing code from the Multiprocessing section in this lesson. Download Lesson1_Assignment_initial_code.zip [1] as the basis for our Lesson 1 programming project. The code is already in multiprocessing mode, so you will not have to write multiprocessing code on your own from scratch but you still will need a good understanding of how the script works. If you are unclear about anything the script does, please ask on the course forums. This part of the assignment will be for getting back into the rhythm of writing arcpy based Python code and practice creating a multiprocessing script.

With the data from Multiprocessing with vector data section, (also here [18]), your task is to extend our vector data clipping script by doing the following:

Write-up

Produce a 400-word write-up on how the assignment went for you; reflect on and briefly discuss the issues and challenges you encountered and what you learned from the assignment.

Deliverable

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

Your modified code files. Please organize the files cleanly.
Your 400-word write-up.