Lesson 1 is two weeks in length. The goal is to get back into Python programming with arcpy, in particular doing so under ArcGIS Pro, and learn about the concepts of parallel programming and multiprocessing and how they can be used in Python to speed up time-consumptive computations. In addition, we will discuss some important general topics related to programming such as debugging code complemented by a discussion of profiling code to detect bottlenecks, version control system software like GitHub, and different integrated development environments (IDEs) available for Python. The IDE we are going to start with in this class is called Spyder but one part of the first homework assignment will be to try out another IDE and present it in a short video.
Some sections in this lesson related to 64-bit processing for ArcGIS Desktop and code profiling are optional so that you can decide for yourself how deep you want to dive into the respective topic. The lessons in this course contain quite a lot of content, so feel absolutely free to skip these optional sections; you can always come back to check them out later or after the end of the class.
Please refer to the Calendar for specific time frames and due dates. To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.
Step | Activity | Access/Directions |
---|---|---|
1 | Engage with Lesson 1 Content | Begin with 1.2 Differences between Python 2 and Python 3 |
2 | Video Presentation of IDE Research | Choose the IDE you wish to research (in the IDE Investigation: Choose Topic discussion forum) and submit a video demonstration and discussion (to both the Assignment Dropbox and the Media Gallery). When picking your IDE, please take into account that we would like to see all the IDEs presented by at least one student. |
3 | Programming Assignment and Reflection | Submit your modified code versions and ArcGIS Pro toolboxes along with a short report (400 words) including your profiling results and a reflection on what you learned and/or what you found challenging. |
4 | Quiz 1 | Complete the Lesson 1 Quiz. |
5 | Questions/Comments | Remember to visit the Lesson 1 Discussion Forum to post/answer any questions or comments pertaining to Lesson 1 |
All downloads and full instructions are available and included in the Lesson 1 course material. The list below is for those who want to frontload downloading items.
Optional software: The following software will only be required if you decide to follow along the steps described in optional sections included in this lesson that contain complementary materials. We recommend that you do not install this software now, but wait until you are sure that you want to test them out.
These modules are also only needed for optional materials in this lesson. So the same as said above holds here: We recommend that you do not install these now, but wait until you are sure that you want to test them out.
To install pyprof2call tree (Section 1.7.2.2)- open Python command prompt in Administrator mode and type in:
scripts\pip install pyprof2calltree
If received, ignore the message about upgrading pip.
To install line_profiler (Section 1.7.2.4)- open Python command prompt in Administrator mode and type in:
scripts\pip install line_profiler
If you receive an error that "Microsoft Visual C++ 14.0 is required", visit Microsoft's Visual Studio Downloads page [3] and download the package for "Visual Studio Community 2017" which will download Visual Studio Installer. Run Visual Studio Installer, and under the "Workloads" tab, you will select two components to install. Under Windows, check the box in the upper right-hand corner of "Desktop Development with C++," and under Web & Cloud, check the box for "Python development". After checking the boxes, click "Install" in the lower right-hand corner. After installing those, open the Python command prompt again and enter:
scripts\pip install misaka
If that works, then install the line_profiler with...
scripts\pip install line_profiler
You should see a message saying “Successfully installed line-profiler-2.1.2" although your version number may be different and that’s okay.
If you have taken GEOG 485 before we changed it from ArcGIS Desktop to Pro or you learned about Python programming and customization of ArcGIS Desktop in some other context, then you have been working with arcpy under Python version 2. ArcPy is also available in ArcGIS Pro, but everything here runs under Python version 3. This is why we start this first lesson of the course with an overview on the differences between Python 2 and Python 3.
Python 3.0 was released in 2008 and the final version of 2.7 was released in mid-2010, so they have both been around for a long time. There are no major developments planned for Python 2, with all the attention focused on Python 3.
While a lot of the changes from Python 2 to Python 3 were in the background or with special features that we won’t need in this course, there are a few changes that you are somewhat likely to encounter and that we, therefore, list on the following pages.
In addition, while many things in the ArcGIS Pro version of arcpy still work in the same way as in the version you might know from ArcGIS Desktop, there are some differences, in particular with respect to the availability of certain modules and tools. This won’t be a concern in this first lesson, but we will examine the differences in the following section.
You might wonder why we're talking about old versions of Python and ArcGIS Desktop and it's a fair question. At some point, you may want to run some pre-existing Python 2 code in ArcGIS Pro for any number of reasons that might include updating legacy tools, converting old code, using new ArcGIS Pro tools, sharing with ArcGIS Pro users, using Python 3 libraries, importing a legacy MXD from ArcGIS Desktop or perhaps using the better performance, which may come from using either multiprocessing or 64 bit processing (although these are also available in Python 2 and you can experiment with those later in the lesson). There's a reasonable chance you're going to be exposed to this older Python 2 code at some point and when that happens we want you to know how to update it easily.
There are several differences between Python 2 and 3, but the most obvious one is that the print statement from Python 2 is now a function in 3. You might recall from GEOG 485 that a function takes parameters (and sometimes returns a value).
Have a look at the example below for a simple illustration of this change. You’ll notice that for Python 2 both will work. Usually, though, most people have written code using the standard Python 2 print statement, not the way which appears to support print as a function.
#Python 2 print "Hello World" print ("Hello World")
#Python 3 print ("Hello World")
All of the more complicated things that we can do with a print statement such as adding in variables or using the .format statement can be implemented just the same. If you're unfamiliar with .format we will look at it in more detail soon [4].
#Python 2 Name = "James" print "Hello World. My name is " + Name print "Hello World. My name is {0}".format(Name)
#Python 3 Name = "James" print ("Hello World. My name is " + Name) print ("Hello World. My name is {0}".format(Name))
You can experiment with creating more complicated versions of those print statements or functions depending on which version of Python you’re using. In the class, if we’re describing print, we will be using the terms statement and function interchangeably – be sure to adjust your code according to the version of Python you’re using. Be aware that you can use the Python 3 version in Python 2.7 and many programmers have transitioned to using this approach over time (but you'll potentially still see print used as per our Python 2 examples above).
For the technically-minded, I mentioned above that in Python 2 print appears to work as a function but it’s actually being interpreted by Python as (“Hello World”) which is equivalent to “Hello World” in Python.
In Python 2, the result of a division between two integer numbers was again an integer number, namely the result with everything after the decimal point truncated. So for instance, the result of the expression
1 / 2
is the integer number 0. If you wanted to have the result as a floating point number you had to use something like 1 / float(2) or 1 / 2.0 to first turn one of the operands into a float. This behavior has changed in Python 3. The result is now a floating point number, so 0.5 in this case.
Python 2 had two different string data types, str() for ASCII strings (so only providing for a very limited set of different characters but also only requiring one byte per character) and unicode() for Unicode strings allowing for a much larger set of characters and supporting writings that are not Latin-based such as Chinese characters for example. To create a Unicode string you had to use the prefix u like this: u'some string'. You might remember this u appearing in front of some output in Python 2, for instance from arcpy functions such as ListFeatureClasses().
In Python 3, everything within quotes is considered a unicode UTF-8 string, so you can write
print('Saying hello in Chinese: 你好')
As in Python 2, Unicode characters can be written with a \u followed by their 4 or 8 digit hexadecimal number if you have no other way of entering the characters into the code. So the previous command can also be written as:
print('Saying hello in Chinese: \u4f60\u597d')
In case you have not heard much about the Unicode standard for encoding characters, here is a good article explaining what this is all about: A Beginner-Friendly Guide to Unicode [5].
With the change to Python 3, the modules from the standard library have been reorganized. As a result, the import statements used in a Python 2 script may not work anymore because the names of modules have changed, etc. For instance, in Python 2 the standard library contains the modules urllib and urllib2 for working with URLs and accessing content on the web. In Python 3, the functionality has been reorganized into three submodules called urllib.parse, urllib.request, and urllib.error. There are more examples like this and also examples of individual functions or classes that have a changed name or have been removed entirely.
If you want to dive deeper into this topic, have a look at the page "What's New in Python 3.0" [6] from the Python documentation and this article [7] summarizing the key differences between Python 2 and 3.
There are some differences in the level of functionality available in arcpy in Desktop when compared to Pro. These are documented in the Pro help here [8] and here [9]. Probably the one most likely to trip up those of you with existing scripts is the renaming of arcpy.mapping to arcpy.mp which will require some changes to code (including any function calls to arcpy.mapping.<name> functions).
There is also a list of tools which are not supported in Pro which aren't commonly used but could have specialty tools that rely on them. These tools include Coverage (arcpy.arc), Data Interoperability (arcpy.interop), Parcel Fabric (arcpy.fabric), Schematics (arcpy.schematics), and Tracking Analyst (arcpy.ta). If you're migrating from Desktop to Pro in your professional lives, it might be worthwhile checking that none of your scripts or workflows require these tools.
In addition to these entire toolboxes which are no longer available in Pro, a number of individual tools within toolboxes have either not been implemented or haven't been implemented yet. We won't repeat that long list here, but it may be worth having a quick look over the list [9] (this is the second link from two paragraphs above) to double-check that any of your necessary or favorite tools aren't in the list.
There are also both new and improved tools within Pro that don't exist within Desktop. Some of these take advantage of parallel processing or are written more efficiently and therefore perform better than their legacy (old) versions. When using Pro you'll often see a small tooltip in the top of the Geoprocessing window mentioning that another tool offers improved performance or additional functionality.
To warm up a bit, let’s briefly revisit a few Python features that you are already familiar with but for which there exist some forms or details that you may not yet know, starting with the Python “import” command. We are also going to introduce a few Python constructs that you may not have heard about yet on the way.
It is highly recommended that you try out these examples yourself and experiment with them to get a better understanding. The examples work in both Python 2 and Python 3, so you can use any Python installation and IDE that you have on your computer for this. If you are not sure what to use, you can also look ahead at the part of Section 1.5 [10] about getting a Python 3 IDE for ArcGIS Pro, spyder, up and running and then come back to this section here.
The form of the “import” command that you definitely should already know is
import <module name>
e.g.,
import arcpy
What happens here is that the module (either a module from the standard library, a module that is part of another package you installed, or simply another .py file in your project directory) is loaded, unless it has already been loaded before, and the name of the module becomes part of the namespace of the script that contains the import command. As a result, you can now access all variables, functions, or classes defined in the imported module, by writing
<module name>.<variable or function name>
e.g.,
arcpy.Describe(…)
You can also use the import command like this instead:
import arcpy as ap
This form introduces a new alias for the module name, typically to save some typing when the module name is rather long, and instead of writing
arcpy.Describe(…)
, you would now use
ap.Describe(…)
in your code.
Another approach of using “import” is to directly add content of a module (again either variables, functions, or classes) to the namespace of the importing Python script. This is done by using the form "from … import …" as in the following example:
from arcpy import Describe, Point , … ... Describe(…)
The difference is that now you can use the imported names directly in our code without having to use the module name (or an alias) as a prefix as it is done in line 5 of the example code. However, be aware that if you are importing multiple modules, this can easily lead to name conflicts if, for instance, two modules contain functions with the same name. It can also make your code a little more difficult to read since
arcpy.Describe(...)
helps you or another programmer recognize that you’re using something defined in arcpy and not in another library or the main code of your script.
You can also use
from arcpy import *
to import all variable, function and class names from a module into the namespace of your script if you don’t want to list all those you actually need. However, this can increase the likelihood of a name conflict.
Next, let’s quickly revisit loops in Python. There are two kinds of loops in Python, the for-loop and the while-loop. You should know that the for-loop is typically used when the goal is to go through a given set or list of items or do something a certain number of times. In the first case, the for-loop typically looks like this
for item in list: # do something with item
while in the second case, the for-loop is often used together with the range(…) function to determine how often the loop body should be executed:
for i in range(50): # do something 50 times
In contrast, the while-loop has a condition that is checked before each iteration and if the condition becomes False, the loop is terminated and the code execution continues after the loop body. With this knowledge, it should be pretty clear what the following code example does:
import random r = random.randrange(100) # produce random number between 0 and 99 attempts = 1 while r != 11: attempts += 1 r = random.randrange(100) print('This took ' + str(attempts) + ' attempts')
What you may not yet know is that there are two additional commands, break and continue, that can be used in combination with either a for or a while-loop. The break command will automatically terminate the execution of the current loop and continue with the code after it. If the loop is part of a nested loop only the inner loop will be terminated. This means we can rewrite the program from above using a for-loop rather than a while-loop like this:
import random attempts = 0 for i in range(1000): r = random.randrange(100) attempts += 1 if r == 11: break # terminate loop and continue after it print('This took ' + str(attempts) + ' attempts')
When the random number produced in the loop body is 11, the body of the if-statement, so the break command, will be executed and the program execution immediately leaves the loop and continues with the print statement after it. Obviously, this version is not completely identical to the while based version from above because the loop will be executed at most 1000 times here.
If you have experience with programming languages other than Python, you may know that some languages have a "do … while" loop construct where the condition is only tested after each time the loop body has been executed so that the loop body is always executed at least once. Since we first need to create a random number before the condition can be tested, this example would actually be a little bit shorter and clearer using a do-while loop. Python does not have a do-while loop but it can be simulated using a combination of while and break:
import random attempts = 0 while True: r = random.randrange(100) attempts += 1 if r == 11: break print('This took ' + str(attempts) + ' attempts')
A while loop with the condition True will in principle run forever. However, since we have the if-statement with the break, the execution will be terminated as soon as the random number generator rolls an 11. While this code is not shorter than the previous while-based version, we are only creating random numbers in one place, so it can be considered a little bit more clear.
When a continue command is encountered within the body of a loop, the current execution of the loop body is also immediately stopped, but in contrast to the break command, the execution then continues with the next iteration of the loop body. Of course, the next iteration is only started if, in the case of a while-loop, the condition is still true, or in the case of a for-loop, there are still remaining items in the list that we are looping through. The following code goes through a list of numbers and prints out only those numbers that are divisible by 3 (without remainder).
l = [3,7,99,54,3,11,123,444] for n in l: if n % 3 != 0: # test whether n is not divisible by 3 without remainder continue print(n)
This code uses the modulo operator % to get the remainder of the division of n and 3 in line 5. If this remainder is not 0, the continue command is executed and, as a result, the program execution directly jumps back to the beginning of the loop and continues with the next number. If the condition is False (meaning the number is divisible by 3), the execution continues as normal after the if-statement and prints out the number. Hopefully, it is immediately clear that the same could have been achieved by changing the condition from != to == and having an if-block with just the print statement, so this is really just a toy example illustrating how continue works.
As you saw in these few examples, there are often multiple ways in which for, while, break, continue, and if-else can be combined to achieve the same thing. While break and continue can be useful commands, they can also make code more difficult to read and understand. Therefore, they should only be used sparingly and when their usage leads to a simpler and more comprehensible code structure than a combination of for /while and if-else would do.
You are already familiar with Python binary operators that can be used to define arbitrarily complex expressions. For instance, you can use arithmetic expressions that evaluate to a number, or boolean expressions that evaluate to either True or False. Here is an example of an arithmetic expression using the arithmetic operators – and *:
x = 25 – 2 * 3
Each binary operator takes two operand values of a particular type (all numbers in this example) and replaces them by a new value calculated from the operands. All Python operators are organized into different precedence classes, determining in which order the operators are applied when the expression is evaluated unless parentheses are used to explicitly change the order of evaluation. This operator precedence table [11] shows the classes from lowest to highest precedence. The operator * for multiplication has a higher precedence than the – operator for subtraction, so the multiplication will be performed first and the result of the overall expression assigned to variable x is 19.
Here is an example for a boolean expression:
x = y > 12 and z == 3
The boolean expression on the right side of the assignment operator contains three binary operators: two comparison operators, > and ==, that take two numbers and return a boolean value, and the logical ‘and’ operator that takes two boolean values and returns a new boolean (True only if both input values are True, False otherwise). The precedence of ‘and’ is lower than that of the two comparison operators, so the ‘and’ will be evaluated last. So if y has the value 6 and z the value 3, the value assigned to variable x by this expression will be False because the comparison on the left side of the ‘and’ evaluates to False.
In addition to all these binary operators, Python has a ternary operator, so an operator that takes three operands as input. This operator has the format
x if c else y
x, y, and c here are the three operands while ‘if’ and ‘else’ are the keywords making up the operator and demarcating the operands. While x and y can be values or expressions of arbitrary type, the condition c needs to be a boolean value or expression. What the operator does is it looks at the condition c and if c is True it evaluates to x, else it evaluates to y. So for example in the following line of code
p = 1 if x > 12 else 0
variable p will be assigned the value 1 if x is larger than 12, else p will be assigned the value 0. Obviously what the ternary if-else operator does is very similar to what we can do with an if or if-else statement. For instance, we could have written the previous code as
p = 1 if x > 12: p = 0
The “x if c else y” operator is an example of a language construct that does not add anything principally new to the language but enables writing things more compactly or more elegantly. That’s why such constructs are often called syntactic sugar. The nice thing about “x if c else y” is that in contrast to the if-else statement, it is an operator that evaluates to a value and, hence, can be embedded directly within more complex expressions as in the following example that uses the operator twice:
newValue = 25 + (10 if oldValue < 20 else 44) / 100 + (5 if useOffset else 0)
Using an if-else statement for this expression would have required at least five lines of code.
In GEOG 485, we used the + operator for string concatenation to produce strings from multiple components to then print them out or use them in some other way, as in the following two examples:
print('The feature class contains ' + str(n) + ' point features.') queryString = '"'+ fieldName+ '" = ' + "'" + countryName + "'"
An alternative to this approach using string concatenation is to use the string method format(…). When this method is invoked for a particular string, the string content is interpreted as a template in which parts surrounded by curly brackets {…} should be replaced by the variables given as parameters to the method. Here is how the two examples from above would look in this approach:
print('The feature class contains {0} point features.'.format(n) ) queryString = '"{0}" = \'{1}\''.format(fieldName, countryName)
In both examples, we have a string literal '….' and then directly call the format(…) method for this string literal to give us a new string in which the occurrences of {…} have been replaced. In the simple form {i} used here, each occurrence of this pattern will be replaced by the i-th parameter given to format(…). In the second example, {0} will be replaced by the value of variable fieldName and {1} will be replaced by variable countryName. Please note that the second example will also use \' to produce the single quotes so that the entire template could be written as a single string. The numbers within the curly brackets can also be omitted if the parameters should be inserted into the string in the order in which they appear.
The main advantages of using format(…) are that the string can be a bit easier to produce and read as in particular in the second example, and that we don’t have to explicitly convert all non-string variables to strings with str(…). In addition, format allows us to include information about how the values of the variables should be formatted. By using {i:n}, we say that the value of the i-th variable should be expanded to n characters if it’s less than that. For strings, this will by default be done by adding spaces after the actual string content, while for numbers, spaces will be added before the actual string representation of the number. In addition, for numbers, we can also specify the number d of decimal digits that should be displayed by using the pattern {i:n.df}. The following example shows how this can be used to produce some well-formatted list output:
items = [('Maple trees', 45.232 ), ('Pine trees', 30.213 ), ('Oak trees', 24.331)] for i in items: '{0:20} {1:3.2f}%'.format(i[0], i[1])
Output:
Maple trees 45.23% Pine trees 30.21% Oak trees 24.33%
The pattern {0:20} is used here to always fill up the names of the tree species in the list with spaces to get 20 characters. Then the pattern {1:3.2f} is used to have the percentage numbers displayed as three characters before the decimal point and two digits after. As a result, the numbers line up perfectly.
The format method can do a few more things, but we are not going to go into further details here. Check out this page about formatted output [12] if you would like to learn more about this.
From GEOG 485 or similar previous experience, you should be familiar with defining simple functions that take a set of input parameters and potentially return some value. When calling such a function from somewhere in your Python code, you have to provide values (or expressions that evaluate to some value) for each of these parameters, and these values are then accessible under the names of the respective parameters in the code that makes up the body of the function.
However, from working with different tool functions provided by arcpy and different functions from the Python standard library, you also already know that functions can also have optional parameters, and you can use the names of such parameters to explicitly provide a value for them when calling the function. In this section, we will show you how to write functions with such keyword arguments and functions that take an arbitrary number of parameters, and we will discuss some more details about passing different kinds of values as parameters to a function.
The parameters we have been using so far, for which we only specify a name in the function definition, are called positional parameters or positional arguments because the value that will be assigned to them when the function is called depends on their position in the parameter list: The first positional parameter will be assigned the first value given within the parentheses (…) when the function is called, and so on. Here is a simple function with two positional parameters, one for providing the last name of a person and one for providing a form of address. The function returns a string to greet the person with.
def greet(lastName, formOfAddress): return 'Hello {0} {1}!'.format(formOfAddress, lastName) print(greet('Smith', 'Mrs.'))
Output: Hello Mrs. Smith!
Note how the first value used in the function call (“Smith”) in line 6 is assigned to the first positional parameter (lastName) and the second value (“Mrs.”) to the second positional parameter (formOfAddress). Nothing new here so far.
The parameter list of a function definition can also contain one or more so-called keyword arguments. A keyword argument appears in the parameter list as
<argument name> = <default value>
A keyword argument can be provided in the function by again using the notation
def greet(lastName, formOfAddress, language = 'English'): greetings = { 'English': 'Hello', 'Spanish': 'Hola' } return '{0} {1} {2}!'.format(greetings[language], formOfAddress, lastName) print(greet('Smith', 'Mrs.')) print(greet('Rodriguez', 'Sr.', language = 'Spanish'))
Output: Hello Mrs. Smith! Hola Sr. Rodriguez!
Compare the two different ways in which the function is called in lines 8 and 10. In line 8, we do not provide a value for the ‘language’ parameter so the default value ‘English’ is used when looking up the proper greeting in the dictionary stored in variable greetings. In the second version in line 10, the value ‘Spanish’ is provided for the keyword argument ‘language,’ so this is used instead of the default value and the person is greeted with “Hola” instead of "Hello." Keyword arguments can be used like positional arguments meaning the second call could also have been
print(greet('Rodriguez', 'Sr.', 'Spanish'))
without the “language =” before the value.
Things get more interesting when there are several keyword arguments, so let’s add another one for the time of day:
def greet(lastName, formOfAddress, language = 'English', timeOfDay = 'morning'): greetings = { 'English': { 'morning': 'Good morning', 'afternoon': 'Good afternoon' }, 'Spanish': { 'morning': 'Buenos dias', 'afternoon': 'Buenas tardes' } } return '{0}, {1} {2}!'.format(greetings[language][timeOfDay], formOfAddress, lastName) print(greet('Smith', 'Mrs.')) print(greet('Rodriguez', 'Sr.', language = 'Spanish', timeOfDay = 'afternoon'))
Output: Good morning, Mrs. Smith! Buenas tardes, Sr. Rodriguez!
Since we now have four different forms of greetings depending on two parameters (language and time of day), we now store these in a dictionary in variable greetings that for each key (= language) contains another dictionary for the different times of day. For simplicity reasons, we left it at two times of day, namely “morning” and “afternoon.” In line 7, we then first use the variable language as the key to get the inner dictionary based on the given language and then directly follow up with using variable timeOfDay as the key for the inner dictionary.
The two ways we are calling the function in this example are the two extreme cases of (a) providing none of the keyword arguments, in which case default values will be used for both of them (line 10), and (b) providing values for both of them (line 12). However, we could now also just provide a value for the time of day if we want to greet an English person in the afternoon:
print(greet('Rogers', 'Mrs.', timeOfDay = 'afternoon'))
Output: Good afternoon, Mrs. Rogers!
This is an example in which we have to use the prefix “timeOfDay =” because if we leave it out, it will be treated like a positional parameter and used for the parameter ‘language’ instead which will result in an error when looking up the value in the dictionary of languages. For similar reasons, keyword arguments must always come after the positional arguments in the definition of a function and in the call. However, when calling the function, the order of the keyword arguments doesn’t matter, so we can switch the order of ‘language’ and ‘timeOfDay’ in this example:
print(greet('Rodriguez', 'Sr.', timeOfDay = 'afternoon', language = 'Spanish'))
Of course, it is also possible to have function definitions that only use optional keyword arguments in Python.
Let us continue with the “greet” example, but let’s modify it to be a bit simpler again with a single parameter for picking the language, and instead of using last name and form of address we just go with first names. However, we now want to be able to not only greet a single person but arbitrarily many persons, like this:
greet('English', 'Jim', 'Michelle')
Output: Hello Jim! Hello Michelle!
greet('Spanish', 'Jim', 'Michelle', 'Sam')
Output: Hola Jim! Hola Michelle! Hola Sam!
To achieve this, the parameter list of the function needs to end with a special parameter that has a * symbol in front of its name. If you look at the code below, you will see that this parameter is treated like a list in the body of the function:
def greet(language, *names): greetings = { 'English': 'Hello', 'Spanish': 'Hola' } for n in names: print('{0} {1}!'.format(greetings[language], n))
What happens is that all values given to that function from the one corresponding to the parameter with the * on will be placed in a list and assigned to that parameter. This way you can provide as many parameters as you want with the call and the function code can iterate through them in a loop. Please note that for this example we changed things so that the function directly prints out the greetings rather than returning a string.
We also changed language to a positional parameter because if you want to use keyword arguments in combination with an arbitrary number of parameters, you need to write the function in a different way. You then need to provide another special parameter starting with two stars ** and that parameter will be assigned a dictionary with all the keyword arguments provided when the function is called. Here is how this would look if we make language a keyword parameter again:
def greet(*names, **kwargs): greetings = { 'English': 'Hello', 'Spanish': 'Hola' } language = kwargs['language'] if 'language' in kwargs else 'English' for n in names: print('{0} {1}!'.format(greetings[language], n))
If we call this function as
greet('Jim', 'Michelle')
the output will be:
Hello Jim! Hello Michelle!
And if we use
greet('Jim', 'Michelle', 'Sam', language = 'Spanish')
we get:
Hola Jim! Hola Michelle! Hola Sam!
Yes, this is getting quite complicated, and it’s possible that you will never have to write functions with both * and ** parameters, still here is a little explanation: All non-keyword parameters are again collected in a list and assigned to variable names. All keyword parameters are placed in a dictionary using the name appearing before the equal sign as the key, and the dictionary is assigned to variable kwargs. To really make the ‘language’ keyword argument optional, we have added line 5 in which we check if something is stored under the key ‘language’ in the dictionary (this is an example of using the ternary "... if ... else ..." operator). If yes, we use the stored value and assign it to variable language, else we instead use ‘English’ as the default value. In line 9, language is then used to get the correct greeting from the dictionary in variable greetings while looping through the name list in variable names.
When making the transition from a beginner to an intermediate or advanced Python programmer, it also gets important to understand the intricacies of variables used within functions and of passing parameters to functions in detail. First of all, we can distinguish between global and local variables within a Python script. Global variables are defined outside of any function. They can be accessed from anywhere in the script and they exist and keep their values as long as the script is loaded which typically means as long as the Python interpreter into which they are loaded is running.
In contrast, local variables are defined inside a function and can only be accessed in the body of that function. Furthermore, when the body of the function has been executed, its local variables will be discarded and cannot be used anymore to access their current values. A local variable is either a parameter of that function, in which case it is assigned a value immediately when the function is called, or it is introduced in the function body by making an assignment to the name for the first time.
Here are a few examples to illustrate the concepts of global and local variables and how to use them in Python.
def doSomething(x): # parameter x is a local variable of the function count = 1000 * x # local variable count is introduced return count y = 10 # global variable y is introduced print(doSomething(y)) print(count) # this will result in an error print(x) # this will also result in an error
This example introduces one global variable, y, and two local variables, x and count, both part of the function doSomething(…). x is a parameter of the function, while count is introduced in the body of the function in line 3. When this function is called in line 11, the local variable x is created and assigned the value that is currently stored in global variable y, so the integer number 10. Then the body of the function is executed. In line 3, an assignment is made to variable count. Since this variable hasn’t been introduced in the function body before, a new local variable will now be created and assigned the value 10000. After executing the return statement in line 5, both x and count will be discarded. Hence, the two print statements at the end of the code would lead to errors because they try to access variables that do not exist anymore.
Now let’s change the example to the following:
def doSomething(): count = 1000 * y # global variable y is accessed here return count y = 10 print(doSomething())
This example shows that global variable y can also be directly accessed from within the function doSomething(): When Python encounters a variable name that is neither the name of a parameter of that function nor has been introduced via an assignment previously in the body of that function, it will look for that variable among the global variables. However, the first version using a parameter instead is usually preferable because then the code in the function doesn’t depend on how you name and use variables outside of it. That makes it much easier to, for instance, re-use the same function in different projects.
So maybe you are wondering whether it is also possible to change the value of a global variable from within a function, not just read its value? One attempt to achieve this could be the following:
def doSomething(): count = 1000 y = 5 return count * y y = 10 print(doSomething()) print(y) # output will still be 10 here
However, if you run the code, you will see that last line still produces the output 10, so the global variable y hasn't been changed by the assignment in line 5. That is because the rule is that if a name is encountered on the left side of an assignment in a function, it will be considered a local variable. Since this is the first time an assignment to y is made in the body of the function, a new local variable with that name is created at that point that will overshadow the global variable with the same name until the end of the function has been reached. Instead, you explicitly have to tell Python that a variable name should be interpreted as the name of a global variable by using the keyword ‘global’, like this:
def doSomething(): count = 1000 global y # tells Python to treat y as the name of global variable y = 5 # as a result, global variable y is assigned a new value here return count * y y = 10 print(doSomething()) print(y) # output will now be 5 here
In line 5, we are telling Python that y in this function should refer to the global variable y. As a result, the assignment in line 7 changes the value of the global variable called y and the output of the last line will be 5. While it's good to know how these things work in Python, we again want to emphasize that accessing global variables from within functions should be avoided as much as possible. Passing values via parameters and returning values is usually preferable because it keeps different parts of the code as independent of each other as possible.
So after talking about global vs. local variables, what is the issue with mutable vs. immutable mentioned in the heading? There is an important difference in passing values to a function depending on whether the value is from a mutable or immutable data type. All values of primitive data types like numbers and boolean values in Python are immutable, meaning you cannot change any part of them. On the other hand, we have mutable data types like lists and dictionaries for which it is possible to change their parts: You can, for instance, change one of the elements in a list or what is stored under a particular key in a given dictionary without creating a completely new object.
What about strings and tuples? You may think these are mutable objects, but they are actually immutable. While you can access a single character from a string or element from a tuple, you will get an error message if you try to change it by using it on the left side of the equal sign in an assignment. Moreover, when you use a string method like replace(…) to replace all occurrences of a character by another one, the method cannot change the string object in memory for which it was called but has to construct a new string object and return that to the caller.
Why is that important to know in the context of writing functions? Because mutable and immutable data types are treated differently when provided as a parameter to functions as shown in the following two examples:
def changeIt(x): x = 5 # this does not change the value assigned to y y = 3 changeIt(y) print(y) # will print out 3
As we already discussed above, the parameter x is treated as a local variable in the function body. We can think of it as being assigned a copy of the value that variable y contains when the function is called. As a result, the value of the global variable y doesn’t change and the output produced by the last line is 3. But it only works like this for immutable objects, like numbers in this case! Let’s do the same thing for a list:
def changeIt(x): x[0] = 5 # this will change the list y refers to y = [3,5,7] changeIt(y) print(y) # output will be [5, 5, 7]
The output [5,5,7] produced by the print statement in the last line shows that the assignment in line 3 changed the list object that is stored in global variable y. How is that possible? Well, for values of mutable data types like lists, assigning the value to function parameter x cannot be conceived as creating a copy of that value and, as a result, having the value appear twice in memory. Instead, x is set up to refer to the same list object in memory as y. Therefore, any change made with the help of either variable x or y will change the same list object in memory. When variable x is discarded when the function body has been executed, variable y will still refer to that modified list object. Maybe you have already heard the terms “call-by-value” and “call-by-reference” in the context of assigning values to function parameters in other programming languages. What happens for immutable data types in Python works like “call-by-value,” while what happens to mutable data types works like “call-by-reference.” If you feel like learning more about the details of these concepts, check out this article on Parameter Passing [13].
While the reasons behind these different mechanisms are very technical and related to efficiency, this means it is actually possible to write functions that take parameters of mutable type as input and modify their content. This is common practice (in particular for class objects which are also mutable) and not generally considered bad style because it is based on function parameters and the code in the function body does not have to know anything about what happens outside of the function. Nevertheless, often returning a new object as the return value of the function rather than changing a mutable parameter is preferable. This brings us to the last part of this section.
It happens quite often that you want to hand back different things as the result of a function, for instance four coordinates describing the bounding box of a polygon. But a function can only have one return value. It is common practice in such situations to simply return a tuple with the different components you want to return, so in this case a tuple with the four coordinates. Python has a useful mechanism to help with this by allowing us to assign the elements of a tuple (or other sequences like lists) to several variables in a single assignment. Given a tuple t = (12,3,2,2), instead of writing
top = t[0] left = t[1] bottom = t[2] right = t[3]
you can write
top, left, bottom, right = t
and it will have the exact same effect. The following example illustrates how this can be used with a function that returns a tuple of multiple return values. For simplicity, the function computeBoundingBox() in this example only returns a fixed tuple rather than computing the actual tuple values from a polygon given as input parameter.
def computeBoundingBox(): return (12,3,41,32) top, left, bottom, right = computeBoundingBox() # assigns the four elements of the returned tuple to individual variables print(top) # output: 12
This section has been quite theoretical, but you will often encounter the constructs presented here when reading other people’s Python code and also in the rest of this course.
Now that we’re all warmed up with some Python revision and a few clues about the changes between Python 2 and 3, we’ll start getting familiar with Python 3 in ArcGIS Pro by exploring how we write code and deploy tools just like we did when we started out in GEOG 485.
We’ll cover the conda environment that ArcGIS Pro uses for Python 3 in more detail in Lesson 2, but for now it might be helpful to think of conda as a box or container that Python 3 and all of its parts sit inside. In order to access Python 3, we’ll need to open the conda box, and to do that we will need a command prompt with administrator privileges.
Spyder is the easiest IDE to install for Python 3 development as we can install it from ArcGIS Pro. Within Pro, you can navigate to the "Project" menu and then choose "Python" to access the Python package and environment manager of the ArcGIS Pro installation.
Since version 2.3 of ArcGIS Pro, it is not possible to modify the default Python environment anymore (see here [14] for details). If you already have a working Pro + Spyder setup (e.g. from Geog485) and it is at least Pro version 2.7, you can keep using this installation for this class. Else I'd recommend you work with the newest version, so you will first have to create a clone of Pro's default Python environment and make it the active environment of ArcGIS before installing Spyder. In the past, students sometimes had problems with the cloning operation that we were able to solve by running Pro in admin mode.
Therefore, we recommend that before performing the following steps, you exit Pro and restart it in admin mode by doing a right-click -> Run as administrator. Then go back to "Project" -> "Python", click on "Manage Environments", and then click on "Clone Default" in the Manage Environments dialog that opens up. Installing the clone will take some time (you can watch the individual packages being installed within the "Manage Environments" window and you may be prompted to restart ArcGIS Pro to effect your changes); when it's done, the new environment "arcgispro-py3-clone" (or whatever you choose to call it - but we'll be assuming it's called the default name) can be activated by clicking on the button on the left.
Do so and also note down the path where the cloned environment has been installed appearing below the name. It should be something like C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone . Then click the OK button.
Important: The cloned environment will most likely become unusable when you update Pro to a newer main version (e.g. from 2.9 to 3.0 or 3.0 to 3.1). So once you have cloned the environment successfully, please don't update your Pro installation before the end of the class, unless you are willing to do the cloning and spyder installation again. There is a function in V3.x and later of Pro to update your Python installation but it's new functionality so it might not yet always work as expected.
Now back at the package manager, the new Python environment should appear under "Project Environment" as shown in the figure below (but be aware this might take 30+ minutes so you'll need to be patient).
To now install Spyder, select "Add Packages," search for Spyder and click the "Install" button. This might also take around 30+ minutes and it'll be best if you've restarted Pro after creating your new environment and selecting it.
The package manager will show you a list of packages that will have to be installed and ask you to agree to the terms and conditions. After doing that, the installation will start and probably take a while. You may also get get a "User Access Control" window popup asking if you want conda_uac.exe to make changes to your device; it is OK to choose Yes.
Once the installation is finished, it is recommended that you restart ArcGIS Pro (and if you have trouble restart your PC as well it usually helps). If you keep having problems with the installation failing or Spyder simply not showing up in the list of installed pacakges (even after refereshing the list), please try with starting ArcGIS Pro in admin mode (if you are not already running it this way) by doing a right-click -> Run as administrator.
Once Spyder is installed, you might like to create a shortcut to it on your Desktop or Start Menu. In that case, you should be able to find the Spyder executable in the Scripts subfolder of your cloned Python environment, so at C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\Scripts\spyder.exe where <username> needs to be replaced with your Windows user name. f you don't see the AppData folder, you will have to change the options in the Windows File Explorer to display hidden files and folders. Make sure to use the .exe file called spyder.exe, not the one called spyder3.exe . If you are using an older version of ArcGIS Pro and installed Spyder directly into the default environment, the path will most likely be C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\Scripts\spyder.exe .
If you are familiar with another IDE, you're welcome to substitute it for Spyder (just verify that it is using Python 3).
When Spyder launches, it may ask you whether you want to update to a newer version. We recommend to NOT try this because the update procedure will most likely not work with the ArcGIS Pro Python environment. Once Spyder is started, it should display a message in the IPython tab similar to:
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 12:30:02) [MSC v.1900 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information. IPython 6.3.1 -- An enhanced Interactive Python. In [1]:
Don’t worry if the version number is different, as long as it starts with a 3. What we’re looking at here is equivalent to the Python interactive window in ArcGIS Desktop, ArcGIS Pro, PythonWin or any of the IDEs you might be familiar with.
We can experiment here by typing "import arcpy"
to import arcpy or running some of those print statement examples from earlier.
In [1]: import arcpy In [2]: print ("Hello World") Hello world
You might have noticed while typing in that second example a useful function of the IPython interactive window - code completion. This is where the IDE (spyder does it too) is smart enough to recognize that you're entering a function name and it provides you with the information about the parameters that function takes. If you missed it the first time, enter print( in the IPython window and wait for a second (or less) and the print function's parameters will appear. This also works for arcpy functions (or those from any library that you import). Try it out with arcpy.management.CreateFeatureclass(or your favorite arcpy function).
Click the File menu -> New File option to open a blank code editor window that we can use to write our first piece of Python 3 code with the ArcGIS Pro version of arcpy. In the remainder of this lesson, we’re going to look at some simple examples taken from GEOG 485 (because they should be somewhat familiar to most people) which we’ll use to practice modifying code from Python 2 to 3 where needed and working with arcpy under ArcGIS Pro. Later, we’ll use some of these same code examples to migrate from single processor, sequential execution to multiprocessor, parallel execution. Below, we show the "old" Python 2 version of the code followed by the Python 3 version that you can try out in spyder, e.g. by copying the code into an empty editor window and running it from there.
This first example script reports the spatial reference (coordinate system) of a feature class stored in a geodatabase [1]:
# Opens a feature class from a geodatabase and prints the spatial reference import arcpy featureClass = "C:/Data/USA/USA.gdb/States" # Describe the feature class and get its spatial reference desc = arcpy.Describe(featureClass) spatialRef = desc.spatialReference # Print the spatial reference name print spatialRef.Name
Python 3 / ArcGIS Pro version:
# Opens a feature class from a geodatabase and prints the spatial reference import arcpy featureClass = "C:/Data/USA/USA.gdb/States" # Describe the feature class and get its spatial reference desc = arcpy.Describe(featureClass) spatialRef = desc.spatialReference # Print the spatial reference name print (spatialRef.Name)
Did you notice the very subtle difference?
First, let us look at all of the things that are the same and refresh our memories of what the code is doing:
So, what’s different? The only difference is in the last, highlighted line of the script. The print statement from Python 2 is now a function as we described earlier, so it takes parameters and therefore we’re passing print a value, in this case the spatialRef.Name that we want it to print. That's all!
We’re going to look at a couple more examples (also borrowed from GEOG 485) and convert them from Python 2 to 3 if needed as we continue through the lesson. Esri recognized that a lot of existing Python developers would want to migrate from Python 2 to 3 and to smooth the way they developed a tool for ArcGIS Desktop (which they've since ported to Pro) called Analyze Tools for Pro [15] which does just what the name suggests.
To test the example code we just investigated manually, we saved the Python 2 version to a .py file and supplied it as input to the tool. The output we get from this displaying all of the elements which need to be converted as warnings is shown below.
As you can see from the image, we get a warning about the print statement (on line 12) as well as a suggestion of what to change that line to. Those warnings are also written into our output file which will be helpful when we’re trying to modify longer pieces of code (or if you wanted to share the task among many programmers).
Here’s another simple script that finds all cells over 3500 meters in an elevation raster and makes a new raster that codes all those cells as 1. Remaining values in the new raster are coded as 0. By now, you’re probably familiar with this type of “map algebra” operation which is common in site selection and other GIS scenarios.
Just in case you’ve forgotten, the expression Raster(inRaster) tells arcpy that it needs to treat your inRaster variable as a raster dataset so that you can perform map algebra on it. If you didn't do this, the script would treat inRaster as just a literal string of characters (the path) instead of a raster dataset.
# This script uses map algebra to find values in an # elevation raster greater than 3500 (meters). import arcpy from arcpy.sa import * # Specify the input raster inRaster = "C:/Data/Elevation/foxlake" cutoffElevation = 3500 # Check out the Spatial Analyst extension arcpy.CheckOutExtension("Spatial") # Make a map algebra expression and save the resulting raster outRaster = Raster(inRaster) > cutoffElevation outRaster.save("C:/Data/Elevation/foxlake_hi_10") # Check in the Spatial Analyst extension now that you're done arcpy.CheckInExtension("Spatial")
You can probably easily work out what this script is doing but, just in case, the main points to remember on this script are:
Copy the code above into a file called Lesson1A.py (or similar as long as it has a .py extension) in spyder or your favorite IDE or text editor and then save it.
We don’t need to do anything to this code to get it to work in Python 3, it will be fine just as it is. Feel free to check it against Analyze Tools for Pro if you like. Your results should say “Analyze Tools for Pro Completed Successfully” with the lack of warnings signifying that the code you supplied is compatible with Python 3.
Next, we'll convert the script to a Tool.
Now, let’s convert this script to a script tool in ArcGIS Pro to familiarize ourselves with the process and we’ll examine the differences between ArcGIS Desktop and ArcGIS Pro when it comes to working with script tools (hint: there aren’t any other than the interface looking slightly different).
We’ll get started by opening ArcGIS Pro. You will be prompted to sign in (use your Penn State ArcGIS Online account which you should already have) and create a project when Pro starts.
Signing in to ArcGIS Pro is an important, new development for running code in Pro as compared to Desktop. As you may be aware, Pro operates with a different licensing structure such that it will regularly "phone home" to Esri's license servers to check that you have a valid license. With Desktop, once you had installed it and set up your license, you could run it for the 12 months the license was valid, online or offline, without any issues. As Pro will regularly check-in with Esri, we need to be mindful that if our code stops working due to an extension not being licensed error or due to a more generic licensing issue, we should check that Pro is still signed in. For nearly everyone, this won't be an issue as you'll generally be using Pro on an Internet connected computer and you won't notice the licensing checks. If you take your computer offline for an extended period, you will need to investigate Esri's offline licensing options [16].
Projects are Pro’s way of keeping all of your maps, layouts, tasks, data, toolboxes etc. organized. If you’re coming from Desktop, think of it as an MXD with a few more features (such as allowing multiple layouts for your maps).
Choose to Create a new project using the Blank template, give it a meaningful name and put it in a folder appropriate for your local machine (things will look slightly different in version 3.0 of Pro: simply click on the Map option under New Project there if you are using that version).
You will then have Pro running with your own toolbox already created. In the figure below, I’ve clicked on the Toolboxes to expand it to show the toolbox which has the same name as my project.
If we right-click on our toolbox we can choose to create a New > Script.
A window will pop up allowing us to enter a name for our script (“Lesson1A”) and a label for our script (“Geog 489 Lesson 1A”), and then we’ll use the file browse icon to locate the script file we saved earlier. In new versions of Pro (2.9 and 3.0), the script file now has to be selected in a new tab called "Execution" located below "Parameters". If your script isn’t showing up in that folder or you get a message that says “Container is empty” press F5 on your keyboard to refresh the view.
We won’t choose to “Import Script” or define any parameters (yet) or investigate validation (yet). When we click OK, we’ll have our script tool created in Pro. We’re not going to run our script tool (yet) as it’s currently expecting to find the foxlake DEM data in C:\data\elevation and write the results back to that folder which is not very convenient. It also has the hardcoded cutoff of 3500 embedded in the code. You can download the FoxLake DEM here [17].
To make the script more user-friendly, we’re going to make a few changes to allow us to pick the location of the input and output files as well as allow the user to input the cutoff value. Later we’ll also use validation to check whether that cutoff value falls inside the range of values present in the raster and, if not, we’ll change it.
We can edit our script from within Pro, but if we do that it opens in Notepad which isn’t the best environment for coding. You can use Notepad if you like, but I’d suggest opening the script again in your favorite text editor (I like Notepad++) or just using spyder.
If you want, you can change this preferred editor by modifying Pro’s geoprocessing options (see http://pro.arcgis.com/en/pro-app/help/analysis/geoprocessing/basics/geoprocessing-options.htm). To access these options in Pro, click Home -> Options -> Geoprocessing Options. Here you can also choose an option to automatically validate tools and scripts for Pro compatibility (so you don’t need to run the Analyze Tools for Pro manually each time).
We're going to make a few changes to our code now, swapping out the hardcoded paths in lines 8 and 17 and the hardcoded cutoffElevation value in line 9. We’re also setting up an outPath variable in line 10 and setting it to arcpy.env.workspace.
You might recall from GEOG 485 or your other experience with Desktop that the default workspace in Desktop is usually default.gdb in your user path. Pro is smarter than that and sets the default workspace to be the geodatabase of your project. We’ll take advantage of that to put our output raster into our project workspace. Note the difference in the type of parameter we’re using in lines 8 & 9. It’s ok for us to get the path as Text, but we don’t want to get the number in cutoffElevation as Text because we need it to be a number.
To simplify the programming, we’ll specify a different parameter type in Pro and let that be passed through to our script. To make that happen, we’ll use GetParameter instead of GetParameterAsText.
# This script uses map algebra to find values in an # elevation raster greater than 3500 (meters). import arcpy from arcpy.sa import * # Specify the input raster inRaster = arcpy.GetParameterAsText(0) cutoffElevation = arcpy.GetParameter(1) outPath = arcpy.env.workspace # Check out the Spatial Analyst extension arcpy.CheckOutExtension("Spatial") # Make a map algebra expression and save the resulting raster outRaster = Raster(inRaster) > cutoffElevation outRaster.save(outPath+"/foxlake_hi_10") # Check in the Spatial Analyst extension now that you're done arcpy.CheckInExtension("Spatial")
Once you have made those changes, save the file and we’ll go back to our script tool in Pro and update it to use the parameters we’ve just defined. Right click on the script tool within the toolbox and choose Properties and then click Parameters. The first parameter we defined (remember Python counts from 0) was the path to our input raster (inRaster), so let's set that up. Click in the text box under Label and type “Input Raster” and when you click into Name you’ll see that Name is already automatically populated for you. Next, click the Data Type (currently String) and change it to “Raster Dataset” and we’ll leave the other values with their defaults.
Click the next Label text box below your first parameter (currently numbered with a *) and type “Cutoff Value” and change the Data Type to Long (which is a type of number) and we’ll keep the rest of the defaults here too. The final version should look as in the figure below.
Click OK and then we’ll run the tool to test the changes we made by double-clicking it. Use the file icon alongside our Input Raster parameter to navigate to your foxlake raster (which is the FoxLake digital elevation model (DEM) in your Lesson 1 data folder) and then enter 3500 into the cutoff value parameter and click OK to run the tool.
The tool should have executed without errors and placed a raster called foxlake_hi_10 into your project geodatabase.
If it doesn’t work the first time, verify that:
Now let’s expand on the user friendliness of the tool by using the validator methods to ensure that our cutoff value falls within the minimum and maximum values of our raster (otherwise performing the analysis is a waste of resources).
The purpose of the validation process is to allow us to have some customizable behavior depending on what values we have in our tool parameters. For example, we might want to make sure a value is within a range as in this case (although we could do that within our code as well), or we might want to offer a user different options if they provide a point feature class instead of a polygon feature class, or different options if they select a different type of field (e.g. a string vs. a numeric type).
The Esri help for Tool Validation [18] gives a longer list of uses and also explains the difference between internal validation (what Desktop & Pro do for us already) and the validation that we are going to do here which works in concert with that internal validation.
You will notice in the help that Esri specifically tells us not to do what I’m doing in this example – running geoprocessing tools. The reason for this is they generally take a long time to run. In this case, however, we’re using a very simple tool which gets the minimum & maximum raster values and therefore executes very quickly. We wouldn’t want to run an intersection or a buffer operation for example in the ToolValidator, but for something very small and fast such as this value checking, I would argue that it’s ok to break Esri’s rule. You will probably also note that Esri hints that it’s ok to do this by using Describe to get the properties of a feature class and we’re not really doing anything different except we’re getting the properties of a raster.
So how do we do it? Go back to your tool (either in the Toolbox for your Project, Results, or the Recent Tools section of the Geoprocessing sidebar), right click and choose Properties and then Validation.
You will notice that we have a pre-written, Esri-provided class definition here. We will talk about how class definitions look in Python in Lesson 4 but the comments in this code should give you an idea of what the different parts are for. We’ll populate this template with the lines of code that we need. For now, it is sufficient to understand that different methods (initializeParameters(), updateParameters(), etc.) are defined that will be called by the script tool dialog to perform the operations described in the documentation strings following each line starting with def.
Take the code below and use it to overwrite what is in your ToolValidator:
import arcpy class ToolValidator(object): """Class for validating a tool's parameter values and controlling the behavior of the tool's dialog.""" def __init__(self): """Setup arcpy and the list of tool parameters.""" self.params = arcpy.GetParameterInfo() def initializeParameters(self): """Refine the properties of a tool's parameters. This method is called when the tool is opened.""" def updateParameters(self): """Modify the values and properties of parameters before internal validation is performed. This method is called whenever a parameter has been changed.""" def updateMessages(self): """Modify the messages created by internal validation for each tool parameter. This method is called after internal validation.""" ## Remove any existing messages self.params[1].clearMessage() if self.params[1].value is not None: ## Get the raster path/name from the first [0] parameter as text inRaster1 = self.params[0].valueAsText ## calculate the minimum value of the raster and store in a variable elevMINResult = arcpy.GetRasterProperties_management(inRaster1, "MINIMUM") ## calculate the maximum value of the raster and store in a variable elevMAXResult = arcpy.GetRasterProperties_management(inRaster1, "MAXIMUM") ## convert those values to floating points elevMin = float(elevMINResult.getOutput(0)) elevMax = float(elevMAXResult.getOutput(0)) ## calculate a new cutoff value if the original wasn't suitable but only if the user hasn't specified a value. if self.params[1].value < elevMin or self.params[1].value > elevMax: cutoffValue = elevMin + ((elevMax-elevMin)/100*90) self.params[1].value = cutoffValue self.params[1].setWarningMessage("Cutoff Value was outside the range of ["+str(elevMin)+","+str(elevMax)+"] supplied raster so a 90% value was calculated")
Our logic here is to take the raster supplied by the user and determine the min and max values so that we can evaluate whether the cutoff value supplied by the user falls within that range. If that is not the case, we're going to do a simple mathematical calculation to find the value 90% of the way between the min and max values and suggest that as a default to the user (by putting it into the parameter). We’ll also display a warning message to the user telling them that the value has been adjusted and why their original value doesn’t work.
As you look over the code, you’ll see that all of the work is being done in the bottom function updateMessages(). This function is called after the updateParameters() and the internal arcpy validation code have been executed. It is mainly intended for modifying the warning or error messages produced by the internal validation code. The reason why we are putting all our validation code here is because we want to produce the warning message and there is no entirely simple way to do this if we already perform the validation and potentially automatic adjustment of the cutoff value in updateParameters() instead. Here is what happens in the updateMessages() function:
We start by cleaning up any previous messages self.params[1].clearMessages() (line 24). Then we check if the user has entered a value into the cutoffValue parameter (self.params[1]) on line 26. If they haven't, we don’t do anything (for efficiency). If the user has entered a value (i.e., the value is not None) then we get the raster name from the first parameter (self.params[0]) and we extract it as text (because we want the content to use as a path) on line 28. Then we’ll call the arcpy GetRasterProperties function twice, once to get the min value (line 30) and again to get the max value (on line 32) of the raster. We’ll then convert those values to floating point numbers (lines 34 & 35).
Once we’ve done that, we do a little bit of checking to see if the value the user supplied is within the range of the raster. If it is not, then we will do some simple math to calculate a value that falls 90% of the way into the range and then update the parameter (self.params[1].value) with the number we calculated (line 40 and 41). Finally, in line 42, we produce the warning message informing the users of the automatic value adjustment.
Now let’s test our Validator. Click OK and return to your script in the Toolbox, Results or Geoprocessing window. Run the script again. Insert the name of the input raster again. If you didn’t make any mistakes entering the code there won’t be a red X by the Input Raster. If you did make a mistake, an error message will be displayed there, showing you the usual arcpy / geoprocessing error message and the line of code that the error is occurring on. If you have to do any debugging, exit the script, return to the Toolbox, right click the script and go back to the Tool Validator and correct the error. Repeat as many times as necessary.
If there were no errors, we should test out our validation by putting a value into our Cutoff Value parameter that we know to be outside the range of our data. If you choose a value < 2798 or > 3884, you should see a yellow warning triangle appear that displays our error message, and you will also note that the value in Cutoff Value has been updated to our 90% value.
We can change the value to one we know works within the range (e.g. 3500), and now the tool should run.
Now that we are back into the groove of writing arcpy code and creating script tools, we want to look at a topic that didn't play a big role in our introductory class, GEOG 485, but that is very important to programming: We want to talk about run-time performance and the related topics of parallelisation and code profiling. These will be the main topics of the next sections and this lesson in general.
We are going to address the question of how we can improve the performance and reliability of our Python code when dealing with more complicated tasks that require a larger number of operations on a greater number of datasets and/or more memory. To do this we’re going to look at both 64-bit processing and multiprocessing. We’re going to start investigating these topics using a simple raster script to process LiDAR data from Penn State’s main campus and surrounding area. In later sections, we will also look at a vector data example using different data sets for Pennsylvania.
The raster data consists of 808 tiles which are all individually zipped, 550MB zipped in total. The individual .zip files can be downloaded from PASDA directly [19].
Previously PASDA provided access via FTP but unfortunately that ability has been removed. However, we recommend you use a little Python script we put together that uses BeautifulSoup (which we'll look at more in Lesson 2) to download the files. The script will also automatically extract the individual .zip files. For this you have to do the following:
Doing any GIS processing with these LiDAR files is definitely a task to be handled by scripting, and any performance benefits we can gain when we’re processing that many tiles will be worthwhile. The question you might be asking is why don’t we just join all of the tiles together and process them at once - we’d run out of memory very fast and if something goes wrong we need to start over. Processing small tiles we can do one (or a few) at a time using less memory and if one tile fails we still have all of the others and just need to restart that tile.
Below is our simple raster script which gets our list of tiles and then for every tile in the list we fill the DEM, create a flow direction and flow accumulation raster to then derive a stream raster (to determine where the water might flow), and lastly we convert the stream raster to polygon or polyline feature classes. This is a simplified version of the sort of analysis you might undertake to prepare data prior to performing a flood study. The code we are writing here will work in both Desktop and Pro as long as you have the Spatial Analyst extension installed, authorized and enabled (it is this last step that generally causes errors). I’ve restricted the processing to a subset of those tiles for testing and performance reasons using only tiles with 227 in the name but more tiles can be included by modifying the wild card list in line 19.
If you used the download script above, you already have the downloaded raster files ready. You can move them to a new folder or keep them where they are. In any case, you will need to make sure that the workspace in the script below points to the folder containing the extracted raster files (line 9). If you obtained the raster files in some other way, you may have to unzip them to a folder first.
Let’s look over the code now. You will notice that the version below is for Python 3 (the print function gives it away) but it will work in both Python 2 and 3 without changes.
# Setup _very_ simple timing. import time process_start_time = time.time() import arcpy from arcpy.sa import * arcpy.env.overwriteOutput = True arcpy.env.workspace = r'C:\489\PSU_LiDAR' ## If our rasters aren't in our filter list then drop them from our list. def filter_list(fileList,filterList): return[i for i in fileList if any(j in i for j in filterList)] # Ordinarily we would want all of the rasters I'm filtering by a small set for testing & efficiency # I did this by manually looking up the tile index for the LiDAR and determining an area of interest # tiles ending in 227, 228, 230, 231, 232, 233, 235, 236 wildCardList = set(['227']) ##,'228','230','231','232','233','235','236']) # Get a list of rasters in my folder rasters = arcpy.ListRasters("*") new_rasters = filter_list(rasters,wildCardList) # for all of our rasters for raster in new_rasters: raster_start_time = time.time() # Now that we have our list of rasters ## Note also for performance we're not saving any of the intermediate rasters - they will exist only in memory ## Fill the DEM to remove any sinks try: FilledRaster = Fill(raster) ## Calculate the Flow Direction (how water runs across the surface) FlowDirRaster = FlowDirection(FilledRaster) ## Calculate the Flow Accumulation (where the water accumulates in the surface) FlowAccRaster = FlowAccumulation(FlowDirRaster) ## Convert the Flow Accumulation to a Stream Network ## We're setting an arbitray threshold of 100 cells flowing into another cell to set it as part of our stream ## http://pro.arcgis.com/en/pro-app/tool-reference/spatial-analyst/identifying-stream-networks.htm Streams = Con(FlowAccRaster,1,"","Value > 100") ## Convert the Raster Stream network to a feature class output_Polyline = raster.replace(".img",".shp") arcpy.CheckOutExtension("Spatial") arcpy.sa.StreamToFeature(Streams,FlowDirRaster,output_Polyline) arcpy.CheckInExtension("Spatial") except: print ("Errors occured") print (arcpy.GetMessages()) arcpy.AddMessage ("Errors occurred") arcpy.AddMessage(arcpy.GetMessages()) # Output how long the whole process took. arcpy.AddMessage("--- %s seconds ---" % (time.time() - process_start_time)) print ("--- %s seconds ---" % (time.time() - process_start_time))
We have set up some very simple timing functionality in this script using the time() function defined in the module time of the Python standard library. The function gives you the current time and, by calling it at the beginning and end of the program and then taking the difference in the very last line of the script, we get an idea of the runtime of the script.
Later in the lesson, we will go into more detail about properly profiling code where we will examine the performance of a whole program as well as individual instructions. For now, we just want an estimate of the execution time. Of course, it’s not going to be very precise as it will depend on what else you’re doing on your PC at the same time and we would need to run a number of iterations to remove any inconsistencies (such as the delay when arcpy loads for the first time etc.). On my PC that code runs in around 40 seconds. Your results will vary depending on many factors related to the performance of your PC (we'll review some of them in the Speed Limiters section) but you should test out the code to get an idea of the baseline performance of the algorithm on your PC.
In lines 12 and 13, we have a simple function to filter our list of rasters to just those we want to work with (centered on the PSU campus). This function might look a little different to what you have seen before - that's because we're using list comprehension which we'll examine in more detail in Lesson 2 [21]. So don't worry about understanding how exactly this works at the moment. It basically says to return a list with only those file names from the original list that contain one of the numbers in the wild card list.
We set up some environment variables, our wildcard list (used by our function for filtering) at line 19 - where you will notice I have commented out some of the list for speed during testing, and then we get our list of rasters, filter it and for those rasters left and we iterate through them with the central for-loop in line 25 performing our spatial analysis tasks mentioned earlier. There is some basic error checking wrapped around the tasks (which is also reporting running times if anything goes wrong) and then lastly there is a message and print function with the total time. I’ve included both print and AddMessage just in case you wanted to test the code as a script tool in ArcGIS.
Feel free to run the script now and see what total computation time you get from the print statement in the last line of the code. We‘re going to demonstrate some very simple performance evaluation of the different versions of ArcGIS (32 bit Desktop, 64 bit Desktop, Pro and arcpy Multiprocessing) using this code. Before we do that though it is important to understand the differences between each of them. You do not have to run this testing yourself; we’re mainly providing it as some background. You are welcome to experiment with it, but please do not do that to the detriment of your project.
Once we’ve examined the theory of 64-bit processing and parallelisation and worked through a simple example using the Hi-ho Cherry-O game from GEOG 485, we’ll come back to the raster example above and convert it to running in parallel using the Python multiprocessing package instead of sequentially and we will further look at an example of multiprocessing using vector data.
32-bit software or hardware can only directly represent and operate with numbers up to 2^32 and, hence, can only address up to a maximum of 4GB of memory (that is 2^32 = 4294967296 bytes). If the file system of your operating system is limited to 32-bit integers as well, this also means you cannot have any single file larger than 4GB either in memory or on disk (you can still page or chain larger files together though).
64-bit architectures don’t have this limit. Instead you can access up to 16 terabytes of memory and this is actually only the limit of current chip architectures which "only" use 44 bits which will change over time as software and hardware architectures evolve. Technically with a 64-bit architecture you could access 16 Exabytes of memory (2^64) and while not wanting to paraphrase Bill Gates, that is probably more than we’ll need for the foreseeable future.
There most likely won't be any innate performance benefits to be gained by moving from 32-bit and 64-bit unless you need that extra memory. While in principle, you can move larger amounts of data per time between memory and CPU with 64-bit, this typically doesn't result in significantly improved execution times because of caching and other optimization techniques used by modern CPUs. However if we start using programming models where we run many tasks at once, you might want more than 4GB allocated to those processes. For example if you had 8 tasks that all needed 500MB of RAM each – that’s very close to the 4GB limit in total (500MB * 8 = 4000MB). If you had a machine with more processors (e.g. 64) you would very easily hit the 32-bit 4GB limit as you would only be able to allocate 62.5MB of RAM per processor from your code.
Even with hardware architectures and operating systems mainly being 64-bit these days, a lot of software still is only available as 32-bit versions. 64-bit operating systems are designed to be backwards compatible with 32-bit applications, and if there is no real expected benefit for a particular software, the developer of the software may just as well decide to stick with 32-bit and avoid the efforts and costs that it would take to make the change to 64-bit or even support multiple versions of the software. ArcGIS Desktop is an example of a software that is only available as 32-bit and this is (most likely) not going to change anymore since ArcGIS Pro, which is 64-bit, fills that role. However, Esri also provides a 64-bit geoprocessing extension for ArcGIS Desktop which will be further described in the next section. However, this section is considered optional. You may read through it and learn how it can be set up and about what performance gain can be achieved using it or you may skip most of it and just have a look at the table with the computation time comparison at the very end of the section. But we strongly recommend that you do not try to install 64-bit geoprocessing and perform the steps yourself before you have worked through the rest of lesson and the homework assignment.
This section is provided for interest only - as it only applies to ArcGIS Desktop - not Pro (which is natively 64 bit). It is recommended that you only read/skim through the section and check out the computation time comparison at the end without performing the described steps yourself and then loop back to it at the end of the lesson if you have free time and are interested in exploring 64-bit Background Geoprocessing in ArcGIS Desktop.
A number of versions ago (since Server 10.1), Esri added support for 64-bit arcpy. Esri also introduced 64-bit geoprocessing using the 64-bit Background Geoprocessing patch which was part of 10.1 Service Pack 1 as an option in ArcGIS Desktop (Pro is entirely 64-bit) to work around these memory issues for large geoprocessing tasks. Not all tools support 64-bit geoprocessing within Desktop and there are some tricks to getting it installed so you can access it in Desktop. There is also a 64-bit arcpy geoprocessing library so you can run your code (any code) from the command line. Background Geoprocessing (64-bit) is still available as a separate installation on top of ArcGIS (see this ArcMap/Background Geoprocessing (64-bit) page [22]) and we’ve provided a link for students to obtain it within Canvas. You'll find this link on the "64-bit Geoprocessing downloads for ArcGIS (optional)" page under Lesson 1 in Canvas.
As Esri hint in their documentation, 64-bit processing within ArcGIS Desktop requires that the tool run in the background. This is because it is running using a separate set of tools which are detached from the Desktop application (which is 32-bit). Personally, I rarely use Background Geoprocessing but I do make use of the 64-bit version of Python that it installs to run a lot of my scripts in 64-bit mode from the command line.
If you’ve typically run your code in the past from within an IDE (such as PythonWin, IDLE or spyder) or from within ArcGIS you might not be aware that you can also run that code from the command line by calling Python directly.
For ArcGIS Desktop you can start a regular command prompt and, using the standard Windows commands, change to that path where your Python script is located. Usually when you open a command window, it will start in your home folder (e.g. c:\users\yourname). We could dedicate an entire class to operating system commands but Microsoft has a good resource at this Windows Commands page [23] for those who are interested.
We just need a couple of the commands listed there :
We’ll change the directory to where our code [26] from section 1.6 is (e.g. mine is c:\wcgis\geog489\lesson1) and see how to run the script using the command line versions of 32-bit and 64-bit Python.
cd c:\wcgis\geog489\lesson1
If you downloaded and installed the 64-bit Background Geoprocessing from above you will have both 32-bit and 64-bit Python installed. We’ll use the 32-bit Python first which should be located at c:\python27\arcgis10.6\python.exe (substitute 10.6 by whichever version of ArcGIS you have installed).
There’s a neat little feature built into the command line where you can use the TAB key to autocomplete paths so you could start typing c:\python27\a and then hit TAB and you should see the path cycling through the various ArcGIS folders.
We’ll run our code using:
C:\python27\ArcGIS10.6\python.exe myScriptName.py
Where myScriptName.py is whatever you saved the code from section 1.6 as. You will now see the code run in the command window and pop up all of the same messages you would have seen if you had run it from an IDE (but not the AddMessages messages as they are only interpreted by ArcGIS).
To run the code against the 64-bit version of Python the command is almost identical except that you’ll use the x64 version of Python that has been installed by Background Geoprocessing. In my case that means the command is:
C:\python27\ArcGISx6410.5\python.exe myScriptName.py
Once your script finishes you’ll have a few time stamp values. Running that code from Section 1.6 through the 32-bit and 64-bit versions a few times we have some sample results below. The first runs of each are understandably slower as arcpy is imported for the first time. You have probably witnessed this behavior yourself as your code takes longer the very first time it runs.
32-bit Desktop | 64-bit Desktop | 64-bit Pro |
---|---|---|
149 seconds | 107 seconds | 109 seconds |
119 seconds | 73 seconds | 144 seconds |
91 seconds | 90 seconds | 111 seconds |
85 seconds | 73 seconds | |
93 seconds | 75 seconds |
We can see a couple of things with these results – they are a little inconsistent depending on what else my PC was doing at the time the code was running and, if you are looking at individual executions of the code, it is difficult to see which pieces of the code are slower or faster from time to time. This is the problem that we will solve later in the lesson when we look at profiling code where we examine how long each line of code takes to run.
You have probably noticed if you have a relatively modern PC (anything from the last several years) that when you open Windows Task Manager (from the bottom of the list when you press CTRL-ALT-DEL) and you click the Performance tab and right click on the CPU graph and choose Change Graph to -> Logical Processors you have a number of processors (or cores) within your PC. These are actually “logical processors“ within your main processor but they function as though they were individual processors – and we’ll just refer to them as processors here for simplicity.
Now because we have multiple processors, we can run multiple tasks in parallel at the same time instead of one at a time. There are two ways that we can run tasks at the same time – multithreaded and multiprocessing. We’ll look at the differences in each in the following but it’s important to know that arcpy doesn’t support multithreading but it does support multiprocessing. In addition, there is a third form of parallelisation called distributed computing, which involves distributing the task over multiple computers, that we will also briefly talk about.
Multithreading is based on the notion of "threads" for a number of tasks that are executed within the same memory space. The advantage of this is that because the memory is shared between the threads they can share information. This results in a much lower memory overhead because information doesn’t need to be duplicated between threads. The basic logic is that a single thread starts off a task and then multiple threads are spawned to undertake sub-tasks. At the conclusion of those sub-tasks all of the results are joined back together again. Those threads might run across multiple processors or all on the same one depending on how the operating system (e.g. Windows) chooses to prioritize the resources of your computer. In the example of the PC above which has 4 processors, a single-threaded program would only run on one processor while a multi-threaded program would run across all of them (or as many as necessary).
Multiprocessing achieves broadly the same goal as multi-threading which is to split the workload across all of the available processors in a PC. The difference is that multiprocessing tasks cannot communicate directly with each other as they each receive their own allocation of memory. That means there is a performance penalty as information that the processes need must be stored in each one. In the case of Python a new copy of python.exe (referred to as an instance) is created for each process that you launch with multiprocessing. The tasks to run in multiprocessing are usually organized into a pool of workers which is given a list of the tasks to be completed. The multiprocessing library will assign each task to a worker (which is usually a processor on your PC) and then once a worker completes a task the next one from the list will be assigned to that worker. That process is repeated across all of the workers so that as each finishes a task a new one will be assigned to them until there are no more tasks left to complete.
You might have heard of the MapReduce framework which underpins the Hadoop parallel processing approach. The use of the term map might be confusing to us as GIS folks as it has nothing to do with our normal concept of maps for displaying geographical information. Instead in this instance map means to take a function (as in a programming function) and apply it once to every item in a list (e.g. our list of rasters from the earlier example).
The reduce part of the name is similar as we apply a function to a list and combine the results of our function into a single result (e.g. a list from 1 – 10,000 which is our number of Hi-ho Cherry-O games and we want the number of turns for each game).
The two elements map and reduce work harmoniously to solve our parallel problems. The map part takes our one large task (which we have broken down into a number of smaller tasks and put into a list) and applies whatever function we give it to the list (one item in the list at a time) on each processor (which is called a worker). Once we have a result, that result is collected by the reduce part from each of the workers and brought back to the calling function. There is a more technical explanation in the Python documentation [27].
At around the same time that Esri introduced 64-bit processing, they also introduced multiprocessing to some of the tools within ArcGIS Desktop (mostly raster based tools in the first iteration) and also added multiprocessor support to the arcpy library.
Multiprocessing has been available in Python for some time and it’s a reasonably complicated concept so we will do our best to simplify it here. We’ll also provide a list of resources at the end of this section for you to continue exploring if you are interested. The multiprocessing package of Python is part of the standard library and has been available since around Python 2.6. The multiprocessing library is required if you want to implement multiprocessing and we import it into our code just like any other package using:
import multiprocessing
Using multiprocessing isn’t as simple as switching from 32-bit to 64-bit as we did above. It does require some careful thought about which processes we can run in parallel and which need to run sequentially. There are also issues about file sharing and file locking, performance penalties where sometimes multiprocessing is slower due to the time taken to setup and remove the multiprocessing pool, and some tasks that do not support multiprocessing. We’ll cover all of these issues in the following sections and then we’ll convert our simple, sequential raster processing example into a multiprocessing one to demonstrate all of these concepts.
Distributed processing is a type of parallel processing that instead of (just) using each processor in a single machine will use all of the processors across multiple machines. Of course, this requires that you have multiple machines to run your code on but with the rise of cloud computing architectures from providers such as Amazon, Google, and Microsoft this is getting more widespread and more affordable. We won’t cover the specifics of how to implement distributed processing in this class but we have provided a few links if you want to explore the theory in more detail.
In a nutshell what we are doing with distributed processing is taking our idea of multiprocessing on a single machine and instead of using the 4 or however many processors we might have available, we're accessing a number of machines over the internet and utilizing the processors in all of them. Hadoop [28]is one method of achieving this and others include Amazon's Elastic Map Reduce [29], MongoDB [30]and Cassandra [31]. GEOG 865 [32] has cloud computing as its main topic, so if you are interested in this, you may want to check it out.
With all of these approaches to speeding up our code, what are the elements which will cause bottlenecks and slow us down?
Well, there are a few – these include the time to set up each of the processes for multiprocessing. Remember earlier we mentioned that because each process doesn’t share memory it needs a copy of the data to use. This will need to be copied to a memory location. Also as each process runs its own Python.exe instance, it needs to be launched and arcpy needs to be imported for each instance (although fortunately, multiprocessing takes care of this for us). Still, all of that takes time to start so our code won’t appear to do much at first while it is doing this housekeeping - and if we're not starting a lot of processes then we won't see enough of a speed up in processing to make up for those start-up time costs.
Other things that can slow us down are the speed of our RAM. Access times for RAM used to be measured in nanoseconds but now are measured in megahertz (MHz). The method of calculating the speed isn’t especially important but if you’re moving large files around in RAM or performing calculations that require getting a number out of RAM, adding, subtracting, multiplying, etc. and then putting the result into another location in RAM and you’re doing that millions of times very, very small delays will quickly add up to seconds or minutes. Another speedbump is running out of RAM. While we can allocate more than 4GB per process using 64-bit programming, if we don’t have enough RAM to complete all of the tasks that we might launch then our operating system will start swapping between RAM (which is fast) and our hard disk (which isn’t – even if it’s one of the solid state types – SSDs).
Speaking of hard disks, it’s very likely that we’re loading and saving data to them and as our disks are slower than our RAM and our processors, that is going to cause a delay. The less we need to load and save data the better, so good multiprocessing practice is to keep as much data as possible in RAM (see the caveat above about running out of RAM). The speed of disks is governed by a couple of factors; the speed that the motor spins (unless it is an SSD), seek time and the amount of cache that the disk has. Here is how these elements all work together to speed up (or slow down) your code. The hard disk receives a request for data from the operating system, which it then goes looking for. This is the seek time referring to how long it takes the disk to position the read head over the segment of disk the data is located on, which is a function of motor speed as well. Then once the file is found, it needs to be loaded into memory – cache - and then this is sent through to the process that needed the data. When data is written back to your disk, the reverse process takes place. The cache is filled (as memory is faster than disks ) and then the cache is written to the disk. If the file is larger than the cache, the cache gets topped up as it starts to empty until the file is written. A slow spinning hard disk motor or a small amount of cache can both slow down this process.
It’s also possible that we’re loading data from across a network connection (e.g. from a database or remotely stored files) and that will also be slow due to network latency – the time it takes to get to and from the other device on the network with the request and the result.
We can also be slowed down by inefficient code, for example, using too many loops or an inefficient if / else / elif statement that we evaluate too many times or using a mathematical function that is slower than its alternatives. We'll examine these sorts of coding bottlenecks - or at least how to identify them when we look at code profiling later in the lesson.
From the brief description in the previous section, you might have realized that there are generally two broad types of tasks – those that are input/output (I/O) heavy which require a lot of data to be read, written or otherwise moved around; and those that are CPU (or processor) heavy that require a lot of calculations to be done. Because getting data is the slowest part of our operation, I/O heavy tasks do not demonstrate the same improvement in performance from multiprocessing as CPU heavy tasks. The more work there is to do for the CPU the greater the benefit in splitting that workload among a range of processors so that they can share the load.
The other thing that can slow us down is outputting to the screen – although this isn’t really an issue in multiprocessing because printing to our output window can get messy. Think about two print statements executing at exactly the same time – you’re likely to get the content of both intermingled, leading to a very difficult to understand message. Even so, updating the screen with print statements is a slow task.
Don’t believe me? Try this sample piece of code that sums the numbers from 0-100.
# Setup _very_ simple timing. import time start_time = time.time() sum = 0 for i in range(0,100): sum += i print(sum) # Output how long the process took. print ("--- %s seconds ---" % (time.time() - start_time))
If I run it with the print function in the loop the code takes 0.049 seconds to run on my PC. If I comment that print function out, the code runs in 0.0009 seconds.
4278 4371 4465 4560 4656 4753 4851 4950 --- 0.04900026321411133 seconds --- runfile('C:/Users/jao160/Documents/Teaching_PSU/Geog489_SU_21/Lesson 1/untitled1.py', wdir='C:/Users/jao160/Documents/Teaching_PSU/Geog489_SU_21/Lesson 1') --- 0.0009996891021728516 seconds ---
In Penn State's GEOG 485 course, we simulated 10,000 runs of the children's game Cherry-O to determine the average number of turns it takes. If we printed out the results, the code took a minute or more to run. If we skipped all but the final print statement the code ran in less than a second. We’ll revisit that Cherry-O example as we experiment with moving code from the single processor paradigm to multiprocessor. We’ll start with it as a simple, non arcpy example and then move on to two arcpy examples – one raster (our raster calculation example from before) and one vector.
Since you most likely did not take GEOG 485, you may want to have a quick look at the description [33].
Following is the original Cherry-O code.
# Simulates 10K game of Hi Ho! Cherry-O # Setup _very_ simple timing. import time start_time = time.time() import random spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] turns = 0 totalTurns = 0 cherriesOnTree = 10 games = 0 while games < 10001: # Take a turn as long as you have more than 0 cherries cherriesOnTree = 10 turns = 0 while cherriesOnTree > 0: # Spin the spinner spinIndex = random.randrange(0, 7) spinResult = spinnerChoices[spinIndex] # Print the spin result # print ("You spun " + str(spinResult) + ".") # Add or remove cherries based on the result cherriesOnTree += spinResult # Make sure the number of cherries is between 0 and 10 if cherriesOnTree > 10: cherriesOnTree = 10 elif cherriesOnTree < 0: cherriesOnTree = 0 # Print the number of cherries on the tree # print ("You have " + str(cherriesOnTree) + " cherries on your tree.") turns += 1 # Print the number of turns it took to win the game # print ("It took you " + str(turns) + " turns to win the game.") games += 1 totalTurns += turns print("totalTurns " + str(float(totalTurns) / games)) # lastline = raw_input(">") # Output how long the process took. print("--- %s seconds ---" % (time.time() - start_time))
We've added in our very simple timing from earlier and this example runs for me in about 1/3 of a second (without the intermediate print functions). That is reasonably fast and you might think we won't see a significant improvement from modifying the code to use multiprocessor mode but let's experiment.
The Cherry-O task is a good example of a CPU bound task; we’re limited only by the calculation speed of our random numbers, as there is no I/O being performed. It is also an embarrassingly parallel task as none of the 10,000 runs of the game are dependent on each other. All we need to know is the average number of turns; there is no need to share any other information. Our logic here could be to have a function (Cherry-O) which plays the game and returns to our calling function the number of turns. We can add that value returned to a variable in the calling function and when we’re done divide by the number of games (e.g. 10,000) and we’ll have our average.
So with that in mind, let us examine how we can convert a simple program like Cherry-O from sequential to multiprocessing.
There are a couple of basic steps we need to add to our code in order to support multiprocessing. The first is that our code needs to import multiprocessing which is a Python library which as you will have guessed from the name enables multiprocessing support. We’ll add that as the first line of our code.
The second thing our code needs to have is a __main__ method defined. We’ll add that into our code at the very bottom with:
if __name__ == '__main__': mp_handler()
With this, we make sure that the code in the body of the if-statement is only executed for the main process we start by running our script file in Python, not the subprocesses we will create when using multiprocessing, which also are loading this file. Otherwise, this would result in an infinite creation of subprocesses, subsubprocesses, and so on. Next, we need to have that mp_handler() function we are calling defined. This is the function that will set up our pool of processors and also assign (map) each of our tasks onto a worker (usually a processor) in that pool.
Our mp_handler() function is very simple. It has two main lines of code based on the multiprocessing module:
The first instantiates a pool with a number of workers (usually our number of processors or a number slightly less than our number of processors). There’s a function to determine how many processors we have, multiprocessing.cpu_count(), so that our code can take full advantage of whichever machine it is running on. That first line is:
with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool: ... # code for setting up the pool of jobs
You have probably already seen this notation from working with arcpy cursors. This with ... as ... statement creates an object of the Pool class defined in the multiprocessing module and assigns it to variable myPool. The parameter given to it is the number of processors on my machine (which is the value that multiprocessing.cpu_count() is returning), so here we are making sure that all processor cores will be used. All code that uses the variable myPool (e.g., for setting up the pool of multiprocessing jobs) now needs to be indented relative to the "with" and the construct makes sure that everything is cleaned up afterwards. The same could be achieved with the following lines of code:
myPool = multiprocessing.Pool(multiprocessing.cpu_count()) ... # code for setting up the pool of jobs myPool.close() myPool.join()
Here the Pool variable is created without the with ... as ... statement. As a result, the statements in the last two lines are needed for telling Python that we are done adding jobs to the pool and for cleaning up all sub-processes when we are done to free up resources. We prefer to use the version using the with ... as ... construct in this course.
The next line that we need in our code after the with ... as ... line is for adding tasks (also called jobs) to that pool:
res = myPool.map(cherryO, range(10000))
What we have here is the name of another function, cherryO(), which is going to be doing the work of running a single game and returning the number of turns as the result. The second parameter given to map() contains the parameters that should be given to the calls of thecherryO() function as a simple list. So this is how we are passing data to process to the worker function in a multiprocessing application. In this case, the worker function cherryO() does not really need any input data to work with. What we are providing is simply the number of the game this call of the function is for, so we use the range from 0-9,999 for this. That means we will have to introduce a parameter into the definiton of the cherryO() function for playing a single game. While the function will not make any use of this parameter, the number of elements in the list (10000 in this case) will determine how many timescherryO() will be run in our multiprocessing pool and, hence, how many games will be played to determine the average number of turns. In the final version, we will replace the hard-coded number by a variable called numGames. Later in this part of the lesson, we will show you how you can use a different function called starmap(...) instead of map(...) that works for worker functions that do take more than one argument so that we can pass different parameters to it.
Python will now run the pool of calls of the cherryO() worker function by distributing them over the number of cores that we provided when creating the Pool object. The returned results, so the number of turns for each game played, will be collected in a single list and we store this list in variable res. We’ll average those turns per game to get an average using the Python library statistics and the function mean().
To prepare for the multiprocessing version, we’ll take our Cherry-O code from before and make a couple of small changes. We’ll define function cherryO() around this code (taking the game number as parameter as explained above) and we’ll remove the while loop that currently executes the code 10,000 times (our map range above will take care of that) and we’ll therefore need to “dedent“ the code.
Here’s what our revised function will look like :
def cherryO(game): spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] turns = 0 cherriesOnTree = 10 # Take a turn as long as you have more than 0 cherries while cherriesOnTree > 0: # Spin the spinner spinIndex = random.randrange(0, 7) spinResult = spinnerChoices[spinIndex] # Print the spin result #print ("You spun " + str(spinResult) + ".") # Add or remove cherries based on the result cherriesOnTree += spinResult # Make sure the number of cherries is between 0 and 10 if cherriesOnTree > 10: cherriesOnTree = 10 elif cherriesOnTree < 0: cherriesOnTree = 0 # Print the number of cherries on the tree #print ("You have " + str(cherriesOnTree) + " cherries on your tree.") turns += 1 # return the number of turns it took to win the game return(turns)
Now lets put it all together. We’ve made a couple of other changes to our code to define a variable at the very top called numGames = 10000 to define the size of our range.
# Simulates 10K game of Hi Ho! Cherry-O # Setup _very_ simple timing. import time start_time = time.time() import multiprocessing from statistics import mean import random numGames = 10000 def cherryO(game): spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] turns = 0 cherriesOnTree = 10 # Take a turn as long as you have more than 0 cherries while cherriesOnTree > 0: # Spin the spinner spinIndex = random.randrange(0, 7) spinResult = spinnerChoices[spinIndex] # Print the spin result #print ("You spun " + str(spinResult) + ".") # Add or remove cherries based on the result cherriesOnTree += spinResult # Make sure the number of cherries is between 0 and 10 if cherriesOnTree > 10: cherriesOnTree = 10 elif cherriesOnTree < 0: cherriesOnTree = 0 # Print the number of cherries on the tree #print ("You have " + str(cherriesOnTree) + " cherries on your tree.") turns += 1 # return the number of turns it took to win the game return(turns) def mp_handler(): with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool: ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list. turns = myPool.map(cherryO,range(numGames)) # Uncomment this line to print out the list of total turns (but note this will slow down your code's execution) #print(turns) # Use the statistics library function mean() to calculate the mean of turns print(mean(turns)) if __name__ == '__main__': mp_handler() # Output how long the process took. print ("--- %s seconds ---" % (time.time() - start_time))
You will also see that we have the list of results returned on the left side of the = before our map function (line 40). We’re taking all of the returned results and putting them into a list called turns (feel free to add a print or type statement here to check that it's a list). Once all of the workers have finished playing the games, we will use the Python library statistics function mean, which we imported at the very top of our code (right after multiprocessing) to calculate the mean of our list in variable turns. The call to mean() will act as our reduce as it takes our list and returns the single value that we're really interested in.
When you have finished writing the code in spyder, you can run it. However, it is important to know that there are some well-documented problems with running multiprocessing code directly in spyder. You may only experience these issues with the more complicated arcpy based examples in Section 1.6.6 but we recommend that you run all multiprocessing examples from the command line rather than inside spyder.
The Windows command line and its commands have already been explained in Section 1.6.2 but since this was an optional section, we are repeating the explanation here: Use the shortcut called "Python command prompt" that can be found within the ArcGIS program group on the start menu. This will open a command window running within the Pro conda environment indicating that this is Python 3 (py3). You actually may have several shortcuts with rather similar sounding names, e.g. if you have both ArcGIS Pro and ArcGIS Desktop installed, and it is important that you pick the right one from ArcGIS Pro that mentions Python 3. The prompt will tell you that you are in the folder C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\ or C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\ depending on your version of ArcGIS Pro.
We could dedicate an entire class to operating system commands that you can use in the command window but Microsoft has a good resource at this Windows Commands page [23] for those who are interested.
We just need a couple of the commands listed there :
We’ll change the directory to where we saved the code from above (e.g. mine is in c:\489\lesson1) with the following command:
cd c:\489\lesson1
Before you run the code for the first time, we suggest you change the number of games to a much smaller number (e.g. 5 or 10) just to check everything is working fine so you don’t spawn 10,000 Python instances that you need to kill off. In the event that something does go horribly wrong with your multiprocessing code, see the information about the Windows taskkill command below. To now run the Cherry-O script (which we saved under the name cherry-o.py) in the command window, we use the command:
python cherry-o.py
You should now get the output from the different print statements, in particular the average number of turns and the time it took to run the script. If everything went ok, set the number of games back to 10000 and run the script again.
It is useful to know that there is a Windows command that can kill off all of your Python processes quickly and easily. Imagine having to open Task Manager and manually kill them off, answer a prompt and then move to the next one! The easiest way to access the command is by pressing your Windows key, typing taskkill /im python.exe and hitting Enter which will kill off every task called python.exe. It’s important to only use this when absolutely necessary as it will usually also stop your IDE from running and any other Python processes that are legitimately running in the background. The full help for taskkill is at the Microsoft Windows IT Pro Center taskkill page [34].
Look closely at the images below, which show a four processor PC running the sequential and multiprocessing versions of the Cherry-O code. In the sequential version, you’ll see that the CPU usage is relatively low (around 50%) and there are two instances of Python running (one for the code and (at least) one for spyder).
In the multiprocessing version, the code was run from the command line instead (which is why it’s sitting within a Windows Command Processor task) and you can see the CPU usage is pegged at 100% as all of the processors are working as hard as they can and there are five instances of Python running.
This might seem odd as there are only four processors, so what is that extra instance doing? Four of the Python instances, the ones all working hard, are the workers, the fifth one that isn’t working hard is the master process which launched the workers – it is waiting for the results to come back from the workers. There isn’t another Python instance for spyder because I ran the code from the command prompt – therefore spyder wasn’t running. We'll cover running code from the command prompt in the Profiling [35]section.
On this four processor PC, this code runs in about 1 second and returns an answer of between 15 and 16. That is about three times slower than my sequential version which ran in 1/3 of a second. If instead I play 1M games instead of 10K games, the parallel version takes 20 seconds on average and my sequential version takes on average 52 seconds. If I run the game 100M times, the parallel version takes around 1,600 seconds (26 minutes) while the sequential version takes 2,646 seconds (44 minutes). The more games I play, the better the performance of the parallel version. Those results aren’t as fast as you might expect with 4 processors in the multiprocessor version but it is still around half the time taken. When we look at profiling our code a bit later in this lesson, we’ll examine why this code isn’t running 4x faster.
When moving the code to a much more powerful PC with 32 processors, there is a much more significant performance improvement. The parallel version plays 100M games in 273 seconds (< 5 minutes) while the sequential version takes 3136 seconds (52 minutes) which is about 11 times slower. Below you can see what the task manager looks like for the 32 core PC in sequential and multiprocessing mode. In sequential mode, only one of the processors is working hard – in the middle of the third row – while the others are either idle or doing the occasional, unrelated background task. It is a different story for the multiprocessor mode where the cores are all running at 100%. The spike you can see from 0 is when the code was started.
Let's examine some of the reasons for these speed differences. The 4-processor PC’s CPU runs at 3GHz while the 32-processor PC runs at 2.4GHz; the extra cycles that the 4-processor CPU can perform per second make it a little quicker at math. The reason the multiprocessor code runs much faster on the 32-processor PC than the 4-processor PC is straightforward enough –- there are 8 times as many processors (although it isn’t 8 times faster – but it is close at 6.4x (32 min / 5 min)). So while each individual processor is a little slower on the larger PC, because there are so many more, it catches up (but not quite to 8x faster due to each processor being a little slower).
Memory quantity isn’t really an issue here as the numbers being calculated are very small but if we were doing bigger operations, the 4-processor PC with just 8GB of RAM would be slower than the 32-processor PC with 128GB. The memory in the 32-processor PC is also faster at 2.13 GHz versus 1.6GHz in the 4-processor PC.
So the takeaway message here is if you have a lot of tasks that are largely the same but independent of each other, you can save a significant amount of time utilizing all of the resources within your PC with the help of multiprocessing. The more powerful the PC, the more time that can potentially be saved. However, the caveat is that as already noted multiprocessing is generally only faster for CPU-bound processes, not I/O-bound ones.
Now that we have completed a non-ArcGIS parallel processing exercise, let's look at a couple of examples using ArcGIS functions. There are a number of caveats or gotchas to using multiprocessing with ArcGIS and it is important to cover them up-front because they affect the ways in which we can write our code.
Esri describe a number of best practices for multiprocessing with arcpy. These include:
So bearing the top two points in mind we should make use of in_memory workspaces wherever possible and we should avoid writing to FGDBs (in our worker functions at least – but we could use them in our master function to merge a number of shapefiles or even individual FGDBs back into a single source).
There are two types of operations with rasters that can easily (and productively) be implemented in parallel: operations that are independent components in a workflow, and raster operations which are local, focal or zonal – that is they work on a small portion of a raster such as a pixel or a group of pixels.
Esri’s Clinton Dow and Neeraj Rajasekar presented way back at the 2017 User Conference demonstrating multiprocessing with arcpy and they had a number of useful graphics in their slides which demonstrate these two categories of raster operations which we have reproduced here as they're still appropriate and relevant.
An example of an independent workflow would be if we calculate the slope, aspect and some other operations on a raster and then produce a weighted sum or other statistics. Each of the operations is independently performed on our raster up until the final operation which relies on each of them (see the first image below). Therefore, the independent operations can be parallelized and sent to a worker and the final task (which could also be done by a worker) aggregates or summarises the result. Which is what we can see in the second image as each of the tasks is assigned to a worker (even though two of the workers are using a common dataset) and then Worker 4 completes the task. You can probably imagine a more complex version of this task where it is scaled up to process many elevation and land-use rasters to perform many slope, aspect and reclassification calculations with the results being combined at the end.
An example of the second type of raster operation is a case where we want to make a mathematical calculation on every pixel in a raster such as squaring or taking the square root. Each pixel in a raster is independent of its neighbors in this operation so we could have multiple workers processing multiple tiles in the raster and the result is written to a new raster. In this example, instead of having a single core serially performing a square root calculation across a raster (the first image below) we can segment our raster into a number of tiles, assign each tile to a worker and then perform the square root operation for each pixel in the tile outputting the result to a single raster which is shown in the second image below.
Let‘s return to the raster coding example that we used to build our ArcGIS Pro tool earlier in the lesson. That simple example processed a list of rasters and completed a number of tasks on each raster. Based on what you have read so far I expect that you have realized that this is also a pleasingly parallel problem.
Bearing in mind the caveats about parallel programming from above and the process that we undertook to convert the Cherry-O program, let's begin.
Our first task is to identify the parts of our problem that can work in parallel and the parts which we need to run sequentially.
The best place to start with this can be with the pseudocode of the original task. If we have documented our sequential code well, this could be as simple as copying/pasting each line of documentation into a new file and working through the process. We can start with the text description of the problem and build our sequential pseudocode from there and then create the multiprocessing pseudocode. It is very important to correctly and carefully design our multiprocessing solutions to ensure that they are as efficient as possible and that the worker functions have the bare minimum of data that they need to complete the tasks, use in_memory workspaces, and write as little data back to disk as possible.
Our original task was :
Get a list of raster tiles For every tile in the list: Fill the DEM Create a slope raster Calculate a flow direction raster Calculate a flow accumulation raster Convert those stream rasters to polygon or polyline feature classes.
You will notice that I’ve formatted the pseudocode just like Python code with indentations showing which instructions are within the loop.
As this is a simple example we can place all of the functionality within the loop into our worker function as it will be called for every raster. The list of rasters will need to be determined sequentially and we’ll then pass that to our multiprocessing function and let the map element of multiprocessing map each raster onto a worker to perform the tasks. We won’t explicitly be using the reduce part of multiprocessing here as the output will be a featureclass but reduce will probably tidy up after us by deleting temporary files that we don’t need.
Our new pseudocode then will look like :
Get a list of raster tiles For every tile in the list: Launch a worker function with the name of a raster Worker: Fill the DEM Create a slope raster Calculate a flow direction raster Calculate a flow accumulation raster Convert those stream rasters to polygon or polyline feature classes.
Bear in mind that not all multiprocessing conversions are this simple. We need to remember that user output can be complicated because multiple workers might be attempting to write messages to our screen at once and that can cause those messages to get garbled and confused. A workaround for this problem is to use Python’s logging library which is much better at handling messages than us manually using print statements. We haven't implemented logging in this sample solution for this script but feel free to briefly investigate it to supplement the print and arcpy.AddMessage functions with calls to the logging function. The Python Logging Cookbook [36] has some helpful examples.
As an exercise, attempt to implement the conversion from sequential to multiprocessing. You will probably not get everything right since there are a few details that need to be taken into account such as setting up an individual scratch workspace for each call of the worker function. In addition, to be able to run as a script tool the script needs to be separated into two files with the worker function in its own file. But don't worry about these things, just try to set up the overall structure in the same way as in the Cherry-O multiprocessing version and then place the code from the sequential version of the raster example either in the main function or worker function depending on where you think it needs to go. Then check out the solution linked below.
Click here for one way of implementing the solution [37]
When you run this code, do you notice any performance differences between the sequential and multiprocessor versions?
The sequential version took 96 seconds on the same 4-processor PC we were using in the Cherry-O example, while the multiprocessor version completed in 58 seconds. Again not 4 times faster as we might expect but nearly twice as fast with multiprocessing is a good improvement. For reference, the 32-processor PC from the Cherry-O example processed the sequential code in 110 seconds and the multiprocessing version in 40 seconds. We will look in more detail at the individual lines of code and their performance when we examine code profiling but you might also find it useful to watch the CPU usage tab in Task Manager to see how hard (or not) your PC is working.
The best practices of multiprocessing that we introduced earlier are even more important when we are working with vector data than they are with raster data. The geodatabase locking issue is likely to become much more of a factor as typically we use more vector data than raster and often geodatabases are used more with feature classes.
The example we’re going to use here involves clipping a feature layer by polygons in another feature layer. A sample use case of this might be if you need to segment one or several infrastructure layers by state or county (or even a smaller subdivision). If I want to provide each state or county with a version of the roads, sewer, water or electricity layers (for example) this would be a helpful script. To test out the code in this section (and also the first homework assignment), you can again use the data from the USA.gdb geodatabase (Section 1.5) we provided. The application then is to clip the data from the roads, cities, or hydrology data sets to the individual state polygons from the States data set in the geodatabase.
To achieve this task, one could run the Clip tool manually in ArcGIS Pro but if there are a lot of polygons in the clip data set, it will be more effective to write a script that performs the task. As each state/county is unrelated to the others, this is an example of an operation that can be run in parallel.
The code example below has been adapted from a code example written by Duncan Hornby at the University of Southampton in the United Kingdom that has been used to demonstrate multiprocessing and also how to create a script tool that supports multiprocessing. We will take advantage of Mr. Hornby’s efforts and make use of his code (with attribution of course) but we have also reorganized and simplified it quite a bit and added some enhancements.
Let us examine the code’s logic and then we’ll dig into the syntax.
The code has two Python files [38]. This is important because when we want to be able to run it as a script tool in ArcGIS, it is required that the worker function for running the individual tasks be defined in its own module file, not in the main script file for the script tool with the multiprocessing code that calls the worker function. The first file called scripttool.py imports arcpy, multiprocessing, and the worker code contained in the second python file called multicode.py and it contains the definition of the main function mp_handler() responsible for managing the multiprocessing operations similar to the Cherry-O multiprocessing version. It uses two script tool parameters, the file containing the polygons to use for clipping (variable clipper) and the file to be clipped (variable tobeclipped). Furthermore, the file includes a definition of an auxiliary function get_install_path() which is needed to determine the location of the Python interpreter for running the subprocesses in when running the code as a script tool in ArcGIS. The content of this function you don't have to worry about. The main function mp_handler() calls the worker(...) function located in the multicode file, passing it the files to be used and other information needed to perform the clipping operation. This will be further explained below . The code for the first file including the main function is shown below.
import os, sys import arcpy import multiprocessing from multicode import worker # Input parameters clipper = r"C:\489\USA.gdb\States" #clipper = arcpy.GetParameterAsText(0) tobeclipped = r"C:\489\USA.gdb\Roads" #tobeclipped = arcpy.GetParameterAsText(1) def get_install_path(): ''' Return 64bit python install path from registry (if installed and registered), otherwise fall back to current 32bit process install path. ''' if sys.maxsize > 2**32: return sys.exec_prefix #We're running in a 64bit process #We're 32 bit so see if there's a 64bit install path = r'SOFTWARE\Python\PythonCore\2.7' from _winreg import OpenKey, QueryValue from _winreg import HKEY_LOCAL_MACHINE, KEY_READ, KEY_WOW64_64KEY try: with OpenKey(HKEY_LOCAL_MACHINE, path, 0, KEY_READ | KEY_WOW64_64KEY) as key: return QueryValue(key, "InstallPath").strip(os.sep) #We have a 64bit install, so return that. except: return sys.exec_prefix #No 64bit, so return 32bit path def mp_handler(): try: # Create a list of object IDs for clipper polygons arcpy.AddMessage("Creating Polygon OID list...") print("Creating Polygon OID list...") clipperDescObj = arcpy.Describe(clipper) field = clipperDescObj.OIDFieldName idList = [] with arcpy.da.SearchCursor(clipper, [field]) as cursor: for row in cursor: id = row[0] idList.append(id) arcpy.AddMessage("There are " + str(len(idList)) + " object IDs (polygons) to process.") print("There are " + str(len(idList)) + " object IDs (polygons) to process.") # Create a task list with parameter tuples for each call of the worker function. Tuples consist of the clippper, tobeclipped, field, and oid values. jobs = [] for id in idList: jobs.append((clipper,tobeclipped,field,id)) # adds tuples of the parameters that need to be given to the worker function to the jobs list arcpy.AddMessage("Job list has " + str(len(jobs)) + " elements.") print("Job list has " + str(len(jobs)) + " elements.") # Create and run multiprocessing pool. multiprocessing.set_executable(os.path.join(get_install_path(), 'pythonw.exe')) # make sure Python environment is used for running processes, even when this is run as a script tool arcpy.AddMessage("Sending to pool") print("Sending to pool") cpuNum = multiprocessing.cpu_count() # determine number of cores to use print("there are: " + str(cpuNum) + " cpu cores on this machine") with multiprocessing.Pool(processes=cpuNum) as pool: # Create the pool object res = pool.starmap(worker, jobs) # run jobs in job list; res is a list with return values of the worker function # If an error has occurred report it failed = res.count(False) # count how many times False appears in the list with the return values if failed > 0: arcpy.AddError("{} workers failed!".format(failed)) print("{} workers failed!".format(failed)) arcpy.AddMessage("Finished multiprocessing!") print("Finished multiprocessing!") except arcpy.ExecuteError: # Geoprocessor threw an error arcpy.AddError(arcpy.GetMessages(2)) print("Execute Error:", arcpy.ExecuteError) except Exception as e: # Capture all other errors arcpy.AddError(str(e)) print("Exception:", e) if __name__ == '__main__': mp_handler()
Let's now have a close look at the logic of the two main functions which will do the work. The first one is the mp_handler() function shown in the code section above. It takes the input variables and has the job of processing the polygons in the clipping file to get a list of their unique IDs, building a job list of parameter tuples that will be given to the individual calls of the worker function, setting up the multiprocessing pool and running it, and taking care of error handling.
The second function is the worker function called by the pool (named worker in this example) located in the multicode.py file (code shown below). This function takes the name of the clipping feature layer, the name of the layer to be clipped, the name of the field that contains the unique IDs of the polygons in the clipping feature layer, and the feature ID identifying the particular polygon to use for the clipping as parameters. This function will be called from the pool constructed in mp_handler().
The worker function will then make a selection from the clipping layer. This has to happen in the worker function because all parameters given to that function in a multiprocessing scenario need to be of a simple type that can be "pickled." Pickling data [39] means converting it to a byte-stream which in the simplest terms means that the data is converted to a sequence of simple Python types (string, number etc.). As feature classes are much more complicated than that containing spatial and non-spatial data, they cannot be readily converted to a simple type. That means feature classes cannot be "pickled" and any selections that might have been made in the calling function are not shared with the worker functions. Therefore, we need to think about creative ways of getting our data shared with our sub-processes. In this case, that means we’re not going to do the selection in the master module and pass the polygon to the worker module. Instead, we’re going to create a list of feature IDs that we want to process and we’ll pass an ID from that list as a parameter with each call of the worker function that can then do the selection with that ID on its own before performing the clipping operation. For this, the worker function selects the polygon matching the OID field parameter when creating a layer with MakeFeatureLayer_management() and uses this selection to clip the feature layer to be clipped. The results are saved in a shapefile including the OID in the file's name to ensure that the names are unique.
import os, sys import arcpy def worker(clipper, tobeclipped, field, oid): """ This is the function that gets called and does the work of clipping the input feature class to one of the polygons from the clipper feature class. Note that this function does not try to write to arcpy.AddMessage() as nothing is ever displayed. If the clip succeeds then it returns TRUE else FALSE. """ try: # Create a layer with only the polygon with ID oid. Each clipper layer needs a unique name, so we include oid in the layer name. query = '"' + field +'" = ' + str(oid) arcpy.MakeFeatureLayer_management(clipper, "clipper_" + str(oid), query) # Do the clip. We include the oid in the name of the output feature class. outFC = r"c:\489\output\clip_" + str(oid) + ".shp" arcpy.Clip_analysis(tobeclipped, "clipper_" + str(oid), outFC) print("finished clipping:", str(oid)) return True # everything went well so we return True except: # Some error occurred so return False print("error condition") return False
Having covered the logic of the code, let's review the specific syntax used to make it all work. While you’re reading this, try visualizing how this code might run sequentially first – that is one polygon being used to clip the to-be-clipped feature class, then another polygon being used to clip the to-be-clipped feature class and so on (maybe through 4 or 5 iterations). Then once you have an understanding of how the code is running sequentially try to visualize how it might run in parallel with the worker function being called 4 times simultaneously and each worker performing its task independently of the other workers.
We’ll start with exploring the syntax within the mp_handler(...) function.
The mp_handler(...) function begins by determining the name of the field that contains the unique IDs of the clipper feature class using the arcpy.Describe(...) function (line 36 and 37). The code then uses a Search Cursor to get a list of all of the object (feature) IDs from within the clipper polygon feature class (line 39 to 43). This gives us a list of IDs that we can pass to our worker function along with the other parameters. As a check, the length of that list is printed out (lines 45 and 46).
Next, we create the job list with one entry for each call of the worker() function we want to make (lines 50 to 53). Each element in this list is a tuple of the parameters that should be given to that particular call of worker(). This list will be required when we set up the pool by calling pool.starmap(...). To construct the list, we simply loop through the ID list and append a parameter tuple to the list in variable jobs. The first three parameters will always be the same for all tuples in the job list; only the polygon ID will be different. In the homework assignment for this lesson, you will adapt this code to work with multiple input files to be clipped. As a result, the parameter tuples will vary in both the values for the oid parameter and for the tobeclipped parameter.
To prepare the multiprocessing pool, we first specify what executable should be used each time a worker is spawned (line 60). Without this line, a new instance of ArcGIS Pro would be launched by each worker, which is clearly less than ideal. Instead, this line calls the get_install_path() function defined in lines 12-27 to determine the path to the pythonw.exe executable.
The code then sets up the size of the pool using the maximum number of processors in lines 65-68 (as we have done in previous examples) and then, using the starmap() method of Pool, calls the worker function worker(...) once for each parameter tuple in the jobs list (line 69).
Any outputs from the worker function will be stored in variable res. These are the boolean values returned by the worker() function, True to indicate that everything went ok and False to indicate that the operation failed. If there is at least one False value in the list, an error message is produced stating the exact number of worker processes that failed (lines 73 to 76).
Let's now look at the code in our worker function worker(...). As we noted in the logic section above, it receives four parameters: the full paths of the clipping and to-be-clipped feature classes, the name of the field that contains the unique IDs in the clipper feature class, and the OID of the polygon it is to use for the clipping.
Notice that the MakeFeatureLayer_management(...) function in line 12 is used to create an in_memory layer which is a copy of the original clipper layer. This use of the in_memory layer is important in three ways: The first is performance – in_memory layers are faster; second, the use of an in_memory layer can help prevent any chance of file locking (although not if we were writing back to the file); third, selection only works on layers so even if we wanted to, we couldn’t get away without creating this layer.
The call of MakeFeatureLayer_management(...) also includes an SQL query string defined one line earlier in line 11 to create the layer with just the polygon that matches the oid that was passed as a parameter. The name of the layer we are producing here should be unique; this is why we’re adding str(oid) to the name in the first parameter.
Now with our selection held in our in_memory, uniquely named feature layer, we perform the clip against our to-be-clipped layer (line 16) and store the result in outFC which we define in line 15 to be a hardcoded folder with a unique name starting with "clip_" followed by the oid. To run the code, you will most likely have to adapt the path used in variable outFC.
The process then returns from the worker function and will be supplied with another oid. This will repeat until a call has been made for each polygon in the clipping feature class.
We are going to use this code as the basis for our Lesson 1 homework project. Have a look at the Assignment Page [40] for full details.
You can test this code out by running it in a number of ways. If you run it from ArcGIS Pro as a script tool, you will have to swap the hashmarks for the clipper and tobeclipped input variables so that GetParameterAsText() is called instead of using hardcoded paths and file names. Be sure to set your parameter type for both parameters to Feature Class. If you make changes to the code and have problems with the changes to the code not being reflected in Pro, delete your Script tool from the Toolbox, restart Pro and re-add the script tool.
You can run your code from spyder but only if you hardcode your parameters or supply them from spyder (Run>Run configuration per file, tick the Command line options checkbox and then enter your filenames separated by a space in the text box alongside. Also make sure you're running the scripttool.py in spyder (not the multicode.py). You can also run your code from the Command Prompt which is the fastest way with the smallest resource overhead.
The final thing to remember about this code is that it has a hardcoded output path defined in variable outFC in the worker() function - which you will want to change, create and/or parameterize etc. so that you have some output to investigate. If you do none of these things then no output will be created.
When the code runs it will create a shapefile for every unique object identifier in the "clipper" shapefile (there are 51 in the States data set from the sample data) named using the OID (that is clip_1.shp - clip_59.shp).
Debugging and profiling are important skills for any serious programmer – debugging helps you step through your code and helps you analyze the contents of variables (watches) and set breakpoints to check code progress. Profiling runs code to provide an in-depth breakdown of the execution times of individual lines of code or blocks of code to identify performance bottlenecks in the code (e.g. slow I/O, inefficient loops etc.)
In this section, we will first examine debugging techniques and processes before investigating code profiling (which is required for the Lesson 1 Homework Assignment [40]).
As you may remember from GEOG 485, the simplest method of debugging is to embed print statements in your code to either determine how far your code is running through a loop or to print out the contents of a variable.
A more complicated and detailed method involves using the tools or features of your IDE to create watches for checking the contents of variables and breakpoints for stepping through your code. While we will introduce a range of IDEs at the end of the lesson and you will be investigating IDEs further and examining how their debugging tools work, we will provide a generic overview of the techniques here of setting watches, breakpoints and stepping through code. Don’t focus on the specifics of the interface as we do this. Instead, it is more important to understand the purpose of each of the different methods of debugging. While we could examine debugging using IDLE (or PythonWin as we did in GEOG 485) we’ve chosen to use a more representative IDE – Spyder as it more accurately demonstrates the features available in the majority of IDEs and you should also already have it installed from earlier.
We will start off by looking at Spyder’s debugging functions, which are similar to those of PythonWin from GEOG 485 and all other IDEs. You might remember back in GEOG 485 when we looked at debugging [41] we examined setting up breakpoints, watches, and stepping through code. We’re going to revisit those concepts here briefly using Spyder; the same functionality will be available in all of the IDEs you’ll be investigating later but you might have to dig a little in the menus for it. There's more details in the Spyder help here [42].
For our debugging and profiling with Spyder, we’re going to use our raster multiprocessing example (the one that involved delineating streams from lidar data) from earlier. After starting Spyder and opening that file, click on the Debug menu and you will see a number of options.
Set the cursor to the line of code which filters our list of rasters :
new_rasters = filter_list(rasters,wildCardList)
and set a breakpoint by choosing Set/Clear breakpoint from the Debug menu or pressing F12 or double-clicking alongside the line number of the code where you want a breakpoint (removing breakpoints uses the same methods as adding them).
Now run your code using the Debug item from the Debug menu, by pressing CTRL+F5, or using the button. Your code will now start running and then pause prior to running the line of code at your breakpoint.
One of the nice touches to Spyder is that it has a variable explorer pane, accessible from the View > Panes menu. This will show you a list of your variables, their type, size, and contents once they’re declared, and you can watch them change.
If you bring this pane up you will see two variables already in it, start_time and rasters, our list which has a size of 43 – that is 43 elements - so the code found 43 rasters in my case.
We can step to the next line of code using either Debug>Step, CTRL+F10, or the Step button . Notice that when you do this, an extra variable will appear in your list (new_rasters) and your next executable line of code will be highlighted in a pale grey band.
We can step through our code line by line to monitor its state as well as the contents of our variables to ensure that our code is doing what we expect it to. While we are doing this, we are looking for unexpected results in variables, loops that are not running correctly (either doing too many or too few iterations or not at all) and if/elif/else statements which are not being correctly evaluated.
While we can step through lines individually, there are also two more useful options in most IDEs which are "run current function until it returns" and "run to the next breakpoint." Spyder has these implemented using the buttons and , respectively, and I suggest that you experiment with both to see how your functions are executed – particularly with the multiprocessing code. If you want to run between multiple breakpoints, you will need to add some more breakpoints to your code, otherwise, the code will run to the end.
Lastly, if you don’t want to step through all of the lines of code in a function, you can use the Step Return button , which will return you to the calling function.
We have experimented with some very simple code profiling by introducing the time() function into our code in previous examples using it to record the start and end times of our code and check the overall performance.
While that gives us a high level view of the performance of our code, we do not know where specific bottlenecks might exist within that code. For example what is the slowest part of our algorithm, is it the file reading or writing or the calculations in between? When we know where these bottlenecks are we can investigate ways of removing them or at least using faster techniques to speed up our code.
In this section, we will focus on basic profiling that looks at how long each function call takes – not each individual line of code. This basic type of code profiling is built into Spyder (and most IDEs). However, we also provide some complementary materials that explain how to visualize profiling results and on how to profile each line of code individually but these parts will be entirely optional because they are quite a bit more complex and require the installation of additional software and packages. It is possible that you will run into some technical/installation issues, so our recommendation is that you only come back and try to run the described steps in these optional subsections yourself if you are done with lesson and homework assignment and still have time left.
We will again use the Spyder IDE and our basic raster code from earlier in the lesson. Spyder has a Profile pane as well which is accessible from the View -> Panes menu. You may need to manually load your code into Profiler using the folder icon. The spyder help for the Profiler is here if you'd like to read it (Profiler — Spyder 5 documentation (spyder-ide.org) [43]) but we explain the important parts below.
Once you load your code, Profiler automatically starts profiling it and displays the message “ Profiling, please wait...“ at the top of the Profiler window. You will need to be a little patient as Spyder runs through all of your code to perform the timing (or alternatively in these raster examples reduce the sample size). You probably remember that we recommended to run multiprocessing code from the command line rather than inside Spyder. However, using the built-in profiler for loading and running multiprocessing code works as long as the worker function is defined in its own module as we did for the vector clipping example to be able to use it as a script tool. If this is not the case, you will receive an error that certain elements in the code cannot be “pickled“ which you might remember from our multiprocessing discussion means those objects cannot be converted to a basic type. We didn't split the multiprocessing version of the raster example into two separate modules, so here we will only look at the non-multiprocessing version and profile that. We won't have this issue when we using other profilers in the following optional sections and in the homework assignment you will work with the vector clipping multiprocessing example which has been set up in a way that allows for profiling with the Spyder profiler.
Once the Profiler has completed you will see a set of results like the ones below.
Looking over those results you will see a list of functions together with the times each has taken. The important column to examine is the Local Time column which shows you how long each function has taken to execute in us (microseconds), ms (milliseconds), seconds, minutes etc. The Total Time column is showing you the cumulative time for each of those processes that was run (e.g. if your code was running in a function). You can order any of the columns but if you arrange the Total Time column in ascending order this will give you a logical starting point as the times will be arranged from the shortest to longest running. There is no way to order the results to see the order in which your code ran. So you will see (depending on your code arrangement) overwriteOutput followed by the time function, then filterList, etc.
The next column to look at is the Calls column which has the count of how many times each of those functions was launched. So a high value in Local Time might be indicative of either a large number of calls to a very fast function or a small number of calls to a very slow function.
In my timings, there aren’t any obvious places to look for performance improvements although the code could be fractionally faster (but less of a good team player) if I didn’t check the extension back in and my .replace() method and print add a small amount of time to the execution.
What we are doing with this type of profiling is examining the sum of the functions and methods which were called during the code execution and how long they took, in total, for the number of times that they were called. It is possible to identify inefficiencies with this sort of profiling, particularly in combination with debugging. Am I calling a slow function too often? Can I find a faster function to do the same job (for example some mathematical functions are significantly faster than others and achieve the same result and often there exist approximations that give almost the exact result but are much faster to compute)?
It is worth pointing out here that the results from Spyder’s Profiler are actually the output from the cProfile [44] package of the Python standard library which is essentially wrapped around our script to calculate the statistics we are seeing above. You could import this package into your own code and use it there directly but we will focus on using its functionality from the IDE which is usually more convenient and the results are presented in a more readily understood format.
You might be thinking that these results aren’t really that readily understood and it would be easier if there were a graphical visualization of the timings. Luckily there is and if you want to learn more about it, the following optional sections on code profiling with visualiuation are a good starting point. In addition, we continue another optional section that explains how you can do more detailed profiling looking at each individual line rather than complete functions. However, we recommend that you skip or only skim through these optional sections on your first pass through the lesson materials and come back when you have the time.
As we said at the beginning of this section, this section is provided for interest only. Please feel free to skip over it and you can loop back to it at the end of the lesson if you have free time or after the end of the class. Be warned: installing the required extensions and using them is a little complicated - but this is an advanced class, so don't let this deter you.
We are going to need to download some software called QCacheGrind which reads a tree type file (like a family tree). Unfortunately, QcacheGrind doesn’t natively support the profile files we are going to be creating so we will also need a converter (pyprof2calltree [45]), written in Python. Our workflow is going to be :
Download QCacheGrind [2] and unzip it to a folder. QcacheGrind can be run by double-clicking on the qcachegrind executable in the folder you’ve just unzipped it to. Don’t do that just yet though, we’ll come back to it once we’ve done the other steps in our workflow and when we have some profile files to visualize.
Now we’re going to install our converter using the Python Preferred Installer Program, pip. If you would like to learn more about pip, the Python 3 Help (Key Terms) [46] has a full explanation. You will also learn more about Python packet managers in the next lesson. Pip is included by default with the Python installation you have but we have to access it from the Python Command Prompt.
As we mentioned in Section 1.6.5.2, there should be a shortcut within the ArcGIS program group on the start menu called "Python Command Prompt" on your PC that opens a command window running within the conda environment indicating that this is Python 3 (py3). You actually may have several shortcuts with rather similar sounding names, e.g. if you have both ArcGIS Pro and ArcGIS Desktop installed, and it is important that you pick the right one from ArcGIS Pro using Python 3.
In the event that there isn’t a shortcut, you can start Python from a standard Windows command prompt by typing :
"%PROGRAMFILES%\ArcGIS\Pro\bin\Python\Scripts\propy"
The instructions above mirror Esri's help [47]for running Python.
Open your Python command prompt and you should be in the folder C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\ or C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\ depending on your version of ArcGIS Pro. This is the default folder when the command prompt opens (which you can see at the prompt). Then type (and hit Enter) :
Scripts\pip install pyprof2calltree
Pip will then download from a repository of hosted Python packages the converter that we need and then install it. You should see a message saying “Successfully installed pyprof2calltree-1.4.3“ although your version number may be higher and that’s ok. If you receive an error message about permissions during the pyprof2calltree installation, close out of Python command prompt and reopen the program with administrative privileges (usually right-clicking on the program and selecting "Run as administrator," in Windows 10, you might need to "Open File Location" and then right-click to "Run as administrator").
After running commands in Python command prompt, you will probably also see an information message stating:
“You are using pip version 9.0.3, however version 10.0.1 is available. You should consider upgrading via the `python –m pip install –upgrade pip‘ command.“
You can ignore this message, we’re going to leave pip at its current version as that is what came with the ArcGIS Pro Python distribution and we know that it works.
This section is provided for interest only. Please feel free to skip over it and you can loop back to it at the end of the lesson if you have free time or after the end of the class.
Now that we have the QCacheGrind viewer and converter installed we can create some profile files and we'll do that using IPython (available on another tab in Spyder). Think of IPython a little like the interactive Python window in ArcGIS with a few more bells and whistles. We’ll use IPython very shortly for line-by-line profiling, so as a bridge to that task, let's use it to perform some more simple profiling.
Open the IPython pane in Spyder and then you will probably need to change to the folder where your code is located. This should be as easy as looking at the top of the Spyder window, selecting and copying the folder name where your code is, then clicking in the IPython window and typing cd and pasting in the folder name and hitting enter (as seen below).
This might look like:
cd c:\Users\YourName\Documents\GEOG489\Lesson1\
We could (but won't) run our code from IPython by typing:
run Lesson1B_basic_raster_for_mp.py
and hitting enter and our code will run and display its output in the IPython window.
That is somewhat useful, but more useful is the ability to use additional packages within IPython for more detailed profiling. Let’s create a function-level profile file using IPython for that raster code and then we’ll run it through the converter and then visualize it in QCacheGrind.
To create that function-level profile we’ll use the built-in profiler prun.
We’ll access it using is a magic word instruction to Spyder which is shorthand for loading an external package; that package is called line_profiler which we just installed. Think of it in the same way as import for Python code – we’re now going to be able to access functionality embedded within that package.
Our magic words have % in front of them. Let's use it to see what the parameters are for using prun with (the ? on the end tells prun to show us its built-in help):
%prun?
If you scroll back up through the IPython console, you will be able to see all of the options for prun. You can compare our command below to that list to break down the various options and experiment with others if you wish. Notice the very last line of the help which states:
If you want to run complete programs under the profiler's control, use "%run -p [prof_opts] filename.py [args to program]" where prof_opts contains profiler specific options as described here.
That is what we’re going to do – use run with the prun options.
cd "C:\Users\YourName" %run -p -T profile_run.txt -D profile_run.prof Lesson1B_basic_raster_for_mp
There are a couple of important things to note in these commands. The first is the double quotes around the full path name in the cd command – these are important: just in case there is a space in your path, the double quotes encapsulate it so your folder is found correctly. The other thing is the casing of the various parameters (remember Python is case-sensitive and so are a lot of the built-in tools).
It could take a little while to complete our profiling as our code will run through from start to end. We can check that our code is running by opening the Windows Task Manager and watching the CPU usage which is probably at 100% on one of our cores.
While our code is running we’ll see the normal output with the timing print functions we had implemented earlier. When the run command completes we’ll see a few lines of output that look like :
%run -p -T profile_run.txt -D profile_run.prof Lesson1B_basic_raster_for_mp *** Profile stats marshalled to file 'profile_run.prof'. *** Profile printout saved to text file 'profile_run.txt'. 3 function calls in 0.000 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 ... 1 0.000 0.000 0.000 0.000 SSL.py:677(Session) 1 0.000 0.000 0.000 0.000 cookiejar.py:1753(LoadError) 1 0.000 0.000 0.000 0.000 socks.py:127(ProxyConnectionError) 1 0.000 0.000 0.000 0.000 _conditional.py:177(cryptography_has_mem_functions) 1 0.000 0.000 0.000 0.000 _conditional.py:196(cryptography_has_x509_store_ctx_get_issuer) 1 0.000 0.000 0.000 0.000 _conditional.py:210(cryptography_has_evp_pkey_get_set_tls_enco dedpoint)
These summary outputs will also be written to a text file that we can open in a text editor (or in Spyder) and our profile file which we will convert and open in QCacheGrind. Writing that output to a text file is useful because there is too much of it to fit within IPython’s window buffer, and you won’t be able to get back to the output right at the start of the execution. If you open the profile_run.txt file in Spyder you’ll see the full output.
We’ll run the converter using some familiar Python commands and the convert function within the IPython window:
from pyprof2calltree import convert, visualize convert('profile_run.prof','callgrind.profile_run')
The converted output file can now be opened in QCacheGrind. Open QCacheGrind from the folder you installed it into earlier by double-clicking its icon. Click the folder icon in the top left of QCacheGrind or choose File ->Open from the menu and open the callgrind.profile_run file we just created, which should be in the same folder as your source code.
What we have now is a complicated and detailed interface and visualization of every function that our code called but in a more graphically friendly format than the original text file. We can sort by time, number of times a function was called, the function name and the location of that code (our own, within the arcpy library or another library) in the left-hand pane of the interface.
In the list, you will see a lot of built-in functions (things that Python does behind the scenes or that arcpy has it do – calculations, string functions etc.) but you will also see the names of some of the arcpy functions that we used such as FlowAccumulation(...) or StreamToFeature(...). If you double-click on one of them and click on the Call Graph tab in the lower pane you will see a graphical representation of where the tasks were called from. If you double-click on the function’s box above it in the Call Graph pane
In the example below, we can see that FlowAccumulation(...) is the slowest of our tasks taking about 43% of the execution time. If we can find a way to speed up (or eliminate this process), we’ll make our code more efficient. If we can’t – that’s ok too – we’ll just have to accept that our code takes a certain amount of time to run.
Spend a little time clicking around in the interface and exploring your code – don’t worry too much about going down the rabbit hole of optimizing your code – just explore. Check out functions whose name you recognize such as those raster ones, or ListRasters(). Experiment with examining the content of the different tabs and seeing which modules call which functions (Callers, All Callers and Callee Map). Click down deep into one of the modules and watch the Callee Map change to show each small task being undertaken. If you get too far down the rabbit hole you can use the up arrow near the menu bar to find your way back to the top level modules.
If you’re interested, feel free to run through the process again with your multiprocessing code and see the differences. As a quick reminder, the IPython commands are (although your filenames might be different and be sure to double-check that you're in the correct folder if you get file not found errors):
%run -p -T profile_run_mp.txt -D profile_run_mp.prof Lesson1B_basic_raster_using_mp from pyprof2calltree import convert, visualize convert('profile_run_mp.prof','callgrind.profile_run_mp')
If you load that newly created file into QCacheGrind, you’ll note it looks a little different – like an Escher drawing or a geometric pattern. That is the representation of the multiprocessing functions being run. Feel free to explore among here as well – and you will notice that the functions we previously saw are harder to find – or invisible.
I haven't forgotten about my promise from earlier in the lesson to review the reasons why the Cherry-O code is only about 2-3x faster in multiprocessing mode than the 4x that we would have hoped.
Feel free to run both versions of your Cherry-O code against the profiler and you'll notice that most of the time is taken up by some code described as {method 'acquire' of '_thread.lock' objects} which is called a small number of times. This doesn't give us a lot of information but does hint that perhaps the slower performance is related to something to do with handling multiprocessing objects.
Remember back to our brief discussion about pickling objects which was required for multiprocessing?
It's the culprit, and the following optional section on line profiling will take a closer look at this issue. However, as we said before, line profiling adds quite a bit of complexity, so feel free to skip this section entirely or get back to it after you have worked through the rest of the lesson.
As we said at the beginning of this section, this section is provided for interest only. Please feel free to skip over it and you can loop back to it at the end of the lesson if you have free time or after the end of the class. Be warned: installing the required extensions and using them is a little complicated - but this is an advanced class, so don't let this deter you.
Before we begin this optional section, you should know that line profiling is slow – it adds a lot of overhead to our code execution and it should only be done on functions that we know are slow and we’re trying to identify the specifics of why they are slow. Due to that overhead, we cannot rely on the absolute timing reported in line profiling, but we can rely on the relative timings. If a line of code is taking 50% of our execution time and we can reduce that to 40,30 or 20% (or better) of our total execution time then we have been successful.
With that warning about the performance overhead of line profiling, we’re going to install our line profiler (which isn’t the same one that Spyder or ArcGIS Pro would have us install) again using pip (although you are welcome to experiment with those too).
Setting PermissionsBefore we start installing packages we will need to adjust operating system (Windows) permissions on the Python folder of ArcGIS Pro, as we will need the ability to write to some of the folders it contains. This will also help us if we inadvertently attempt to create profile files in the Python folder as the files will be created instead of producing an error that the folder is inaccessible (but we shouldn't create those files there as it will add clutter).
Open up Windows Explorer and navigate to the c:\Program Files\ArcGIS\Pro\bin folder. Select the Python folder and right click on it and select Properties. Select the Security tab, and click the Advanced button. In the new window that opens select Change Permissions, Select the Users group, Uncheck the Include inheritable permissions from this object’s parent box or Disable Inheritance – depending on your version of Windows, and select Add (or Make Explicit) on the dialog that appears.Once again open (if it isn’t already) your Python command prompt and you should be in the folder C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3 or C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone, the default folder of your ArcGIS Pro Python environment, when that command prompt opens (which you can see at the prompt). If in your version of Pro, the command prompt instead shows C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone, then you will have to use this path instead in some of the following commands where the kernprof program is used.
Then type :
scripts\pip install line_profiler
Pip will then download from a repository of hosted Python packages the line_profiler that we need and then install it.
If you receive an error that "Microsoft Visual C++ 14.0 is required," visit https://www.visualstudio.com/downloads/ [3] and download the package for "Visual Studio Community 2017" which will download Visual Studio Installer. Run Visual Studio Installer, and under the "Workloads" tab, you will select two components to install. Under Windows, check the box in the upper right hand corner of "Desktop Development with C++," and, under Web & Cloud, check the box for "Python development." After checking the boxes, click "Install" in the lower right hand corner. After installing those, open the Python command prompt again and enter:
scripts\pip install misaka
If that works, then install the line_profiler with...
scripts\pip install line_profiler
You should see a message saying "Successfully installed line_profiler 2.1.2" although your version number may be higher and that’s okay.
Now that IPython is aware of the line profiler we can run it. There are two modes for running the line profiler, function mode, where we supply a Python file and a function we want to run as well as the parameters for that function given as parameters to the profiler, and module mode, where we supply the module name.
Function-level line profiling is very useful when you want to test just a single function, or if you’re doing multiprocessing as we saw above. Module-level line profiling is a useful first pass to identify those functions that might be slowing things down and it's why we did a similar approach with our higher-level profiling earlier.
Now we can dive into function-level profiling to find the specific lines which might be slowing down our code and then optimize or enhance them and then perform further function-level line profiling (or module-level profiling) again to test our improvements.
We will start with module-level profiling using our single processor Cherry-O code [48], look at our non-multiprocessing raster example code [26] that did the analysis of the Penn State campus, and finally move on to the multiprocessing Cherry-O code [49]. You may notice a few little deviations in the code that is being used in this section compared to the versions presented in Section 1.5 and 1.6. These deviations are really minimal and have no effect on how the code works and the insights we gain from the profiling.
Our line profiler is in a package called KernProf (named after its author) and it works as a wrapper around the standard cProfile and Line_Profiler tools.
We need to make some changes to our code so that the line profiler knows which functions we wish it to interrogate. The first of those changes is to wrap a function definition around our code (so that we have a function to profile instead of just a single block of code). The second change we need is to use a decorator which is an instruction or meta information for a piece of code that will be ignored by Python. In the case of our profiler, we need to use the decorator @profile to tell KernProf which functions (and we can do many) to examine and it will ignore any without a decorator. Your decorator may give you errors if you're not running your code against the profiler so in that case comment it out.
We’ve made those changes to our original Cherry-O code below so you can see them for yourselves. Check out line 8 for the decorator, line 9 for the function definition (and note how the code is now indented within the function) and line 53 where the function is called. You might also notice that I reduced the number of games back down to 10001 from our very large number earlier. Don’t forget to save your code after you make these changes.
# Simulates 10K game of Hi Ho! Cherry-O # Setup _very_ simple timing. import time start_time = time.time() import random @profile def cherryo(): spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] turns = 0 totalTurns = 0 cherriesOnTree = 10 games = 0 while games < 10001: # Take a turn as long as you have more than 0 cherries cherriesOnTree = 10 turns = 0 while cherriesOnTree > 0: # Spin the spinner spinIndex = random.randrange(0, 7) spinResult = spinnerChoices[spinIndex] # Print the spin result #print ("You spun " + str(spinResult) + ".") # Add or remove cherries based on the result cherriesOnTree += spinResult # Make sure the number of cherries is between 0 and 10 if cherriesOnTree > 10: cherriesOnTree = 10 elif cherriesOnTree < 0: cherriesOnTree = 0 # Print the number of cherries on the tree #print ("You have " + str(cherriesOnTree) + " cherries on your tree.") turns += 1 # Print the number of turns it took to win the game #print ("It took you " + str(turns) + " turns to win the game.") games += 1 totalTurns += turns print ("totalTurns "+str(float(totalTurns)/games)) #lastline = raw_input(">") # Output how long the process took. print ("--- %s seconds ---" % (time.time() - start_time)) cherryo()
We could try to run the profiler and the other code from within IPython but that often causes issues such as unfound paths, files, etc., as well as making it difficult to convert our output to a nice readable text file. Instead, we’ll use our Python command prompt and then we’ll run the line profiler using (note: the "-l" is a lowercase L and not the number 1):
python "c:\program files\arcgis\pro\bin\python\envs\arcgispro-py3\lib\site-packages\kernprof.py" -l “c:\users\YourName\CherryO.py”
When the profiling completes (and it will be fast in this simple example), you’ll see our normal code output from our print functions and the summary from the line profiler:
Wrote profile results to CherryO.py.lprof
This tells us the profiler has created an lprof file in our current directory called CherryO.py.lprof (or whatever our input code was called).
The profile files will be saved wherever your Python command prompt path is pointing. Unless you've changed the directory, the Python command prompt will most likely be pointing to C:\Program Files\ArcGIS\Pro\Bin\Python\envs\arcgispro-py3, and the files will be saved in that folder.
The profile files are binary files that will be impossible for us to read without some help from another tool. So to rectify that we’ll run that .lprof file through the line_profiler (which seems a little confusing because you would think we just created that file with the line_profiler and we did, but the line_profiler can also read the files it created) and then pipe (redirect) the output to a text file which we’ll put back in our code directory so we can find it more easily.
To achieve this, we run the following command in our Python command window:
..\python –m line_profiler CherryO.py.lprof > "c:\users\YourName\CherryO.profile.txt"
This command will instruct Python to run the line_profiler (which is some Python code itself) to process the .lprof file we created. The > will redirect the output to a text file at the provided path instead of displaying the output to the screen.
We can then open the resulting output file which should be back in our code folder from within Spyder and read the results. I’ve included my output for reference below and they are also in the CherryO.Profile pdf [50].
Timer unit: 4.27655e-07 s Total time: 3.02697 s File: c:\users\YourName\Lesson 1\CherryO.py Function: cherryo at line 8 Line # Hits Time Per Hit % Time Line Contents ============================================================== 8 @profile 9 def cherryo(): 10 1 5.0 5.0 0.0 spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 11 1 2.0 2.0 0.0 turns = 0 12 1 1.0 1.0 0.0 totalTurns = 0 13 1 1.0 1.0 0.0 cherriesOnTree = 10 14 1 1.0 1.0 0.0 games = 0 15 16 10002 36775.0 3.7 0.5 while games < 10001: 17 # Take a turn as long as you have more than 0 cherries 18 10001 25568.0 2.6 0.4 cherriesOnTree = 10 19 10001 27091.0 2.7 0.4 turns = 0 20 168060 464529.0 2.8 6.6 while cherriesOnTree > 0: 21 22 # Spin the spinner 23 158059 4153276.0 26.3 58.7 spinIndex = random.randrange(0, 7) 24 158059 487698.0 3.1 6.9 spinResult = spinnerChoices[spinIndex] 25 26 # Print the spin result 27 #print "You spun " + str(spinResult) + "." 28 29 # Add or remove cherries based on the result 30 158059 460642.0 2.9 6.5 cherriesOnTree += spinResult 31 32 # Make sure the number of cherries is between 0 and 10 33 158059 458508.0 2.9 6.5 if cherriesOnTree > 10: 34 42049 112815.0 2.7 1.6 cherriesOnTree = 10 35 116010 325651.0 2.8 4.6 elif cherriesOnTree < 0: 36 5566 14506.0 2.6 0.2 cherriesOnTree = 0 37 38 # Print the number of cherries on the tree 39 #print "You have " + str(cherriesOnTree) + " cherries on your tree." 40 41 158059 445969.0 2.8 6.3 turns += 1 42 # Print the number of turns it took to win the game 43 #print "It took you " + str(turns) + " turns to win the game." 44 10001 29417.0 2.9 0.4 games += 1 45 10001 31447.0 3.1 0.4 totalTurns += turns 46 47 1 443.0 443.0 0.0 print ("totalTurns "+str(float(totalTurns)/games)) 48 #lastline = raw_input(">") 49 50 # Output how long the process took. 51 1 3723.0 3723.0 0.1 print ("--- %s seconds ---" % (time.time() - start_time))
What we can see here are the individual times to run each line of code: The numbers on the left are the code line numbers, the number of times each line was run (Hits), the time each line took in total (Hits * Time Per Hit), the time per hit, the percentage of time those lines took and, for reference, the line of code alongside.
The first thing that jumps out at me is that the random number selection (line 23) takes the longest time and is called the most – if we can speed this up somehow we can improve our performance.
Let’s move onto the other examples of our code to see some more profiling results.
First we’ll look at the sequential version of our raster processing code before coming back to look at the multiprocessing examples as they’re a little special.
As before with the Cherry-O example we’ll need to wrap our code into a function and use the @profile decorator (and of course call the function). Attempt to make these changes on your own and if you get stuck you can find my code sample here [51] if you want to check your work against mine.
We’ll run the profiler again, produce our output file and then convert it to text and review the results using:
python "c:\program files\arcgis\pro\bin\python\envs\arcgispro-py3\lib\site-packages\kernprof.py" -l "c:\users\YourName\Lesson1B_basic_raster_for_mp.py"
and then
..\python –m line_profiler Lesson1B_basic_raster_for_mp.py.lprof > "c:\users\YourName\Lesson1B_basic_raster_for_mp.py_profile.txt"
If we investigate these outputs (my outputs are here [52]) we can see that the Flow Accumulation calculation is again the slowest, just as we saw when we were doing the module-level calculations. In this case, because we’re predominantly using arcpy functions, we’re not seeing as much granularity or resolution in the results. That is, we don’t know why Flow Accumulation(...) is so slow but I’m sure you can see that in some other circumstances, you could identify multiple arcpy functions which could achieve the same result – and choose the most efficient.
Next, we’ll look at the multiprocessing example of the Cherry-O example to see how we can implement line profiling into our multiprocessing code. As we noted earlier, multiprocessing and profiling are a little special as there is a lot going on behind the scenes and we need to very carefully select what functions and lines we’re profiling as some things cannot be pickled.
Therefore what we need to do is use the line profiler in its API mode. That means instead of using the line profiling outside of our code, we need to embed it in our code and put it in a function between our map and the function we’re calling. This will give us output for each process that we launch. Now if we do this for the Cherry-O example we’re going to get 10,000 files – but thankfully they are small so we’ll work through that as an example.
The point to reiterate before we do that is the Cherry-O code runs in seconds (at most) – once we make these line profiling changes the code will take a few minutes to run.
We’ll start with the easy steps and work our way up to the more complicated ones:
import line_profiler
Now let’s define a function within our code to sit between the map function and the called function (cherryO in my version).
We’ll break down what is happening in this function shortly and that will also help to explain how it fits into our workflow. This new function will be called from our mp_handler() function instead of our original call to cherryO and this new function will then call cherryO.
Our new mp_handler function looks like:
def mp_handler(): myPool = multiprocessing.Pool(multiprocessing.cpu_count()) ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list. turns = myPool.map(worker,range(numGames)) # Uncomment this line to print out the list of total turns (but note this will slow down your code's execution #print(turns) # Use the statistics library function mean() to calculate the mean of turns print(mean(turns))
Note that our call to myPool.map now has worker(...) not cherryO(...) as the function being spawned multiple times. Now let's look at this intermediate function that will contain the line profiler as well as our call to cherryO(...).
def worker(game): profiler=line_profiler.LineProfiler(cherryO) call = 'cherryO('+str(game)+')' turns = profiler.run(call) profiler.dump_stats('profile_'+str(game)+'.lprof') return(turns)
The first line of our new function is setting up the line profiler and instructing it to track the cherryO(...) function.
As before, we pass the variable game to the worker(...) function and then pass it through to cherryO(...) so it can still perform as before. It’s also important that, when we call cherryO(...), we record the value it returns into a variable turns – so we can return that to the calling function so our calculations work as before. You will notice we’re not just calling cherryO(...) and passing it the variable though – we need to pass the variable a little differently as the profiler can only support certain picklable objects. The most straightforward way to achieve that is to encode our function call into a string (call) and then have the profiler run that call. If we don’t do this the profiler will run but no results will be returned.
Just before we send that value back we use the profiler’s function dump_stats to write out the profile results for the single game to an output file.
Don’t forget to save your code after you make these changes. Now we can run through a slightly different (but still familiar) process to profile and export our results, just with different file names. To run this code we’ll use the Python command prompt:
python c:\users\YourName\CherryO_mp.py
Notice how much longer the code now takes to run – this is another reason to wrap the line profiling in its own function. That means that we don’t need to leave it in production code; we can just change the function calls back and leave the line profiling code in place in case we want to test it again.
It's also possible you'll receive several error messages when the code runs, but the lprof files are still created.
Once our code completes, you will notice we have those 10,000 lprof files (which is overkill as they are probably all largely the same). Examine a few of the files if you like by converting them to text files and viewing them in your favorite text editor or Spyder using the following in the Python command prompt:
python –m line_profiler profile_1.lprof > c:\users\YourName\profile_1.txt
If you examine one of those files, you’ll see results similar to:
Timer unit: 4.27655e-07 s Total time: 0.00028995 s File: c:\users\obrien\Dropbox\Teaching_PSU\Geog489_SU_1_18\Lesson 1\CherryO_MP.py Function: cherryO at line 25 Line # Hits Time Per Hit % Time Line Contents ============================================================== 25 def cherryO(game): 26 1 11.0 11.0 1.6 spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 27 1 9.0 9.0 1.3 turns = 0 28 1 8.0 8.0 1.2 totalTurns = 0 29 1 8.0 8.0 1.2 cherriesOnTree = 10 30 1 9.0 9.0 1.3 games = 0 31 32 # Take a turn as long as you have more than 0 cherries 33 1 9.0 9.0 1.3 cherriesOnTree = 10 34 1 9.0 9.0 1.3 turns = 0 35 16 41.0 2.6 6.0 while cherriesOnTree > 0: 36 37 # Spin the spinner 38 15 402.0 26.8 59.3 spinIndex = random.randrange(0, 7) 39 15 38.0 2.5 5.6 spinResult = spinnerChoices[spinIndex] 40 41 # Print the spin result 42 #print "You spun " + str(spinResult) + "." 43 44 # Add or remove cherries based on the result 45 15 34.0 2.3 5.0 cherriesOnTree += spinResult 46 47 # Make sure the number of cherries is between 0 and 10 48 15 35.0 2.3 5.2 if cherriesOnTree > 10: 49 4 8.0 2.0 1.2 cherriesOnTree = 10 50 11 24.0 2.2 3.5 elif cherriesOnTree < 0: 51 cherriesOnTree = 0 52 53 # Print the number of cherries on the tree 54 #print "You have " + str(cherriesOnTree) + " cherries on your tree." 55 56 15 32.0 2.1 4.7 turns += 1 57 # Print the number of turns it took to win the game 58 1 1.0 1.0 0.1 return(turns)
Arguably we’re not learning anything that we didn’t know from the sequential version of the code – we can still see the randrange() function is the slowest or most time consuming (by percentage) – however, if we didn’t have the sequential version and wanted to profile our multiprocessing code this would be a very important skill.
The same steps to modify our code above would be implemented if we were performing this line profiling on arcpy (or any other) multiprocessing code. The same type of intermediate function would be required, we would need to pass and return parameters (if necessary) and also reformat the function call so that it was picklable. The output from the line profiler is delivered in a different format to the module-level profiling we were doing before and, therefore, isn’t suitable for loading into QCacheGrind. I'd suggest that isn't as important as we're looking at a much smaller number of lines of code, so the graphical representation isn't as important.
Returning to our ongoing discussion about the less than anticipated performance improvement between our sequential and multiprocessing Cherry-O code, what we can infer by comparing the line profile output of the sequential version of our code and the multiprocessing version is that pretty much all of the steps take the same proportion of time. So if we're doing nearly everything in about the same proportions, but 4 times as many of them (using our 4 processor PC example) then why isn't the performance improvement around 4x faster? We'd expect that setting up the multiprocessing environment might be a little bit of an overhead so maybe we'd be happy with 3.8x or so.
That isn't the case though so I did a little bit of experimenting with calculating how much time it takes to pickle those simple integers. I modified the mp_handler function in my multiprocessor code so that instead of doing actual work selecting cherries, it pickled the 1 million integers that would represent the game number. That function looked like this (nothing else changed in the code):
import pickle def mp_handler(): myPool = multiprocessing.Pool(multiprocessing.cpu_count()) ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list. #turns = myPool.map(worker,range(numGames)) #turns = myPool.map(cherryO,range(numGames)) t_start=time.time() for i in range(0,numGames): pickle_data = pickle.dumps(range(numGames)) print ("cPickle data took",time.time()-t_start) t_start=time.time() pickle.loads(pickle_data) print ("cPickle data took",time.time()-t_start)
What I learned from this experimentation was that the pickling took around 4 seconds - or about 1/4 of the time my code took to play 1M Cherry-O games in multiprocessing mode - 16 seconds (it was 47 seconds for the sequential version).
A simplistic analysis suggests that the pickling comprises about 25% of my execution time (and your results might vary). If I subtract the time taken for the pickling, my code would have run in 12 seconds and then 47s ÷ 12s = 3.916 - or the nearly the 4x improvement we would have anticipated. So the takeaway message here is the reinforcing of some of the implications of implementing multiprocessing that we discussed earlier: there is an overhead to multiprocessing and lots of small calculations as in this case aren't the best application for it because we lose some of that performance benefit due to that implementation overhead. Still, an almost tripling of performance (47s / 16s) is worth the effort.
Your coding assessment for this lesson will have you modify some code (as mentioned earlier) and profile some multiprocessing analysis code which is why we’re not demonstrating it specifically here. See the lesson assignment page for full details.
Before we move on from profiling a few important points need to be made. As you might have worked out for yourself by this point, profiling is time-consuming and you really should only undertake it if you have a very slow piece of code or one where you will be running it thousands of times or more and a small performance improvement is likely to be beneficial in the long run. To put that another way, if you spend a day profiling your code to improve its performance and you reduce the execution time from ten minutes to five minutes but you only run that code once then I would argue you haven’t used your time productively. If your code is already fast enough that it executes in a reasonable amount of time – that is fine.
Do not get caught in the trap of beginning to optimize and profile your code too early, particularly before the code is complete. You may be focusing on a slow piece of code that will only be executed once or twice and the performance improvement will not be significant compared to the execution of the other 99% of the code.
We have to accept that some external libraries are inefficient and if we need to use them, then we must accept that they take as long as they do to get the job done. It is also possible that those libraries are extremely efficient and take as long as they do because the task they are performing is complicated. There isn’t any point attempting to speed up the arcpy.da cursors for example as they are probably as fast as they are likely to be in the near future. If that is the slowest part of our code, we may have to accept that.
Software projects can often grow in complexity and expand to include multiple developers. Version control systems (VCS) are designed to record changes to data and encourage team collaboration. Data, often software code, can be backed up to prevent data loss and track changes made. VCS are tools to facilitate teamwork and merging of different contributor’s changes. Version control [1] [53] is also known as “revision control.” Version control tools like Git can help development teams or individuals manage their projects in a logical, procedural way without needing to email copies of files around and worry about who made what changes in which version.
Centralized VCS, like Subversion (SVN), Microsoft Team Foundation Server (TFS) and IBM ClearCase, all use a centralized, client-server model for data storage and to varying degrees discourage “branching” of code (discussed in more detail below). These systems instead encourage a file check-out, check-in process and often have longer “commit cycles” where developers work locally with their code for longer periods before committing their changes to the central repository for back-up and collaboration with others. Centralized VCS have a longer history in the software development world than DVCS, which are comparatively newer. Some of these tools are difficult to compare solely on their VCS merits because they perform more operations than just version control. For example, TFS and ClearCase are not just VCS software, but integrate bug tracking and release deployment as well.
Distributed VCS (DVCS) like Git (what we’re focusing on) or Mercurial (hg), all use a decentralized, peer-to-peer model where each developer can check out an entire repository to their local environment. This creates a system of distributed backup where if any one system becomes unavailable, the code can be reconstructed from a different developer’s copy of the repository. This also allows off-line editing of the repository code when a network connection to a central repository is unavailable. As well, DVCS software encourages branching to allow developers to experiment with new functionality without “breaking” the main “trunk” of the code base.
A hybrid VCS might use the concept of a central main repository that can be branched by multiple developers using DVCS software, but where all changes are eventually merged back to the main trunk code repository. This is generally the model used by online code repositories like GitHub or Bitbucket.
Git is a VCS that stores and tracks source code in a repository. A variety of data about code projects is tracked such as what changes were made, who made them, and comments about the changes [3] [54]. Past versions of a project can be accessed and reinstated if necessary. Git uses permissions to control what changes get incorporated in the master repository. In projects with multiple people, one user will be designated as the project owner and will approve or reject changes as necessary.
Changes to the source code are handled by branches, merges, and commits. Branching, sometimes called forking, lets a developer copy a code repository (or part of a repository) for development in parallel with the main trunk of the code base. This is typically done to allow multiple developers to work separately and then merge their changes back into a main trunk code repository.
Although Git is commonly used on code projects with multiple developers, the technology can be applied to any number of users (including one) working on any types of digital files. More recently, Git has gained in popularity since it is used as the back end for GitHub among other platforms. Although other VCS exist, Git is frequently chosen since it is free, open source, and easily implemented.
Git has a few key terms to know moving forward [2] [55]:
A Git repository begins as a folder, either one that already exists or one that is created specifically to house the repository. For the cleanest approach, this folder will only contain folders and files that contribute to one particular project. When a folder is designated as a repository, Git adds one additional hidden subfolder called .git that houses several folders and files and two text files called .gitignore and .gitmodule as highlighted in Figure 1.29
These file components handle all of the version control and tracking as the user commits changes to Git. If the user does not commit their changes to Git, the changes are not “saved” in the version control system. Because of this, it’s best to commit changes at fairly frequent intervals. The committed changes are only active on one particular user’s computer at this point. If the user is working on a branch of another repository, they will want to pull changes from the master repository fairly often to make sure they’re working on the most recent version of the code. If a conflict arises when the branch and the master have both changed in the same place in different ways, the user can work through how to resolve the conflict. When the user wants to integrate their changes with the master repository, the user will create a pull request to the owner of the repository. The owner will then review the changes made and any conflicts that exist, and either choose to accept the pull request to merge the edits into the master repository or send the changes back for additional work. These workflow steps may happen hundreds or thousands of times throughout the lifetime of a code project.
On its own, Git operates off a command line interface; users perform all actions by typing commands. Although this method is perfectly fine, visualizing what’s going on with the project can be a bit hard. To help with that, multiple GUI interfaces have been created to visualize and thus simplify the version control process, and some IDEs include built-in version control hooks. Currently, GitHub is the most popular front-end for Git and offers a free version for basic users.
Resources:
[1] https://en.wikipedia.org/wiki/Version_control
[2] https://github.com/kansasgis/GithubWebinar_2015
[3] https://en.wikipedia.org/wiki/Git
Some popular online hosting solutions for VCS and DVCS code repositories include: GitHub, Bitbucket, Google Code and Microsoft CodePlex. These online repositories are often used as the main trunk repositories for open-source projects with many developers who may be geographically dispersed. For the purposes of this class, we will focus on GitHub.
GitHub takes all of Git’s version control components, adds a graphical user interface to repositories, change history, and branch documentation, and adds several social components. Users can add comments, submit issues, and get involved in the bug tracking process. Users can follow other GitHub users or specific projects to be notified of project updates. GitHub can either be used entirely online or with an application download for easily managing and syncing local and online repositories. Optional (not required for class): Click here for the desktop application download [56].
The following exercise will cover the basics of Git and how they’re used in the GitHub website.
GitHub has the ability to display everything that changed with every commit. Take a look at GitHub's Kansasgis/NG911 page [59]. If you click on one of the titles of one of the commits, it displays whatever basic description the developer included of the changes and then as you scroll down, you can see every code change that occurred - red highlighting what was removed and green highlighting what was added. If you mouse over the code, a plus sign graphic shows up, and users can leave comments and such.
Conflicts occur if two branches being merged have had different changes in the same places. Git automatically flags conflicts and will not complete the merge like normal; instead, the user will be notified that the conflicts must be resolved. Some conflicts can be resolved inside GitHub, and other types of conflicts have to be resolved in the Git command line [4] [60]. Due to the complexity of resolving conflicts in the command line, it’s best to plan ahead and silo projects as much as possible to avoid conflicts.
Git adds three different markers to the code to flag conflicts:
<<<<<<<HEAD – This marker indicates the beginning of the conflict in the base branch. The code from the base branch is located directly under this marker.
======= – This marker divides the base branch code from the other branch.
>>>>>>> BRANCH-NAME – This marker will have the name of the other branch next to it and indicates the end of the conflict.
Here’s a full example of how Git flags a conflict between branches:
<<<<<<<HEAD myString = “Monty Python and the Holy Grail is the best. ” ======= myString = “John Cleese is hilarious.” >>>>>>> cleese-branch
To resolve the conflict, the user needs to pick what myString will equal. Possible resolution options-
Keeping the base branch -
myString = “Monty Python and the Holy Grail is the best.”
Using the other branch -
myString = “John Cleese is hilarious.”
Combining branches, in this case combining the options -
myString = “Monty Python and the Holy Grail is the best. John Cleese is hilarious.”
GitHub has an interface that can be activated for resolving basic conflicts by clicking on the “Resolve Conflicts” button under the “Pull Requests” tab. This interface steps through each conflict and the user must decide which version to take, keep their changes, use the other changes, or work out a way to integrate both sets of changes. Inside the GitHub interface, the user must also remove the Git symbols for the conflict. The user steps through every conflict in that particular file to decide how to resolve the conflict and then will eventually click on the “Mark as resolved” button. The next file in the project with conflicts will show up and the user will repeat all of the steps until the conflicts are resolved. At this point, the user will click “Commit merge” and then “Merge pull request.”
For more complex types of conflicts like one branch deleting a file that the other keeps, the resolution has to take place in the Git command line. This process can hopefully be avoided, but basic instructions are available at GitHub Help: Resolving a merge conflict using the command line [61].
Resources:
[4] https://help.github.com/articles/resolving-a-merge-conflict-on-github/ [60]
GitHub is a great fit for managing open source code projects since, with a free account, all repositories are available on the internet at large. For example, the open source GIS software QGIS (see Lesson 4) is housed on GitHub at GitHub's qgis/QGIS page [62]. Take a look at the repository.
On the front page, you can see in the dashboard statistics that (at the time of this writing) there have been over 40,000 commits, 50 branches, 100 releases, and 250 contributors to the QGIS project. Users worldwide can now contribute their ideas, bugs, and code improvements to a central location that can be managed with standard version control workflows.
Some software companies that have traditionally been protective about their code have adopted GitHub to open certain projects. Esri is rather active on GitHub at GitHub's Esri page [63] including the documentation and samples for the ArcGIS API for Python (see Lesson 3). Microsoft also is present at the GitHub Microsoft page [64] with the tagline “Open source, from Microsoft with love.”
While GitHub is open to all digital files and any programming languages, Python is a great fit for use in GitHub for multiple reasons. Unlike other, heavier programming languages, Python doesn’t require extensive libraries with complex dlls and installation structures to get the job done.
Creating Python repositories is as simple as adding the .py files, and then the project can be shared, documented, and updated as needed. GitHub is also a great place to find both Python snippets and entire modules to use. For basic purposes, users can copy/paste just the portions of code off another project they want to try. Otherwise, users can fork an entire repository and tweak it as necessary to fit their purposes.
GitHub strongly recommends that every repository contain a README.txt or README.md file. This file will act as the “home page” for the project and is displayed on the repository page after files and folders are listed. This document should contain specific information about the project, how to use it, licensing, and support.
Text files will show up without formatting, so many users choose to use an .md (markdown) file instead. Markdown notation will be interpreted to show various formatting components like font size, bold, italics, imbedded links, numbered lists, and bullet points.
For more information on markdown formatting, visit GitHub Guide's Mastering Markdown page [65]. We will also use Markdown in Lesson 3, in the context of Jupyter notebooks, and provide a brief introduction there.
While all free GitHub accounts are required to publish public repositories, all accounts have the ability to create Gists. Gists are single page repositories in GitHub, so they don't support projects with folder structures or multiple files. Since Gists are a single page repository, they are good for storing code snippets or one page projects. Gists can be public or private, even with a free account.
To create a Gist in GitHub, log into GitHub and then click on the plus sign in the upper right hand corner. In the options presented, choose "New gist." Enter a description of the Gist (in figure 1.30 "Delete if Exists Shortcut" is the description) as well as the filename with extension (in figure 1.30 this is DeleteIfExists.py). Enter code or notes in the large portion of the screen or import the code by using the "Add File" button. You have two options for saving your Gist- either "Create secret gist" or "Create public gist."
"Secret" Gists are only mostly secret since they use the internet philosophy of difficult-to-guess urls. If you create a secret Gist, you can still share the Gist with anyone by sending them the url, but there are no logins required to view the Gist. Along this same philosophy, if someone stumbles across the url, they will be able to see the Gist.
For more information about Gists, see the official GitHub documentation at "About Gists" page on Github's website [66].
For GIS professionals, Gists are additionally useful since a Gist can be a single GeoJson file. GeoJson files are essentially a text version of geographic data in json formatting. Other developers can instantly access your GeoJson data and incorporate it from GitHub into their online mapping applications without needing to get a hard copy of the shapefile or geodatabase feature class or rely on some kind of map server. GitHub will automatically display GeoJson files as a map whether the file is a Gist or a part of a larger repository. For example, take a look at GitHub's lyzidiamond/learn-geojson page [67]. At first, you’ll see the GeoJson file interpreted as a map. If you click the “Raw” button located on the upper right-ish side of the map, you will see what the GeoJson file looks like in text form. GeoJson can be easily used in Python since after reading in the file, Python can work with the text as if it is one giant dictionary.
In GEOG 489, using GitHub to store the sample code and exercise code from the lessons can be a great way to practice and gain experience with a new software tool. Using GitHub is not required and we don't recommend that you store your completed projects on there. GitHub is an encouraged platform for students to learn since many organizations use GitHub or other VCS.
Git and GitHub provide fast and convenient ways to track projects, whether the project is by one individual or a team of software developers. Although GitHub has many complex features available, it’s easily accessible for individual and small projects that need some kind of tracking mechanism. In addition to version control, GitHub provides users with a social platform for project management as well as the ability for users to create Gists and store GeoJson.
ArcGIS Pro does install an IDE by default (IDLE which can be accessed from a conda prompt by typing idle) and there is also the straightforward installation of spyder which we covered when we started looking at writing Python 3 code for ArcGIS Pro.
The benefit of using an IDE like spyder is that for more complicated coding and particularly debugging, we have tools that enable us to step through our code line by line, debug and profile multiprocessing code, and watch what is happening in our variables as we saw earlier in the lesson.
While we've had you install and use spyder a little there is a wide range of IDEs that you could use. We’ll provide a brief overview of some different IDEs and code editors in case you would like to use something else (and the choice is entirely yours). Esri supplies a link to a list of IDEs that work with Anaconda (the implementation of conda used for ArcGIS Pro).
Part of the homework assignment for this lesson involves selecting an IDE and reviewing it. For full details see the Assignment Page [40].
Since Python is used in so many different fields and for so many different purposes, there are multiple places Python can be written and edited, and we have seen some of these already.
Some ways are fairly straightforward. You can create a text file, write your Python code, save it with a “.py” extension, and execute it on your computer. However, this approach gives you little to no assistance with writing your code or debugging it.
Because writing code can be such an intensive process, Integrated Development Environment (IDE) applications were developed. Depending on the complexity of the IDE, IDEs can assist developers in a multitude of ways. Many IDEs exist for many different languages, but several focus particularly on Python. Typical Python IDEs include a source code editor and debugging tools. Extended features in some IDEs include
IDLE is installed with ArcGIS Desktop software when Python is also installed. IDLE offers very basic code editing capabilities and color codes objects, functions, and methods so users can easily differentiate parts of their code. IDLE also includes a basic debugger that reports messages and errors back to the user. IDLE has some simple text editing tools for bulk indenting, dedenting, commenting, and tabifying.
In GEOG485, we used PythonWin [68] coming with ArcGIS Desktop as the course IDE (until we changed to ArcGIS Pro recently). PythonWin offers a bit more coder assistance and debugging tools than IDLE. For example, the debugger can step through code line by line so you can see exactly what’s going on with your code.
PyScripter is an open-source IDE available for download at SourceForge's PyScripter page [69]. The full project is also available on GitHub at their pyscripter page [70]. Each download of PyScripter installs several different versions of PyScripter so you can select the right one for the version you’re developing in. Developing in the wrong version of Python can lead to compatibility issues and lost functionality.
If you want to use PyScripter to develop code to be used in ArcGIS Desktop software, you will want to know ahead of time that ArcGIS Desktop uses Python 2.7 and that ArcGIS Desktop is a 32-bit software program. On SourceForge, the PyScripter version is not related to any Python version, so you’ll probably just want to download the latest version of PyScripter without the term “x64” included in it. As you can see in Figure 1.31, this installation of PyScripter will include versions for Python 2.4 through 3.6. For ArcGIS Desktop, you’ll want to use PyScripter for Python 2.7.
If you want to use PyScripter to develop for ArcGIS Pro or to use the ArcGIS API for Python, you’ll want to download and install the latest version that includes “x64.” To check and see what version of Python ArcGIS Pro is using, open ArcGIS Pro, and on the “Analysis” tab, open the Python interpreter. Inside the Python interpreter, type in the following lines:
import sys sys.version_info
The Python window will report back what version of Python ArcGIS Pro is using so you can use the corresponding PyScripter version. In Figure 1.32, the version used is 3.5.3, so PyScripter for Python 3.5 would be used.
PyScripter has many tools available for making developer’s lives easier. The source code editor includes auto-completion and a built-in syntax checker. Each developer can also customize certain parts of the interface based on their personal preferences. PyScripter also has more advanced debugging tools than IDLE or PythonWin.
PyCharm is another popular IDE that has a free community edition available to download- see the Jet Brains Download PyCharm page [71]. PyCharm differs from PyScripter the most in that PyCharm integrates easily with conda, so developers can specify which conda environment a project needs to be developed in and run on. PyCharm has many different development aids and tools available including auto-completion, enhanced debugging, various integrated code-checking processes, error detection, and on-the-fly code fixes.
Microsoft has a plugin for Visual Studio so it can be used as a Python IDE. Visual Studio has Community Editions available for free download, and Python Tools for Visual Studio can be downloaded and installed- see Microsoft's How to Install Python Support in Visual Studio on Windows page [72]. For Visual Studio 2015 and lower, you must also install a Python interpreter. Visual Studio is a robust IDE with organization tools that can integrate project components from various source code languages as well as auto-completion and enhanced debugging. Visual Studio can also be used to directly debug certain Python components in ArcGIS Pro including script tool execution, tool validation, and Python toolboxes.
There are multiple other Python IDEs available, and what to pick usually depends on a developer’s personal preference. Others include Eclipse/PyDev [73], Eric [74], and Spyder [75] of course.
For more information about IDEs in general, feel free to refer to these links: Wikipedia's page for Integrated Development Environment [76], Esri's Technical Support Page [77], and ArcGIS Pro's Debug Python code page [78].
For the first part of the Lesson 1 homework project, you will be evaluating an IDE. Each student will be evaluating a different IDE and can “claim” their IDE in the "L1: IDE Investigation: Choose topic" discussion forum within Canvas. Possible IDEs include but are not limited to the following (please do NOT choose Spyder!):
First, claim your IDE in the "L1: IDE Investigation: Choose topic" discussion forum. Then experiment with writing and debugging code in that IDE and study the documentation. Pay special attention to which of the features mentioned in Section 1.9 (auto-completion, syntax checking, version control, environment control, and project organization) are available in that IDE. Record a 5 minute demo and discussion video of your chosen IDE using Kaltura that highlights the IDE’s features, functionalities, and possible difficulties. Post a link to your video in the Media Gallery.
We are going to use the arcpy vector data processing code from Section 1.6.6.2 (download Lesson1_Assignment_initial_code.py [38]) as the basis for our Lesson 1 programming project. The code is already in multiprocessing mode, so you will not have to write multiprocessing code on your own from scratch but you still will need a good understanding of how the script works. If you are unclear about anything the script does, please ask on the course forums. This part of the assignment will be for getting back into the rhythm of writing arcpy based Python code and practice creating script tool with ArcGIS Pro. Your task is to extend our vector data clipping script by doing the following:
You will have to submit several versions of the modified script for this assignment:
To realize the modified code versions in this assignment, all main modifications have to be made to the input variables and within the code of the worker() and mp_handler() functions; the code from the get_install_path() function should be left unchanged. Of course, we will also look at code quality, so make sure the code is readable and well documented. Here are a few more hints that may be helpful:
When you adapt the worker() function, I strongly recommend that you do some tests with individual calls of that function first before you run the full multiprocessing version. For this, you can, for instance, comment out the pool code and instead call worker() directly from the loop that produces the job list, meaning all calls will be made sequentially rather than in parallel. This makes it easier to detect errors compared to running everything in multiprocessing mode right away. Similarly, it could be a good idea to add print statements for printing out the parameter tuples placed in the job list to make sure that the correct values will be passed to the worker function.
When changing to the multiple-input-files version, you will not only have to change the code that produces the name of the output files in variable outFC by incorporating the name of the input feature class, you will have to do the same for the name of the temporary layer that is being created by MakeFeatureClass_managment() to make sure that the layer names remain unique. Else some worker calls will fail because they try to create a layer with a name that is already in use.
To get the basename of a feature class without file extension, you can use a combination of the os.path.basename() and os.path.splitext() functions defined in the os module of the Python standard library. The basename() function will remove the leading path (so e.g., turn "C:\489\data\Roads.shp" into just "Roads.shp"). The expression os.path.splitext(filename)[0] will give you the filename without file extension. So for instance "Roads.shp" will become just "Roads". (Using [1] instead of [0] will give you just the file extension but you won't need this here.)
This is not required but if you decide to create a script tool for the multiple-input-files version from step (4) for over and above points, you will have to use the "Multiple value" option for the input parameter you create for the to-be-clipped feature class list in the script tool interface. If you then use GetParameterAsText(...) for this parameter in your code, what you will get is a single string(!) with the names of the feature classes the user picked separated by semicolons, not a list of name strings. You can then either use the string method split(...) to turn this string into a list of feature class names or you use GetParameter(...) instead of GetParameterAsText(...) which will directly give you the feature class names as a list.
Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:
Links
[1] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/downloads/USA.gdb.zip
[2] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/downloads/qcachegrind074-32bit-x86.zip
[3] https://www.visualstudio.com/downloads/
[4] https://www.e-education.psu.edu/geog489/node/2247
[5] https://medium.com/free-code-camp/a-beginner-friendly-guide-to-unicode-d6d45a903515
[6] https://docs.python.org/3.0/whatsnew/3.0.html
[7] http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html
[8] http://pro.arcgis.com/en/pro-app/arcpy/get-started/python-migration-for-arcgis-pro.htm
[9] http://pro.arcgis.com/en/pro-app/tool-reference/appendices/unavailable-tools.htm
[10] https://www.e-education.psu.edu/geog489/l1_p5.html
[11] https://docs.python.org/3/reference/expressions.html#operator-precedence
[12] https://www.python-course.eu/python3_formatted_output.php
[13] https://www.python-course.eu/python3_passing_arguments.php
[14] https://community.esri.com/docs/DOC-12021-python-at-arcgispro-22
[15] http://desktop.arcgis.com/en/arcmap/10.3/tools/data-management-toolbox/analyzetoolsforpro.htm
[16] http://pro.arcgis.com/en/pro-app/get-started/faq.htm#anchor25
[17] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/downloads/FoxLake.zip
[18] http://pro.arcgis.com/en/pro-app/arcpy/geoprocessing_and_python/understanding-validation-in-script-tools.htm
[19] http://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=5037
[20] https://github.com/obrienja-RF/DownloadUnzipLidarWWW/blob/main/DownloadUnzipLiDARwww.py
[21] https://www.e-education.psu.edu/geog489/node/2265
[22] http://desktop.arcgis.com/en/arcmap/10.5/analyze/executing-tools/64bit-background.htm
[23] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/windows-commands
[24] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/cd
[25] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/dir
[26] https://www.e-education.psu.edu/geog489/node/2249
[27] https://docs.python.org/3/library/multiprocessing.html
[28] http://hadoop.apache.org/
[29] https://aws.amazon.com/elasticmapreduce/
[30] https://www.mongodb.org/
[31] https://cassandra.apache.org/
[32] https://www.e-education.psu.edu/geog865/node/25
[33] http://www.e-education.psu.edu/geog485/node/242
[34] https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
[35] https://www.e-education.psu.edu/geog489/node/2269
[36] https://docs.python.org/3/howto/logging-cookbook.html
[37] https://www.e-education.psu.edu/geog489/node/2282
[38] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/Lesson1_Assignment_initial_code.zip
[39] https://www.datacamp.com/community/tutorials/pickle-python-tutorial
[40] https://www.e-education.psu.edu/geog489/node/2293
[41] https://www.e-education.psu.edu/geog485/node/316
[42] https://docs.spyder-ide.org/current/panes/debugging.html
[43] https://docs.spyder-ide.org/current/panes/profiler.html
[44] https://docs.python.org/3/library/profile.html#module-cProfile
[45] https://pypi.org/project/pyprof2calltree/
[46] https://docs.python.org/3/installing/index.html
[47] http://pro.arcgis.com/en/pro-app/arcpy/get-started/installing-python-for-arcgis-pro.htm
[48] https://www.e-education.psu.edu/geog489/node/2258
[49] https://www.e-education.psu.edu/geog489/node/2260
[50] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/downloads/CherryO.profile.pdf
[51] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/downloads/Lesson1B_basic_raster_for_mp.py.txt
[52] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/downloads/Lesson1B_basic_raster_for_mp_profile.txt
[53] https://en.wikipedia.org/wiki/Version_control
[54] https://en.wikipedia.org/wiki/Git
[55] https://github.com/kansasgis/GithubWebinar_2015
[56] https://desktop.github.com/
[57] https://github.com/
[58] https://guides.github.com/activities/hello-world/
[59] https://github.com/kansasgis/NG911/commits/master
[60] https://help.github.com/articles/resolving-a-merge-conflict-on-github/
[61] https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/
[62] https://github.com/qgis/QGIS
[63] https://github.com/esri
[64] https://github.com/Microsoft
[65] https://guides.github.com/features/mastering-markdown/
[66] https://help.github.com/articles/about-gists/
[67] https://github.com/lyzidiamond/learn-geojson/blob/master/geojson/cupcakes.geojson
[68] https://wiki.python.org/moin/PythonWin
[69] https://sourceforge.net/projects/pyscripter
[70] https://github.com/pyscripter/pyscripter
[71] https://www.jetbrains.com/pycharm/download/#section=windows
[72] https://docs.microsoft.com/en-us/visualstudio/python/installing-python-support-in-visual-studio
[73] http://www.pydev.org/
[74] https://eric-ide.python-projects.org/
[75] https://pypi.python.org/pypi/spyder
[76] https://en.wikipedia.org/wiki/Integrated_development_environment
[77] https://support.esri.com/en/technical-article/000013224
[78] http://pro.arcgis.com/en/pro-app/arcpy/get-started/debugging-python-code.htm