Lesson 2: Open Source Data

Hopefully lesson 1 wasn’t too hard and you learned something new. Lesson 2 will take a look at how to use Python to retrieve data in many different formats and translate it to other formats. This process is known by an acronym ETL, or Extract, Translate, and Load and is a large part of any script that works with data. Once you get the hang of working through data structures in Python, it opens up many possibilities for task automation and efficiency improvements.

This lesson is more relaxed than the first, but still covers many topics. You will learn how to read and parse a CSV, access data in multiple database flavors, parse KML’s and KMZ's, convert structured data such as featureclasses to dictionaries, and lastly, how to use python to retrieve open source data from the web.

We will also take a look at ESRI’s GIS module from their ArcGIS for Python API. This portion was designed for Jupyter Notebooks, but the code can be ran in PyScripter as a script. If you are interested in stepping through the material within a Notebook, I encourage you to reference the 3rd lesson in Penn State's full GEOG 489 course [1]. The reason we are not pursuing Notebook here, is that the environment can take a very long time to set up and involve a lot of troubleshooting if it does not resolve.

This is not required for this lesson, but a simple and fast way of accessing a Jupyter Notebook through ArcGIS Pro can be done follwoing these instructions How to use ArcGIS Notebooks in ArcGIS Pro [2]. Note that we will be going over Jupyter Notebooks in lesson 4.

Overview and Checklist

We will look at data structures and how you can use Python to transform it from one structure to another. We will look at how to access data on the web from within a Python program, and introduce some python packages that help with Extracting, Transforming, and Loading (ETL) data from CSV’s, SQL, JSON, and KML/KMZ’s.

Learning Outcomes

By the end of this lesson, you should be able to:

Traverse data formats in different formats
Use cursors to iterate over data
Access and work with data from the web
Access GIS data within an Enterprise or AGOL.

Lesson Roadmap

To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.

Steps for Completing Lesson 2
Step	Activity	Access/Directions
1	Engage with Lesson 2 Content	Begin with 2.1 Reading and parsing text using the Python csv module.
2	Programming Assignment and Reflection	Submit your code for the programming assignment and 400 words write-up with reflections
3	Quiz 2	Complete the Lesson 2 Quiz
4	Questions/Comments	Remember to visit Canvas to post/answer any questions or comments pertaining to Lesson 2

Downloads

The following is a list of datasets that you will be prompted to download through the course of the lesson. They are divided into two sections: Datasets that you will need for the assignments and Datasets used for the content and examples in the lesson.

Required:

None for this lesson!

Suggested:

GPS_track.txt [3] <- opens a text file in the browser.
KML and KMZ sample files [4]
Pennsylvania State Data [5]

Assignments

In this homework assignment, the task is to use requests and any other packages you may need to query the service and parse the features out of the returned JSON and save as a csv file or shapefile. You can select a service from esri, you can be adventurous and select a service from the Pennsylvania Spatial Data Access PASDA hub, or test your skill and see if you can extract vector data from a service of your choosing.

The url for the services are:

Pennsylvania Spatial Data Access (PASDA) https://www.pasda.psu.edu/ [7]

Hint to return all features of the service, you still need to provide a valid sql query. For this assignment, you can use OBJECTID IS NOT NULL or 1=1 for the where clause. If you want to try other sql statements for other results, please feel free to do so.

If you are having issues with the url string, you can use the service query UI to see how it should be formatted.

Write-up

Produce a 400-word write-up on how the assignment went for you; reflect on and briefly discuss the issues and challenges you encountered and what you learned from the assignment.

Deliverable

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

• The script

• Your 400-word write-up

Questions?

If you have any questions, please send a message through Canvas. I will check daily to respond. If your question is one that is relevant to the entire class, I may respond to the entire class rather than individually.

Open Source Data

One of the best ways to increase your effectiveness as a GIS programmer is to learn how to manipulate text-based information. GIS data is often collected and shared in more "raw" formats such as a spreadsheet in CSV (comma-separated value), a list of coordinates in a text file, or a response received through a web service such as XML, JSON or GEOJSON.

When faced with these files, you should first understand if your GIS software already comes with a tool or script that can read or convert the data to a format it can use. If no tool or script exists, you'll need to do some programmatic work to read the file and separate out the pieces of text that you really need.

For example, a Web service may return you many lines of XML describing all the readings at a weather station, when all you're really interested in are the coordinates of the weather station and the annual average temperature. Parsing the response involves writing some code to read through the lines and tags in the XML and isolating only those three values. Similarly, many APIs provide data in a JSON (Javascript Serialized Object Notation) format and parsing the response includes accessing the desired keys. JSON and GeoJSON both have documentation: JSON documentation [8] and GeoJSON documentation [9]. There are some differences between the two and require the use of different packages.

There are several different approaches to parsing. Usually, the wisest is to see if some Python module exists that will examine the text for you and turn it into an object that you can then work with. In this lesson, you will work with the Python "csv" module that can read comma-delimited values and turn them into a Python list. Other helpful libraries such as this include lxml and xml.dom for parsing XML, BeautifulSoup for parsing HTML, and the built in module json for working with JSON.

If a module or library doesn't exist that fits your parsing needs, then you'll have to extract the information from the data yourself using a combinations of Python's string manipulation methods and packages. It is common to translate the structure from one package to another or one type to another to be able to extract what you need. One of the most helpful string manipulation methods is string.split(), which turns a big string into a list of smaller strings based on some delimiting character, such as a space or comma. When you write your own parser, however, it's hard to anticipate all the exceptional cases you might run across. For example, sometimes a comma-separated value file might have substrings that naturally contain commas, such as dates or addresses. In these cases, splitting the string using a simple comma as the delimiter is not sufficient and you need to add extra logic or use Regex.

text = "green,red,blue"
text.split(",")

['green', 'red', 'blue']

Another pitfall when parsing is the use of "magic numbers" to slice off a particular number of characters in a string, to refer to a specific column number in a spreadsheet, and so on. If the structure of the data changes, or if the script is applied to data with a slightly different structure, the code could be rendered inoperable and would require some precision surgery to fix. People who read your code and see a number other than 0 (to begin a series) or 1 (to increase a counter) will often be left wondering how the number was derived and what it refers to. In programming, numbers other than 0 or 1 are magic numbers that should typically be avoided, or at least accompanied by a comment explaining what the number refers to.

There are an infinite number of parsing scenarios that you can encounter. This lesson will attempt to teach you the general approach by walking through a couple examples.

Lesson content developed by Jan Wallgrun and James O’Brien

The Python csv module

A common text-based data interchange format is the comma-separated value (CSV) file. This is often used when transferring spreadsheets or other tabular data. Each line in the file represents a row of the dataset, and the columns in the data are separated by commas. The file often begins with a header line containing all the field names.

Spreadsheet programs like Microsoft Excel can understand the CSV structure and display all the values in a row-column grid. A CSV file may look a little messier when you open it in a text editor, but it can be helpful to always continue thinking of it as a grid structure. If you had a Python list of rows and a Python list of column values for each row, you could use looping logic to pull out any value you needed. This is exactly what the Python csv module gives you. It's easiest to learn about the csv module by looking at a real example.

Lesson content developed by Jim Detwiler, Jan Wallgrun and James O’Brien

GPS Track csv Parsing Example

This example reads a text file collected from a GPS unit. The lines in the file represent readings taken from the GPS unit as the user traveled along a path. In this section of the lesson, you'll learn one way to parse out the coordinates from each reading. The next section of the lesson uses a variation of this example to show how you could write the user's track to a polyline feature class.

The file for this example is called gps_track.txt and it looks something like the text string shown below. (Please note, line breaks have been added to the file shown below to ensure that the text fits within the page margins. Click on this link to the gps track.txt [3] file to see what the text file actually looks like.)

type,ident,lat,long,y_proj,x_proj,new_seg,display,color,altitude,depth,temp,time,model,filename,ltime 
TRACK,ACTIVE LOG,40.78966141,-77.85948515,4627251.76270444,1779451.21349775,True,False, 
    255,358.228393554688,0,0,2008/06/11-14:08:30,eTrex Venture, ,2008/06/11 09:08:30 
TRACK,ACTIVE LOG,40.78963995,-77.85954952,4627248.40489401,1779446.18060893,False,False, 
    255,358.228393554688,0,0,2008/06/11-14:09:43,eTrex Venture, ,2008/06/11 09:09:43 
TRACK,ACTIVE LOG,40.78961849,-77.85957098,4627245.69008772,1779444.78476531,False,False, 
    255,357.747802734375,0,0,2008/06/11-14:09:44,eTrex Venture, ,2008/06/11 09:09:44 
TRACK,ACTIVE LOG,40.78953266,-77.85965681,4627234.83213242,1779439.20202706,False,False, 
    255,353.421875,0,0,2008/06/11-14:10:18,eTrex Venture, ,2008/06/11 09:10:18 
TRACK,ACTIVE LOG,40.78957558,-77.85972118,4627238.65402635,1779432.89982442,False,False, 
    255,356.786376953125,0,0,2008/06/11-14:11:57,eTrex Venture, ,2008/06/11 09:11:57 
TRACK,ACTIVE LOG,40.78968287,-77.85976410,4627249.97592111,1779427.14663093,False,False, 
    255,354.383178710938,0,0,2008/06/11-14:12:18,eTrex Venture, ,2008/06/11 09:12:18 
TRACK,ACTIVE LOG,40.78979015,-77.85961390,4627264.19055204,1779437.76243578,False,False, 
    255,351.499145507813,0,0,2008/06/11-14:12:50,eTrex Venture, ,2008/06/11 09:12:50 
etc. ...

Notice that the file starts with a header line, explaining the meaning of the values contained in the readings from the GPS unit. Each subsequent line contains one reading. The goal for this example is to create a Python list containing the X,Y coordinates from each reading. Specifically, the script should be able to read the above file and print a text string like the one shown below.

[['-77.85948515', '40.78966141'], ['-77.85954952', '40.78963995'], ['-77.85957098', '40.78961849'], etc.]

Approach for parsing the GPS track

Before you start parsing a file, it's helpful to outline what you're going to do and break up the task into manageable chunks. Here's some pseudocode for the approach we'll take in this example:

Open the file.
Read the header line.
Loop through the header line to find the index positions of the "lat" and "long" values.
Read the rest of the lines one by one.
Find the values in the list that correspond to the lat and long coordinates and write them to a new list.

Importing the module

When you work with the csv module, you need to explicitly import it at the top of your script, just like you do with arcpy.

import csv

You don't have to install anything special to get the csv module; it just comes with the base Python installation.

Opening the file and creating the CSV reader

The first thing the script needs to do is open the file. Python contains a built-in open() method [10] for doing this. The parameters for this method are the path to the file and the mode in which you want to open the file (read, write, etc.). In this example, "r" stands for read-only mode. If you wanted to write items to the file, you would use "w" as the mode. The open() method is commonly used within a "with" statement, like cursors were instantiated in the previous lesson, for much the same reason: it simplifies "cleanup." In the case of opening a file, using "with" is done so that the file is closed automatically when execution of the "with" block is completed. A close() method does exist, but need not be called explicitly.

with open("C:\\data\\Geog485\\gps_track.txt", "r") as gpsTrack:

Notice that your file does not need to have the extension .csv in order to be read by the CSV module. It can be suffixed .txt as long as the text in the file conforms to the CSV pattern where commas separate the columns and carriage returns separate the rows. Once the file is open, you create a CSV reader object, in this manner:

csvReader = csv.reader(gpsTrack)

This object is kind of like a cursor. You can use the next() method to go to the next line, but you can also use it with a for loop to iterate through all the lines of the file. Note that this and the following lines concerned with parsing the CSV file must be indented to be considered part of the "with" block.

Reading the header line

The header line of a CSV file is different from the other lines. It gets you the information about all the field names. Therefore, you will examine this line a little differently than the other lines. First, you advance the CSV reader to the header line by using the next() method, like this:

header = next(csvReader)

This gives you back a Python list of each item in the header. Remember that the header was a pretty long string beginning with: "type,ident,lat,long...". The CSV reader breaks the header up into a list of parts that can be referenced by an index number. The default delimiter, or separating character, for these parts is the comma. Therefore, header[0] would have the value "type", header[1] would have the value "ident", and so on.

We are most interested in pulling latitude and longitude values out of this file, therefore we're going to have to take note of the position of the "lat" and "long" columns in this file. Using the logic above, you would use header[2] to get "lat" and header[3] to get "long". However, what if you got some other file where these field names were all in a different order? You could not be sure that the column with index 2 represented "lat" and so on.

A safer way to parse is to use the list.index() method and ask the list to give you the index position corresponding to a particular field name, like this:

latIndex = header.index("lat")
lonIndex = header.index("long")

In our case, latIndex would have a value of 2 and lonIndex would have a value of 3, but our code is now flexible enough to handle those columns in other positions.

Processing the rest of the lines in the file

The rest of the file can be read using a loop. In this case, you treat the csvReader as an iterable list of the remaining lines in the file. Each run of the loop takes a row and breaks it into a Python list of values. If we get the value with index 2 (represented by the variable latIndex), then we have the latitude. If we get the value with index 3 (represented by the variable lonIndex), then we get the longitude. Once we get these values, we can add them to a list we made, called coordList:

# Make an empty list
coordList = []

# Loop through the lines in the file and get each coordinate
for row in csvReader:
    lat = row[latIndex]
    lon = row[lonIndex]
    coordList.append([lat,lon])

# Print the coordinate list
print (coordList)

Note a few important things about the above code:

coordList actually contains a bunch of small lists within a big list. Each small list is a coordinate pair representing the x (longitude) and y (latitude) location of one GPS reading.
The list.append() method is used to add items to coordList. Notice again that you can append a list itself (representing the coordinate pair) using this method.

Full code for the example

Here's the full code for the example. Feel free to download the text file [11] and try it out on your computer.

# This script reads a GPS track in CSV format and
#  prints a list of coordinate pairs
import csv

# Open the input file 
with open(r"C:\NGA\Lesson 2\gps_track.txt", "r") as gpsTrack:
    #Set up CSV reader and process the header
    csvReader = csv.reader(gpsTrack)
    header = next(csvReader)
    latIndex = header.index("lat")
    lonIndex = header.index("long")
 
    # Make an empty list
    coordList = []
 
    # Loop through the lines in the file and get each coordinate
    for row in csvReader:
        lat = row[latIndex]
        lon = row[lonIndex]
        coordList.append([lat,lon])
 
    # Print the coordinate list
    print (coordList)

Applications of this script

You might be asking at this point, "What good does this list of coordinates do for me?" Admittedly, the data is still very "raw." It cannot be read directly in this state by a GIS. However, having the coordinates in a Python list makes them easy to get into other formats that can be visualized. For example, these coordinates could be written to points in a feature class, or vertices in a polyline or polygon feature class. The list of points could also be sent to a Web service for reverse geocoding, or finding the address associated with each point. The points could also be plotted on top of a Web map using programming tools like the ArcGIS JavaScript API. Or, if you were feeling really ambitious, you might use Python to write a new file in KML format, which could be viewed in 3D in Google Earth.

Summary

Parsing any piece of text requires you to be familiar with file opening and reading methods, the structure of the text you're going to parse, the available parsing modules that fit your text structure, and string manipulation methods. In the preceding example, we parsed a simple text file, extracting coordinates collected by a handheld GPS unit. We used the csv module to break up each GPS reading and find the latitude and longitude values. In the next section of the lesson, you'll learn how you could do more with this information by writing the coordinates to a polyline dataset.

As you use Python in your GIS work, you could encounter a variety of parsing tasks. As you approach these, don't be afraid to seek help from Internet examples, code reference topics such as the ones linked to in this lesson, and your textbook.

Lesson content developed by Jim Detwiler, Jan Wallgrun and James O’Brien

Database Query With Cursors

Before we get too deep into vector data access, it's going to be helpful to quickly review how the vector data is stored in the software. Vector features in ArcGIS feature classes (remember, including shapefiles) are stored in a table. The table has rows (records) and columns (fields).

Fields in the table

Fields in the table store the geometry and attribute information for the features.

There are two fields in the table that you cannot delete. One of the fields (usually called Shape) contains the geometry information for the features. This includes the coordinates of each vertex in the feature and allows the feature to be drawn on the screen. The geometry is stored in binary format; if you were to see it printed on the screen, it wouldn't make any sense to you. However, you can read and work with geometries using objects that are provided with arcpy.

The other field included in every feature class is an object ID field (OBJECTID or FID). This contains a unique number, or identifier for each record that is used by ArcGIS to keep track of features. The object ID helps avoid confusion when working with data. Sometimes records have the same attributes. For example, both Los Angeles and San Francisco could have a STATE attribute of 'California,' or a USA cities dataset could contain multiple cities with the NAME attribute of 'Portland;' however, the OBJECTID field can never have the same value for two records.

The rest of the fields contain attribute information that describe the feature. These attributes are usually stored as numbers or text.

Discovering field names

When you write a script, you'll need to provide the names of the particular fields you want to read and write. You can get a Python list of field names using arcpy.ListFields().

# Reads the fields in a feature class

import arcpy

featureClass = r"C:\Data\USA\USA.gdb\Cities"
fieldList = arcpy.ListFields(featureClass)
# Loop through each field in the list and print the name
for field in fieldList:
    print (field.name)

The above would yield a list of the fields in the Cities feature class in a file geodatabase named USA.gdb. If you ran this script in PyScripter (try it with one of your own feature classes!) you would see something like the following in the IPython Console.

OBJECTID
Shape
UIDENT
POPCLASS
NAME
CAPITAL
STATEABB
COUNTRY

Notice the two special fields we already talked about: OBJECTID, which holds the unique identifying number for each record, and Shape, which holds the geometry for the record. Additionally, this feature class has fields that hold the name (NAME), the state (STATEABB), whether or not the city is a capital (CAPITAL), and so on.

arcpy treats the field as an object. Therefore the field has properties that describe it. That's why you can print field.name. The help reference topic Using fields and indexes [12] lists all the properties that you can read from a field. These include aliasName, length, type, scale, precision, and others.

Properties of a field are read-only, meaning that you can find out what the field properties are, but you cannot change those properties in a script using the Field object. If you wanted to change the scale and precision of a field, for instance, you would have to programmatically add a new field.

Lesson content developed by Jim Detwiler, Jan Wallgrun and James O’Brien

Reading Through Records

Let’s examine how to read up and down through the table records within Featureclasses and tables.

The arcpy module contains some objects called cursors that allow you to move through records in a table. At version 10.1 of ArcMap, Esri released a new data access module, which offered faster performance along with more robust behavior when crashes or errors were encountered with the cursor. The module contains different cursor classes for the different operations a developer might want to perform on table records -- one for selecting, or reading, existing records; one for making changes to existing records; and one for adding new records. We'll discuss the editing cursors later, focusing our discussion now on the cursor used for reading tabular data, the search cursor.

As with all geoprocessing classes, the SearchCursor class is documented in the Help system [13]. (Be sure when searching the Help system that you choose the SearchCursor class found in the Data Access module. An older, pre-10.1 class is available through the arcpy module and appears in the Help system as an "ArcPy Function.")

The common workflow for reading data with a search cursor is as follows:

Create the search cursor. This is done through the method arcpy.da.SearchCursor(). This method takes several parameters in which you specify which dataset and, optionally, which specific rows you want to read as a sql filter, and some limited sorting functions.
Set up a loop that will iterate through the rows until there are no more available to read.
Within the loop, do something with the values in the current row.

Here's a very simple example of a search cursor that reads through a point dataset of cities and prints the name of each.

# Prints the name of each city in a feature class

import arcpy

featureClass = r"C:\Data\USA\USA.gdb\Cities"

with arcpy.da.SearchCursor(featureClass,["NAME"]) as cursor:
    for row in cursor:
        print (row[0])

Important points to note in this example:

The cursor is created using a "with" statement. Although the explanation of "with" is somewhat technical, the key thing to understand is that it allows your cursor to exit the dataset gracefully, whether it crashes or completes its work successfully. This is a big issue with cursors, which can sometimes maintain locks on data if they are not exited properly.
The "with" statement requires that you indent all the code beneath it. After you create the cursor in your "with" statement, you'll initiate a for loop to run through all the rows in the table. This requires additional indentation.
Creating a SearchCursor object requires specifying not just the desired table/feature class, but also the desired fields within it, as a list. Supplying this list speeds up the work of the cursor because it does not have to deal with the potentially dozens of fields included in your dataset. In the example above, the list contains just one field, "NAME".
The "with" statement creates a SearchCursor object, and declares that it will be named "cursor" in any subsequent code.
A for loop is used to iterate through the SearchCursor (using the "cursor" name assigned to it). Just as in a loop through a list, the iterator variable can be assigned a name of your choice (here, it's called "row" to represent the feature as a “row” of values).
Retrieving values out of the rows is done using the index position of the field name in the tuple you submitted when you created the object. Since the above example submits only one item in the tuple, then the index position of "NAME" within that tuple is 0 (remember that we start counting from 0 in Python).

Here's another example where something more complex is done with the row values. This script finds the average population for counties in a dataset. To find the average, you need to divide the total population by the number of counties. The code below loops through each record and keeps a running total of the population and the number of records counted. Once all the records have been read, only one line of division is necessary to find the average. You can download the sample data [5] for this script.

# Finds the average population in a county dataset
import arcpy

featureClass = r"C:\Data\Pennsylvania\Counties.shp"
populationField = "POP1990"
nameField = "NAME"

average = 0
totalPopulation = 0
recordsCounted = 0

print("County populations:")

with arcpy.da.SearchCursor(featureClass, [nameField, populationField]) as countiesCursor:
    for row in countiesCursor:
        print (row[0] + ": " + str(row[1]))
        totalPopulation += row[1]

        recordsCounted += 1

average = totalPopulation / recordsCounted

print ("Average population for a county is " + str(average))

Differences between this example and the previous one:

The field list includes multiple fields, with their names having been stored in variables near the top of the script.
The SearchCursor object goes by the variable name countiesCursor rather than cursor. It is good practice to indicate what type of cursor you are creating and a simple method of denoting this is to use the first letter of the type of cursor you create. For example,

sCur = SearchCursor

uCur = UpdateCursor

iCur = InsertCursor

Before moving on, you should note that cursor objects have a couple of methods that you may find helpful in traversing their associated records. To understand what these methods do, and to better understand cursors in general, it may help to visualize the attribute table with an arrow pointing at the "current row." When a cursor is first created, that arrow is pointing just above the first row in the table. When a cursor is included in a for loop, as in the above examples, each execution of the for statement moves the arrow down one row and assigns that row's values to the row variable. If the for statement is executed when the arrow is pointing at the last row, there is not another row to advance to and the loop will terminate. (The row variable will be left holding the last row's values.)

SearchCursors are read only and return the attributes as a tuple of values. Remember that tuples are immutable, and you will not be able to assign new values to this tuple “row”.

Imagine that you wanted to iterate through the rows of the cursor a second time. It is best practice to limit nested iterations of the same dataset, and to store the values in a dictionary that could be referenced in the second loop. But for sake of exploring that it can be done, if you were to modify the Cities example above, adding a second loop immediately after the first, you'd see that the second loop would never "get off the ground" because the cursor's internal pointer is still left pointing at the last row. To deal with this problem, you could just re-create the cursor object. However, a simpler solution would be to call on the cursor's reset() method. For example:

cursor.reset()

This will cause the internal pointer (the arrow) to move just above the first row again, enabling you to loop through its rows again.

The other method supported by cursor objects is the next() method, which allows you to retrieve rows without using a for loop. Returning to the internal pointer concept, a call to the next() method moves the pointer down one row and returns the row's values (again, as a tuple). For example:

row = cursor.next()

An alternative means of iterating through all rows in a cursor is to use the next() method together with a while loop. Here is the original Cities example modified to iterate using next() and while:

# Prints the name of each city in a feature class (using next() and while)

import arcpy

featureClass = r"C:\Data\USA\USA.gdb\Cities"

with arcpy.da.SearchCursor(featureClass,("NAME")) as cursor:
    try:
        row = cursor.next()
        while row:
            print (row[0])
            row = cursor.next()
    except StopIteration:
        pass

Points to note in this script:

This approach requires invoking next() both prior to the loop and within the loop.
Calling the next() method when the internal pointer is pointing at the last row will raise a StopIteration exception. The example prevents that exception from displaying an ugly error.

You should find that using a for loop is usually the better approach, and in fact, you won't see the next() method even listed in the documentation of arcpy.da.SearchCursor. We're pointing out the existence of this method because a) older ArcGIS versions have cursors that can/must be traversed in this way, so you may encounter this coding pattern if looking at older scripts, and b) you may want to use the next() method if you're in a situation where you know the cursor will contain exactly one row (or you're interested in only the first row if you have applied some sort of sorting on the results).

Lesson content developed by Jim Detwiler, Jan Wallgrun and James O’Brien

KML, KMZ and Featureclasses

It is a very common need to convert the form of data from one structure to another. Whether it’s data stored as a simple list, CSV, JSON, or XML, it is likely that it will need to be in a different structure in order for it to be consumed or analyzed. For this section, we will look at extracting and translating data from a KML and extracting Featureclass data to a dictionary.

KML and KMZ

A friendly competitor or companion to the shapefile format is the KML or Keyhole Markup Language and it is used to display geographic data in several different GIS software suites. Some sample KML and KMZ files [4] can be downloaded. KML organizes the different features into placemarks, paths, and polygons and groups like features into these categories. When you import a KML into GIS, you will see these three shapefiles, even if there are different features with different attributes. For example, if you have a KML of hotels and soccer field locations, these would be combined within the placemark features. Esri provides some conversion of KML’s but they do not offer the ability to also dissolve on an attribute (separating hotels from soccer fields) during the import. This becomes a multistep process if you want your data compartmentalized by feature type.

There are python packages that provide tools to access the feature’s data and parse them into familiar featureclass or geojson structures that are worth knowing about. One package that is worth knowing is kml2geojson, which extracts the KML data into a geojson format. The steps are mostly the same, but involves writing the JSON to a file in order for esri's JSONToFeatureClass method to read it.

from zipfile import ZipFile
import kml2geojson
import os
import json

outDir = r'C:\NGA\kml kmz data\kml kmz data'
kmz = os.path.join(outDir, r'CurtGowdyStatePark.kmz')

# extract the kml file
with ZipFile(kmz) as kmzFile:
    # extract files
    kmzFile.extractall(outDir)

# convert the doc.kml to a geojson object
kml_json = kml2geojson.main.convert(fr'{outDir}\doc.kml', r'curt_gowdy')

From here, you have a geojson object that you can convert to a Featureclass via JSONToFeatureClass, import into other analytical tools like geopandas/ pandas dataframes, or use in API's like the ArcGIS API for Python. Having the data in this portable format also assists with extracting data out of the KML. For example, if you were tasked with getting a list of all places within the KML, you could use a mapping program like Google Earth, QGIS or ArcGIS Pro to open the dataset and copy out each requested attribute, or you could employ Python and a few other packages to do the work for you now that you know how to parse the KML. For example, getting the place name of each feature by using the geopandas package:

import kml2geojson
import os
import geopandas as gpd

# read in the kml and convert to geojson.
kml_json = kml2geojson.main.convert(fr'{outDir}\CurtGowdyArchery.kml', r'curt_gowdy_archery')

# ETL to geopandas for manipulation/ data extraction
gdf = gpd.GeoDataFrame.from_features(kml_json[0]["features"])

place_names = list(set(gdf['name']))

print(place_names)

The features from the kml2geojson are read into the geopandas dataframe, where you can then use pandas functions to access the values. You could drill down into the features of the kml_json to loop over the list of features, though geopandas does it for us in less code.

Writing the geojson to file can be done using json.dump() method by adding it into our script:

...
# write the converted items to a file, using the name property of the kml_json object
# as the output filename.
with open(fr'{outDir}\{kml_json[0]["name"]}.geojson', 'w') as file:
    json.dump(kml_json, file)

...

ESRI Tools

Esri provides several geoprocessing tools that assist in the ETL of KML’s. KML To Layer [14] converts a .kml or .kmz file into Featureclasses and a symbology layer file. This method creates a file geodatabase and parses the KML features into the respective point, polyline, and polygon featureclasses as part of a Feature Dataset. It uses a layer file to maintain the symbology that is included within the KMZ. The example below takes it a step further and splits the resulting featurclasses into individual featureclasses based on the attributes.

import os

outDir = r'C:\NGA\kml kmz data'
file_name = 'CurtGowdyStatePark'
kmz = os.path.join(outDir, f'{file_name}.kmz')

# extract the kml file
with ZipFile(kmz) as kmzFile:
    # extract files
    kmzFile.extractall(outDir)

# convert the kml/kmz to featureclasses in a gdb.
arcpy.conversion.KMLToLayer(fr"{outDir}\doc.kml", fr"{outDir}\{file_name}", file_name)

# Change the workspace to the gdb created by KMLToLayer. The method creates a file geodatabase named .gdb
arcpy.env.workspace = fr'{outDir}\{file_name}\{file_name}.gdb'

# get the featureclasses created by the KMLToLayer method
fcs = [fc for fc in arcpy.ListFeatureClasses(feature_dataset='Placemarks')]

# Set the fields that will be used for splitting the geometry into named featureclasses
# Multiple fields can be used, ie ['Name', 'SymbolId']
splitDict = {'Points': 'SymbolID',
             'Polylines': 'Name',
             'Polygons': 'Name'}

# iterate over featureclasses and execute the split by attributes.
for fc in fcs:
    arcpy.SplitByAttributes_analysis(fc, arcpy.env.workspace, splitDict.get(fc))

Featureclass to Dictionary

When working with Featureclasses and Tables, it is necessary to sometimes compare one dataset to another and make updates. You can do this with nested cursors, but it can get confusing, circular, and costly regarding speed and performance. It is better to read one dataset into an organized structure and operate against it than trying to nest cursors. I'll provide some methods of creating dictionaries and then demonstrate how you could use it.

A simple way to avoid nested cursors is to load all the data into a dictionary using a search cursor. Then use dictionary lookups to retrieve new data. For example, this code below creates a dictionary of attribute values from a search cursor.

fc = r'C:\NGA 489\USA.gdb\Cities' 

# Create a dictionary of fields and values, using the objectid as key
# Get a list of field names from the Featureclass.
fields = [fld.name for fld in arcpy.ListFields(fc)] 

# Create empty dictionary
fc_dict = {} 

# Start the search cursor
with arcpy.da.SearchCursor(fc, fields) as sCur: 
    for row in sCur: 
        # Add the attribute values from the cursor to the avalache_dict using the OBJECTID as the key
        fc_dict[row[fields.index('OBJECTID')]] = {k: v for k, v in zip(sCur.fields, row)}

This code converts all features into a dictionary using a field (TYPE) as the key, and adds the feature's UIDENTvalue to a list to create a group of UIDENT values for each TYPE.

fc = r'C:\NGA 489\USA.gdb\Hydro' 
fields = ['OBJECTID', 'TYPE', 'HYDRO_', 'UIDENT'] 

fc_dct = {} 

with arcpy.da.SearchCursor(fc, fields) as sCur: 
    for row in sCur: 
        if dct.get(row[fields.index('TYPE')]): # gets the list index of ‘TYPE’ from the fields list as index of row 
            # Append the UIDENT value to the list
            fc_dct[row[fields.index('TYPE')]].append(row[3]) 
        else: 
            # Create the list for the accountno and add the space value.
            fc_dct[row[fields.index('TYPE')]] = [row[3]]

This example creates a dictionary of dictionaries using the OBJECTID as the key and {field:value, …} for all features in dictionary comprehension. We will discuss the list and dictionary comprehension construct in more detail in lesson 3.

fc = r'C:\NGA 489\USA.gdb\Hydro'
fields = [f.name for f in arcpy.ListFields(fc)] 

# Create a dictionary of fields and values, using the objectid as key
sCur = arcpy.da.SearchCursor(fc, fields)

fc_dct = {row[fields.index('OBJECTID')]: {k: v for k, v in zip(sCur.fields, row)} for row in sCur} # zip is a built in method that creates a field:value, one to one representation of the field to the value in the row, like a zipper.

Once the data is in the dictionary structure, you can iterate over the dictionary within an Update or Insert cursor to save the new data. For example:

# Create InsertCursor to add new features to a Featureclass

fc = r'C:\NGA 489\USA.gdb\Hydro'
# filter out the objectid field since it is autogenerated
fields = [f.name for f in arcpy.ListFields(fc) if f.name not in ['OBJECTID']] 
with arcpy.da.InsertCursor(fc, fields) as iCur: 
    # Iterate over the dictionary items.  Note that the dictionary may be in the { objectid : {field: value, ...}, ...} 
    # format so there is a need to get the inner dictionary of fields and values 
    for k, v in fc_dct.items(): 
        # k = objectid
        # v = {field: value, ...}
        # get list of values in the order of the cursor fields.
        iRow = [v.get(f) for f in iCur.fields]
        iCur.insertRow(iRow)

or as an Update and iterating over the features from the cursor and matching to the key in the dictionary by OBJECTID:

with arcpy.da.UpdateCursor(fc, [‘OBJECTID’, 'TYPE', 'UIDENT']) as uCur: 
    for row in uCur: 
        vals = fc_dct.get(row[0]) 
        if vals: 
            row[1] = vals['TYPE'] 
            row[2] = vals['UIDENT'] 
            uCur.updatetRow(row)

Accessing and Working with Web Data

There is a wealth of geographic (and other) information available out there on the web in the form of web pages and web services, and sometimes we may want to make use of this information in our Python programs. In the first walkthrough of this lesson, we will access two web services from our Python code that allow us to retrieve information about places based on the places’ names. Another common web-based programming task is scraping the content of web pages with the goal of extracting certain pieces of information from them, for instance links leading to other pages. In this section, we are laying the foundation to perform such tasks in Python by showing some examples of working with URLs and web requests using the urllib and requests packages from the standard Python library, and the BeautifulSoup4 (bs4) package, which is a 3rd party package that you will have to install.

Urllib and Requests

Urllib in Python 3 consists of the three main modules: urllib.requests for opening and reading URLs, urllib.error defining the exceptions that can be raised, and urllib.parse for parsing URLs. It is quite comprehensive and includes many useful auxiliary functions for working with URLs and communicating with web servers, mainly via the HTTP protocol. Nevertheless, we will only use it to access a web page in this first example here, and then we will switch over to using the requests package instead, which is more convenient to use for the high-level tasks we are going to perform.

In the following example, we use urllib to download the start page from Lesson 1 of the 10-week course:

import urllib.request 

url = "https://www.e-education.psu.edu/geog489/l1.html" 
response = urllib.request.urlopen(url) 
htmlCode = response.read() 
print(htmlCode)

After importing the urllib.request module, we define the URL of the page we want to access in a string variable. Then in line 4, we use function urlopen(…) of urllib to send out an HTTP request over the internet to get the page whose URL we provide as a parameter. After a successful request, the response object returned by the function will contain the html code of the page and we can access it via the read() method (line 5). If you run this example, you will see that the print statement in the last line prints out the raw html code of the Lesson 1 start page.

Here is how the same example looks using the requests package rather than urllib:

import requests 

url = "https://www.e-education.psu.edu/geog489/l1.html" 
response = requests.get(url) 
htmlCode = response.text 
print(htmlCode)

As you can see, for this simple example there really isn’t a big difference in the code. The function used to request the page in line 4 is called get(…) in requests and the raw html code can be accessed by using a property called text of the response object in line 5 not a method, that’s why there are no parentheses after text.

The most common things returned by a single web request, at least in our domain, are:

html code
plain text
an image (e.g. JPEG or PNG)
XML code
JSON code

Most likely you are at least somewhat familiar with html code and how it uses tags to hierarchically organize the content of a page including semantic and meta information about the content as well as formatting instructions. Most common browsers like Chrome, Firefox, and Edge have some tools to inspect the html code of a page in the browser. Open the first lesson page [15] in a new browser window and then do a right-click -> Inspect (element) on the first bullet point for “1.1 Overview and Checklist” in the middle of the window. That should open up a window in your browser showing the html code with the part that produces this line with the link to the Section 1.1 web page highlighted as in the figure below.

screenshot of code, see caption and surrounding text

Figure 2.1 Lesson 1 start page html code as shown when using the Inspect function of the browser

The arrows indicate the hierarchical organization of the html code, the so-called Document Object Model (DOM) [16], and can be used to unfold/fold in part of the code. Also note how most html tags (‘body’,‘div’, ‘a’, ‘span’, etc.) have an attribute “id” that defines a unique ID for that element in the document as well as an attribute “class” which declares the element to be of one or several classes (separated by spaces) that, for instance, affect how the element will be formatted. We cannot provide an introduction to html and DOM here but this should be enough background information to understand the following examples. (These topics are addressed in more detail in our GEOG 863 class [17].)

Unless our program contains a browser component for displaying web pages, we are typically downloading the html code of a web page because we are looking for very specific information in that code. For this, it is helpful to first parse the entire html code and create a hierarchical data structure from it that reflects the DOM structure of the html code and can be used to query for specific html elements in the structure to then access their attributes or content. This is exactly what BeautifulSoup does.

Go ahead and install the beautifulsoup4 package in the Python Package Manager of ArcGIS Pro as you did with PyScripter in Section 1.5. Once installed, BeautifulSoup will be available under the module name bs4. The following example shows how we can use it to access the <title> element of the html document:

import requests 
from bs4 import BeautifulSoup 

url = "https://www.e-education.psu.edu/geog489/l1.html" 
response = requests.get(url) 
soup = BeautifulSoup(response.text, 'html.parser') 

print(soup.find('title'))

Output:
<title>Lesson 1 Python 3, ArcGIS Pro & Multiprocessing | GEOG 489: GIS Application Development</title>

In line 6, we are taking the raw html code from response.text and create a BeautifulSoup object from it using an html parser and store it in variable soup. Parsing the html code and creating the hierarchical data structure can take a few seconds. We then call the find(…) method to get the element demarcated by the title tags <title>…</title> in the document. This works fine here for <title> because an html document only contains a single <title> tag. If used with other tags, find(…) will always return only the first element, which may not be the one we are looking for.

However, we can provide additional attributes like a class or id for the element we are looking for. For instance, the following command can be used to get the link element (= html tag <a>) that is of the class “print-page”:

print(soup.find('a', attrs = {'class': 'print-page'}))

The output will start with <a class=”print-page” href…” and include the html code for all child elements of this <a> element. The “attrs” keyword argument takes a dictionary that maps attribute names to expected values. If we don’t want to print out all this html code but just a particular attribute of the found element, we can use the get(…) method of the object returned by find(…), for instance with ‘href’ for the attribute that contains the actual link URL:

element = soup.find('a', attrs = {'class': 'print-page'}) 
print(element.get('href'))

Output: 
https://www.e-education.psu.edu/geog489/print/book/export/html/1703

You can also get a list of all elements that match the given criteria, not only the first element, by using the method find_all(…) instead of find(…). But let’s instead look at another method that is even more powerful, the method called select(…). Let’s say what we really want to achieve with our code is extract the link URLs for all the pages linked to from the content list on the page. If you look at the highlighted part in the image above again, you will see that the <a> tags for these links do not have an id or class attribute to distinguish them from other <a> tags appearing in the document. How can we unambiguously characterize these links?

What we can say is that these are the links that are formed by a <a> tag within a <li> element within a <ul> element within a <div> element that has the class “book-navigation”. This condition is only satisfied by the links we are interested in. With select(…) we can perform such queries by providing a string that describes these parent-child relationships:

elementList = soup.select('div.book-navigation > ul > li > a') 
for e in elementList: 
	print(e.get('href'))

Output: 
/geog/489/l1_p1.html 
/geog/489/l1_p2.html 
/geog/489/l1_p3.html 
…

The list produced by the code should consist of ten URLs in total. Note how in the string given to select(…) the required class for the <div> element is appended with a dot and how the > symbol is used to describe the parent-child relationships along the chain of elements down to the <a> elements we are interested in. The result is a list of elements that match this condition and we loop through that list in line 2 and print out the “href” attribute of each element to display the URLs.

One final example showing the power of BeautifulSoup: The web page www.timeanddate.com [18], among other things, allows you to look up the current time for a given place name by directly incorporating country and place name into the URL, e.g.

http://www.timeanddate.com/worldclock/usa/state-college [19]

… to get a web page showing the current time in State College, PA. Check out the web page returned by this request and use right-click -> Inspect (element) again to check how the digital clock with the current time for State College is produced in the html code. The highlighted line contains a <span> tag with the id “ct”. That makes it easy to extract this information with the help of BeautifulSoup. Here is the full code for this:

import requests 
from bs4 import BeautifulSoup 

url = "http://www.timeanddate.com/worldclock/usa/state-college" 

response = requests.get(url) 
soup = BeautifulSoup(response.text, 'html.parser') 
time = soup.find('span', attrs= { 'id': 'ct'}) 

print('Current time in State College: ' + time.text)

Output: 
Current time in State College: 13:32:28

Obviously, the exact output depends on the time of day you run the code. Please note that in the last line we use time.text to get the content of the <span> tag found, which is what appears between the <span> and </span> tags in the html.

We are intentionally only doing this for a single place here because if you ever do this kind of scraping of web pages on a larger scale, you should make sure that this form of usage is not against the web site’s terms of use.

We also mentioned above that one common response format is JSON (JavaScript Object Notation) code [20]. If you actually click the link above, you will see that Google sends back the response as JSON code. JSON is intended to be easily readable by computers not humans, but the good thing is that we as Python programmers are already used to reading it because it is based on notations for arrays (=lists) and objects (=dictionaries) that use the same syntax as Python.

Study the JSON response to our Zandbergen query from above for a moment. At the top level we have a dictionary that describes the response. One entry “totalItems” in the dictionary says that the response contains 16 results. The entry “items” contains these results as a list of dictionaries/objects. The first dictionary from the list is the one for our course textbook. One attribute of this dictionary is “volumeInfo”, which is again a dictionary/object whose attributes include the title of the book and name of the author. Please note that the “authors” attribute is again a list because books can have multiple authors. If you scroll down a bit, you will see that at some point the dictionary for the Zandbergen book is closed with a “}” and then a new dictionary for another book starts which is the second item from the “items” list, and so on.

After this explanation of web APIs and JSON, here is the Python code to run this query and process the returned JSON code:

import requests, urllib.parse 

url = "https://www.googleapis.com/books/v1/volumes"  
query = "Zandbergen Python" 

parameterString = "?q=" + urllib.parse.quote(query) 

response = requests.get(url + parameterString)  
jsonCode = response.json() 

print(jsonCode['items'][0]['volumeInfo']['title'])

Output: 
Python Scripting for Arcgis

We here define the base URL for this web API call and the query term string in different variables (lines 3 and 4). You saw above that certain characters like spaces appearing in URLs need to be encoded in certain ways. When we enter such URLs into a browser, the browser will take care of this but if we construct the URL for a request in our code we have to take care of this ourselves. Fortunately, the urllib.parse module provides the function quote(…) for this, which we use in line 6 to construct the correctly encoded parameter list which is then combined with the base url in the call of requests.get(…) in line 8.

By using the json() method of the response object in line 9, we get a Python data structure that will represent the JSON response and store it in variable jsonCode. In this case, it is a dictionary that under the key “items” contains a Python list with dictionaries for the individual book items returned. In line 11, we use this data structure to access the 'title' attribute of the first book item in the list: With ['items'] we first get the “items” list, then we take the first element from that list with [0], then we access the 'volumeInfo' property of the resulting dictionary, and finally with ['title'] we get the 'title' attribute from the volume info dictionary.

The code from above was supposed to show you how you to explicitly encode parameters for web API requests (with the help of urllib.parse.quote(...)) and build the final URL. The great thing about the requests module is that it can take care of all these things for you: You can simply provide an additional parameter for get(…) that contains a dictionary of parameter names for the web request and what values should be assigned to these parameters. Requests then automatically encodes these values and builds the final URL. Here is the version of the previous example that uses this approach.

import requests 

url = "https://www.googleapis.com/books/v1/volumes"  
query = "Zandbergen Python" 

response = requests.get(url, {'q': query})  
jsonCode = response.json() 

print(jsonCode['items'][0]['volumeInfo']['title'])

The dictionary with parameters for the web request that we use in line 6 says that the value assigned to parameter 'q' should be the string contained in variable query. As said, requests will automatically take care of encoding special characters like spaces in the parameter values and of producing the final URL from them.

You will see more examples of using web APIs and processing the JSON code returned in the first walkthrough of this lesson. These examples will actually return GeoJSON [21] code which is a standardized approach for encoding spatial features in JSON including their geometry data. However, there is a rather large but also very interesting topic that we have to cover first.

Lesson content developed by Jan Wallgrun and James O’Brien

Sites Robots.txt

Believe it or not, the internet does try to have some rule and order and 'a code of conduct'. After all, without rules, there is chaos. The code of conduct is up to the websites to create and provide. This code of conduct sets rules on what bots, crawlers, and scrapers can access within their site. This conduct request can be a legal binding document so it is a good idea to make sure you are within allowance. The code of conduct is the robots.txt file.

You can view the sites robots.txt file to see what endpoints they allow for scraping by adding /robots.txt to the end of the domain in the browser. For example:

http://www.timeanddate.com/robots.txt

returns a long list of endpoints they are ok with and are not ok with as well as rules for certain crawlers and bots.

# http://web.nexor.co.uk/mak/doc/robots/norobots.html 
# 
# internal note, this file is in git now! 

User-agent: MSIECrawler 
Disallow: /
User-agent: PortalBSpider 
Disallow: / 
User-agent: Bytespider 
Disallow: / 
User-agent: Mediapartners-Google* 
Disallow: /
User-agent: Yahoo Pipes 1.0 
Disallow: / 
# disallow any urls with ? in 
User-Agent: AhrefsBot 
Disallow: /*? 
Disallow: /information/privacy.html 
Disallow: /information/terms-conditions.html 
User-Agent: dotbot 
Disallow: /*? 
Disallow: /worldclock/results.html 
Disallow: /scripts/tzq.php 
User-agent: ScoutJet 
Disallow: /worldclock/distanceresult.html 
Disallow: /worldclock/distanceresult.html* 
Disallow: /information/privacy.html 
Disallow: /information/terms-conditions.html 
Crawl-delay: 1 

User-agent: * 
Allow: /calendar/create.html?*year=2022 
Allow: /calendar/create.html?*year=2023 
Allow: /calendar/create.html?*year=2024 
Disallow: /adverts/ 
Disallow: /advgifs/ 
Disallow: /eclipse/in/*?starty= 
Disallow: /eclipse/in/*?iso 
Allow:    /eclipse/in/*?iso=202 
… 
Disallow: /worldclock/sunearth.html? 
Disallow: /worldclock/sunearth.html?iso

Give that a try in a browser for your favorite website and read through it to see what they are requesting to not be scanned by bots/ scrappers and what they do allow.

In addition, some things can be done to keep the load on the server produced by web scraping as low as possible, e.g. by making sure the results are stored/cached when the program is running and not constantly being queried again unless the result may have changed. In this example, while the time changes constantly, one could still only run the query once, calculate the offset to the local computer’s current time once, and then always recalculate the current time for State College based on this information and the current local time.

The examples we have seen so far all used simple URLs, although this last example was already an example where parameters of the query are encoded in the URL (country and place name), and the response was always an html page intended to be displayed in a browser. In addition, there exist web APIs that realize a form of programming interface that can be used via URLs and HTTP requests. Such web APIs are, for instance, available by Twitter to search within recent tweets, by Google Maps, and by Esri [22]. Often there is a business model behind these APIs that requires license fees and some form of authorization.

Web APIs often allow for providing additional parameters for a particular request that have to be included in the URL. This works very similar to a function call, just the syntax is a bit different with the special symbol ? used to separate the base URL of a particular web API call from its parameters and the special symbol & used to separate different parameters. Here is an example of using a URL for querying the Google Books API for the query parameter “Zandbergen Python”:

Example of using a URL for querying Google Books API [23]

www.googleapis.com/books/v1/volumes [24] is the base URL for using the web API to perform this kind of query and q=Zandbergen%20Python is the query parameter specifying what terms we want to search for. The %20 encodes a single space in a URL. If there would be more parameters, they would be separated by & symbols like this:

<parameter 1>=<value 1>&<parameter 2>=<value 2>&…

Portions of lesson content developed by Jan Wallgrun and James O’Brien

Parsing JSON

Python includes several built in methods for handling JSON (or json). As you remember from Lesson 1, json is widely used to transfer data from one language to another or from one data source to another. We already looked at some geojson earlier in the lesson when we extracted data from KML's. For this section we will demonstrate loading it into a Python Dictionary and dumping it to a json string. The official documentation can be found here [25].

Loading json means to convert a formatted string into a json object. This is useful for converting string representations of dictionaries that come from various sources, or from API’s results. It is important to note that there is also json.load(…) which takes an file or bytes like object as an input parameter whereas json.loads(…) takes a json string. The s at the end of loads denotes that the method is for a string input, whereas the load method without the s is for objects such as files. The dump and dumps follows the same convetion, except it outputs to a file writer type object or a json string so beware of what type of data you are working with. If you do forget, the Python interpreter will kindly let you know.

A simple json loads example:

import json 
 
# JSON string:Multi-line string 
x = '{"City": "Cheyenne", "State": "Wyoming", "population": "Very Little", "Industries":["Mining", "Restaurants", "Rodeos"]}' 
 
# parse x: 
y = json.loads(x) 
 
print(type(y)) 
print(y)

To write the python dictionary back to a json string, you would use the .dumps() method.

And a simple json dumps example:

# Creating a dictionary 
wyo_hi_dict = {1:'Welcome', 2:'to', 3:'Cheyenne', 4:'Wyoming'} 
 
# Converts input dictionary into 
# string and stores it in json_string 
json_string = json.dumps(wyo_hi_dict)

print('Equivalent json string of input dictionary:', json_string)

Equivalent json string of input dictionary: '{"1": "Welcome", "2": "to", "3": "Cheyenne", "4": "Wyoming"}'

You will notice that the JSON dumps converts the keys and values to strings. The deserialization process converts the value to its datatype, but it doesn't always get it right so sometimes we are left with adding custom casting.

Accessing the properties of the json object when loaded from json.loads() is the same as accessing them via Python dictionary.

print(json_string["1"])

Accessing GIS Data via REST

Now that we know some methods for parsing JSON data, let’s look at the REST and how we can use it to generate a url for our code. There are four parameters that we need to fill out for it to return a result. These four parameters are the Where, Out Fields, and Return Geometry, and return format. If you need a more specific result, you can enter more parameters as needed.

The base url for the query is looks like this from Query: BLM California Active Fires 2023 (ID: 0) [26].

With the query at the end followed by the ?. The ? separates the url into sections. This ? is starting the passing of the parameters, which are separated by & sign. Note that the url does not contain spaces

The complete request’s parameters are added to the URL when the request is submitted:

?where=1=1&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=0.0&units=esriSRUnit_Meter&relationParam=&returnGeodetic=false&outFields=*&returnGeometry=true&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&defaultSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=false&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=none&f=json&token=

Building our query that will return all features, all fields in json format will look like:

https://www.arcgis.com/apps/mapviewer/index.html?url=https://services3.arcgis.com/T4QMspbfLg3qTGWY/ArcGIS/rest/services/Boulder_Fire_Tax_District/FeatureServer/0/query?where=1=1&outFields=*&returnGeometry=true &f=json

With this url string, can use request to retrieve data from services and save them locally. There are a few ways of doing this such as using pandas to read the url, converting the json to a dataframe and then to a featureclass, converting the JSON result to a dictionary and using an insert/update cursor, or creating a new featureclass using the arcpy method JSONToFeatures_conversion(…) method. It is important to note an important aspect of this method noted in the Summary section of the documentation:

Converts feature collections in an Esri JSON formatted file (.json) or a GeoJSON formatted file (.geojson) to a feature class.

This means that the method is expecting a file as the input and trying to pass the response json will results in an error and if you do not need to manipulate the data before outputting to a featureclass, this method might be the simplest to implement with the least amount of packages. If you need to work on the data, such as convert the datetimes for a field, converting the JSON to a dataframe or dictionary would be the preferred process.

Below is an example of the process.

import arcpy 
import requests 
import json 

arcpy.env.overwriteOutput = True 

url = " https://services3.arcgis.com/T4QMspbfLg3qTGWY/ArcGIS/rest/services/Boulder_Fire_Tax_District/FeatureServer/0/query?where=1=1&returnGeometry=true&outFields=*&f=json" 

# send the request to the url and store the reply as response
response = requests.get(url) 

# Get result as json
jsonCode = response.json() 

jsonFile = r"C:\NGA\Lesson 3\response.json" 

# write the response json to a file
with open(jsonFile, "w") as outfile: 
    json.dump(jsonCode, outfile) 

# JSONToFeatures requires a file as input
arcpy.JSONToFeatures_conversion(jsonFile, r'C:\NGA\Lesson 3\Geog489.gdb\Boulder_Fire_Tax_Dist', 'POLYGON') 

# Clean up
if os.path.exists(jsonFile): 
    os.remove(jsonFile)

You can also separate the parameters into a dictionary and let requests do the url formatting, making for a cleaner code and this method seems to also format the returned date fields, if there are any:

params = {'where': '1=1', 'outFields': '*', 'f': 'pjson', 'returnGeometry': True} 
response = requests.get(url, params)

FeatureService to Featureclass

As we said earlier about searching for packages that perform an ETL process, we saved a hidden gem for last. Compared to the previous methods of retreiving data from from a service that we went over, the few lines of code this process requires is welcoming from a managerial and readbility standpoint.

A hidden capability of arcpy’s conversion FeatureclassToFeatureclass [27] is that it can take a service endpoint as an input and make short work of this conversion to a Featureclass. However, as promising as it seems, some services do not transform. Since it is only a few lines of code, it is worth giving it a try and saving some time.

url = 'https://services3.arcgis.com/T4QMspbfLg3qTGWY/ArcGIS/rest/services/Boulder_Fire_Tax_District/FeatureServer/0' 
 
out_location = r"C:\NGA\TestingExportData\output.gdb" 
out_featureclass = "Fire_Tax_District" 
 
# Run FeatureClassToFeatureClass 
arcpy.conversion.FeatureClassToFeatureClass(url, out_location, out_featureclass)

To add the definition query, you can set the parameter of the method.

## Adding Expression 
delimitedField = arcpy.AddFieldDelimiters(arcpy.env.workspace, "NAME") 
expression = delimitedField + " = 'Post Office'"

arcpy.conversion.FeatureClassToFeatureClass(url, out_location, out_featureclass, expression)

ArcGIS API for Python

Esri’s ArcGIS API for Python was announced in summer 2016 and was officially released at the end of the same year. The goal of the API as stated in this ArcGIS blog(link is external) [28] accompanying the initial release is to provide a pythonic GIS API that is powerful, modern, and easy to use. Pythonic here means that it complies with Python best practices regarding design and implementation and is embedded into the Python ecosystem leveraging standard classes and packages (pandas and numpy, for instance). Furthermore, the API is supposed to be “not tied to specific ArcGIS jargon and implementation patterns” (Singh, 2016) and has been developed specifically to work well with Jupyter Notebook. Most of the examples from the API’s website(link is external) [29] actually come in the form of Jupyter notebooks.

The current version (at the time of this writing) is version 2.1.0.3 published in March 2023. The API consists of several modules for:

managing a GIS hosted on ArcGIS Online or ArcGIS Enterprise/Portal (module arcgis.gis)
administering environment settings (module arcgis.env)
working with features and feature layers (module arcgis.features)
working with raster data (module arcgis.raster)
performing network analyses (module arcgis.network)
distributed analysis of large datasets (module arcgis.Geoanalytics)
performing geocoding tasks (module arcgis.geocoding)
answering questions about locations (module arcgis.geoenrichment)
manipulating geometries (module arcgis.geometry)
creating and sharing geoprocessing tools (module arcgis.geoprocessing)
creating map presentations and visualizing data (module arcgis.mapping)
processing real-time data feeds and streams (module arcgis.realtime)
working with simplified representations of networks called schematics (module arcgis.schematics)
embedding maps and visualizations as widgets, for instance within a Jupyter Notebook (module arcgis.widgets)

You will see some of these modules in action in the examples provided in the rest of this section.

Lesson content developed by Jan Wallgrun and James O’Brien

GIS, map and content

The central class of the API is the GIS class defined in the arcgis.gis module. It represents the GIS you are working with to access and publish content or to perform different kinds of management or analysis tasks. A GIS object can be connected to AGOL or Enterprise/Portal for ArcGIS. Typically, the first step of working with the API in a Python script consists of constructing a GIS object by invoking the GIS(...) function defined in the argis.gis module. There are many different ways of how this function can be called, depending on the target GIS and the authentication mechanism that should be employed. Below we are showing a couple of common ways to create a GIS object before settling on the approach that we will use in this class, which is tailored to work with your pennstate organizational account.

Most commonly, the GIS(...) function is called with three parameters, the URL of the ESRI web GIS to use (e.g. ‘https://www.arcgis.com [30]’ for a GIS object that is connected to AGOL), a username, and the login password of that user. If no URL is provided, the default is ArcGIS Online and it is possible to create a GIS object anonymously without providing username and password. So the simplest case would be to use the following code to create an anonymous GIS object connected to AGOL (we do not recommend running this code - it is shown only as an example):

from arcgis.gis import GIS
gis = GIS()

Due to not being connected to a particular user account, the available functionality will be limited. For instance, you won’t be able to upload and publish new content. If, instead, you'd want to create the GIS object but with a username and password for a normal AGOL account (so not an enterprise account), you would use the following way with <your username> and <your password> replaced by the actual user name and password (we do not recommend running this code either- it is shown only as an example):

from arcgis.gis import GIS
gis = GIS('https://www.arcgis.com', '', '')
gis?

This approach unfortunately does not work with the pennstate organizational account you had to create at the beginning of the class. But we will need to use this account in this lesson to make sure you have the required permissions. The URL needs to be changed to ‘https://pennstate.maps.arcgis.com [31]’ and we have to use a parameter called client_id to provide the ID of an app that we created for this class on AGOL. When using this approach, calling the GIS(...) function will open a browser window where you can authenticate with your PSU credentials and/or you will be asked to grant the permission to use Python with your pennstate AGOL account. After that, you will be given a code that you have to paste into a box that will be showing up in your notebook. The details of what is going on behind the scenes are a bit complicated but whenever you need to create a GIS object in this class to work with the API, you can simply use these exact few lines and then follow the steps we just described. The last line of the code below is for testing that the GIS object has been created correctly. It will print out some help text about the GIS object in a window at the bottom of the notebook.

Craeting a clientID in AGOL for authentication can be found here [32]. This is not required for the lesson or the assignment, and is being provided as FYI.

The screenshot below illustrates the steps needed to create the GIS object and the output of the gis? command. (for safety it would be best to restart your Notebook kernel prior to running the code below if you did run the code above as it can cause issues with the below instructions working properly)

import arcgis
from arcgis.gis import GIS
gis = GIS('https://pennstate.maps.arcgis.com', client_id='lDSJ3yfux2gkFBYc')
gis?

Steps for creating GIS object and output produced by the gis? command in the previous code example

The GIS object in variable gis gives you access to a user manager (gis.users), a group manager (gis.groups) and a content manager (gis.content) object. The first two are useful for performing administrative tasks related to groups and user management. If you are interested in these topics, please check out the examples on the Accessing and Managing Users(link is external) [33] and Batch Creation of Groups(link is external) [34] pages. Here we simply use the get(...) method of the user manager to access information about a user, namely ourselves. Again, you will have to replace the <your username> tag with your PSU AGOL name to get some information display related to your account:

user = gis.users.get('')
user

The user object now stored in variable user contains much more information about the given user in addition to what is displayed as the output of the previous command. To see a list of available attributes, type in

user.

and then press the TAB key for the Jupyter autocompletion functionality. The following command prints out the email address associated with your account:

user.email

We will talk about the content manager object in a moment. But before that, let’s talk about maps and how easy it is to create a web map like visualization in a Jupyter Notebook with the ArcGIS API. A map can be created by calling the map(...) method of the GIS object. We can pass a string parameter to the method with the name of a place that should be shown on the map. The API will automatically try to geocode that name and then set up the parameters of the map to show the area around that place. Let’s use State College as an example:

map = gis.map('State College, PA')
map

Map widget in Jupyter Notebook produced by the previous code example

The API contains a widget for maps so that these will appear as an interactive map in a Jupyter Notebook. As a result, you will be able to pan around and use the buttons at the top left to zoom the map. The map object has many properties and methods that can be used to access its parameters, affect the map display, provide additional interactivity, etc. For instance, the following command changes the zoom property to include a bit more area around State College. The map widget will automatically update accordingly.

	
map.zoom = 11

With the following command, we change the basemap tiles of our map to satellite imagery. Again the map widget automatically updates to reflect this change.

	
map.basemap = 'satellite'

As another example, let’s add two simple circle marker objects to the map, one for Walker Building on the PSU campus, the home of the Geography Department, and one for the Old Main building. The ArcGIS API provides a very handy geocode(...) function in the arcgis.geocoding module so that we won’t have to type in the coordinates of these buildings ourselves. The properties of the marker are defined in the pt_sym dictionary.

from arcgis.geocoding import geocode
 
pt_sym = {
    "type": "esriSMS",
    "style": "esriSMSCircle",
    "color": [255,0,0,255],        
}
 
walker = geocode('Walker Bldg, State College, PA')[0]
oldMain = geocode('Old Main Bldg, State College, PA')[0]
map.draw(walker, {'title':'Walker Bldg', 'content': walker['attributes']['LongLabel']}, symbol=pt_sym)
map.draw(oldMain, {'title':'Old Main Bldg', 'content': oldMain['attributes']['LongLabel']}, symbol=pt_sym)

Map widget in Jupyter Notebook produced by the previous code example

The map widget should now include the markers for the two buildings as shown in the image above (the markers may look different). A little explanation on what happens in the code: the geocode(...) function returns a list of candidate entities matching the given place name or address, with the candidate considered most likely appearing as the first one in the list. We here simply trust that the ArcGIS API has been able to determine the correct candidate and hence just take the first object from the respective lists via the “[0]” in the code. What we have now stored in variables walker and oldMain are dictionaries with many attributes to describe the entities. When adding the markers to the map, we provide the respective variables as the first parameter and the API will automatically access the location information in the dictionaries to figure out where the marker should be placed. The second parameter is a dictionary that specifies what should be shown in the popup that appears when you click on the marker. We use short names for the title of the popups and then use the ‘LongLabel’ property from the dictionaries, which contains the full address, for the content of the popups.

Let's have a look at a last map example demonstrating how one can use the API to manually draw a route from Walker building to Old Main and have the API calculate the length of that route. The first thing we do is import the lengths(...) function from the arcgis.geometry module and define a function that will be called once the route has been drawn to calculate the length of the resulting polyline object that is passed to the function in parameter g:

from arcgis.geometry import lengths
 
def calcLength(map,g):
    length = lengths(g['spatialReference'], [g], "", "geodesic")['lengths']
    print('Length: '+ str(length[0]) + 'm.')

Once the length has been computed, the function will print out the result in meters. We now register this function as the function to be called when a drawing action on the map has terminated. In addition, we define the symbology for the drawing to be a dotted red line:

map.on_draw_end(calcLength)
 
line_sym = {
"type": "esriSLS",
"style": "esriSLSDot",
"color": [255,0,0,255],
"width": 3
}

With the next command we start the drawing with the ‘polyline’ tool. Before you execute the command, it's a good idea to make sure the map widget is zoomed in to show the area between the two markers but you will still be able to pan the map by dragging with the left mouse button pressed and to zoom with the mouse wheel during the drawing. You need to do short clicks with the left mouse button to set the points for the start point and end points of the line segments. To end the polyline, make a double click for the last point while making sure you don't move the mouse at the same time. After this, the calculated distance will appear under the map widget.

map.draw('polyline', symbol=line_sym)

You can always restart the drawing by executing the previous code line again. Using the command map.clear_graphics() allows for clearing all drawings from the map but you will have to recreate the markers after doing so.

Map widget produced by the previous code example showing the drawn route

Now let’s get back to the content manager object of our GIS. The content manager object allows us to search and use all the content we have access to. That includes content we have uploaded and published ourselves but also content published within our organization and content that is completely public on AGOL. Content can be searched with the search(...) method of the content manager object. The method takes a query string as parameter and has additional parameters for setting the item type we are looking for and the maximum number of entries returned as a result for the query. For instance, try out the following command to get a list of available feature services that have PA in the title:

featureServicesPA = gis.content.search(query='title:PA', item_type='Feature Layer Collection', max_items = 50)
featureServicesPA

This will display a list of different AGOL feature service items available to you. The list is probably rather short at the moment because not much has been published in the new pennstate organization yet, but there should at least be one entry with municipality polygons for PA. Each feature service from the returned list can, for instance, be added to your map or used for some spatial analysis. To add a feature service to your map as a new layer, you have to use the add_layer(...) method of the map object. For example, the following command takes the first feature service from our result list in variable featureServicesPA and adds it to our map widget:

map.add_layer(featureServicesPA[0], {'opacity': 0.8})

The first parameter specifies what should be added as a layer (the feature service with index 0 from our list in this case), while the second parameter is an optional dictionary with additional attributes (e.g., opacity, title, visibility, symbol, renderer) specifying how the layer should be symbolized and rendered. Feel free to try out a few other queries (e.g. using "*" for the query parameter to get a list of all available feature services) and adding a few other layers to the map by changing the index used in the previous command. If the map should get too cluttered, you can simply recreate the map widget with the map = gis.map(...) command from above.

The query string given to search(...) can contain other search criteria in addition to ‘title’. For instance, the following command lists all feature services that you are the owner of (replace <your agol username> with your actual Penn State AGOL/Pro user name):

gis.content.search(query='owner:', item_type='Feature Service')

Unless you have published feature services in your AGOL account, the list returned by this search will most likely be empty ([]). So how can we add new content to our GIS and publish it as a feature service? That’s what the add(...) method of the content manager object and the publish(...) method of content items are for. The following code uploads and publishes a shapefile of larger cities in the northeast of the United States.

To run it, you will need to first download the zipped shapefile [35] and then slightly change the filename on your computer. Since AGOL organizations must have unique services names, edit the file name to something like ne_cities_[initials].zip or ne_cities_[your psu id].zip so the service name will be unique to the organization. Adapt the path used in the code below to refer to this .zip file on your disk.

neCities = gis.content.add({'type': 'Shapefile'}, r'C:\489\L3\ne_cities.zip')
neCitiesFS = neCities.publish()

The first parameter given to gis.content.add(...) is a dictionary that can be used to define properties of the data set that is being added, for instance tags, what type the data set is, etc. Variable neCitiesFS now refers to a published content item of type arcgis.gis.Item that we can add to a map with add_layer(...). There is a very slight chance that this layer will not load in the map. If this happens for you - continue and you should be able to do the buffer & intersection operations later in the walkthrough without issue. Let’s do this for a new map widget:

cityMap = gis.map('Pennsylvania')
cityMap.add_layer(neCitiesFS, {})
cityMap

Map widget produced by the previous code example

The image shows the new map widget that will now be visible in the notebook. If we rerun the query from above again...

gis.content.search(query='owner:', item_type='Feature Service')

...our newly published feature service should now show up in the result list as:

Lesson content developed by Jan Wallgrun and James O’Brien

Accessing Features and geoprocessing tools from Portal

The ArcGIS Python API also provides the functionality to access individual features in a feature layer, to manipulate layers, and to run analyses from a large collection of GIS tools including typical GIS geoprocessing tools. Most of this functionality is available in the different submodules of arcgis.features. To give you a few examples, let’s continue with the city feature service we still have in variable neCitiesFS from the previous section. A feature service can actually have several layers but, in this case, it only contains one layer with the point features for the cities. The following code example accesses this layer (neCitiesFE.layers[0]) and prints out the names of the attribute fields of the layer by looping through the ...properties.fields list of the layer:

for f in neCitiesFS.layers[0].properties.fields:
    print(f)

When you try this out, you should get a list of dictionaries, each describing a field of the layer in detail including name, type, and other information. For instance, this is the description of the STATEABB field:

{
 "name": "STATEABB",
 "type": "esriFieldTypeString",
 "actualType": "nvarchar",
 "alias": "STATEABB",
 "sqlType": "sqlTypeNVarchar",
 "length": 10,
 "nullable": true,
 "editable": true,
 "domain": null,
 "defaultValue": null
 }

The query(...) method allows us to create a subset of the features in a layer using an SQL query string similar to what we use for selection by attribute in ArcGIS itself or with arcpy. The result is a feature set in ESRI terms. If you look at the output of the following command, you will see that this class stores a JSON representation of the features in the result set. Let’s use query(...) to create a feature set of all the cities that are located in Pennsylvania using the query string "STATEABB"='US-PA' for the where parameter of query(...):

paCities = neCitiesFS.layers[0].query(where='"STATEABB"=\'US-PA\'')
print(paCities.features)
paCities

Output: 
[{"geometry": {"x": -8425181.625237303, "y": 5075313.651659228}, 
"attributes": {"FID": 50, "OBJECTID": "149", "UIDENT": 87507, "POPCLASS": 2, "NAME": 
"Scranton", "CAPITAL": -1, "STATEABB": "US-PA", "COUNTRY": "USA"}}, {"geometry": 
{"x": -8912583.489066456, "y": 5176670.443556941}, "attributes": {"FID": 53, 
"OBJECTID": "156", "UIDENT": 88707, "POPCLASS": 3, "NAME": "Erie", "CAPITAL": -1, 
"STATEABB": "US-PA", "COUNTRY": "USA"}}, ...  ]
<FeatureSet> 10 features

Of course, the queries we can use with the query(...) function can be much more complex and logically connect different conditions. The attributes of the features in our result set are stored in a dictionary called attributes for each of the features. The following loop goes through the features in our result set and prints out their name (f.attributes['NAME']) and state (f.attributes['STATEABB']) to verify that we only have cities for Pennsylvania now:

for f in paCities:
    print(f.attributes['NAME'] + " - "+ f.attributes['STATEABB'])

Output: 
Scranton - US-PA
Erie - US-PA
Wilkes-Barre - US-PA
...
Pittsburgh - US-PA

Now, to briefly illustrate some of the geoprocessing functions in the ArcGIS Python API, let’s look at the example of how one would determine the parts of the Appalachian Trail that are within 30 miles of a larger Pennsylvanian city. For this we first need to upload and publish another data set, namely one that’s representing the Appalachian Trail. We use a data set that we acquired from PASDA(link is external) [36] for this, and you can download the DCNR_apptrail file here [37]. As with the cities layer earlier, there is a very small chance that this will not display for you - but you should be able to continue to perform the buffer and intersection operations. The following code uploads the file to AGOL (don’t forget to adapt the path and add your initials or ID to the name!), publishes it, and adds the resulting trail feature service to your cityMap from above:

appTrail = gis.content.add({'type': 'Shapefile'}, r'C:\489\L3\dcnr_apptrail_2003.zip')
appTrailFS = appTrail.publish()
cityMap.add_layer(appTrailFS, {})

Map of Pennsylvania with the appalachian trail highlighted & cities marked

Next, we create a 30 miles buffer around the cities in variable paCities that we can then intersect with the Appalachian Trail layer. The create_buffers(...) function that we will use for this is defined in the arcgis.features.use_proximity module together with other proximity based analysis functions. We provide the feature set we want to buffer as the first parameter, but, since the function cannot be applied to a feature set directly, we have to invoke the to_dict() method of the feature set first. The second parameter is a list of buffer distances allowing for creating multiple buffers at different distances. We only use one distance value, namely 30, here and also specify that the unit is supposed to be ‘Miles’. Finally, we add the resulting buffer feature set to the cityMap above.

from arcgis.features.use_proximity import create_buffers
bufferedCities = create_buffers(paCities.to_dict(), [30], units='Miles')
cityMap.add_layer(bufferedCities, {})

Map widget with buffers created by the previous code example

As the last step, we use the overlay_layers(...) function defined in the arcgis.features.manage_data module to create a new feature set by intersecting the buffers with the Appalachian Trail polylines. For this we have to provide the two input sets as parameters and specify that the overlay operation should be ‘Intersect’.

from arcgis.features.manage_data import overlay_layers
trailPartsCloseToCity = overlay_layers(appTrailFS, bufferedCities, overlay_type='Intersect')

We show the result by creating a new map ...

resultMap.add_layer(trailPartsCloseToCity)

... and just add the features from the resulting trailPartsCloseToCity feature set to this map:

resultMap.add_layer(trailPartsCloseToCity)

The result is shown in the figure below.

New map widget showing the final result of our analysis

Lesson content developed by Jan Wallgrun and James O’Brien

Spatially Enabled Dataframes

ArcGIS API for Python also includes a spatially enabled wrapper for dataframes and provides access to Python packages pyshp, shapely and fiona. Full documenation can be seen here [38]. This package enables you to translate data from multiple sources and diverse data types to perform geoprocessing tasks. This also provides a quick mechanism for translating the GeoJSON extracted from the KML to other formats such as Feature Layers, Feature Collections, Feature Sets, and Featueclasses.

The important thing to note is that it needs a special import into the script.

import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor

If you are using an intelligent IDE, you may notice that these imports will remain as ‘unused’, but they are essential for performing geoprocessing and error messages related to them being absent are not helpful in diagnosing their absence.

The package can read features from AGOL and Portal, as well as local file types such as shapefiles and featureclasses and I suggest you refer the documentation [38] for its implimentation.

We will go over data manipulation through dataframes in the next lesson.

Practice Exercises

Practice Exercise 2: Requests and BeautifulSoup4

We want to write a script to extract the text from the three text paragraphs from section 1.7.2 on profiling [39] without the heading and following list of subsections. Write a script that does that using the requests module to load the html code and BeautifulSoup4 to extract the text (Section 2.3).

Finally, use a list comprehension (Section 2.2) to create a list that contains the number of characters for each word in the three paragraphs. The output should start like this:

[2, 4, 12, 4, 4, 4, 6, 4, 9, 2, 11… ]

Hint 1

If you use Inspect in your browser, you will see that the text is the content of a <div> element within another <div> element within an <article> element with a unique id attribute (“node-book-2269”). This should help you write a call of the soup.select(…) method to get the <div> element you are interested in. An <article> element with this particular id would be written as “article#node-book-2269” in the string given to soup.select(…).

Hint 2:

Remember that you can get the plain text content of an element you get from BeautifulSoup from its .text property (as in the www.timeanddate.com [18] example in Section 2.3).

Hint 3

It’s ok not to care about punctuation marks, etc. in this exercise and simply use the string method split() to split the text into words at any whitespace character. The number of characters in a string can be computed with the Python function len(...).

Lesson 2 Assignment

The url for the services are:

Pennsylvania Spatial Data Access (PASDA) https://www.pasda.psu.edu/ [7]

NGA https://geonames.nga.mil/gn-ags/rest/services/ [40]

Hint to return all features of the service, you still need to provide a valid sql query. For this assignment, you can use OBJECTID IS NOT NULL or 1=1 for the where clause. If you want to try other sql statements for other results, please feel free to do so.

If you are having issues with the url string, you can use the service query UI to see how it should be formatted.

Write-up

Produce a 400-word write-up on how the assignment went for you; reflect on and briefly discuss the issues and challenges you encountered and what you learned from the assignment.

Deliverable

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

• The script

• Your 400-word write-up