Now that we are back into the groove of writing arcpy code and creating script tools, we want to look at a topic that didn't play a big role in our introductory class, GEOG 485, but that is very important to programming: We want to talk about run-time performance and the related topics of parallelization and code profiling. These will be the main topics of the next sections and this lesson in general.
We are going to address the question of how we can improve the performance and reliability of our Python code when dealing with more complicated tasks that require a larger number of operations on a greater number of datasets and/or more memory. To do this we’re going to look at both 64-bit processing and multiprocessing. We’re going to start investigating these topics using a simple raster script to process LiDAR data from Penn State’s main campus and surrounding area. In later sections, we will also look at a vector data example using different data sets for Pennsylvania.
The raster data consists of 808 tiles which are all individually zipped, 550MB zipped in total. The individual .zip files can be downloaded from PASDA directly.
Previously PASDA provided access via FTP but unfortunately that ability has been removed. However, we recommend you use a little Python script we put together that uses BeautifulSoup (which we'll look at more in Lesson 2) to download the files. The script will also automatically extract the individual .zip files. For this you have to do the following:
- Download the script from here at github or simply copy & paste the code into a new script file.
- At the beginning of the main function, two folders are specified, the first one for storing the .zip files and the second for storing the extracted raster files. These are currently set to C:\temp and C:\temp\unzipped but you may not have the permissions to write to these folders. We therefore recommend that you edit these variables to have the LiDAR files downloaded and extracted to folders within your Windows user's home directory, e.g. C:\Users\<username>\Documents\489 and C:\Users\<username>\Documents\489\unzipped (assuming that the folder 489 already exist in your Documents folder). The script uses a wildcard on line 66 that tells Python to only download tile files with 227 in the file name, not all of them. This is ok for running the code examples in this lesson but if you want to do test runs with more or all the files, you can also edit this line so that it reads wildcard_pattern = "zip" (because "zip" exists in all the filenames)
- Run the script and you should see the downloaded .zip files and extracted raster files appear in the specified target folders.
Doing any GIS processing with these LiDAR files is definitely a task to be handled by scripting, and any performance benefits we can gain when we’re processing that many tiles will be worthwhile. The question you might be asking is why don’t we just join all the tiles together and process them at once - we’d run out of memory very fast and if something goes wrong we need to start over. Processing small tiles we can do one (or a few) at a time using less memory and if one tile fails we still have all the others and just need to restart that tile.
Below is our simple raster script which gets our list of tiles and then for every tile in the list we fill the DEM, create a flow direction and flow accumulation raster to then derive a stream raster (to determine where the water might flow), and lastly we convert the stream raster to polygon or polyline feature classes. This is a simplified version of the sort of analysis you might undertake to prepare data prior to performing a flood study. The code we are writing here will work in both Desktop and Pro as long as you have the Spatial Analyst extension installed, authorized and enabled (it is this last step that generally causes errors). I’ve restricted the processing to a subset of those tiles for testing and performance reasons using only tiles with 227 in the name but more tiles can be included by modifying the wild card list in line 19.
If you used the download script above, you already have the downloaded raster files ready. You can move them to a new folder or keep them where they are. In any case, you will need to make sure that the workspace in the script below points to the folder containing the extracted raster files (line 9). If you obtained the raster files in some other way, you may have to unzip them to a folder first.
Let’s look over the code now. You will notice that the version below is for Python 3 (the print function gives it away) but it will work in both Python 2 and 3 without changes.
# Setup _very_ simple timing. import time process_start_time = time.time() import arcpy from arcpy.sa import * arcpy.env.overwriteOutput = True arcpy.env.workspace = r'C:\489\PSU_LiDAR' ## If our rasters aren't in our filter list then drop them from our list. def filter_list(fileList,filterList): return[i for i in fileList if any(j in i for j in filterList)] # Ordinarily we would want all of the rasters I'm filtering by a small set for testing & efficiency # I did this by manually looking up the tile index for the LiDAR and determining an area of interest # tiles ending in 227, 228, 230, 231, 232, 233, 235, 236 wildCardList = set(['227']) ##,'228','230','231','232','233','235','236']) # Get a list of rasters in my folder rasters = arcpy.ListRasters("*") new_rasters = filter_list(rasters,wildCardList) # for all of our rasters for raster in new_rasters: raster_start_time = time.time() # Now that we have our list of rasters ## Note also for performance we're not saving any of the intermediate rasters - they will exist only in memory ## Fill the DEM to remove any sinks try: FilledRaster = Fill(raster) ## Calculate the Flow Direction (how water runs across the surface) FlowDirRaster = FlowDirection(FilledRaster) ## Calculate the Flow Accumulation (where the water accumulates in the surface) FlowAccRaster = FlowAccumulation(FlowDirRaster) ## Convert the Flow Accumulation to a Stream Network ## We're setting an arbitrary threshold of 100 cells flowing into another cell to set it as part of our stream ## http://pro.arcgis.com/en/pro-app/tool-reference/spatial-analyst/identifying-stream-networks.htm Streams = Con(FlowAccRaster,1,"","Value > 100") ## Convert the Raster Stream network to a feature class output_Polyline = raster.replace(".img",".shp") arcpy.CheckOutExtension("Spatial") arcpy.sa.StreamToFeature(Streams,FlowDirRaster,output_Polyline) arcpy.CheckInExtension("Spatial") except Exception as ex: #print (f"Errors occurred: {ex}") arcpy.AddMessage (f"Errors occurred: {ex}") # Output how long the whole process took. arcpy.AddMessage("--- %s seconds ---" % (time.time() - process_start_time)) print ("--- %s seconds ---" % (time.time() - process_start_time))
We have set up some very simple timing functionality in this script using the time() function defined in the module time of the Python standard library. The function gives you the current time and, by calling it at the beginning and end of the program and then taking the difference in the very last line of the script, we get an idea of the runtime of the script.
Later in the lesson, we will go into more detail about properly profiling code where we will examine the performance of a whole program as well as individual instructions. For now, we just want an estimate of the execution time. Of course, it’s not going to be very precise as it will depend on what else you’re doing on your PC at the same time and we would need to run a number of iterations to remove any inconsistencies (such as the delay when arcpy loads for the first time etc.). On my PC that code runs in around 40 seconds. Your results will vary depending on many factors related to the performance of your PC (we'll review some of them in the Speed Limiters section) but you should test out the code to get an idea of the baseline performance of the algorithm on your PC.
In lines 12 and 13, we have a simple function to filter our list of rasters to just those we want to work with (centered on the PSU campus). This function might look a little different to what you have seen before - that's because we're using list comprehension which we'll examine in more detail in Lesson 2. So don't worry about understanding how exactly this works at the moment. It basically says to return a list with only those file names from the original list that contain one of the numbers in the wild card list.
We set up some environment variables, our wildcard list (used by our function for filtering) at line 19 - where you will notice I have commented out some of the list for speed during testing, and then we get our list of rasters, filter it and for those rasters left and we iterate through them with the central for-loop in line 25 performing our spatial analysis tasks mentioned earlier. There is some basic error checking wrapped around the tasks (which is also reporting running times if anything goes wrong) and then lastly there is a message and print function with the total time. I’ve included both print and AddMessage just in case you wanted to test the code as a script tool in ArcGIS.
Feel free to run the script now and see what total computation time you get from the print statement in the last line of the code. We‘re going to demonstrate some very simple performance evaluation of the different versions of ArcGIS (32 bit Desktop, 64 bit Desktop, Pro and arcpy Multiprocessing) using this code. Before we do that though it is important to understand the differences between each of them. You do not have to run this testing yourself; we’re mainly providing it as some background. You are welcome to experiment with it, but please do not do that to the detriment of your project.
Once we’ve examined the theory of 64-bit processing and parallelization and worked through a simple example using the Hi-ho Cherry-O game from GEOG 485, we’ll come back to the raster example above and convert it to running in parallel using the Python multiprocessing package instead of sequentially and we will further look at an example of multiprocessing using vector data.