GEOG 489
Advanced Python Programming for GIS

1.6.6.2 Multiprocessing with vector data

PrintPrint

The best practices of multiprocessing that we introduced earlier are even more important when we are working with vector data than they are with raster data. The geodatabase locking issue is likely to become much more of a factor as typically we use more vector data than raster and often geodatabases are used more with feature classes.

The example we’re going to use here involves clipping a feature layer by polygons in another feature layer. A sample use case of this might be if you need to segment one or several infrastructure layers by state or county (or even a smaller subdivision). If I want to provide each state or county with a version of the roads, sewer, water or electricity layers (for example) this would be a helpful script. To test out the code in this section (and also the first homework assignment), you can again use the data from the USA.gdb geodatabase (Section 1.5) we provided. The application then is to clip the data from the roads, cities, or hydrology data sets to the individual state polygons from the States data set in the geodatabase. 

To achieve this task, one could run the Clip tool manually in ArcGIS Pro but if there are a lot of polygons in the clip data set, it will be more effective to write a script that performs the task. As each state/county is unrelated to the others, this is an example of an operation that can be run in parallel.

The code example below has been adapted from a code example written by Duncan Hornby at the University of Southampton in the United Kingdom that has been used to demonstrate multiprocessing and also how to create a script tool that supports multiprocessing. We will take advantage of Mr. Hornby’s efforts and make use of his code (with attribution of course) but we have also reorganized and simplified it quite a bit and added some enhancements.

Let us examine the code’s logic and then we’ll dig into the syntax.

The code has two Python files. This is important because when we want to be able to run it as a script tool in ArcGIS, it is required that the worker function for running the individual tasks be defined in its own module file, not in the main script file for the script tool with the multiprocessing code that calls the worker function. The first file called scripttool.py imports arcpy, multiprocessing, and the worker code contained in the second python file called multicode.py and it contains the definition of the main function mp_handler() responsible for managing the multiprocessing operations similar to the Cherry-O multiprocessing version. It uses two script tool parameters, the file containing the polygons to use for clipping (variable clipper) and the file to be clipped (variable tobeclipped). Furthermore, the file includes a definition of an auxiliary function get_install_path() which is needed to determine the location of the Python interpreter for running the subprocesses in when running the code as a script tool in ArcGIS. The content of this function you don't have to worry about. The main function mp_handler() calls the worker(...) function located in the multicode file, passing it the files to be used and other information needed to perform the clipping operation. This will be further explained below . The code for the first file including the main function is shown below.

import os, sys
import arcpy 
import multiprocessing 
from multicode import worker

# Input parameters
clipper = r"C:\489\USA.gdb\States"
#clipper = arcpy.GetParameterAsText(0) 
tobeclipped = r"C:\489\USA.gdb\Roads"
#tobeclipped = arcpy.GetParameterAsText(1)

def get_install_path():
    ''' Return 64bit python install path from registry (if installed and registered),
        otherwise fall back to current 32bit process install path.
    '''
    if sys.maxsize > 2**32: return sys.exec_prefix #We're running in a 64bit process
 
    #We're 32 bit so see if there's a 64bit install
    path = r'SOFTWARE\Python\PythonCore\2.7'
 
    from _winreg import OpenKey, QueryValue
    from _winreg import HKEY_LOCAL_MACHINE, KEY_READ, KEY_WOW64_64KEY
 
    try:
        with OpenKey(HKEY_LOCAL_MACHINE, path, 0, KEY_READ | KEY_WOW64_64KEY) as key:
            return QueryValue(key, "InstallPath").strip(os.sep) #We have a 64bit install, so return that.
    except: return sys.exec_prefix #No 64bit, so return 32bit path 
   
def mp_handler():

    try: 
        # Create a list of object IDs for clipper polygons 
        
        arcpy.AddMessage("Creating Polygon OID list...") 
        print("Creating Polygon OID list...") 
        clipperDescObj = arcpy.Describe(clipper) 
        field = clipperDescObj.OIDFieldName 
     
        idList = [] 
        with arcpy.da.SearchCursor(clipper, [field]) as cursor: 
            for row in cursor: 
                id = row[0] 
                idList.append(id)

        arcpy.AddMessage("There are " + str(len(idList)) + " object IDs (polygons) to process.") 
        print("There are " + str(len(idList)) + " object IDs (polygons) to process.") 

        # Create a task list with parameter tuples for each call of the worker function. Tuples consist of the clippper, tobeclipped, field, and oid values.
       
        jobs = []
    
        for id in idList:
            jobs.append((clipper,tobeclipped,field,id)) # adds tuples of the parameters that need to be given to the worker function to the jobs list

        arcpy.AddMessage("Job list has " + str(len(jobs)) + " elements.") 
        print("Job list has " + str(len(jobs)) + " elements.") 

        # Create and run multiprocessing pool.

        multiprocessing.set_executable(os.path.join(get_install_path(), 'pythonw.exe')) # make sure Python environment is used for running processes, even when this is run as a script tool

        arcpy.AddMessage("Sending to pool") 
        print("Sending to pool") 

        cpuNum = multiprocessing.cpu_count()  # determine number of cores to use
        print("there are: " + str(cpuNum) + " cpu cores on this machine") 
 
        with multiprocessing.Pool(processes=cpuNum) as pool: # Create the pool object 
            res = pool.starmap(worker, jobs)  # run jobs in job list; res is a list with return values of the worker function

        # If an error has occurred report it 
        
        failed = res.count(False) # count how many times False appears in the list with the return values
        if failed > 0:
            arcpy.AddError("{} workers failed!".format(failed)) 
            print("{} workers failed!".format(failed)) 
        
        arcpy.AddMessage("Finished multiprocessing!") 
        print("Finished multiprocessing!") 

    except arcpy.ExecuteError:
        # Geoprocessor threw an error 
        arcpy.AddError(arcpy.GetMessages(2)) 
        print("Execute Error:", arcpy.ExecuteError) 
    except Exception as e: 
        # Capture all other errors 
        arcpy.AddError(str(e)) 
        print("Exception:", e)

if __name__ == '__main__':   
    mp_handler() 

Let's now have a close look at the logic of the two main functions which will do the work. The first one is the mp_handler() function shown in the code section above. It takes the input variables and has the job of processing the polygons in the clipping file to get a list of their unique IDs, building a job list of parameter tuples that will be given to the individual calls of the worker function, setting up the multiprocessing pool and running it, and taking care of error handling.

The second function is the worker function called by the pool (named worker in this example) located in the multicode.py file (code shown below). This function takes the name of the clipping feature layer, the name of the layer to be clipped, the name of the field that contains the unique IDs of the polygons in the clipping feature layer, and the feature ID identifying the particular polygon to use for the clipping as parameters. This function will be called from the pool constructed in mp_handler().

The worker function will then make a selection from the clipping layer. This has to happen in the worker function because all parameters given to that function in a multiprocessing scenario need to be of a simple type that can be "pickled.Pickling data means converting it to a byte-stream which in the simplest terms means that the data is converted to a sequence of simple Python types (string, number etc.).  As feature classes are much more complicated than that containing spatial and non-spatial data, they cannot be readily converted to a simple type. That means feature classes cannot be "pickled" and any selections that might have been made in the calling function are not shared with the worker functions. Therefore, we need to think about creative ways of getting our data shared with our sub-processes. In this case, that means we’re not going to do the selection in the master module and pass the polygon to the worker module. Instead, we’re going to create a list of feature IDs that we want to process and we’ll pass an ID from that list as a parameter with each call of the worker function that can then do the selection with that ID on its own before performing the clipping operation. For this, the worker function selects the polygon matching the OID field parameter when creating a layer with MakeFeatureLayer_management() and uses this selection to clip the feature layer to be clipped. The results are saved in a shapefile including the OID in the file's name to ensure that the names are unique.

import os, sys
import arcpy

def worker(clipper, tobeclipped, field, oid): 
    """  
       This is the function that gets called and does the work of clipping the input feature class to one of the polygons from the clipper feature class. 
       Note that this function does not try to write to arcpy.AddMessage() as nothing is ever displayed.  If the clip succeeds then it returns TRUE else FALSE.  
    """
    try:   
        # Create a layer with only the polygon with ID oid. Each clipper layer needs a unique name, so we include oid in the layer name.
        query = '"' + field +'" = ' + str(oid)
        arcpy.MakeFeatureLayer_management(clipper, "clipper_" + str(oid), query) 
       
        # Do the clip. We include the oid in the name of the output feature class. 
        outFC = r"c:\489\output\clip_" + str(oid) + ".shp"
        arcpy.Clip_analysis(tobeclipped, "clipper_" + str(oid), outFC) 
        
        print("finished clipping:", str(oid)) 
        return True # everything went well so we return True
    except: 
        # Some error occurred so return False 
        print("error condition") 
        return False

Having covered the logic of the code, let's review the specific syntax used to make it all work. While you’re reading this, try visualizing how this code might run sequentially first – that is one polygon being used to clip the to-be-clipped feature class, then another polygon being used to clip the to-be-clipped feature class and so on (maybe through 4 or 5 iterations). Then once you have an understanding of how the code is running sequentially try to visualize how it might run in parallel with the worker function being called 4 times simultaneously and each worker performing its task independently of the other workers.

We’ll start with exploring the syntax within the mp_handler(...) function.

The mp_handler(...) function begins by determining the name of the field that contains the unique IDs of the clipper feature class using the arcpy.Describe(...) function (line 36 and 37). The code then uses a Search Cursor to get a list of all of the object (feature) IDs from within the clipper polygon feature class (line 39 to 43). This gives us a list of IDs that we can pass to our worker function along with the other parameters. As a check, the length of that list is printed out (lines 45 and 46).

Next, we create the job list with one entry for each call of the worker() function we want to make (lines 50 to 53). Each element in this list is a tuple of the parameters that should be given to that particular call of worker(). This list will be required when we set up the pool by calling pool.starmap(...). To construct the list, we simply loop through the ID list and append a parameter tuple to the list in variable jobs. The first three parameters will always be the same for all tuples in the job list; only the polygon ID will be different. In the homework assignment for this lesson, you will adapt this code to work with multiple input files to be clipped. As a result, the parameter tuples will vary in both the values for the oid parameter and for the tobeclipped parameter.

To prepare the multiprocessing pool, we first specify what executable should be used each time a worker is spawned (line 60). Without this line, a new instance of ArcGIS Pro would be launched by each worker, which is clearly less than ideal.  Instead, this line calls the get_install_path() function defined in lines 12-27 to determine the path to the pythonw.exe executable.  

The code then sets up the size of the pool using the maximum number of processors in lines 65-68 (as we have done in previous examples) and then, using the starmap() method of Pool, calls the worker function worker(...) once for each parameter tuple in the jobs list (line 69). 

Any outputs from the worker function will be stored in variable res. These are the boolean values returned by the worker() function, True to indicate that everything went ok and False to indicate that the operation failed. If there is at least one False value in the list, an error message is produced stating the exact number of worker processes that failed (lines 73 to 76).

Let's now look at the code in our worker function worker(...). As we noted in the logic section above, it receives four parameters: the full paths of the clipping and to-be-clipped feature classes, the name of the field that contains the unique IDs in the clipper feature class, and the OID of the polygon it is to use for the clipping.

Notice that the MakeFeatureLayer_management(...) function in line 12 is used to create an in_memory layer which is a copy of the original clipper layer. This use of the in_memory layer is important in three ways: The first is performance – in_memory layers are faster; second, the use of an in_memory layer can help prevent any chance of file locking (although not if we were writing back to the file); third, selection only works on layers so even if we wanted to, we couldn’t get away without creating this layer.

The call of MakeFeatureLayer_management(...) also includes an SQL query string defined one line earlier in line 11 to create the layer with just the polygon that matches the oid that was passed as a parameter. The name of the layer we are producing here should be unique; this is why we’re adding str(oid) to the name in the first parameter.

Now with our selection held in our in_memory, uniquely named feature layer, we perform the clip against our to-be-clipped layer (line 16) and store the result in outFC which we define in line 15 to be a hardcoded folder with a unique name starting with "clip_" followed by the oid. To run the code, you will most likely have to adapt the path used in variable outFC.

The process then returns from the worker function and will be supplied with another oid. This will repeat until a call has been made for each polygon in the clipping feature class.

We are going to use this code as the basis for our Lesson 1 homework project. Have a look at the Assignment Page for full details.

You can test this code out by running it in a number of ways. If you run it from ArcGIS Pro as a script tool, you will have to swap the hashmarks for the clipper and tobeclipped input variables so that GetParameterAsText() is called instead of using hardcoded paths and file names. Be sure to set your parameter type for both parameters to Feature Class. If you make changes to the code and have problems with the changes to the code not being reflected in Pro, delete your Script tool from the Toolbox, restart Pro and re-add the script tool. 

You can run your code from spyder but only if you hardcode your parameters or supply them from spyder (Run>Run configuration per file, tick the Command line options checkbox and then enter your filenames separated by a space in the text box alongside. Also make sure you're running the scripttool.py in spyder (not the multicode.py). You can also run your code from the Command Prompt which is the fastest way with the smallest resource overhead.

The final thing to remember about this code is that it has a hardcoded output path defined in variable outFC in the worker() function - which you will want to change, create and/or parameterize etc. so that you have some output to investigate. If you do none of these things then no output will be created.

When the code runs it will create a shapefile for every unique object identifier in the "clipper" shapefile (there are 51 in the States data set from the sample data) named using the OID (that is clip_1.shp - clip_59.shp).