GIS for Transportation: Principles, Data and Applications

3.1 Geocoding


Geocoding is the process of taking the description of a specific location and converting it into a set of coordinates or a point feature which can then be displayed on a map or used in some type of spatial analysis. A variety of location description types can be geocoded including addresses and place names. There are a number of different approaches which can be used for geocoding, but at a high level they all follow the same process:

  • Descriptions of the locations to be geocoded are compiled into a standard format.
  • The location descriptions are compared to a reference dataset.
  • Candidate locations are established and scored according to a set of rules.
  • If the score for a candidate location exceeds a threshold value, it is declared a match.
  • If no candidate location score exceeds the established threshold, the location of interest is flagged as unmatched.
  • If two or more candidate locations share the same score and that score exceeds the threshold value, a tie is declared.

Geocoding is a widely used geospatial technique that has applications across many industries. It is often a prerequisite process to performing some type of network analysis such as routing. There are a variety of distinct processes which can be used for geocoding. The primary differences lie in the type of reference data which is used. The most common type of geocoding uses roadway centerline data where each street segment has address range attributes for each side of the street. Most online geocoding services including Google Maps, Yahoo Maps, and MapQuest rely almost exclusively on this type of geocoding. Other types of geocoding use parcel boundary data or address point data. You’ll read more about the different types of geocoding in Assignment 3-1.

There are many geocoding services which are available, some of which are free and some of which are subscription based. The free services generally limit the number of locations you can process at one time. Given a suitable reference dataset, it is also possible to create your own geocoding service. You’ll have an opportunity to do just that in Assignment 3-2.

Assignment 3-1 (15 points)

Read the article A comparison of address point, parcel and street geocoding techniques by Paul A. Zandbergen. Address the following questions and submit an M.S. Word document with your responses to Assignment 3-1 in Canvas.

  1. List some pros and cons of address point, parcel, and street geocoding techniques. (2 points)
  2. Zandbergen points out that while the street network data model can result in higher match rates than the parcel boundary address model, this is in part due to false positive matches. What gives rise to the false positive matches? (2 points)
  3. In your own words, briefly describe how Soundex is used in geocoding. (2 points)
  4. In the street network data model, describe how a particular street number is located along a street. (3 points)
  5. List an important advantage the address point data model offers over the parcel boundary address model. (2 points)
  6. Zandbergen discusses three components of geocoding quality. What are they? (2 points)
  7. Zandbergen states that the positional accuracy of geocoding is generally higher in urban areas than in rural areas. Why might this be the case? (2 points)

ArcGIS Address Locators

The first step to geocoding in ArcGIS is selecting an address locator which will be used. The address locator defines the reference dataset and the rules which will be used by the geocoding engine in identifying candidates and matches for the location descriptions (typically addresses) you are trying to locate. You can use an existing address locator, which typically requires a subscription, or you can create your own. To create your own address locator, you need to have access to a suitable set of reference data. There are many potential reference datasets available including those which are created by state or county governments. One good source of reference data for geocoding is the TIGER/Line shapefiles we examined in Lesson 2.

To create an address locator, use the “Create Address Locator” tool in ArcToolbox (see Figure 3.1).

ArcToolbox menu
Figure 3.1 - Create Address Locator Tool

When you launch the tool, you are presented with the Address Locator dialog (see Figure 3.2).

Create Adress Locator dialog
Figure 3.2 - Address Locator Dialog

The first step in creating an address locator is selecting a locator style. The locator style which is most appropriate depends on the reference data you’re planning to use in addition to the format of the locations you’re trying to geocode. A commonly used address locator style is the U.S. Addresses – Dual Ranges (see Figure 3.3).

Select Address Locator Style
Figure 3.3 - Address Locator Styles

Once the locator style has been selected, the Field Map list in the bottom portion of the Address Locator dialog is automatically populated (see Figure 3.4). Fields with an asterisk are required by the locator style, and fields without an asterisk are optional. Once you have loaded a reference dataset, you can map these fields to the corresponding fields in the reference data.

Create Address Locator, image described in text above.
Figure 3.4 - Address Locator Dialog with Field Map List Populated

The second step in creating an address locator is defining the reference dataset or datasets which will be used. As mentioned above, there are many reference data sources which can be used. For example, you can use a linear feature class based on roadway centerlines such as the “Address Range-Feature Shapefile” TIGER/Line shapefiles we reviewed in Lesson 2. Alternatively, you could use a polygon feature class based on parcel boundaries or zip code boundaries. Yet another option would be to use a point feature class based on address points.

Once you have selected the reference data, you can map the fields associated with the address locator style you have selected with the corresponding fields in the reference data (see Figure 3.5).

See caption for image description.
Figure 3.5 - Mapping Locator Style Fields with Reference Data Fields

The final step is to save the address locator to a location you select. While you can store the locator in either a geodatabase or a file folder, ESRI recommends storing an address locator in a file folder for better performance.

Here is a link to an ESRI webpage where you can download a white paper which tells you everything you’d ever want to know about address locators in ArcGIS.

Geocoding a List of Addresses (or other location descriptions)

To geocode a list of addresses, you should first add the table of addresses data to your map document in ArcGIS. The addresses to be geocoded can be prepared in any number of file formats including xlsx, xls, dbf, csv, and txt. Once the table of addresses has been added, you can right-click on the newly added table and select “Geocode Addresses” from the resulting context menu. At this point, you’ll be asked to select an address locator (see Figure 3.6).

Choose and Address Locator to use. Image is described in text above.
Figure 3.6 - Select an Address Locator

If the address locator you wish to use is not in the list, you can add it. Once you select an address locator and click “ok,” you will be presented with the “Geocode Addresses” dialog (see Figure 3.7).

Geocode Addresses window, see text below for more details.
Figure 3.7 - Geocode Addresses Dialog

In the top portion of the dialog, you can map the fields in the input table to the corresponding fields in the address locator, if it isn’t done automatically, and define the location and name of the shapefile or feature class where the results of the geocoding process should be stored. You can also configure some parameters for the address locator by clicking the “Geocoding Options” button. The “Geocoding Options” dialog is then displayed (see Figure 3.8).

Geocoding Options window. Image is described in more detail below.
Figure 3.8 - Geocoding Options Dialog

In the top portion of the dialog, you can exercise some control over how matching is performed. The spelling sensitivity level controls the extent to which misspellings will still be considered a match. The lower the score, the more tolerant the geocoding engine is for misspelled words. The minimum candidate score sets the threshold score for identifying candidates. The lower this score, the more candidates an address could have. Finally, the minimum match score establishes the threshold score for declaring a match for the address. Lowering the minimum match score will generally increase the match rate but will also tend to result in a higher rate of false positives.

The dialog can also be used to set other parameters for the geocoding engine such as offset positions for geocoded point features and some output data elements which can optionally be included as attributes in the resultant shapefile or feature class.

Once the geocoding options have been defined, the geocoding process can be initiated by clicking “Ok” on the “Geocode Addresses” dialog (see Figure 3.7). When the geocoding process is complete, a summary of the geocoding results is presented (see Figure 3.9).

Summary of geocoding results: matched, tied, unmatched and percentage complete
Figure 3.9 - Geocoding Results Summary Screen

This summary shows the number of addresses which had candidates above the minimum match score (i.e., matches), the number of addresses which had multiple candidates which were above the minimum match score and had the same score (i.e., ties) and the number of addresses which did not produce any candidates above the minimum candidate score (i.e., unmatched).

From the results summary screen, a manual rematch process can be initiated by clicking the “Rematch” button. This brings up the “Interactive Rematch” screen (see Figure 3.10).

Interactive rematch window. See description below.
Figure 3.10 - Interactive Rematch Screen

On this screen, unmatched addresses, ties, and matched addresses can be reviewed. Unmatched addresses generally result from either a problem with the address or a problem in the reference data. If a problem is observed with the address, it can be corrected and matched with the correct candidate directly on this screen. Often, however, it is unclear what the problem is with a particular address, and additional research is required to determine where the problem lies before it can be corrected.

Assignment 3-2 (20 points)

In this assignment, you will use TIGER/Line data to create an address locator and geocode some addresses.

  1. Download the 2016 Address Range-Feature Shapefile for Centre County Pennsylvania.
  2. Create a custom address locator in ArcGIS using the locator style “U.S. Addresses – Dual Range” and the Centre County roadway centerline reference data.
  3. Download this list of addresses from Centre County  and geocode them, using the custom address locator you created.

Submit an M.S. Word document to Lesson 3-2 in Canvas which addresses the following items:

  1. Capture a screen shot of the geocoding summary screen showing the number of matches, ties, and unmatched addresses. (5 points)
  2. For each unmatched address and tie, use the interactive rematch screen to assess why a match for the address was not found. (Note: Out of the 15,000 or so addresses in the table, there should be fewer than 10 addresses that were unmatched or had tied candidates. This is only because this list of addresses was carefully prepared. Generally, you should expect to see a lower match rate.) (7 points)
  3. Include a screen shot of a map showing the geocoded addresses. The map should also include layers for the TIGER roadway centerlines used as reference data, county boundaries (hint: you can download the TIGER/Line shapefile for counties for the entire nation) and the topographic base map which is available by default in ArcGIS. The extent of the map should be a little larger than Centre County, so the whole county is visible. (5 points)
  4. The address locator you created actually consists of three separate files. One of them is an XML file (i.e., the file’s extension is ‘xml’) which contains the rules the geocoding engine follows while processing the table of addresses. Open this file in Notepad or another text editor and take a brief look at the information it contains. If you’ve never looked at an XML file before, try not to get hung up on the unusual structure of the document. Is there anything you can identify in the file that gives you some insight into some of the rules the address locator defines (be brief)? (3 points)