GEOG 482
The Nature of Geographic Information

5. Shapefiles

PrintPrint

Since 2007, TIGER/Line extracts from the MAF/TIGER database have been distributed in shapefile format. Esri introduced shapefiles in the early 1990s as the native digital vector data format of its ArcView software product. The shapefile format is proprietary, but open; its technical specifications are published and can be implemented and used freely. Largely as a result of ArcView’s popularity, shapefile has become a de facto standard for creation and interchange of vector geospatial data. The Census Bureau’s adoption of Shapefile as a distribution format is therefore consistent with its overall strategy of conformance with mainstream information technology practices.

Elements of a Shapefile Data Set

The first thing GIS pros need to know about shapefiles is that every shapefile data set includes a minimum of three files. One of the three required files stores the geometry of the digital features as sets of vector coordinates. A second required file holds an index that, much like the index in a book, allows quicker access to the spatial features and therefore speeds processing of a given operation involving a subset of features. The third required file stores attribute data in dBASE© format, one of the earliest and most widely-used digital database management system formats. All of the files that make up a Shapefile data set have the same root or prefix name, followed by a three-letter suffix or file extension. The list below shows the names of the three required files making up a shapefile data set named “counties.” Take note of the file extensions:

  • counties.shp: the main shape file, containing vector coordinate data
  • counties.shx: the index file
  • counties.dbf: the dBASE table

Esri lists twelve additional optional files, and practitioners are able to include still others. Two of the most important optional files are the “.prj” file, which includes the coordinate system definition, and “.xml”, which stores metadata. (Why do you suppose that something as essential as a coordinate system definition is considered “optional”?)

Try This!

Downloading and viewing a TIGER/Line Shapefile

In this Try This! (the second of 3 dealing with TIGER/Line Shapefiles), you will download a TIGER/Line Shapefile dataset, investigate the file structure of a typical Esri shapefile, and view it in GIS software.

You can use a free software application called Global Mapper (originally known as dlgv32 Pro) to investigate TIGER/Line shapefiles. Originally developed by the staff of the USGS Mapping Division at Rolla, Missouri as a data viewer for USGS data, Global Mapper has since been commercialized, but is available in a free trial version. The instructions below will guide you through the process of installing the software and opening the TIGER/Line data.

  1. Downloading TIGER/Line Shapefiles: You are going to use the 2010 TIGER/Line Shapefiles.
    • Return to the 2010 TIGER/Line Shapefiles download page.
    • From the Select a layer type pick list, under Features, choose All Lines, and click submit. (You are welcome to download and investigate any TIGER/Line Shapefile(s), but we will use an All Lines dataset in the geocoding Try This later in the chapter, so your downloading one here will make you more familiar with the content.)
    • From the All Lines pick list select a state or territory, and click Submit.
    • Select a County from the next pick list that appears, and click Download.
    • Save the file to your computer.
      The file you download should have a name like tl_2010_42027_edges.zip. The root name of this file, tl_2010_42027_edges in this example, will also be the name of the shapefile dataset. The 42027 is a federal code that represents Pennsylvania (state 42) and Centre County (county 027). The five-digit code in your file name will depend on which state and county you selected.
    • The data are compressed in a .zip archive. Extract the data to a new named folder in a known location. (Within the file hierarchy that is extracted, there may be a second .zip file that needs to be uncompressed.)
  2. Investigating the shapefile data set:
    • Navigate to within the folder in which you stored your uncompressed TIGER/Line Shapefile dataset.
    • Notice the multiple files which make up the shapefile dataset, including:
      • tl_2010_42027_edges.shp, containing the vector coordinate data
      • tl_2010_42027_edges.shp.xml, containing metadata
      • tl_2010_42027_edges.shx, the index file
      • tl_2010_42027_edges.dbf, the dBASE file
      • tl_2010_42027_edges.prj, containing the projection/spatial reference
    • All of the files work in concert to store the necessary components of the Esri shapefile data set. You may be familiar with some of the individual files types. The contents of three of them can be easily viewed. Let's open those three. You can double click on the file and then select "from a list of installed programs,” or you may need to run the suggested application and open the file from within it. Let me know if you need help, or help each other in the Canvas Chapter 4 Discussion Forum.
      • Open the .dbf file using Microsoft Excel.
        Note the typical row-column structure of a flat-file database. Can you find the four columns, or fields, that hold the address range information? Look for LFROMADD, etc. The field name LFROMADD is shorthand for Left From Address. The 10-character length of the field name points up one of the constraints of the dBASE format -- field names are limited to 10 characters.
      • Open the .xml file using your web browser.
        You should see the metadata information bracketed by tags contained within directional brackets < >. XML stands for Extensible Markup Language, and is a common set of rules for encoding documents. Can you locate the portion of the document having to do with horizontal spatial accuracy? (Spatial accuracy metadata is available when you've chosen the All Lines file as your candidate shapefile.)
      • Open the .prj file using Notepad, or any vanilla text editor.
        There are five pieces of information in this file, separated by commas. What are they? They should reinforce some of what you learned in Chapter 2 regarding what defines a geographic coordinate system.
      • The .shp and .shx files are proprietary and specific to the functionality of the shapefile data set.
    • Note that one should not alter the contents of any of these files with any application other than a GIS program that is designed for that task.
      In posts to the Chapter 4 Discussion Forum, discuss with your classmates what you find when you open the .dbf, the .xml, and the .prj files.
  3. Viewing the shapefile dataset in Global Mapper:
    • Download and install the Global Mapper software:
      1. Navigate to the Blue Marble Global Mapper site.
      2. Download the trial version of the software.
      3. Double-click on the setup file you downloaded to install the program.
      4. Launch the Global Mapper program.
    • After opening the Global Mapper software, choose Open Data File(s)... under the File menu, or click the "Open Your Own Data Files" button in the center of the window. Navigate to the extracted shapefile dataset you downloaded above and open it. (Remember, your complete shapefile data set will have a name similar to tl_2010_42027_edges. It will show up in the Open dialog with a .shp extension.)
    • You should be able to see all of the line features (the edges, from the MAF/TIGER database) contained in your county. If you are using the newest version of Global Mapper, you should be able to discern roads from rivers/streams from administrative boundaries, etc. In older versions of the application, the default view showed all line features in a single color and line weight, so the user needed to use the symbolization tools to make the different classes of features distinguishable.
      What do you think has to be understood by the mapping application to allow it to automatically symbolize features differently? Post your thoughts in the Chapter 4 Discussion Forum.

Shapefile Primitives

A single shapefile data set can contain one of three types of spatial data primitives, or features – points, lines or polygons (areas). The technical specification defines these as follows:

  • Points: A point consists of a pair of double-precision coordinates in the order X,Y.
  • Lines: More specifically a polyline, is an ordered set of points, or vertices, that consists of one or more parts. A part is a connected sequence of two or more points. Parts may or may not be connected to one another. Parts may or may not intersect one another.
  • Polygons: A polygon consists of one or more rings. A ring is a connected sequence of four or more points, or vertices, that form a closed, non-self-intersecting loop.
  • Other: M (measured; route data) and Z (3D; vertical datum) versions of point, polyline and polygon Shapefile data sets can be created, but are not included in the TIGER/Line Shapefile extracts.
Diagram illustrating geometric primitives of the Shapefile format
Figure 4.5.1 Three Shapefile data sets that could be extracted from the MAF/TIGER data depicted on the preceding page.

At left in the figure above, a polygon Shapefile data set holds the Census blocks in which the edges from the MAF/TIGER database have been combined to form two distinct polygons, P1 and P2. The diagram shows the two polygons separated to emphasize the fact that what is the single E12 edge in the MAF/TIGER database (see the Figure 4.4.1 on page 4) is now present in each of the Census block polygon features.

In the middle of the illustration, above, a polyline Shapefile data set holds seven line features (L1-7) that correspond to the seven edges in the MAF/TIGER database. The directionality of the line features that represent streets corresponds to address range attributes in the associated dBASE© table. Vertices define the shape of a polygon or a line, and the Start and End Nodes from the MAF/TIGER database are now First and Last Vertices.

Finally, at right in the illustration above, a point Shapefile data set holds the three isolated nodes from the MAF/TIGER database.

Practice Quiz

Registered Penn State students should return now to the Chapter 4 section of the Modules pages in Canvas to take a self-assessment quiz about Shapefiles.

You may take practice quizzes as many times as you wish. They are not scored and do not affect your grade in any way.

Penn State logo
Students who register for this Penn State course gain access to assignments and instructor feedback, and earn academic credit. Information about Penn State's Online Geospatial Education programs is available at the Geospatial Education Program Office.