GEOG 486
Cartography and Visualization

Part II: Data Classification in ArcGIS


Now, let's apply these concepts in ArcGIS. Data classification and symbolization are controlled as Properties of a given data layer. In this part of the lesson you will classify the burglary data using different techniques and then compare and contrast the results.

A. First, some basic symbolization

The roads and hydrology data are included in this exercise to provide some geographic context to the pattern analysis you will do in Lesson 5. You don't necessarily have to include these layers in your screen captures for this lesson, but you may find that they help give a clearer picture of why areas have high or low crime statistics.

  1. Open your Lesson4.mxd document.
  2. Symbolize the padot_stateroads_philadelphia layer by TRAF_RT_NO.
    Give the Interstates an Expressway symbol, the US highways a Highway Ramp symbol, and PA routes a 1pt black line.
  3. Symbolize the Hydrology layer with the default Lakes symbol from the Symbol Selector.
  4. Arrange the layers with roads on top, followed by hydrology and then the census tracts.
  5. Open the Properties for the Tracts2000 layer. Click on the symbology tab.

    So far in this course we have often symbolized data using the options under the Categories field. Here and in lesson 5 you will explore the Quantities symbolization options.
  6. In the Show: field, choose Quantities > Graduated colors. This is the option for creating choropleth maps.

    Notice there are two drop-down menus in the Fields area, Value. and Normalization. Remember from the section on choropleth maps in the concept gallery, that choropleth maps best present data as rates or ratios. We create a rate or ratio by normalizing a raw count by some other value. For example, population density is a ratio created by dividing the number of people by a unit of area.
  7. Scroll down the Value: drop-down menu and choose P500_ALL_2009.

    Notice that ArcMap will map this raw count. There are instances when a simple map of occurrences may be useful, but a Graduated Colors symbol scheme (choropleth map) is generally not the right choice for mapping counts. We will focus more on mapping counts in lesson 5.
  8. For now, let's keep a version of the Philadelphia map just showing the counts, so we can compare it to maps with normalized data.
  9. Click OK to dismiss the Layer Properties window.
  10. In the Table of Contents, click on the heading for the Tracts2000 layer, and change it to 2009 Burglary Counts.

B. Normalizing Data and Using an Equal Interval Classification

As mentioned above, and in the Lesson 4 concept gallery, enumeration units in choropleth maps rarely represent equal populations or equal area. This means that if we are counting a certain phenomena that relates to people, there will almost always be more incidence of that phenomena where there are more people. So a map of crime counts would likely just show you where more people live, not where there are higher crime rates. So let's not map the burglary data by counts, but create a crime rate from the counts we have.

  1. In the Table of Contents, copy the 2009 Burglary Counts layer (by right-clicking and selecting copy), and then paste it into the Data Frame to make a second map.
  2. Rename the new layer Equal Interval (since we eventually will classify the data in this layer in equal intervals).
  3. Open the Properties for the Equal Interval layer. Click on the symbology tab.

    As the name suggests, an Equal Interval classification divides the range of data values into classes with similar data intervals. Because this classification only deals with class ranges, it is possible to have classes with no data points.
  4. Look at the choices under the Normalization: drop-down menu. What field is an appropriate choice? Choose AREA from the list.

    Look at the data categories that are generated by your choices so far. Do the Range values make any sense? Normalizing by area creates a ratio of burglaries per unit area (Do you remember what the Map Display units are?). The current settings create a ratio of burglaries per square foot - not very useful information for us here.
  5. Change the Normalization: value to CS_POPN_2000_2000, the population value from the 2000 census.

    This will give us the number of burglaries per person. Because rates per person are so low, rates are often expressed in terms of per 1,000 people or per 100,000 people depending on the frequency of the phenomena. We will come back to this and change our labels to represent a different rate in Step 22.
  6. To the right of the Fields area is the Classification control, which summarizes the current classification scheme and number of classes. Click the Classify... button.
  7. You will see in the histogram that there are clearly some outliers in the data. Most of the data is not even visible because it is under .1 burglaries per person. Can you figure out what it is that causes these extreme outliers that appear once we normalize the data? Look at the counts for those tracts with the extreme outliers. How many burglary counts do they have? Have many people lived there in 2000?

    In this case, with counts throughout an area with enumeration units of various populations, the extreme outliers present in the rates comes from "small numbers," often called the "small numbers problem."

    Statisticians have certain guidelines they use to eliminate small population areas or smooth them over statistically. But it is also important as a cartographer - visually communicating data to others - to be able to deal with these issues. Otherwise, your maps may show unreliable information. Let's investigate the small numbers problem a bit using a scatter plot to see what census tracts should be excluded from our classifications.
  8. Click OK to dismiss the Classification and Layer Properties windows. At this point it does not matter which classification method you have chosen. You will see that the data is heavily skewed and most of the census tracts end up in the lowest classes (for nearly every classification method other than Quantile). You also may see some of the census tracts are empty (or show parts of another map underneath - if you have another one turned on). This is because those tracts have zero population and therefore came out null when we normalized the crime counts with the population figure.
  9. Under the View menu, select Graphs > Create...
  10. Under Graph Type: chose Scatter Plot.
  11. Make sure Equal Interval is the layer being used.
  12. Select P500_ALL_RT_2009 as the variable for the Y field (notice the "RT" in the variable name I want you to select). This is the rate of burglaries per 1000 people in 2009, essentially the same variable we created with the count divided by the population. (You can map this variable - without normalizing it - and see how it does look the same as the variable we created. It too has the same issues with small numbers).
  13. Select P500_ALL_2009 as the variable for the X field. Stretch the Graph Wizard Window horizontally as much as possible.

    We are looking at the 2009 burglary rate as a function of the 2009 burglary count. You should be able to see that the highest rates actually have low counts. This is indicative of a small numbers problem. Those census tracts must have low populations. Let's look at the rate as a function of the population.
  14. SelectCS_POPN_2000_2000 as the variable for the X field (keeping P500_ALL_RT_2009 as the variable for the Y field).

    This should confirm that all those tracts with really high rates have very low populations. Because there are not a lot of controls for the scatter plot in ArcMap, e.g. limiting the ranges or zooming into the parts of the x or y axes, we cannot see from this tool where the bulk of rates actually sits, or what population count we should use as a cut-off to stabilize the rates. You could use another application if you are interested in doing this. But for the sake of staying in one application, I did the legwork and suggest excluding all tracts with populations less than 700 people. I came up with this figure by looking temporally at the rates and counts and seeing how they changed as compared to census tracts with higher populations (you can do this too using the attribute table, sorting and selecting systematically). I also looked at some of the crime rate maps produced from crimeBase (that we got the data from) and was able to see the tracts that they excluded or they portrayed as parks in the city.
  15. Click Cancel to get out of the Graph Wizard since we were just using it to observe the small numbers problem.
  16. In the symbology tab of the layer properties for your Equal Interval layer, first confirm that you are mapping the 2009 burglary counts, P500_ALL_2009, and normalizing with the CS_POPN_2000_2000 variable, then click the Classify... button.
  17. Click the Exclusion... button.
  18. Double click the "CS_POPN_2000_2000" variable in the exclude clause: window so the variable shows up in the bottom window. Click the less than symbol ("<"), and then type in 700.
  19. Go to the Legend tab of the Data Exclusion Properties window. Place a checkmark in Show symbol for excluded data and then click the symbol button to chose an appropriate symbol to show which census tracts have sparse data. Also type in "Sparse data" or some other label to show up in the TOC and legend for the census tracts that are excluded from the classification.
  20. Click OK to execute the data exclusion.

    You should now see the histogram in the Classification window show much more of the data, although still skewed left.
  21. Click the Classification Method: drop-down menu and choose Equal Interval. Use five classes.

    Notice that the dynamic histogram below updates to show the class break points and their data values. You can manually reposition the break points, but, for now, leave them as is.
  22. Click OK to apply these changes and dismiss the Classification window.
  23. Still in the Symbology tab of the layer properties, under the Label column, click on the top class. The text there should then become editable. Change the values there so the rates are per 1000 people rather than per person. To do this, just move the decimal point to the right three places. For instance if it did say "0.000000000 - 0.024725823", change the text for the label to "0.0 - 24.73", rounding to a value you feel is appropriate.

    These will be reflected in the TOC and any legend made from the legend wizard.

C. Quantile (or Percentile) Classification

Rather than separate classes by set value intervals, the quantile classification creates classes with equal numbers of data points in each class. By dictating a certain number of sample points per class, quantile classification schemes can sometimes create classes that include a very wide range of data values. Data values and classes aside, this method produces maps that have an apparent balance - that is to say that each class is represented equally.

  1. In the Table of Contents, copy the Equal Interval layer and paste it into the Data Frame.
  2. Rename the duplicate Equal Interval layer as Quantile.
  3. Open the Properties for the new Quantiles layer. Leave the Value and Normalization fields as is, and click the Classify... button.
  4. Leaving the exclusion as is, choose Quantile from the Method drop-down list and specify 5 classes. Click OK to dismiss the Classification and Layer Properties windows.

    Take a moment and compare the two classified layers. Same data, different classification method. There is quite a difference, wouldn't you agree?
  5. Copy either the Equal Interval or Quantile layer and paste it into the data frame two more times for two more maps using different classification methods.

D. Other Classification Methods

ArcGIS includes several other classification methods. They are all organized in the same manner as Equal Interval and Quantile. Two of the common methods are Natural Breaks (Jenks) and Standard Deviations. Classification by natural breaks uses a calculation that creates class breaks inherent within the data by maximizing the differences between classes. In a standard deviation classification, class breaks reflect the variance of data values from the mean and the data range. By default, ArcGIS will use a diverging color scheme to visually emphasize the idea of classes varying from a central mean, and will label the classes only based on the standard deviation of the data values (whether or not this is useful for visually communicating your data).

  1. Use the steps above to classify the two new layers by Natural Breaks and Standard Deviations. Feel free to experiment with any options within the Classification dialogs.

E. Save your map document.