Lesson 4: Multiple Classifications

The lessons this week and next week examine issues related to presenting data for thematic display. In Lesson 4 we focus on data classification and color schemes, and in lesson 5 we focus on different kinds of map representations. How do different data classifications affect map pattern recognition? How do different color schemes affect pattern recognition? What is the appropriate classification for a given dataset? In lessons 4 and 5, you will explore classification and symbolization tools to create several map series using longitudinal crime data.

A. Goals

B. Deliverables

See the Lesson 4 Deliverables page. For specific due dates, see the course Calendar tab in ANGEL.

Questions?

If you have any questions now or at any point during this week, please feel free to post them to the Lesson 4 Discussion Forum. (To access the forums, return to ANGEL via the ANGEL link in the Resources menu. Once in ANGEL, you can navigate to the Communicate tab and then scroll down to the Discussion Forums section.) While you are there, feel free to post your own responses if you, too, are able to help a classmate.

Checklist

Lesson 4 is one week in length. (See the Calendar in ANGEL for specific due dates.) To finish this lesson, you must complete the activities listed below.

Lesson Four Checklist

Steps to Completing Lesson 4

Step Activity Access/Directions
1 Read the lesson Overview and Checklist. You are in the Lesson 4 online content now. The Overview page is previous to this page.
2 Read the concepts introduced for this lesson. Go to the Concept Gallery for Lesson 4.
3 Work through the Lesson 4 exercise. You are in the Lesson 4 online content now. Click on the "Next Page" link to access the Lesson 4 exercise. Also make sure to read the Lesson 4 Deliverables for project requisites.
4

Participate in the Discussion. 

To participate in the discussion, please go to the Lesson 4 Discussion Forum in ANGEL. 

 5 Submit series of maps made for lesson. After working through the lesson and completing the layout for your maps, submit them to the Lesson 4 drop box in ANGEL, which can be accessed inside the Lesson 4 folder under the Lessons tab.
6 Complete the Lesson 4 quiz. The Lesson 4 Quiz can be accessed inside the Lesson 4 folder under the Lessons tab.

 

Part I: Getting Started

Before you get started on the assignment, read the concepts in theLesson 4 concept gallery.

Concept Gallery

Learn more about Classification Schemes in the Concept Gallery.

Learn more about Choropleth Maps in the Concept Gallery.

For this lesson, you will be downloading all of the spatial and attribute data from two world wide web sites. One sponsored by the Cartographic Modeling Lab at the University of Pennsylvania, and the other sponsored by the Pennsylvania Department of Environmental Protection. Note: the crime data from the Cartographic Modeling Lab is no longer being made available for download. Until I rewrite this part of the lesson with different data, I will still have you walk through the steps of downloading the data below so you understand what data you will be using - and how you might download data on a website like this. But to acquire the actual burglary data (not the boundary data) used for the lesson, go to the ANGEL lesson 4 folder. You will be prompted to do this in the lesson instructions below.

A. Download Lesson Data from a World Wide Web site

  1. In your geog486 directory, create a Lesson4 directory.

    First, we will visit the Philadelphia Neighborhood Information System web site, where we will download three datasets: (1) an outline of the city of Philadelphia, (2) the US Census Tracts for Philadelphia, and (3) several years worth of crime data aggregated to the Census tract level.

  2. Go to the Philadelphia NIS Web site (http://www.cml.upenn.edu/nis/).
  3. Click on the crimeBase link, and then click on go to crimeBase.

    Read the Disclaimer in order to get a sense of where the data come from. You can peruse the instructions if and when you like (the link is in the New Users box). In the mean time, follow the instructions that follow in order to download the data we will be using.



    First we will retrieve the two spatial data layers.

  4. In the left-hand column, click on the GIS Data link.
  5. In the ensuing screen, click on the boundary.zip link.

    This will allow you to download the Philadelphia city boundary.

  6. Via the dialog window that appears, save the boundary.zip file to your geog486/Lesson4 directory.
  7. Unzip and extract the contents of boundary.zip to your Lesson4 directory.

    The boundary layer is a shapefile dataset.



    Next you will download the US Census Tracts for Philadelphia. Census blockgroups are available, but the crime data aggregated at that spatial level is not made available to the public. Blockgroups are smaller than tracts. Privacy issues may arise in situations where there are only a few persons residing in a blockgroup area.

  8. You should still be viewing the Crimebase GIS Data page of the website. Click on the tracts2000.zip link, and
  9. Save the tracts2000.zip file to your Lesson4 directory.
  10. Extract the contents of the tracts2000.zip file to your Lesson4 directory. Again, this is a shapefile dataset.

    Now you will retrieve incidence of burglary information.

  11. In the left-hand column of the web page, click on the Tables link.
  12. Make certain that slot #1 in the box that appears contains 2000 Census Tracts.
  13. In the #2 dropdown slot, choose Burglaries(500 series).
  14. Click the Next button.

    In the page that appears you should see, in step #3 "Choose Data Element(s)," a list of 1998 through 2009, Burglaries (500 series).

  15. Hold down the Shift or the Control key, highlight 1998 through 2009 entries in the #3 window, and then click the Add Element button. Those 12 entries should appear in the lower window of the dialog box.
  16. In the same window, scroll past the Residential and Commercial Burglaries, and then hold down the shift or control key and highlight the 1998 through 2009 entries for the Burglaries(500 series) Rate per 1,000 population. Then click the Add Element button. Those 12 entries should also appear in the lower window of the dialog box.
  17. In the same window, scroll all the way to the bottom past many datasets for other types of crimes and some population figures. Select the seventh data element from the bottom, "2000, Population, Number" and then click the Add Element button. That one entry should appear in the lower window of the dialog box.

    Note that when you click the Add Element button, the entries disappear from the #3 window. So, if things get confusing you may need to use your browsers Back button to get back to our Step 13.

  18. Click the Next button.

    The ensuing page will present you with a listing of the burglary counts and rates for each of the 12 years, for each of the Census tracts for Philadelphia (scroll down to see them). Take note of the numeric designations of the Census tracts, the 000100, etc. values.



    On the right side of the web page, toward the top, you should also find a small diskette icon with Export It written next to it.

  19. Normally at this point you would Click on the Export It link. But as mentioned before, they have deactivated the download for this data.  I have the data from before this became deactivated.  Get the data ("burglaries.xlsx") from the Lesson 4 folder under the Lessons tab in ANGEL. Go to instruction #25. Read through the other steps if you like.
  20. You should see the Export Table page displaying a list of the 25 Data Elements you chose, and Step 1 highlighted. Fill in the data requested and click on the next button. (You can use "Penn State" for Organization and "Course Project" for how you will be using the information.)
  21. In the Step 2 window, you should see a dropdown slot, entitled Choose a Data Export Format:.
  22. Make certain that the slot contains Comma-delimited Text file.
  23. Click the Create Export File button.
  24. Follow the instructions given in order to download the .txt file that was created for you. (Depending upon which web browser you are using, the instruction you choose when you right-click on the file name may say, Save Link Target As...) Direct the .txt file to your Lesson4 directory. (Don't worry about the "necessary shapefiles," we have already downloaded them.)
  25. If you are able, open the .xlsx file with Excel and review the contents. You should see 26 fields of information. The first field contains the encoded designation for the Census tract. What is its name? _______________. The other 25 fields are the burglary counts and rates for each of the 12 years, and the population of each tract in 2000.
  26. Leave the burglaries.xlsx file open.

B. Download additional data

Lesson4.zip (1.2 Mb) - For a review of the download/extraction steps, see Lesson 2, part 1.

The Lesson4.zip file contains a set of files for Philadelphia hydrology and state maintained roads in Philadelphia. They are provided primarily for reference and understanding of the city.

C. Let's view the spatial data via ArcCatalog

  1. Open ArcCatalog. We will preview the spatial data here before we begin to work in ArcMap.
  2. With the right-side pane set to Preview (instead of the default Contents), view the boundary shapefile. If necessary, select Geography in the Preview: dropdown slot at the bottom of the display area. Note the detail along the south-eastern portion of the polygon. That is the waterfront along the Delaware River.
  3. Next, preview the tracts2000 shapefile. Switch the Preview: setting to Table in order to find out how many Census tract polygons there are in this data layer.

    Eventually you will be joining the burglary data to the Census tracts. Recall the field header name of the Census tract codes that you saw in the text file of burglary data. (If you have been working through from the beginning of this Part of the lesson, that text file should still be open.) Which of the fields in the attribute table of the tracts2000 shapefile has contents that match those code numbers? Make note of the field name, for use later: _________________.

  4. Switch the Preview: setting back to Geography.
  5. Now, click on the padot_stateroads-philadelphia_2004 shapefile.

    As you do so, note the difference in shape between the Census tract layer and this roads layer. You can toggle back and forth between the two in order to get a good feel for the difference.
  6. Via the Properties window (double-click to get to the properties) for the padot_stateroads-philadelphia_2004 shapefile, look at this layer's Coordinate System property. You should see that the XY coordinates for this layer are in GCS_North_American_1983. In other words, longitude-latitude in units of decimal degrees, based on the NAD83 datum. This is borne out by viewing the X/Y Domain extent values and comparing them to the approximate longitude and latitude of Philadelphia.
  7. Of the four spatial datasets, only the roads appear to have a different coordinate system. Is this really the case? Look at the Properties for the other shapefiles and note each of their coordinate systems. (Note, too, the Linear Unit values.)

    Boundary.shp ______________________________

    Hydrology.shp ______________________________

    Tracts2000.shp ______________________________

D. Add the spatial and tabular data to an ArcMap session

  1. Open a new ArcMap session, and save it as lesson4.
  2. Add the 4 spatial layers to the ArcMap session: the (1) boundary, (2) hydrology, (3) padot_stateroads-philadelphia_2004, and (4) the tracts2000 shapefiles.



    Arrange the layers with the roads on top, then hydrology, tracts2000, and boundary on the bottom.

    What coordinate system has been assigned to the Data Frame?

  3. If necessary, set the coordinate system of the data frame to that of the boundary (or tracts2000) layer. We want to be doing our GIS in projected coordinates, right? (In Lesson 7 we will investigate reasons for this).
  4. Now, add the burglaries.xlsx file to your ArcMap session.

E. Join the tabular attributes to a spatial layer

  1. Based on what you observed above when you viewed the attributes of the spatial data you are working with, join the contents of the burglaries.xlsx file to the tracts2000 shapefile dataset.

    Hint: you need to know the names of the fields in the burglaries table and in the shapefile that contain data that the two tables have in common.

  2. Review the attribute table of the tracts2000 shapefile. If, after performing the join, you see columns of null values in the attribute table of the tracts2000 layer, you did not choose the correct columns on which to base the join. Perform the join again if necessary.
  3. Save the map document.

That is it for Part I

You have just completed Part I of this project.

Part II: Data Classification in ArcGIS

Now, let's apply these concepts in ArcGIS. Data classification and symbolization are controlled as Properties of a given data layer. In this part of the lesson you will classify the burglary data using different techniques and then compare and contrast the results.

A. First, some basic symbolization

The roads and hydrology data are included in this exercise to provide some geographic context to the pattern analysis you will do in Lesson 5. You don't necessarily have to include these layers in your screen captures for this lesson, but you may find that they help give a clearer picture of why areas have high or low crime statistics.

  1. Open your Lesson4.mxd document.
  2. Symbolize the padot_stateroads_philadelphia layer by TRAF_RT_NO.

    Give the Interstates an Expressway symbol, the US highways a Highway Ramp symbol, and PA routes a 1pt black line.
  3. Symbolize the Hydrology layer with the default Lakes symbol from the Symbol Selector.
  4. Arrange the layers with roads on top, followed by hydrology and then the census tracts.
  5. Open the Properties for the Tracts2000 layer. Click on the symbology tab.



    So far in this course we have often symbolized data using the options under the Categories field. Here and in lesson 5 you will explore the Quantities symbolization options.
  6. In the Show: field, choose Quantities > Graduated colors. This is the option for creating choropleth maps.



    Notice there are two drop-down menus in the Fields area, Value. and Normalization. . Remember from the section on choropleth maps in the concept gallery, that choropleth maps best present data as rates or ratios. We create a rate or ratio by normalizing a raw count by some other value. For example, population density is a ratio created by dividing the number of people by a unit of area.
  7. Scroll down the Value: drop-down menu and choose P500_ALL_2009.



    Notice that ArcMap will map this raw count. There are instances when a simple map of occurrences may be useful, but a Graduated Colors symbol scheme (choropleth map) is generally not the right choice for mapping counts. We will focus more on mapping counts in lesson 5.
  8. For now, let's keep a version of the Philadelphia map just showing the counts, so we can compare it to maps with normalized data.
  9. Click OK to dismiss the Layer Properties window.
  10. In the Table of Contents, click on the heading for the Tracts2000 layer, and change it to 2009 Burglary Counts.

B. Normalizing Data and Using an Equal Interval Classification

As mentioned above, and in the Lesson 4 concept gallery, enumeration units in choropleth maps rarely represent equal populations or equal area. This means that if we are counting a certain phenomena that relates to people, there will almost always be more incidence of that phenomena where there are more people. So a map of crime counts would likely just show you where more people live, not where there are higher crime rates. So let's not map the burglary data by counts, but create a crime rate from the counts we have.

  1. In the Table of Contents, copy the 2009 Burglary Counts layer (by right-clicking and selecting copy), and then paste it into the Data Frame to make a second map.
  2. Rename the new layer Equal Interval (since we eventually will classify the data in this layer in equal intervals).
  3. Open the Properties for the Equal Interval layer. Click on the symbology tab.



    As the name suggests, an Equal Interval classification divides the range of data values into classes with similar data intervals. Because this classification only deals with class ranges, it is possible to have classes with no data points.
  4. Look at the choices under the Normalization: drop-down menu. What field is an appropriate choice? Choose AREA from the list.



    Look at the data categories that are generated by your choices so far. Do the Range values make any sense? Normalizing by area creates a ratio of burglaries per unit area (Do you remember what the Map Display units are?). The current settings create a ratio of burglaries per square foot - not very useful information for us here.
  5. Change the Normalization: value to CS_POPN_2000_2000, the population value from the 2000 census.



    This will give us the number of burglaries per person. Because rates per person are so low, rates are often expressed in terms of per 1,000 people or per 100,000 people depending on the frequency of the phenomena. We will come back to this and change our labels to represent a different rate in Step 22.
  6. To the right of the Fields area is the Classification control, which summarizes the current classification scheme and number of classes. Click the Classify... button.
  7. You will see in the histogram that there are clearly some outliers in the data. Most of the data is not even visible because it is under .1 burglaries per person. Can you figure out what it is that causes these extreme outliers that appear once we normalize the data?  Look at the counts for those tracts with the extreme outliers.  How many burglary counts do they have?  Have many people lived there in 2000?



     In this case, with counts throughout an area with enumeration units of various populations, the extreme outliers present in the rates comes from "small numbers," often called the "small numbers problem."



    Statisticians have certain guidelines they use to eliminate small population areas or smooth them over statistically. But it is also important as a cartographer - visually communicating data to others - to be able to deal with these issues. Otherwise, your maps may show unreliable information. Let's investigate the small numbers problem a bit using a scatter plot to see what census tracts should be excluded from our classifications.
  8. Click OK to dismiss the Classification and Layer Properties windows. At this point it does not matter which classification method you have chosen. You will see that the data is heavily skewed and most of the census tracts end up in the lowest classes (for nearly every classification method other than Quantile). You also may see some of the census tracts are empty (or show parts of another map underneath - if you have another one turned on). This is because those tracts have zero population and therefore came out null when we normalized the crime counts with the population figure.
  9. Under the View menu, select Graphs > Create...
  10. Under Graph Type: chose Scatter Plot.
  11. Make sure Equal Interval is the layer being used.
  12. Select P500_ALL_RT_2009 as the variable for the Y field (notice the "RT" in the variable name I want you to select).  This is the rate of burglaries per 1000 people in 2009, essentially the same variable we created with the count divided by the population. (You can map this variable - without normalizing it - and see how it does look the same as the variable we created. It too has the same issues with small numbers).
  13. Select P500_ALL_2009 as the variable for the X field. Stretch the Graph Wizard Window horizontally as much as possible.



    We are looking at the 2009 burglary rate as a function of the 2009 burglary count. You should be able to see that the highest rates actually have low counts. This is indicative of a small numbers problem. Those census tracts must have low populations. Let's look at the rate as a function of the population.
  14. Select CS_POPN_2000_2000 as the variable for the X field (keeping P500_ALL_RT_2009 as the variable for the Y field).



    This should confirm that all those tracts with really high rates have very low populations. Because there are not a lot of controls for the scatter plot in ArcMap, e.g. limiting the ranges or zooming into the parts of the x or y axes, we cannot see from this tool where the bulk of rates actually sits, or what population count we should use as a cut-off to stabilize the rates. You could use another application if you are interested in doing this. But for the sake of staying in one application, I did the legwork and suggest excluding all tracts with populations less than 700 people. I came up with this figure by looking temporally at the rates and counts and seeing how they changed as compared to census tracts with higher populations (you can do this too using the attribute table, sorting and selecting systematically). I also looked at some of the crime rate maps produced from crimeBase (that we got the data from) and was able to see the tracts that they excluded or they portrayed as parks in the city.
  15. Click Cancel to get out of the Graph Wizard since we were just using it to observe the small numbers problem.
  16. In the symbology tab of the layer properties for your Equal Interval layer, first confirm that you are mapping the 2009 burglary counts, P500_ALL_2009, and normalizing with the CS_POPN_2000_2000 variable, then click the Classify... button.
  17. Click the Exclusion... button.
  18. Double click the "CS_POPN_2000_2000" variable in the  exclude clause: window so the variable shows up in the bottom window. Click  the less than symbol ("<"), and then type in 700.
  19. Go to the Legend tab of the Data Exclusion Properties window.  Place a checkmark in Show symbol for excluded data and then click the symbol button to chose an appropriate symbol to show which census tracts have sparse data. Also type in "Sparse data" or some other label to show up in the TOC and legend for the census tracts that are excluded from the classification.
  20. Click OK to execute the data exclusion.



    You should now see the histogram in the Classification window show much more of the data, although still skewed left.
  21. Click the Classification Method: drop-down menu and choose Equal Interval. Use five classes.



    Notice that the dynamic histogram below updates to show the class break points and their data values. You can manually reposition the break points, but, for now, leave them as is.
  22. Click OK to apply these changes and dismiss the Classification window.
  23. Still in the Symbology tab of the layer properties, under the Label column, click on the top class. The text there should then become editable. Change the values there so the rates are per 1000 people rather than per person.  To do this, just move the decimal point to the right three places. For instance if it did say "0.000000000 - 0.024725823", change the text for the label to "0.0 - 24.73", rounding to a value you feel is appropriate.



    These will be reflected in the TOC and any legend made from the legend wizard.

C. Quantile (or Percentile) Classification

Rather than separate classes by set value intervals, the quantile classification creates classes with equal numbers of data points in each class. By dictating a certain number of sample points per class, quantile classification schemes can sometimes create classes that include a very wide range of data values. Data values and classes aside, this method produces maps that have an apparent balance - that is to say that each class is represented equally.

  1. In the Table of Contents, copy the Equal Interval layer and paste it into the Data Frame.
  2. Rename the duplicate Equal Interval layer as Quantile.
  3. Open the Properties for the new Quantiles layer. Leave the Value and Normalization fields as is, and click the Classify... button.
  4. Leaving the exclusion as is, choose Quantile from the Method drop-down list and specify 5 classes. Click OK to dismiss the Classification and Layer Properties windows.



    Take a moment and compare the two classified layers. Same data, different classification method. There is quite a difference, wouldn't you agree?
  5. Copy either the Equal Interval or Quantile layer and paste it into the data frame two more times for two more maps using different classification methods.

D. Other Classification Methods

ArcGIS includes several other classification methods. They are all organized in the same manner as Equal Interval and Quantile. Two of the common methods are Natural Breaks (Jenks) and Standard Deviations. Classification by natural breaks uses a calculation that creates class breaks inherent within the data by maximizing the differences between classes. In a standard deviation classification, class breaks reflect the variance of data values from the mean and the data range. By default, ArcGIS will use a diverging color scheme to visually emphasize the idea of classes varying from a central mean, and will label the classes only based on the standard deviation of the data values (whether or not this is useful for visually communicating your data).

  1. Use the steps above to classify the two new layers by Natural Breaks and Standard Deviations. Feel free to experiment with any options within the Classification dialogs.

E. Save your map document.

 

Part III: Layer Files and Color Schemes

In the Part II of this lesson you investigated classification methods using ArcGIS's tools and probably used default color schemes. I'd like to revisit (and expand) some of the color topics we touched on briefly in Lessons 1 and 2. In addition, you will see how to create and use Layer files in ArcGIS.

A. Working with Layer files

It is important to remember how ArcGIS stores information. As you know, an ArcMap (*.mxd) document does not store the underlying data compiled in your map. Instead it records a pathway (either full or relative) to the location of the data. Other map elements like a legend, north arrow, or scale bar are saved as part of the *.mxd. Similarly, all of the symbolization and classification changes you make are saved with the map document. You may have experienced the process of remaking a map - when you start over all of the steps taken to give your map a certain look have to be repeated. A Layer File is a way of saving your classification and symbolization choices as a stand-alone file that can be used on other maps.

  1. Open your lesson4.mxd map document.
  2. In the Table of Contents, right-click on your Equal Interval layer and choose Save As a Layer File. Save the new file to your Lesson 4 directory and call it Equal_Interval.lyr.

    Layer files can also be created from groups of map layers.

  3. In the Table of Contents, Cntl-click on your pa_stateroads-philadelphia_2004 and hydrology layers. With both selected, right-click on either and choose Group. Notice that the two individual layers now fall under the heading, New Group Layer. Change this name to Roads and Rivers.
  4. As you did in step 2, save Roads and Rivers as a layer file in your lesson 4 directory. Call it roads_rivers.lyr.

    Let's take a moment and look at the results in ArcCatalog.

  5. Open ArcCatalog and navigate to your Lesson 4 directory. Click the Preview tab.
  6. First, look at the preview of the tracts2000, pa_stateroads-philadelphia_2004, and hydrology shapefiles. Notice that polygons are filled with yellow and lines are displayed in blue.

    Layer files are given yellow, diamond shaped icons.

  7. Now click on the previews of the two new layer files. Do they look as they did on your map?

    It's likely that your roads_rivers.lyr appears unprojected - like the roads shapefile. This is because of the way in which you grouped and saved the *.lyr file. The resultant group inherited coordinate system information from the roads because they were arranged above the rivers in the table of contents. Let's have a closer look at the *.lyr files.

  8. Open the Properties for the Equal_Interval.lyr file. You should notice that the Properties window looks just like one from ArcMap and NOT like an ArcCatalog Properties window for a shapefile.

    Like a *.mxd file, a layer file does not store the actual data - only the appearance information. This means that delivering a layer file to a colleague is only useful if you both have access to the same datasets.

  9. Open the Properties for the roads_rivers.lyr file. Because this is a grouped layer file the information is organized a little differently.
  10. Click on the Group tab. Choose hydrology from the Layers list and then click the Properties button. The window should now look like what you saw for the Equal_Interval.lyr file.

You will add the two new layer files to a new map shortly but first let's review some material about appropriate color schemes for different mapped data and constraint on color selection.

B. Color schemes for maps

Concept Gallery

Learn more about color schemes in the Concept Gallery.

In the Lesson 4 Concept Gallery, we discussed common classification methods and color schemes. Before you start creating any custom colors and ramps, it is worth taking a few minutes to think about these topics again. Below, you will find three maps of different census data (2000 Census). For each map there are nine different color schemes. After reviewing the alternatives, decide which option is the most appropriate for the mapped data. Click the Best Choice link in each caption to see results and comments.

The first set of nine maps (Figures 4.1.a through 4.1.i), each of which uses a unique color scheme, depicts the percentage of people under age 18 identifying themselves as two or more races. The data are aggregated to counties and classified identically in each example. You may click on each individual map to see an enlarged version of that map.

 

A coropleth map (a) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races. A coropleth map (b) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races. A coropleth map (c) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races.

Figure 4.1.a Figure 4.1.b Figure 4.1.c

A coropleth map (d) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races. A coropleth map (e) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races. A coropleth map (f) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races.

Figure 4.1.d Figure 4.1.e Figure 4.1.f

A coropleth map (g) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races. A coropleth map (h) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races. A coropleth map (i) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races.

Figure 4.1.g Figure 4.1.h Figure 4.1.i

Figure 4.1.a through 4.1.i Percentage of people under 18 identifying themselves as two or more races, represented with nine different color schemes.

Which one did you choose? How did you come to your decision? View the best choice.

 

Best choice: Figure 4.1.f

A coropleth map (f) showing the percentage of people under 18 in a section of the northeastern U.S. who identify themselves as two or more races.

Rates suggest a sequential scheme like figures 4.1.c, 4.1.f, or 4.1.i. Figure 4.1.c is a bad choice because there is a big lightness jump between the two darkest colors and the lightest three. Figure 4.1.i is a bad choice because there is a big saturation difference in the second color.

The second set of nine maps (Figures 4.2.a through 4.2.i), each using a unique color scheme, present the percent change in population from 1990 to 2000. The data are again aggregated by county. The U.S. rate of change was 13.2%. You may click on each individual map to see an enlarged version of that map.

 

A coropleth map (a) showing the percent change in population from 1990-2000 in a section of the northeastern U.S. A coropleth map (b) showing the percent change in population from 1990-2000 in a section of the northeastern U.S. A coropleth map (c) showing the percent change in population from 1990-2000 in a section of the northeastern U.S.

Figure 4.2.a Figure 4.2.b Figure 4.2.c

A coropleth map (d) showing the percent change in population from 1990-2000 in a section of the northeastern U.S. A coropleth map (e) showing the percent change in population from 1990-2000 in a section of the northeastern U.S. A coropleth map (f) showing the percent change in population from 1990-2000 in a section of the northeastern U.S.

Figure 4.2.d Figure 4.2.e Figure 4.2.f

A coropleth map (g) showing the percent change in population from 1990-2000 in a section of the northeastern U.S. A coropleth map (h) showing the percent change in population from 1990-2000 in a section of the northeastern U.S. A coropleth map (i) showing the percent change in population from 1990-2000 in a section of the northeastern U.S.

Figure 4.2.g Figure 4.2.h Figure 4.2.i

Figure 4.2.a through 4.2.i Percent change in population from 1990 to 2000, represented with nine different color schemes.

Which one did you choose? How did you come to your decision? View the best choice.

 

Best choice: Figure 4.2.f

A coropleth map (f) showing the percent change in population from 1990-2000 in a section of the northeastern U.S.

The important consideration with these data is the rate of national change (13.2%). Colors representing gains and losses should diverge from this value. Figure 4.2.c uses a diverging scheme but is a bad choice because the hue change does not match the U.S. rate. Figure 4.2.i is also an unwise choice because the hue choices don't make sense(even though they roughly form diverging lightness ramps).

The third set of nine maps (Figures 4.3.a through 4.3.i), shows religious affiliation. Counties are classified by the denomination with the highest percentage of religious adherents. [Additional source: Gaustad and Barlow, 2001, New Historical Atlas of Religion in America. Oxford Press, New York.] You may click on each individual map to see an enlarged version of that map.

 

A coropleth map (a) showing religious affiliation in a section of the northeastern U.S. A coropleth map (b) showing religious affiliation in a section of the northeastern U.S. A coropleth map (c) showing religious affiliation in a section of the northeastern U.S.

Figure 4.3.a Figure 4.3.b Figure 4.3.c

A coropleth map (d) showing religious affiliation in a section of the northeastern U.S. A coropleth map (e) showing religious affiliation in a section of the northeastern U.S. A coropleth map (f) showing religious affiliation in a section of the northeastern U.S.

Figure 4.3.d Figure 4.3.e Figure 4.3.f

A coropleth map (g) showing religious affiliation in a section of the northeastern U.S. A coropleth map (h) showing religious affiliation in a section of the northeastern U.S. A coropleth map (i) showing religious affiliation in a section of the northeastern U.S.

Figure 4.3.g Figure 4.3.h Figure 4.3.i

Figure 4.3.a through 4.3.i Religious affiliation, represented with nine different color schemes.

Which one did you choose? How did you come to your decision? View the best choice.

 

Best choice: Figure 4.3.e

A coropleth map (e) showing religious affiliation in a section of the northeastern U.S.

Religious affiliation is not a rate, and it does not diverge from some baseline or central tending value. The data classes are qualitative. Color selection should be independent of the category. Figure 4.3.b is not a bad choice, but the two Catholic classes should be represented with the same hue. At first glance, Figure 4.3.h appears promising, but it uses light and dark hue pairs that have no consistent relationship with the data.

C. Constraints on color selection

Most of us take color for granted. We see the world in vivid hues and with subtle variations. As map designers, we also need to be cognizant of those occasions when maps need to be read without color - by choice or because of color blindness. Most people who are colorblind are still able to distinguish differences in lightness and see many hues. Color confusion tends to be exacerbated when desaturated colors are used.

The nature of colorblindness has been extensively researched as a matter of physiology and perception. It has also been modeled in numerous color spaces. While the variety of stable color combinations is lengthy (especially when described as luminosity measurements), we can generalize a list of ten color-pair combinations that are clearly distinguishable to people with common color vision impairments.

  • red-blue
  • red-purple
  • orange-blue
  • orange-purple
  • brown-blue
  • brown-purple
  • yellow-blue
  • yellow-purple
  • yellow-gray
  • blue-gray

To visualize how these hue-pairs were determined, imagine a 3D cube (see Figure 4.4, below). Once the cube is flattened, we can arrange the hues in spectral order around the perimeter and create lightness variation by placing white in the center.

A 3-D CMY color cube.
Figure 4.4 A CMY color cube with the white corner facing forward. In this view, hues are arranged in spectral order around the edge of the cube.
Figure by Mark Wherley

Using this arrangement, we can create regions of colors that are indistinguishable by drawing colorblind confusion lines through the space. The lines drawn on the figure below were approximated based on the CIExyY color space.

A flattened color cube demarkated to show colorblind confusion regions.
Figure 4.5 The bold lines demark colorblind confusion regions. Click on the image to see the color specifications as percentages of CMY.
Figure by Cynthia Brewer

Colors within the same or neighboring regions that share similar lightness will be confused. A good rule of thumb is to choose colors that vary in lightness and that are separated by at least one region. In Figure 4.6, below, a diverging color scheme is created by choosing two, three-color lightness ramps from regions that are appropriately distant.

A diverging color scheme created from the colorblind confusion diagram.
Figure 4.6 Use of the colorblind confusion diagram to create a diverging color scheme. First, choose three colors (light yellow, medium orange, and red - avoiding greens) that vary in lightness. Move two regions over and choose another set of three colors (light blue, medium blue, and purple). When organized, these six colors form a color safe scheme.
Figure by Cynthia Brewer

In the previous example, we have discussed color choices in terms of diverging color schemes. All types of color schemes can be modified to be colorblind safe. The basic task is to choose colors that vary distinctly in lightness. An easy test for color stability is to use a black and white photocopier. If your map uses hues or lightness specifications that are too similar, the resulting photocopy will appear as a uniform gray mess.

D. Creating and managing custom colors and color ramps

Let's get back to ArcGIS. For the rest of this part you will use ArcMap to specify and save custom colors and color ramps. To date, all of the custom colors you have created are part of each respective map document. If you specified a great shade of blue for use in one map, it is not available in the next unless you repeat the specification. In this step, you will learn to save colors and color ramps so they are independent of a specific map document.

  1. In ArcMap, open a new map document. Save it as color_schemes.mxd in your Lesson4 directory.
  2. Add the Equal_Interval.lyr file to your map.
  3. Add the roads_rivers.lyr file. In the Table of Contents, right-click Roads and Rivers and choose Ungroup.

    Let's create some custom colors for the road features.

  4. In the Table of Contents, click on the Interstate symbol. This will open the Symbol Selector window. Click the Edit Symbol... button to open the Symbol Property Editor window.
  5. The default interstate symbol is comprised of three elements - a heavy black line, a medium yellow line, and a thin black line. In the Layers field, select the thin black line with a single click and then click the stylized X button to delete this element.
  6. Select the yellow line and change the Color to a light orange and the Width to 2.0.
  7. Select the thick black line and change the Width to 4.0. Click OK to dismiss the editor window.
  8. Back in the Symbol Selector window, click the Save As... button.
  9. In the Item Properties window, name the new road symbol my_Interstates and leave the Category field blank. Click Finish.

    Notice that the new symbol is now listed in the symbol menu (it should be right at the top). Included with ArcMap are a wide variety of purpose- and industry-specific symbol sets. By saving your interstate symbol in the last step, you have added to these. By default most are turned off when ArcMap launches.

  10. Click OK to apply the my_Interstates symbol and then click the US routes symbol. In the Symbol Selector window click the Style References... button.

    Your custom symbols are stored under the user ID you use to log onto your computer. Notice that this set and the Esri set are checked (by default).

  11. Take a few minutes and turn some of the other symbol sets on and off. The new symbols sets will appear in the scrolling symbol list.

    Symbols sets vary for point, line, and polygon features. Because you started this process by selecting a line feature, you will only see line symbols.

  12. Check the Transportation symbol set to make it active. Scroll down the symbols list and choose A23 for your US Routes.

    The alternative to choosing-customizing-saving a symbol is to create one from scratch using the Style Manager. The Style Manager is a tool window that provides you access and control over the look of predefined and custom symbols and elements. Any custom symbol you create is stored within your Windows profile in a file called <username>.style.

  13. Click OK to apply and dismiss the Symbol Selector windows.
  14. On the Main toolbar, click Customize > Style Manager...

    On the left hand side is a list of the active symbols sets. Notice that yours is listed first (described as C:/Documents and Settings/... /<username>.style), followed by Esri and then any others you activated in the previous steps.

  15. Expand your folder by clicking on the + sign next to the folder. Symbols are organized into subtypes. Sub folders with white icons are empty. You should see a yellow folder next to Line Symbols. Click on the folder on the left hand side to see your my_Interstates symbol on the right.
  16. Click on the Color Ramps folder on the left. Then in the right hand pane, right-click in open space and choose New > Algorithmic Color Ramp...
  17. In the Color Ramp window that opens, choose a light pale green for Color 1. Click the Color 2 radio button and pick a dark blue.
  18. Experiment with the different Algorithm choices and the Black and White sliders until you are happy with the ramp.
  19. Click OK and then name the new ramp A Ramp.

    Before you leave the Style Manager, let's create another custom ramp. This one will be a colorblind safe, multi-part, algorithmic, diverging scheme. Can you guess how to make it?

  20. Right-click in the right pane and choose New < Multi-part Color Ramp. Click the Add button twice, choosing Algorithmic Color Ramps each time. Each portion of a multi-part ramp has its own properties.
  21. Click on the top ramp and then the Properties button. Choose a Dark Brown and White for Colors 1 and 2, respectively.
  22. Repeat this process for the second ramp using a Dark Bluish Purple (or Dark Blue) and White. Does white appear in the center of the diverging ramp? If not, you will need to swap Colors 1 and 2 in the second ramp to fix the orientation. Save the new ramp as AA Ramp. Close the Style Manager.
  23. Open the Properties for the Equal_Interval layer. Look at the choices under the Color Ramp drop-down menu (Symbology Tab > Quantities > Graduated Colors). The color ramps are organized alphabetically (even though the names are not listed), so your ramps should be the first two in the list.

    Because you are applying a continuous color ramp to classed data, the legend will appear as distinct steps. If you were using your ramp on unclassed data (like elevations), you would see a smooth transition along the ramp.

  24. Make a duplicate of the Equal_Interval layer. Give the new layer a unique name.
  25. Of the classification methods you have used in this lesson, which one is commonly symbolized with a diverging color scheme? ______________________________
  26. Re-classify the new layer with this classification method. And use your AA Ramp as the color scheme.

E. Save the map document

That is it for the walk-through instructions for Lesson 4. See the deliverables page, next, to see what is due as the assignment for this lesson.

Next week, using the same data, you will learn about map representations other than the choropleth map.

If you have any questions, please post them to the Lesson 4 Discussion Forum.

 

Lesson 4 Deliverables

There are three deliverables for this lesson.

  1. Submit a proposal for your course capstone project to the Capstone Proposal Submissions discussion forum in ANGEL.



    Info on the capstone project is available here: https://www.e-education.psu.edu/geog486/l9.html, and info specifically about the proposal is here: https://www.e-education.psu.edu/geog486/l9_p3.html.
  2. Complete the Lesson 4 quiz.
  3. Create a map layout with a series of four maps (all on the one layout) comparing the four classification methods we covered: equal interval, quantile, natural breaks and standard deviation, using the same data (i.e. for the same year) in each map. Design the map to best illustrate to the map reader how different classifications can drastically alter the look of the information being communicated.
  4. Create a 2nd map layout with a series of four maps (all on the second layout) comparing four different years of burglary data.  Design the map to best illustrate change (or lack thereof) in burglaries across the four years you use. I recommend using the same classes (i.e. same numerical class breaks) in each map to aid in comparisons for the map reader. For example see Unemployment map at this link.  It uses the same classes (even same legend) for all the years/data frames of data even though the range and distribution of the data across that range changes year to year. 

Both map layouts should:

  • include legends and headings as appropriate
  • include a title (describing purpose), scale bar and north arrow as you see appropriate
  • include color schemes that aid in the communication of the purpose of the map
  • be on 8.5" x 11" pages (oriented as either landscapes or portraits)

Notes:

  • You do not need to include the street and stream data. Those layers are included for reference and to demonstrate the bits of the lesson that have to do with the coordinate system of the layout and layer files. If you would like to include them you can, but make sure they don't interfere with the reading of the map (e.g. local roads looking like census tract divisions).
  • Maps using the standard deviation classification method are labeled with standard deviation groupings by default.  You can change or append the labels to the actual break points of the data values to fit with the other classification methods (for the legend). This is done in the symbology tab.

When your maps are complete, export them as PDFs and save them with the names "LASTnameFirstinitial_L4_map1.pdf" and "LASTnameFirstinitial_L4_map2.pdf"  (using your name of course) and submit via the ANGEL Lesson 4 Drop Box in the Lesson 4 folder.

Concept Gallery

Concepts Covered:

Surfaces

Together with the visual variables (refer back to the Symbolization concept gallery item from Lesson 2), one of the most important choices you will make in designing a thematic map is what type of representation you would like to use for your data. In this course, we will focus on and discuss four common types of maps: choropleth (here in Lesson 4), graduated/proportional symbol (in Lesson 5), dot density (in Lesson 5), and isoline maps (in Lesson 6).

When cartographers create thematic maps, one general goal they often have is to try to help the map reader understand the character of the spatial distribution of the attribute(s) displayed in the map. One useful way of talking about spatial distributions is to use the concept of a cartographic data model, first developed by George Jenks, a professor at the University of Kansas (Jenks, 1967). As Jenks defined it, a cartographic data model is an abstract method for representing the most important characteristics of a particular spatial distribution; this representation could be either mathematical or graphical. Other cartographers further developed this notion by creating a typology of how map types can be related to data models (MacEachren and DiBiase, 1991). They identified two important axes along which the spatial distribution of a variable can vary: from discrete to continuous and from abruptly changing to smoothly changing (see Figure 4.cg.1, below). They also matched the visual characteristics of these data models to the visual characteristics of different map types (see Figure 4.cg.2, further below; you may recognize this matching as an exercise in creating map-signs that best match their real world referents - remember our discussion of semiotics in Lesson 1, Part II: Visual Communication).

In the figure below, one of the axes shows a range from discrete to continuous. Discrete phenomena are those that have space between observations (e.g., mobile phone towers), while continuous phenomena exist throughout space (e.g., temperature - it exists everywhere even if we do not choose to measure at every possible location). The other axis relates to the degree of spatial dependence of a phenomenon. A phenomenon with a low degree of spatial dependence may change abruptly over a short distance (e.g., income tax rates between states), while phenomena with a high degree of spatial dependence change more smoothly. Elevation is generally a good example of a smoothly changing phenomenon, with the rare exceptions of canyons and cliffs, where there is a large, abrupt elevation change. One important point is that the character of a phenomenon may be scale-dependent (both spatially and temporally). For example, the distribution of cars may be generally considered to be abrupt: generally we do not find cars in locations that are not paved, and there is some amount of space between cars. However, at certain times of the day (e.g., rush hour), the distribution of cars (at least in particular locations, such as freeways) may become continuous.

Chart to show surface conceptualizations of geographic phenomena
Figure 4.cg.1 Different ways of conceptualizing distributions of geographic phenomena (i.e., data models) as they occur through space.
Credit: MacEachren, 1992

These different conceptualizations of geographic phenomena lend themselves to certain map types (or representations) better than others. For example, elevation is smooth and continuous and is therefore often represented with isolines, as shown in the lower right corner of Fig 4.cg.2 (and discussed more in Chapter 6). A tax rate, or most any kind of rate (e.g. mortality) is a value for an area (e.g. county) and therefore is abrupt but still continuous. A choropleth map would then best represent rate data as shown in the lower left corner of Fig 4.cg.2. What data model in Fig 4.cg.1 represents counts of people per area? And what kind of map type in Fig 4.cg.2 would work with that kind of geographic data? Depending on the scale of aggregation you may consider such data to be discrete and abrupt (upper left corner of Fig 4.cg.1) and use a proportional circle for each unit of area, or with small areas of aggregation (compared to your extent) you may think of your data as discrete but smooth (e.g. population per census tract but looking at a whole state), and then you may consider a dot density map (more on this in Chapter 5).

A chart of map types based on the conceptualizations shown in Figure 6.cg.1
Figure 4.cg.2 Corresponding map types that match the data models depicted in the figure above.
Credit: MacEachren, 1992

Recommended Readings

If you are interested in investigating this subject further, I recommend the following:

  • MacEachren, A. M. and D. W. DiBiase. 1991. "Animated maps of aggregate data: Conceptual and practical problems." Cartography and Geographic Information Systems, 18(4): 221-229.
  • MacEachren, Alan M. 1992. "Visualizing uncertain information." Cartographic Perspectives. 13: 10-19.

Choropleth Maps

Choropleth maps are probably the most commonly created type of map in GIS cartography. Their popularity is due to two main reasons: (1) choropleth mapping capabilities are implemented in most every GIS software package; and (2) much of the data that geographers and GIScientists work with is collected and aggregated into enumeration units, which form the basis for choropleth maps (see Figure 4.cg.3, below). Recall that choropleth maps give the map reader the impression that the phenomenon of interest is continuous (i.e., present throughout the areal unit) and abruptly changing (i.e., that the phenomenon is present at the same intensity throughout each areal unit, but changes abruptly at the area's borders).

The first choropleth map, a map of France, created in 1826 by C. Dupin.
Figure 4.cg.3 The first choropleth map, created by Charles Dupin in 1826 to depict literacy by department in France (Robinson 1982). Departments are one of the main administrative districts in France.
Credit: Friendly, 2004

An enumeration unit is an area defined for a particular purpose (often other than collecting data) and within which data are collected and aggregated. Some common examples of enumeration units include school districts (created to help manage the assignment of students to particular schools within a city or metropolitan area), counties (created as a form of local government) or census tracts (created to help manage the complicated task of counting the population). Typically, the boundaries of enumeration units do not correspond to breaks in the statistical surface of the data that are collected and aggregated to each unit (e.g., the population density does not suddenly change when we cross the border from Los Angeles county to Orange county in southern California). However, there are some cases where enumeration units do provide a good reflection of the structure in the statistical surface (e.g., in the case of income tax rates, which do change abruptly from state to state).

Choropleth maps typically use either differences in color value (sometimes in combination with hue) or differences in spacing (e.g., the intensity of a hatched pattern) to represent differences in the phenomenon being mapped (see Figure 4.cg.4, below). Generally, we use a darker or more closely spaced pattern to represent larger quantities of the phenomenon and a lighter or more sparsely spaced pattern to represent smaller quantities. One empirical study has shown that in most cases (especially with light map backgrounds), map readers do assume that "dark means more and light means less" (McGranaghan 1993).

Two different visual representations of the same data.
Figure 4.cg.4 Here, we map the same data variable with two different visual variables. Which variable do you think is more effective for showing varying intensities of motor vehicle accident mortality?

Although choropleth maps are quite easy to create, there are several issues that you should be aware of and consider when you are thinking about using a choropleth symbolization for representing the phenomenon you would like to map:

Enumeration Units

One important issue is that the size of enumeration units can be quite variable. This issue is important because larger symbols will dominate the visual appearance of the map and can exaggerate the importance of particular enumeration units. If we chose to map raw counts with a choropleth map, you may see counties such as San Bernardino county (the largest county by area in California) dominate the map, as the county has a relatively large population along with a relatively large area. However, if what we are really interested in is investigating the locations where people are more likely to die as the result of a motor vehicle accident, we should really be looking at rates, as it makes sense that anywhere there is a larger population, you would probably find a larger number of deaths due to motor vehicle accidents, as there are typically a larger number of any count pertaining to people in areas with larger populations. For this reason, in choropleth maps, we typically want to avoid mapping raw counts, but instead transform the data so that we are mapping densities, rates or ratios that allow us to make more realistic comparisons between unevenly sized units. This is not to say that there is never a good reason for mapping raw counts, just that other symbolization methods may be more appropriate (e.g., graduated or proportional symbols).

A comparison of two maps to show that rates or ratios, instead of raw counts, allow for more realistic comparisons between unevenly sized units.
Figure 4.cg.5 The map at the left shows the count of motor vehicle deaths by county in California. As we would expect, the larger numbers of deaths occur in the more populous counties of the Los Angeles, San Francisco and San Diego metropolitan areas. The map of rates at the right shows a very different picture of risk of dying in a motor vehicle accident: the highest rates are in non-metropolitan California.

Modifiable Areal Unit Problem

Because of the arbitrary nature of the boundaries of most enumeration units, we can find ourselves facing the modifiable areal unit problem (MAUP) (Openshaw 1984). Simply put, MAUP arises when different aggregations of individual counts (i.e., drawing the boundaries of enumeration units in different ways) produce different spatial patterns (see Figure 4.cg.6, below). Although there is no 'solution' to MAUP, if we can find data at different scales and that are aggregated to different units, we can create multiple maps that tell a more complete story about the distribution of the phenomenon we are working with. We might also choose to use other types of symbolization, e.g., dot maps (see Lesson 5) or dasymetric maps (see concept below in this concept gallery), that can tell us more about the spatial distribution of our phenomenon of interest. Incidentally, using multiple representations (whether using the same symbolization method or different methods) can also help us better understand where we have mismatches between the breaks in the statistical surface and breaks in the geographical surface.

A set of three maps to show that different aggregations of count data can affect the appearance of a choropleth map created from the aggregated data.
Figure 4.cg.6 The three maps above show how different aggregations of count data (e.g., using three different types of enumeration units) can affect the appearance of a choropleth map created from the aggregated data. All three maps use the same color scheme and classification, but the aggregation units contain different counts of the phenomenon. As you see, the effects of MAUP can significantly alter the appearance of the final map.

Data Classification and Map Appearance

Finally, there is the issue of the effect of different data classification decisions (e.g., classification method or the number of classes employed) on the appearance of the map pattern. We discuss this issue in detail in the next section of this concept gallery below, but we should briefly discuss the potential for unclassed choropleth maps. Although there has been some discussion among cartographers about using color values that are proportional to the data values represented in the map (i.e., creating unclassed maps as Tobler suggested in (1973)), the consensus today among cartographers is that it is difficult for map readers to extract quantitative information from static unclassed color value maps, so most cartographers still prefer to classify their data, especially in maps where map readers may need to extract an individual value or compare regions. A final point to note is that in classed choropleth maps, it is important to ensure that symbols are visually differentiable from each other (i.e., that the value differences between symbols are large enough to avoid confusion). This should be evaluated within the context of the map, as simultaneous contrast (the effect of surrounding symbols on the appearance of an interior symbol) can change a symbol's differentiability.

Recommended Readings

If you are interested in investigating this subject further, I recommend the following:

  • Crampton, J. 2004. "Are choropleth maps good for geography?" GeoWorld, Jan. 2003, p. 58.  Text of article also available here: http://www.rodbuckclasses.com/105/cartography/choropleth.htm
  • McGranaphan, M. 1993. "A cartographic view of spatial data quality. Cartographica, 30(2): 8-19.
  • Herzog, A.P. 2003. "MAPresso Java Applet". http://www.mapresso.com/download/map_zh171bfs.html.

    You can check out an unclassed choropleth map using the MAPresso applet created by Adrian Herzog. Note that it might take some time for the applet to completely download, so be patient.

Classification Schemes

At its heart, classification is an exercise in categorization. We assign locations to categories in order to reduce the complexity of the real world, thereby creating an abstraction that helps us better understand particular characteristics of the world without the distraction of all of the other possible characteristics that we could examine. A distinguishing feature of locations that belong to the same category is that they have a set of shared characteristics. The way in which we assign locations to a particular category can depend on qualitative or quantitative characteristics of that location (e.g., what type of phenomenon is present at that location, or how much of the phenomenon is found at the location). In the remainder of this lesson, we will focus on quantitative classification schemes (i.e., on grouping locations together because they have similar amounts of some phenomenon).

The most important choices you will have to make when classifying your data are which classification method to use and the number of classes to create. Generally, the fewer classes you use, the more important your choice of classification method is, as the map pattern will typically be more variable when you have fewer classes (see Figure 4.cg.7, below).

Three four-class maps to show different classification methods.
Figure 4.cg.7 You can see considerable variation in the location of light and dark areas in this set of four-class maps, each created using a different classification method.

However, when you are deciding on how many classes to use, it is also important to evaluate whether your map readers will be able to physically see differences in the symbol set you will use. For example, if you are creating a choropleth map and are only using color value as a visual variable (instead of a combination of color value and hue, which will allow readers to differentiate between a larger number of symbols), most map readers will only be able to distinguish six or seven different value levels, so your map should not exceed six or seven classes (see Figure 4.cg.8, below).

Three class maps, each with a different numbers of classes.
Figure 4.cg.8 In the four-class map at the left, it should be quite easy to decide which observations are in the same class. Take a look at the six- and eleven-class maps and see if you can do the same. You will probably succeed in the six-class map (middle), but have difficulty with the eleven-class map. Although you should be able to tell if one county is lighter or darker than another in the eleven-class map in a pairwise comparison, it will probably be difficult to pick out all observations that fall in a given class.

We can group classification methods into three main types, depending on the characteristics of the data that each method uses to create the classification scheme: those that are based on some exogenous (i.e., outside) criteria, those that only consider statistical characteristics of the data, and those that consider both statistical and geographical characteristics of the data. Most easily accessible classification methods within GIS software today only consider the statistical characteristics of the data, although it is also possible to create your own classification scheme based on exogenous criteria.

Classification schemes based on exogenous criteria are schemes that use important data values that are not derived from a statistical property of the data set as classification break points (i.e., boundaries between one class and another). Some common examples of exogenous criteria can include definitions (e.g., the amount of income defined as the poverty level), points at which the direction of change is altered (e.g., zero population growth), or values at a previous point in time (e.g., 1996 level of greenhouse gas emissions for each country). All of these exogenous criteria provide benchmarks against which the value for each location in the map can be compared.

Most methods that cartographers use for creating classification schemes consider the statistical properties of the data set. Some common examples include equal interval, quantile, natural breaks, optimal, mean-standard deviation, and classifications based on mathematical progressions. The equal interval classification method divides the range of the data into classes with equal-sized ranges. This is done by figuring out the range of the data and dividing that range by the number of classes desired (e.g., a data set with values ranging from 0 to 80 and divided into four classes would have the following classes: 0-20; 20-40; 40-60 and 60-80). The quantile method divides the data set into equal numbers of observations per class (e.g., in a dataset with 20 observations and 4 classes, each class would contain 25% of the observations (i.e., 5 observations)). Natural breaks classifications are typically determined by looking at a graph of the data values (ordered from highest to lowest) and placing breaks in places where the slope substantially changes (see Figure 4.cg.9, below). The optimal classification scheme automates the natural breaks process by using an iterative procedure that divides values into classes that minimize within-group variability and maximize between-group variability in an attempt to create the most homogenous classes that are possible with the dataset. The mean and standard deviation scheme uses the mean of the dataset as the middle break point, and uses the values of standard deviations (or some part thereof, such as 0.5 of a standard deviation) added to or subtracted from the mean for determining the other class breaks. Finally, mathematical progressions (e.g., arithmetic and geometric sequences) can be used to create classes that are increasingly larger or increasingly smaller in size.

A cumulative frequency graph to show where best to establish class breaks.
Figure 4.cg.9 Here, orange lines mark locations where class breaks should be established for a natural breaks classification with three classes.

Each of the schemes that consider statistical properties is more or less useful for mapping data with particular types of statistical distributions. For example, the equal interval scheme seems to work best for data with a rectangular distribution (i.e., approximately equal numbers of observations over the data range), while it is not very effective for highly skewed data as there may be many empty classes, forcing most observations into one or two classes, and leaving a very uninteresting map. Others, such as the mean-standard deviation scheme, work best for normally distributed data but do not work very well for other types of distributions. Generally, the factors to consider when choosing a classification method include the purpose for which the map will be used, the audience who will be using the map, and the distribution of the data (see Figure 4.cg.10, below).

 

A classification comparison made up of nine maps. Refer to the image caption for more information.

Figure 4.cg.10 Here, we show maps made from data with three different distribution types. At the left, we mapped a variable with a skewed distribution with both the optimized and equal interval classification methods. You can see that with the equal interval classification, very few observations fall into the top two classes, and the map suggests that there is less variability in diabetes mortality than the optimized map. In the middle maps of asthma mortality, a normally-distributed data set, you can see that the mean-standard deviation classification method in combination is able to highlight counties with substantially higher or lower mortality rates than the average county, while it is perhaps less easy to describe regional patterns from the optimal map (e.g., that in Northern California those living on the coast may be more likely to die of asthma than those living farther inland). Finally, in other cases, as in this basically rectangular distribution, the map pattern may be fairly stable across classification types (as in the equal interval and optimized classifications of influenza mortality rates at the right).

Recently, several cartographers have argued that classification methods that focus on the statistical characteristics of the data are ignoring an important characteristic of the data: its geographical distribution (Cromley 1996; Murray and Shyy 2000; Armstrong et al. 2003). Without considering the geographical distribution of the data, map readers may have a harder time building regions from the map (Armstrong et al. 2003). Each of these groups has developed a new method that takes contiguity factors (i.e., whether the geographic proximity of observations should be important in drawing class boundaries) into account as well as statistical properties of the data. Cromley (1996) created a minimum-boundary classification that creates classes where the largest differences between adjacent polygons is represented with different classes, while smaller differences across boundaries are contained within classes. Murray and Shyy (2000) used spatial data mining methods to identify spatial clusters of similar observations, and Armstrong et al. (2003) present a method of using multi criteria decision analysis to aid the cartographer in deciding which class breaks to choose (from the universe of possible classification schemes) depending on what criteria s/he thinks is most important (e.g., spatial structure, class variation minimization, etc.).

Recommended Readings

If you are interested in investigating this subject further, I recommend the following:

  • Armstrong, M.P. et al. 2003. "Using genetic algorithms to create multicriteria class intervals for choropleth maps." Annals of the Association of American Geographers. 93(3): 595-623.
  • Slocum, T. et al. 2005. "Chapter 5: Data Classification." Thematic cartography and geographic visualization, Second Edition.

 

Dasymetric Maps

The term 'dasymetric mapping' was first used by Russian geographers who described dasymetric maps as density measuring maps (Wright 1936). Dasymetric maps are similar to choropleth maps in that both types of maps represent data as stepped statistical surfaces. In other words, the data that are within a polygon are assumed to be distributed equally throughout that polygon’s area, and changes in the surface occur abruptly, and only at polygon boundaries.

The main difference between choropleth maps and dasymetric maps is the type of areal unit that is used for collecting data

and representing the phenomenon of interest. In choropleth maps, data are typically represented using enumeration units (e.g., census tracts, health service areas, etc.) whose shapes may not be related to the distribution of the geographic phenomenon we are interested in mapping. For this reason, the visual impression that the map gives (i.e., that the phenomenon is evenly distributed throughout the enumeration unit) is usually incorrect. In dasymetric maps, however, the areal units that divide the space are based on the actual character of the data surface, often in combination with enumeration units (see Figure 4.cg.11, below).

A series of dasymetric maps of population density.
Figure 4.cg.11 This example shows a dasymetric map of population density that was created by using an intersection of land cover information with county boundaries. County polygons were split into multiple polygons based on the land uses present in each county, and the population data from each county were then reapportioned into the new polygons. The map at the left shows a dasymetric map of population density, while the maps in the middle and at the right show estimates of error in estimating the population density using the dasymetric method. The error surface was created by comparing the dasymetric results to a population density surface based on larger scale data (census block groups).
Credit: Eicher and Brewer 2001

By now, you might be wondering how we can create dasymetric maps if data are usually collected using unrelated enumeration units rather than areal units that reflect the nature of the data surface. To get around this problem, we can use ancillary data to create a new set of areal units that better represent the data surface. For example, land use is an ancillary data variable that is often used for creating dasymetric maps of population density. Generally, we can use two types of ancillary data variables: limiting variables and related variables. Limiting variables are attributes that can help us eliminate areas where data values could be. For example, a data layer that depicts where water bodies are located may be useful for mapping population density, as it is highly unlikely that there will be any people living in the middle of lakes or rivers. Related variables have some sort of association or predictable relationship with the data variable we are trying to map. In our population density application, an example of a related ancillary attribute might be land cover; we know that fewer people tend to live in areas that have a cropland land cover than a developed (i.e., built up) land cover, so we can require those areas to have a lower density.

note: We will discuss ancillary data in more detail in the Lesson 5 Concept Gallery item called Dot Maps.

A dasymetric map made with a limiting variable.visual spaceA dasymetric map made with a related variable.
Figures 4.cg.12a and 4.cg.12bThe figure at the left (a) depicts the process of creating a dasymetric map from a limiting variable (e.g., lakes), while the figure at the right (b) depicts a map created from a related variable. In figure a, at left, we know that people cannot live on 10% of the area in Beltrami County, Minnesota, so we can calculate a new population density figure for the county based on the 90% of the area they can live on (bottom portion of figure). In the figure b, at right, we also know that there are some areas where people will not live (lakes (10%) and bogs and wetlands (dark blue; 35% of the total county area). This leaves 55% of the county where people can reside. We know that people are likely to reside at higher densities in towns and farmlands (yellow; 10% of the total county area) and at lower densities in forested areas (green; 45% of the total county area). If we know that half of all people live in towns or on farmlands and half live in forested areas, we can calculate new population densities by apportioning the total number of people into the new areas that we have calculated for each land use type. We then arrive at new densities of 82 people per square mile in towns and 36 people per square mile in forested areas.
Credit: Photo Source

When we are creating this new set of areal units, we are basically performing what is called an areal interpolation. In other words, we are transferring quantities of our phenomenon from one set of areal units to another. One thing that we need to be careful about is that we should preserve what Tobler (1979) called the pycnophylactic property. An easy way of describing this is that if you have 100 people in a county, and you subdivide the county into a larger number of units (e.g., new units based on land cover) and redistribute the population among the new units, the sum of the population in the new units should still add up to 100 people. As Lanford and Unwin (1994, p.24) succinctly phrased it: "People are not destroyed or manufactured during the redistribution process."

Although off-the-shelf GIS software does not have built-in functionality for creating dasymetric maps, in recent years there has been renewed interest in creating automated methods for creating this type of map in both raster and vector format (e.g., Fisher and Langford (1996); Eicher (1999); Mennis (2003)).

Recommended Readings

If you are interested in investigating this subject further, I recommend the following:

  • Mennis, J. 2003. "Generating surface models of population using dasymetric mapping." Professional Geographer. 55(1): 31-2.
  • Tobler, W. 2001. "Pycnophylactic reallocation." CSISS. http://www.csiss.org/streaming_video/csiss/tobler_pycno.htm.

    Note that this resource discussed pycnophylactic reallocation within the context of making isoline maps rather than dasymetric maps. The principle is the same, but the nature of the way the surfaces changes (i.e., smoothly or abruptly) is what is different.

 

Color Schemes

As you may recall from the Symbolization and Color Spaces Lesson 2 concept gallery items, there are three components of color that cartographers have to work with: hue, value and chroma. In this part of the lesson, we will discuss the different ways that you can use these three components to create different types of color schemes.

The main thing to remember when designing a color scheme is that you want the logic of your colors to relate to the logic in your data (i.e., if you are representing differences in the kind of things on your map, use the component of color that works best for showing nominal differences (hue)). We will discuss four main types of color schemes: sequential, diverging, qualitative and binary.

A sequential scheme is typically used to represent differences in the amount of the phenomenon you are mapping. This difference may be quantitative (e.g., inches of rainfall, hours of sunlight, etc.) or ordinal (e.g., least polluted to most polluted; least desirable vacation spot to most desirable vacation spot). Typically, we use color value combined with color chroma differences when we are creating sequential schemes (see Figure 4.cg.13 below). Experiments with map readers have shown that most map readers associate darker symbols with a larger quantity, and lighter symbols with a smaller quantity (McGranahan 1989), so this is a convention that cartographers generally use when designing a sequential scheme. Generally, map readers will not be able to tell the difference between more than six or seven levels of color value, especially in the complicated context of the map itself. It is possible to extend your sequence by using more than one hue in combination with value (e.g. from yellow through green to blue). This combination will allow you to create a larger number of symbols (that are still differentiable from each other) than you could with color value alone. A final consideration when creating your sequential schemes is that cartographers typically try to use value differences that are perceptually equal throughout the symbol set (i.e., we do not want the difference in lightness between any two neighboring symbols in the scheme to seem larger than the difference between other neighboring pairs).

An example of a map made with a one-hue sequential color scheme.visual spaceAn example of a map made with a two-hue sequential color scheme.
Figure 4.cg.13a and 4.cg.13b At the left (a) is a one-hue sequential color scheme that uses different value levels for one hue (orange) to represent quantitative information. At the right (b) is a two-hue sequential color scheme that starts at yellow (a high value color), and progresses through green to blue while also decreasing in color value (i.e., getting darker).
Credit: Photo Source

A diverging scheme can be constructed by fusing two sequential schemes together, using a common color (typically white or another light color such as yellow or light gray) as the midpoint. Hence the name diverging, as this scheme is composed of two sequential schemes that diverge from a common color. Diverging schemes are most useful for making comparisons with some critical value in the data. You can choose to use any number of values as the critical value, ranging from zero (e.g., in a map of population change zero represents no change, with either side of the diverging sequence representing positive or negative population growth) to the mean or median (e.g., in a map of mortality from vehicle accidents (see Figure 4.cg.14, below) to highlight areas that are at higher or lower risk) to some targeted level (e.g., in a map of greenhouse gas emission reductions to emphasize how much more some countries have reduced their emissions beyond the target specified in a treaty and which countries have not met that target and how far they still have to go to meet the target). One research group has also found that diverging schemes have been better able to help map readers identify true clusters of high or low values on maps (and avoid seeing spurious ones), perhaps because of the added differentiation that a second hue brings to the map (Brewer et al. 1997).

A map made with diverging color schemes.
Figure 4.cg.14 In this map of motor vehicle death rates, you can clearly differentiate areas that are substantially higher or lower than the mean mortality rate, as each type of variation is represented by a different color hue. Here, we can see that counties in coastal California and a strip running from San Francisco to Lake Tahoe experience lower motor vehicle accident rates. We may hypothesize that this is due to better access to hospitals, as these areas are where the majority of Californians live.
Credit: Photo Source

A qualitative scheme mainly uses differences in color hue to indicate differences in the kind of some phenomenon (e.g., land use, crop type, religion, etc.). In a qualitative scheme, you will generally want to choose color hues that have approximately the same lightness and chroma level (see Figure 4.cg.15, below). Otherwise, you will find that more saturated or lighter colors really pop out from the map. One exception to this may be in cases where you have groups of related variables within the map. For example, if you were creating a map of foreign-born residents, but you also wanted to make a distinction between levels of residential segregation of new immigrants, you might choose to use a different hue for each continent of origin, and then specify two levels of that hue for each continent based on the proportion of the enumeration unit that group made up (e.g., if the county was more than 30% persons who were born in South America, it might be represented by a dark blue color, while a county where less than 30% of its foreign born residents came from South America would be represented with a lighter blue color).

Contact the instructor if you have difficulty viewing this image
Figure 4.cg.15 This map uses a color hue to show the major fuel that is used to heat homes in California. Notice that all of the colors have about the same color value and chroma (i.e., they are all muted shades rather than including some bright, vibrant colors).
Credit: Photo Source

Binary color schemes are a special case of qualitative or sequential color schemes that have only two categories. Depending on what you are aiming to represent, you may choose to use either color hue or color value for creating a binary scheme. For example, a map that depicted the candidate that most people voted for in the last presidential election might use color hue (e.g., blue and red are colors traditionally used in the United States for this type of map). In other cases, you might choose to use color value (e.g., if you are representing which locations are visible from a particular viewpoint, you might use black for areas that are not visible and white for areas that are visible).

Recommended Readings

If you are interested in investigating this subject further, I recommend the following: