GEOG 486
Cartography and Visualization

Classification Schemes

PrintPrint

At its heart, classification is an exercise in categorization. We assign locations to categories in order to reduce the complexity of the real world, thereby creating an abstraction that helps us better understand particular characteristics of the world without the distraction of all of the other possible characteristics that we could examine. A distinguishing feature of locations that belong to the same category is that they have a set of shared characteristics. The way in which we assign locations to a particular category can depend on qualitative or quantitative characteristics of that location (e.g., what type of phenomenon is present at that location, or how much of the phenomenon is found at the location). In the remainder of this lesson, we will focus on quantitative classification schemes (i.e., on grouping locations together because they have similar amounts of some phenomenon).

The most important choices you will have to make when classifying your data are which classification method to use and the number of classes to create. Generally, the fewer classes you use, the more important your choice of classification method is, as the map pattern will typically be more variable when you have fewer classes (see Figure 4.cg.7, below).

Three four-class maps to show different classification methods.
Figure 4.cg.7 You can see considerable variation in the location of light and dark areas in this set of four-class maps, each created using a different classification method.

However, when you are deciding on how many classes to use, it is also important to evaluate whether your map readers will be able to physically see differences in the symbol set you will use. For example, if you are creating a choropleth map and are only using color value as a visual variable (instead of a combination of color value and hue, which will allow readers to differentiate between a larger number of symbols), most map readers will only be able to distinguish six or seven different value levels, so your map should not exceed six or seven classes (see Figure 4.cg.8, below).

Three class maps, each with a different numbers of classes.
Figure 4.cg.8 In the four-class map at the left, it should be quite easy to decide which observations are in the same class. Take a look at the six- and eleven-class maps and see if you can do the same. You will probably succeed in the six-class map (middle), but have difficulty with the eleven-class map. Although you should be able to tell if one county is lighter or darker than another in the eleven-class map in a pairwise comparison, it will probably be difficult to pick out all observations that fall in a given class.

We can group classification methods into three main types, depending on the characteristics of the data that each method uses to create the classification scheme: those that are based on some exogenous (i.e., outside) criteria, those that only consider statistical characteristics of the data, and those that consider both statistical and geographical characteristics of the data. Most easily accessible classification methods within GIS software today only consider the statistical characteristics of the data, although it is also possible to create your own classification scheme based on exogenous criteria.

Classification schemes based on exogenous criteria are schemes that use important data values that are not derived from a statistical property of the data set as classification break points (i.e., boundaries between one class and another). Some common examples of exogenous criteria can include definitions (e.g., the amount of income defined as the poverty level), points at which the direction of change is altered (e.g., zero population growth), or values at a previous point in time (e.g., 1996 level of greenhouse gas emissions for each country). All of these exogenous criteria provide benchmarks against which the value for each location in the map can be compared.

Most methods that cartographers use for creating classification schemes consider the statistical properties of the data set. Some common examples include equal interval, quantile, natural breaks, optimal, mean-standard deviation, and classifications based on mathematical progressions. The equal interval classification method divides the range of the data into classes with equal-sized ranges. This is done by figuring out the range of the data and dividing that range by the number of classes desired (e.g., a data set with values ranging from 0 to 80 and divided into four classes would have the following classes: 0-20; 20-40; 40-60 and 60-80). The quantile method divides the data set into equal numbers of observations per class (e.g., in a dataset with 20 observations and 4 classes, each class would contain 25% of the observations (i.e., 5 observations)). Natural breaks classifications are typically determined by looking at a graph of the data values (ordered from highest to lowest) and placing breaks in places where the slope substantially changes (see Figure 4.cg.9, below). The optimal classification scheme automates the natural breaks process by using an iterative procedure that divides values into classes that minimize within-group variability and maximize between-group variability in an attempt to create the most homogenous classes that are possible with the dataset. The mean and standard deviation scheme uses the mean of the dataset as the middle break point, and uses the values of standard deviations (or some part thereof, such as 0.5 of a standard deviation) added to or subtracted from the mean for determining the other class breaks. Finally, mathematical progressions (e.g., arithmetic and geometric sequences) can be used to create classes that are increasingly larger or increasingly smaller in size.

A cumulative frequency graph to show where best to establish class breaks.
Figure 4.cg.9 Here, orange lines mark locations where class breaks should be established for a natural breaks classification with three classes.

Each of the schemes that consider statistical properties is more or less useful for mapping data with particular types of statistical distributions. For example, the equal interval scheme seems to work best for data with a rectangular distribution (i.e., approximately equal numbers of observations over the data range), while it is not very effective for highly skewed data as there may be many empty classes, forcing most observations into one or two classes, and leaving a very uninteresting map. Others, such as the mean-standard deviation scheme, work best for normally distributed data but do not work very well for other types of distributions. Generally, the factors to consider when choosing a classification method include the purpose for which the map will be used, the audience who will be using the map, and the distribution of the data (see Figure 4.cg.10, below).

A classification comparison made up of nine maps. Refer to the image caption for more information.
Figure 4.cg.10 Here, we show maps made from data with three different distribution types. At the left, we mapped a variable with a skewed distribution with both the optimized and equal interval classification methods. You can see that with the equal interval classification, very few observations fall into the top two classes, and the map suggests that there is less variability in diabetes mortality than the optimized map. In the middle maps of asthma mortality, a normally-distributed data set, you can see that the mean-standard deviation classification method in combination is able to highlight counties with substantially higher or lower mortality rates than the average county, while it is perhaps less easy to describe regional patterns from the optimal map (e.g., that in Northern California those living on the coast may be more likely to die of asthma than those living farther inland). Finally, in other cases, as in this basically rectangular distribution, the map pattern may be fairly stable across classification types (as in the equal interval and optimized classifications of influenza mortality rates at the right).

Recently, several cartographers have argued that classification methods that focus on the statistical characteristics of the data are ignoring an important characteristic of the data: its geographical distribution (Cromley 1996; Murray and Shyy 2000; Armstrong et al. 2003). Without considering the geographical distribution of the data, map readers may have a harder time building regions from the map (Armstrong et al. 2003). Each of these groups has developed a new method that takes contiguity factors (i.e., whether the geographic proximity of observations should be important in drawing class boundaries) into account as well as statistical properties of the data. Cromley (1996) created a minimum-boundary classification that creates classes where the largest differences between adjacent polygonsis represented with different classes, while smaller differences across boundaries are contained within classes. Murray and Shyy (2000) used spatial data mining methods to identify spatial clusters of similar observations, and Armstrong et al. (2003) present a method of using multi criteria decision analysis to aid the cartographer in deciding which class breaks to choose (from the universe of possible classification schemes) depending on what criteria s/he thinks is most important (e.g., spatial structure, class variation minimization, etc.).

Recommended Readings

If you are interested in investigating this subject further, I recommend the following:

  • Armstrong, M.P. et al. 2003. "Using genetic algorithms to create multicriteria class intervals for choropleth maps." Annals of the Association of American Geographers. 93(3): 595-623.
  • Slocum, T. et al. 2005. "Chapter 5: Data Classification." Thematic cartography and geographic visualization, Second Edition.