At its heart, classification is an exercise in categorization. We assign locations to categories in order to reduce the complexity of the real world, thereby creating an abstraction that helps us better understand particular characteristics of the world without the distraction of all of the other possible characteristics that we could examine. A distinguishing feature of locations that belong to the same category is that they have a set of shared characteristics. The way in which we assign locations to a particular category can depend on qualitative or quantitative characteristics of that location (e.g., what type of phenomenon is present at that location, or how much of the phenomenon is found at the location). In the remainder of this lesson, we will focus on quantitative classification schemes (i.e., on grouping locations together because they have similar amounts of some phenomenon).
The most important choices you will have to make when classifying your data are which classification method to use and the number of classes to create. Generally, the fewer classes you use, the more important your choice of classification method is, as the map pattern will typically be more variable when you have fewer classes (see Figure 4.cg.7, below).
However, when you are deciding on how many classes to use, it is also important to evaluate whether your map readers will be able to physically see differences in the symbol set you will use. For example, if you are creating a choropleth map and are only using color value as a visual variable (instead of a combination of color value and hue, which will allow readers to differentiate between a larger number of symbols), most map readers will only be able to distinguish six or seven different value levels, so your map should not exceed six or seven classes (see Figure 4.cg.8, below).
We can group classification methods into three main types, depending on the characteristics of the data that each method uses to create the classification scheme: those that are based on some exogenous (i.e., outside) criteria, those that only consider statistical characteristics of the data, and those that consider both statistical and geographical characteristics of the data. Most easily accessible classification methods within GIS software today only consider the statistical characteristics of the data, although it is also possible to create your own classification scheme based on exogenous criteria.
Classification schemes based on exogenous criteria are schemes that use important data values that are not derived from a statistical property of the data set as classification break points (i.e., boundaries between one class and another). Some common examples of exogenous criteria can include definitions (e.g., the amount of income defined as the poverty level), points at which the direction of change is altered (e.g., zero population growth), or values at a previous point in time (e.g., 1996 level of greenhouse gas emissions for each country). All of these exogenous criteria provide benchmarks against which the value for each location in the map can be compared.
Most methods that cartographers use for creating classification schemes consider the statistical properties of the data set. Some common examples include equal interval, quantile, natural breaks, optimal, mean-standard deviation, and classifications based on mathematical progressions. The equal interval classification method divides the range of the data into classes with equal-sized ranges. This is done by figuring out the range of the data and dividing that range by the number of classes desired (e.g., a data set with values ranging from 0 to 80 and divided into four classes would have the following classes: 0-20; 20-40; 40-60 and 60-80). The quantile method divides the data set into equal numbers of observations per class (e.g., in a dataset with 20 observations and 4 classes, each class would contain 25% of the observations (i.e., 5 observations)). Natural breaks classifications are typically determined by looking at a graph of the data values (ordered from highest to lowest) and placing breaks in places where the slope substantially changes (see Figure 4.cg.9, below). The optimal classification scheme automates the natural breaks process by using an iterative procedure that divides values into classes that minimize within-group variability and maximize between-group variability in an attempt to create the most homogenous classes that are possible with the dataset. The mean and standard deviation scheme uses the mean of the dataset as the middle break point, and uses the values of standard deviations (or some part thereof, such as 0.5 of a standard deviation) added to or subtracted from the mean for determining the other class breaks. Finally, mathematical progressions (e.g., arithmetic and geometric sequences) can be used to create classes that are increasingly larger or increasingly smaller in size.
Each of the schemes that consider statistical properties is more or less useful for mapping data with particular types of statistical distributions. For example, the equal interval scheme works best for data with a rectangular distribution (i.e., approximately equal numbers of observations over the data range), while it is not very effective for highly skewed data as there may be many empty classes, forcing most observations into one or two classes, and leaving a very uninteresting map. Others, such as the mean-standard deviation scheme, work best for normally distributed data but do not work very well for other types of distributions. Generally, the factors to consider when choosing a classification method include the purpose for which the map will be used, the audience who will be using the map, and the distribution of the data (see Figure 4.cg.10, below).
Several cartographers have argued that classification methods that focus on the statistical characteristics of the data are ignoring an important characteristic of the data: its geographical distribution (Cromley 1996; Murray and Shyy 2000; Armstrong et al. 2003). Without considering the geographical distribution of the data, map readers may have a harder time building regions from the map (Armstrong et al. 2003). Each of these groups has developed a new method that takes contiguity factors (i.e., whether the geographic proximity of observations should be important in drawing class boundaries) into account as well as statistical properties of the data. Cromley (1996) created a minimum-boundary classification that creates classes where the largest differences between adjacent polygons is represented with different classes, while smaller differences across boundaries are contained within classes. Murray and Shyy (2000) used spatial data mining methods to identify spatial clusters of similar observations, and Armstrong et al. (2003) present a method of using multi-criteria decision analysis to aid the cartographer in deciding which class breaks to choose (from the universe of possible classification schemes) depending on what criteria s/he thinks is most important (e.g., spatial structure, class variation minimization, etc.).
If you are interested in investigating this subject further, I recommend the following:
- Armstrong, M.P. et al. 2003. "Using genetic algorithms to create multicriteria class intervals for choropleth maps." Annals of the Association of American Geographers. 93(3): 595-623.
- Slocum, T. et al. 2009. "Chapter 5: Data Classification." Thematic cartography and geographic visualization, Third Edition.