In this exercise we will be using the GeoViz Toolkit, an application that employs cartography for visual thinking purposes. It incorporates a set of tools developed for the visualization of multidimensional, geographic data.
The GeoViz Toolkit was developed by Frank Hardisty and others, primarily at the GeoVISTA Center at The Pennsylvania State University. A full contributor list can be found here. Please note that the GeoViz Toolkit is a research tool, and therefore has some bugs, inconsistencies, and possibily unexplained features. You must have Java 1.6 to use this version of the GeoViz Toolkit.
If you have not done so already, it will be valuable to read the Parts in this lesson on Visual Communication and Visual Thinking. And if you are not familiar with methods of Classification, read that concept gallery item from lesson 4. Also consider reading through the short concept gallery item from Lesson 8 on the Integration of Maps and Information Graphics.
A. Download the GeoViz Toolkit
Download the GeoViz Toolkit (8.6 Mb). (If you have trouble downloading the toolkit from this site, the same version is also available in the lesson 1 folder in Canvas).
This link will download a zip file which will include a jar file version of the GeoViz Toolkit, as well as a csv file that describes the variables present in the Toolkit. This will be useful to decipher the variable names used in the toolkit. Save the zip to your local file system, unzip it to any local directory, and use by double-clicking the jar file. You must have Java 1.6 or later installed. It will not run out of the default Windows zip program, it needs to be unzipped and then run.
B. Introduction to the GeoViz Toolkit
- Open the jar file. The GeoViz Toolkit should open up, showing data from the 2008 presidential election in six different tools: 1) the VariablePicker, 2) the SingleScatterPlot, 3) the GeoMap, 4) the ParallelPlot, 5) the SingleHistogram, and 6) the IndicationAnimator.
Credit: A. Gruver
Before exploring any variables in particular, use your pointer and roll over the various tool windows to see how the tools are linked together. If you roll over a county in the GeoMap tool you will see where that county is represented in some of the other tools (i.e. as a dot in the scatter plot, a line in the parallel coordinate plot, and a highlighted bin in the histogram). Now also roll over some of the dots in the scatter plot and some lines in the parallel plot. And click a bin in the histogram. You can also drag and select dots in the scatter plot (and although you are supposed to be able to do that in the parallel plot as well, I cannot get that to work).
- Open the csv file that was downloaded with the jar file. It is named USA_Election_ObMeta.csv.
Briefly review the variable names (under ColumnHeader) and their descriptions to get an idea of the kind of variables you will be able to look at in the the GeoViz Toolkit. There are election variables from 2008 and 2004, demographic variables from 2000, socioeconomic variables from 2002, and 1997 emissions variables - the last of which you will probably want to ignore. Most of these variables have been used with other datasets like cancer incidence and mortality, and we put them in here since we had them. Clearly not all of these will show much of interest.
Please note: There are a few known errors in this dataset, and you will probably notice them when looking through the data. Keep them in mind when you are visualizing and analyzing the data. Several large cities, Washington D.C., Baltimore, St. Louis, and some smaller ones in VA, are showing 0% vote for Obama and/or McCain. Also, all the 2008 election figures for CA are incorrect, both percents and counts. Miami-Dade county also is showing information incorrectly for the 2004 and 2008 elections.
A couple more notes before exploring some data:
Back in the GeoViz Toolkit, go to the Tutorial under the Help menu. In the left column of the help menu you will see links to all the tools included in the GeoViz Toolkit. Read about the tools you see as well as others to get an idea of what they are for. Minimize the tutorial - without closing it - so you can access it again. (If you close the tutorial, you will have to close out of the program and restart it to see the tutorial again).
Here, also, are a few video demonstrations of tools within the GeoViz Toolkit that may be helpful if you feel like you need more introduction to the individual tools or toolkit: http://www.geovista.psu.edu/geoviztoolkit/.
C. Let's use some of the tools in the toolkit
Let's start exploring data visually using the GeoMap. Two variables can be shown at a time in this bivariate map. In this version of the toolkit, the GeoMap opens up with the OBAMA_PCT as the first variable, using a light gray to blue sequential color scheme, and MCCAIN_PCT as the second variable, using a light gray to red color scheme. The variables are classed in three quantile classes.
Use the sliders at the top of the GeoMap, or type a value into the box next to the slider, to increase the number of classes. More classes allows more variation to be seen across the map, but makes the map and class values harder to comprehend.
- Explore the classification methods possible in the GeoMap via the drop down menus at the top of the tool window. (But don't use the Custom Classifier. It doesn't work and it stops the other classifications from working. If this happens just close the toolkit and start it over). Notice how different the map looks depending on what classification method is used. This would be even more evident with data that is not as normally distributed as these two percentage vote variables are. Because the spatial patterns we see on a map are dependent on the classification method used, it is very important to be aware of your classes (or the method used). We will discuss this more a bit later in this exercise, and in-depth with Lesson 4.
Note: look at the SingleHistogram tool to see a variable's distribution along the number line. A variable that shows a bell curve and is symmetric around its mean has a normal distribution.
- Take note of the legend in the upper right of the GeoMap display (the bivariate colored rectangle). This is the gamut of colors that are possible on the map. Because the percentage vote for McCain and Obama are (almost) exactly inversely correlated, you will see some areas that are blue and red, and several middle purple colors, but not any areas high in both (bright purple) or low in both (light gray).Credit: A. Gruver
- Now look at the scatter plot (with the tool name SingleScatterPlot) to understand the relationship between the two variables. Make sure you are looking at the same two variables in the scatter plot as you are in the map. The color scheme should still be the same as on the map, and you can update the scatter plot's classification to match that of the map, if you find that useful.
Scatter plots are used to illustrate the relationship between two variables. In our case, each point on the plot represents a county (from the map), and each county is plotted along the two axes for the variables shown, making a point somewhere in the 2D space of the scatter plot. If the points tend to create a band running from lower left to upper right, there is a positive correlation between the two variables. If the points create a band from the upper left to lower right, there is a negative, or inverse, correlation. The less distance there is from the points on the plot to the line of best fit (the red line on the plot), the stronger is the relationship that exists between the two variables.
You should be able to see in the scatter plot, as it is when the toolkit is opened and as pictured in the figure below, that the percent vote for McCain and the percent vote for Obama are very clearly and strongly inversely correlated.
Note: To have all points in the scatter plot in focus, drag a box around the whole plot area, as if selecting all points. On the map you can simply drag a small box outside the continental U.S. (in the white area) to have all counties selected and in focus.Credit: A. Gruver
- Next, look at the parallel coordinate plot (named ParallelPlot in the toolkit). Each vertical line in the plot represents a data variable, and each segmented line that moves horizontally across the vertical lines is a county (in our case, or a data record, otherwise). Each county is represented as a point along each vertical variable line, and the points for each county are connected, creating the segmented line. As you roll over the lines with your pointer, they should highlight and the values for the variables shown should appear. The light purple label states the name of the county highlighted. In my figure below, I'm hovering on the line representing Houston county. The names of the variables used in the plot are stated along the bottom of the vertical lines, and the pale transparent yellow boxes that appear show the attribute values for those variables, e.g for Houston county, MCCAIN_PCT is 70.0% and OBAMA_PCT is 29.0%.You should be able to see the clear inverse correlation between the first two variables showing the percent of people voting for Obama and the percent of people voting for McCain. If you look at the next two variables on the plot as well, from the ordering of the colored horizontal segmented lines, you should be able to see that there is a positive correlation between OBAMA_PCT and CHANGE_P_O (the 1st and 3rd vertical lines), and also between MCCAIN_PCT and CHANGE_P_ (the 2nd and 4th vertical lines). After consulting the csv file describing the variable names, this should make more sense. These show a positive correlation between the counties that voted more for Obama and the counties that had a higher percent change in their vote for the Democratic candidate between '04 and '08, AND a positive correlation between the counties that voted more for McCain and the counties that had a higher percent change in their vote for the Republican candidate between '04 and '08. So generally speaking - but not necessarily true in all cases (or the outcome of the '08 election would likely have been different) - places that voted Democratic for the presidential election in '04 voted more strongly Democratic in '08, and counties that voted Republican in '04 voted more strongly Republican in '08. The last two columns in the figure shown above represent counts of votes for Obama and McCain, which show that counties with more votes (which are more populated counties) correlate with a higher percent vote for Obama in '08, and also correlate with each other (from what we can see); that is, counties with higher votes for Obama also generally had higher votes for McCain and vice versa.Credit: A. Gruver
- Click on the Add Tool drop down menu in the Toolkit and explore other visualization tools. Consult the tutorial on their use in the toolkit or the page on video demonstrations of the GeoViz Toolkit if you want more information on any of these tools. You are also welcome to post thoughts or questions in the discussion forum.
D. Data exploration with the GeoViz Toolkit
Let's start to use the toolkit to visually think and explore the data we have at hand. This section will walk you through an example data exploration using the GeoViz Toolkit with the data preloaded into the Toolkit.
- First let's start with something we know about the data we are using. Look at this election map from the New York Times. Take time to look through it if you are unfamiliar with the election from 2008. We are going to use some of this as our hypothetical basis of what we already know about this election data.
- On the NY Times map, click on the County Leaders button in the upper left of the display so the map looks as it does in the image below.
Note: This representation is called a choropleth map. Click on the County bubbles link in the display to see a good example of a proportional symbols representation. In lesson 4 we discuss choropleth maps and in lesson 5 we discuss map representations (including proportional symbols).Credit: The New York Times
Notice that the map is not a bivariate map, like we are working with in the GeoViz Toolkit. It is a univariate map that uses a diverging color scheme (i.e. red for a Republican win in 2008 and blue for a Democratic win in 2008, and the value of the red and blue is determined by the margin of the win. See the map key for class ranges). This representation is also refered to as a bipolar map.
- Looking at our bivariate map in the GeoViz Toolkit showing the two 'percent vote' variables (Obama_PCT and McCain_PCT), classed as quantiles, it doesn't look too far off from the NYT county leaders map.
- Because I have access to election data from 2004 and 2008, I am going to explore some of the differences between those two elections via the Democratic candidate variables. And because there has been a lot of press about how race has affected the election (in '08 and '12), I am going to look at some race/ethnicity variables.
The VariablePicker, the tool in the upper left of the Toolkit window (by default), is the way to update all or many of the tools in the Toolkit at one time, and the only way to get the variables you want in the parallel plot.
Using the variable picker, and the Ctrl key (or Command on a Mac), select OBAMA_PCT, KERRY_PCT, BLACK, WHITE, HISP_LAT. Then click the Send Selection button. (You will have to scroll down past a lot of variables to get to the last four mentioned).
- Looking at the bivariate map with the percent Obama vote from '08 and the percent Kerry vote from '04, we can see much of the U.S. voted similarly in the two elections.
If you did not change the default color scheme, we can see that the light gray counties represent low votes for both of those candidates, and counties that are dark purple have relatively high votes for both of those candidates. (Keep in mind that we are using quantile classification - which arbitrarily places the same number of counties in each class for the two variables). Once you understand the color scheme and correlation of these two variables, it follows that counties that are blue had higher percentage votes for Obama in '08 than they did for Kerry in '04, and counties that are red had higher percentage votes for Kerry in '04 than Obama in '08. With this tool we can then easily identify geographical areas that voted more Democratic or more Republican in '08 compared to the '04 presidential election.Credit: A. Gruver
Where do you see voting shifts from '04 to '08? It appears from this symbolization that areas in the eastern Midwest became more Democratic in '08, and the Arkansas and Tennessee area became more Republican in '08 compared to '04.
- Now look at the scatter plot and the parallel plot with these variables. The scatter plot shows a strong correlation between the two variables (OBAMA_PCT and KERRY_PCT), but with more variance than the inverse correlation between Obama and McCain (e.g. there are more dots further from the line of best fit and the "rSquare" value is not as strong. Note: the r squared value quantifies the amount of variability in a data set).
Rollover or select some of the points in the scatter plot to see where these areas are in the bivariate map. You can also select areas of outliers to see where the counties are that deviated from the correlation, e.g. points below the line of best fit. FYI, in the scatter plot (alone) you are able to shift-click and add on to your selection.Credit: A. Gruver
- Next let's look at the ParallelPlot. Hover over some of the strings to see how particular counties rank across the variables. Looking at the plot overall, paying particular attention to the color scheme used on the OBAMA_PCT variable (specified in the visual classifier at the bottom of the ParallelPlot window) and how it correlates or doesn't with the other variables, we can see some things we might expect. For instance, counties with a high % black population have a high % vote for Obama. But the opposite does not necessarily hold true. Counties with a high % white population (or small % black population) are made up of both places that voted strongly for McCain as well as strongly for Obama.A question that comes to mind at this point is: are there geographic variations between the high % white areas that voted more for McCain and the high % white areas that voted more for Obama? How might we go about looking at this using this toolkit?Credit: A. Gruver
Selection is not functioning or yet implemented in the Parallel Plot tool, so let's use it in combination with other tools. Which is the benefit of linking and brushing in a toolkit like this.
Note: Brushing is when you select or filter records or geographic areas based on an atribute. Whereas linking is the representation of that filter displayed in tools/windows other than the one where the selection/filter was made.
- The scatter plot is an easy way to select a set of points/counties based on one or two variables (whereas the parallel plot would be more useful if you wanted to select attributes from more than two variables, e.g. counties lower for Obama than Kerry, high % white, and higher % Catholic).
In the scatter plot keep OBAMA_PCT as the first variable, but make WHITE the second variable. You will likely have to select all the dots on the scatter plot so they come into focus.
We see here essentially what we saw in the parallel plot above: there is a clear visual relationship between counties with low white populations and higher Obama votes, but the relationship falls apart for areas with high percents of white people. You will see there is a slight to moderate inverse correlation overall between the OBAMA_PCT and WHITE variables with an r-squared value of -.338. The closer the r-squared value is to 1 or -1 the stronger the correlation (direct or inverse, respectively).
- Use your pointer to select the different quadrants of the scatter plot to see how the correlation changes via the r-squared value. Also pay attention to where those counties are in the map (and if you had variables other than those related to race or the vote in the parallel plot, it would be beneficial to pay attention to the selections as they appear there as well).
Specifically, select the points in the upper left quadrant of the scatter plot which represent counties with low % white populations and high % vote for Obama; then select the upper right quadrant representing high % white and high % vote for Obama; and then select the lower right quadrant representing high % white and low % vote for Obama. Since there are no counties in the lower left quadrant you do not need to select there.
What r-squared values did you see?
Did you notice interesting geographic differences with the selections?
When I selected >50% Obama vote and <50% white I saw a -.649 r-squared value, suggesting a much stronger correlation than compared to the whole U.S. for those two variables. But for the other two selections the r-squared value decreased substantially (I got -.188 and -.106 respectively). And there were some noticeable geographic differences between those three selections, though I did not see anything that surprised me. In other words, the selection of high % white and high % vote for Obama seemed representative of areas where Obama did well generally (except for the low % white areas in the SE), and the selection of high % white and low % vote for Obama seemed representative of areas where Obama did poorly generally. From this I could conclude that these differences in the vote for Obama (outside of part of the SE) had more to do with variables outside of race.
- Now let's look at this relationship via the map. Keeping OBAMA_PCT and Kerry_PCT selected as the variables displayed in the map, and OBAMA_PCT and WHITE as the variables displayed in the scatter plot, select different regions of the U.S. in the map and observe the correlations in the scatter plot and the parallel plot. For instance, select the Pacific NW, the entire West, the plains, the Midwest, the NE, the Mid-Atlantic and/or the SE.
What are the areas where the inverse correlation between the vote for Obama and the % white is much stronger than it is for the whole U.S. There are a couple areas where strong correlations are found. Along the Mississippi valley (i.e. essentially selecting MS, AR and LA) and also in central and southern SC, GA and AL, I was able to see correlations in the -.8 to -.95 region of values, which is pretty significant.
Because my original interest was to explore some of the differences between the '04 and '08 election, let's now look to see if this is a substantially different voting pattern for '08 with Obama as a candidate than in '04 with Kerry as the Democratic candidate.
There are several ways to start comparing these variables over the two different years. One option would be to employ another multivariate tool (other than the PCP), e.g. the StarPlotMap. The star plot creates a polygon with the number of sides/vertices based on the variables of interest, and it varies the lengths of the vertices from the center based on the values of the variables. And then the star plot map plots those polygons on the geographic areas they pertain to. So in essence, you should be able to look at a series of variables together in a star plot map and if there are some clear multivariate relationships you might be able to see some patterns.
Feel free to try using this tool, which is in the Add Tool menu. See the image below of part of a star plot map with other tools in the Toolkit, though it will be much easier to inspect at a larger scale by experimenting with this tool directly.Credit: A. Gruver
- Since we were able to see interesting differences in the scatter plot from regional selections on the map, let's look at the scatter plot again, comparing the % vote for Kerry and the % white population
Rather than change the variables in the scatter plot you already have, open a second scatter plot so that we can compare the correlations and r-squared values for the % white population with both Obama and Kerry at the same time. Under the Add Tool menu, click on SingleScatterPlot. A second scatter plot window will open (probably with the % Obama vote on the y-axis and the % Kerry vote on the x-axis).
- Change the axes so the y-axis shows the % Kerry vote and the x-axis shows the % white population. This way the scatter plot will be comparable to the first scatter plot showing the % Obama vote on the y-axis with the % white population on the x-axis.
- Select all the points on the scatter plot so they come into focus. We can see that the r-squared values for the two scatter plots are similar, -.338 (for the '08 election) compared to -.382 (for the '04 election).Credit: A. Gruver
- Now use the map as we did before to select regions of the U.S. and then look at the two scatter plots, to see if (and how) they differ. Do you see differences for certain regions between the '04 and '08 elections (i.e. for the correlation between % white population and the % vote for the two Democratic candidates)?
Recall the bivariate map, still showing the Obama variable in blue and the Kerry variable in red, illustrates the areas where the vote differed for the candidates. The red areas had greater % votes for Kerry than Obama and the blue areas the opposite. Thus, I suspected that there would be differences in the scatter plots for those areas.
- Select the region around the southern Mississippi River, i.e. the states of LA, AR, and MS, including some of the area we see in red (where there was a decrease in % vote for the Democratic candidate from '04 to '08). I observed a -.833 r-squared value for '08 compared to -.592 for '04. A bit of a difference.Credit: A. Gruver
- Now look at the other area I mentioned above in step #10 as having a strong inverse correlation for % white and % vote for Obama: the states of AL, GA and SC. Here, there was not as much variation between the % vote for Obama and % vote for Kerry as illustrated by the primarily gray or purple counties (as opposed to blue or red) in the area. And accordingly, there is less variation between the two scatter plots as well. For the selection of AL, GA and SC, as I show below, the scatter plots are similar in their strong inverse correlations; they show -.922 and -.901 respectively for '08 and '04.Credit: A. Gruver
- So what can you infer from these observations?
For the Alabama, Georgia, and the southern South Carolina area, I'm led to think that differences in voting for Democrats or Republicans had a lot to do with race before Obama came into the picture. The inverse correlation between the % white population and the % vote for Kerry was essentially as strong as it was for Obama, which is quite strong.
Looking at the Arkansas, Lousiana, and Mississippi region, the inverse correlation between the % white population and the % vote for Kerry was not nearly as strong as it was for Obama. Maybe voting for Democrats or Republicans in this area wasn't as racially divided until the race of the candidate was a factor. Or maybe the change had to do with any of the very many other variables involved.
Regardless of what is actually affecting the correlations of what we see here, we do have to be cautious of our conclusions because we are looking at aggregated data. The issue of prescribing correlations seen in aggregate to individuals is known as ecological fallacy. To learn more about this see links on suggested readings below.
You have just completed the Visual Thinking exercise, which involved using cartography for the explicit purpose of visual thinking (as opposed to visual communication). As you will read in the deliverables on the next page, it is now your turn to explore data using the GeoViz Toolkit.