In the previous lesson, we saw how a spatial process can be described in mathematical terms so that the patterns it is expected to produce can be predicted. In this lesson, we will apply this knowledge to the analysis of point patterns. Point pattern analysis is the application in which these ideas are most thoroughly developed, so it is the best place to learn about this approach.
Point pattern analysis has become an extremely important application in recent years, particularly in crime analysis, in epidemiology, and in facility location planning and management. Point pattern analysis also goes all the way back to the very beginning of spatial analysis in Dr. John Snow's work on the London cholera epidemic of 1854.
By the end of this lesson, you should be able to:
Lesson 4 is one week in length. (See the Calendar in Canvas for specific due dates.) To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.
|1||Work through Lesson 4.||You are in the Lesson 4 online content now. The Overview page is previous to this page, and you are on the Checklist page right now.|
|2||Reading Assignment||This week, the reading is detailed, demanding, and long. I therefore recommend that you start it as soon as possible. The project only requires the first chapter of reading for its completion, so you may want to do the first part of the reading, complete the project, and then return to the reading. Whatever you do, don't leave the reading to the last minute this week!
|3||Lesson 4 Deliverables||This lesson is one week in length. The following items must be completed by the end of the week. See the Calendar tab, above, for the specific date.
Please use the 'Lesson 4 Discussion Forum' to ask for clarification on any of these concepts and ideas. Hopefully, some of your classmates will be able to help with answering your questions, and I will also provide further commentary there where appropriate.
It should be pointed out that the distinction between first- and second-order effects is a fine one. In fact, it is often scale-dependent, and often an analytical convenience, rather than a hard and fast distinction. This becomes particularly clear when you realize that an effect that is first-order at one scale may become second-order at a smaller scale (that is, when you 'zoom out').
The simplest example of this is when a (say) east-west steady rise in land elevation viewed at a regional scale is first-order, but zooming out to the continental scale, this trend becomes a more localized topographic feature. This is yet another example of the scale-dependence effects inherent in spatial analysis and noted in Lesson 1.
It is worth emphasizing the point that quadrats need not be square, although it is rare for them not to be.
With regard to kernel density estimation (KDE) it is worth pointing out the strongly scale-dependent nature of this method. This becomes apparent when we view the effect of varying the KDE bandwidth on the estimated event density map in the following sequence of maps, all generated from the same pattern of Redwood saplings as recorded by Strauss, and available in the spatstat package in R (which you will learn about in the project). To begin, Figure 4.1 shows a bandwidth of 0.25.
It may be helpful to briefly distinguish the four major distance methods discussed here:
It is useful to see these measures as forming a progression from least to most informative (with an accompanying rise in complexity).
The measures discussed in the preceding two sections can all be tested statistically for deviations from the expected values associated with a random point process. In fact, deviations from any well defined process can be tested, although the mathematics required becomes more complex.
This section simply outlines how each of the measures described in previous sections may be tested statistically. The most complex of these is the K function, where the additional concept on an L function is introduced to make it easier to detect large deviations from a random pattern. In fact, using the pair correlation function, many of the difficulties of interpreting the K function disappear, so this approach is becoming more widely used.
More important, in practical terms, is the Monte Carlo procedure discussed on pages 148-154 [pages 104-108, 1st edn]. Monte Carlo methods are common in statistics generally, but are particularly useful in spatial analysis when mathematical derivation of the expected values of a pattern measure can be very difficult. Instead of trying to derive analytical results, we simply make use of the computer's ability to randomly generate many patterns according to the process description we have in mind, and then compare our observed result to the simulated distribution of results. This approach is explored in more detail in the project for this lesson.
Ready? Take the first Lesson 4 Quiz to check your knowledge! Return now to the Lesson 4 folder in Canvas to access it. You have an unlimited number of attempts and must score 90% or more.
You may want to come back to this section, which considers the discussion and ideas in Section3 6.1-6.6 [Section 5.1, 1st edn], later, after you've worked on this week's project.
In the real world, the approaches discussed up to this point have their place, but they also have some severe limitations.
The key issue is that classical point pattern analysis allows us to say that a pattern is 'evenly-spaced' or 'clustered' relative to some null spatial process (usually the independent random process), but it does not allow us to say where the pattern is clustered. This is important in most real world applications. A criminal investigator takes it for granted that crime is more common at particular 'hotspots', i.e., that the pattern is clustered, so statistical confirmation of this assumption is nice to have ("I'm not imagining things... phew!"), but it is not particularly useful. However, an indication of where the crime hotspots are located would certainly be useful.
The problem is that detecting clusters in the presence of background variation in the affected population is very difficult. This is especially so for rare events. You can get some idea of the degree of difficulty from the description of the Geographical Analysis Machine (GAM) on pages 166-168 and 173-177 [pages 119-122]. Although GAM has not been widely adopted by epidemiologists, the approach suggested by it was ground-breaking and other more recent tools use very similar methods. (See the optional 'Try This' box below for more on this.)
The basic idea is very simple: repeatedly examine circular areas on the map and compare the observed number of events of interest to the number that would be expected under some null hypothesis (usually spatial randomness). Tag all those circles that are statistically unusual. That's it!
Three things make this conceptually simple procedure tricky.
If you are interested, take a look at the SatSCAN website. SatSCAN is a tool developed by the Biometry Research Group of the National Cancer Institute in the United States. SatSCAN works in a very similar way to the original GAM tool, but has wider acceptance among epidemiological researchers. You can download a free copy of the software and try it on on some sample data.
Ready? Take the second Lesson 4 Quiz to check your knowledge! Return now to the Lesson 4 folder in Canvas to access it. You have an unlimited number of attempts and must score 90% or more.
Now that you've completed the readings and self-test quizzes for this lesson, it is time to apply what you've learned!
There is no specific deliverable for this week; however, you should use this week to begin the peer review process for the preliminary proposals. Early this week, I will send you a message letting you know which students proposals you have been assigned to review. Begin by looking at the proposals you have been assigned to review as posted on the 'Project Topic Discussion Forum.' Then, simply post your comments to the assigned project proposal topic. Your peer reviews are due by the end of Week 5. (Although you are welcome to post them at any point between now and then.)
Timely submission of your peer reviews are worth up to 3 points of the total 30 points available for the term-long final project.
You should consider the following aspects in writing comments for the authors of the proposals:
Remember... you will be receiving two reviews from other students of your own proposal, so you should include the types of useful feedback that you would like to see in those commentaries. Criticism is fine, provided that it includes constructive inputs and suggestions. If something is wrong, how can it be fixed?
Meanwhile, I will be reviewing the preliminary proposals, and providing each of you with feedback and suggestions. I will aim to complete my reviews and e-mail them to you this week.
Please use the 'General Issues' discussion forum to ask any questions now or at any point during this project. You'll find this forum listed under 'Term-Long Project Discussion Forums' in the 'Modules' section in Canvas.
In this week's project, you will use some of the point pattern analysis tools available in the R package spatstat to investigate some point patterns of crime in St. Louis, Missouri.
You need an installation of R, to which you will need to add the spatstat and maptools packages. You should already have added spatstat. To add maptools, use the Packages - Install package(s)... menu option as before.
You will also need data:
You should get your R installation ready (install the packages mentioned above), start up R, change the directory to wherever you have put the city_limits_km shapefile (you will have to unzip it) and crime data file.
For Project 4, the items you are required to submit are as follows:
Please use the 'Lesson 4 Discussion Forum' to ask for clarification on any of these concepts and ideas. Hopefully, some of your classmates will be able to help with answering your questions, and I will also provide further commentary there where appropriate. To access the forums, click on 'Discussions' and navigate to the appropriate forum from there.)
To get started, we need to first get the city limits (i.e., the study area) into R, so that it can be associated with the point data. Here's how:
> library(maptools) > S <- readShapePoly("city_limits_km.shp") > SP <- as(S, "SpatialPolygons") > W <- as(SP, "owin")
In order, this: loads up the maptools package, reads the shapefile into data object S, converts S into a collection of polygons SP, and then converts SP into an 'owin' object, which is the format that spatstat requires so that the data can be used as an analysis window. You can plot W to see what you are dealing with:
Next, read in the crime data:
> xy <- read.table("StLouisCrime2014.txt", header=T, sep="\t")
You can inspect the contents of xy, by typing xy, and see the names of this dataset's attributes by typing names(xy). Then convert it to a spatstat point pattern object, with the different crime types as an identifying mark:
> attach(xy) > pp <- ppp(X, Y, window=W, marks=CRIME)
Remember that you will have to load the spatstat library using library(spatstat) before you can use any of its functions. Note that the attach() command above makes the various attributes of the raw data xy available for direct access by name. You can now make a map:
To see the three crime types separately:
We are going to work with each crime as a distinct dataset, so it's convenient to split them permanently:
> gun <- pp[CRIME=="DISORDERLY"] > rob <- pp[CRIME=="BURGLARY"] > hit <- pp[CRIME=="HITANDRUN"]
And you can make maps of each individually like this:
> plot(density(gun)) > contour(density(gun), add=T) > plot(gun, add=T)
Once you're comfortable that you have the data loaded, proceed to the next page.
The first step in any spatial analysis is to become familiar with your data. In point pattern analysis, kernel density analysis is often used for this purpose; so, first, you are asked to experiment with the kernel density function in spatsat.
Kernel density visualization is performed in spatstat using the density() function which we have already seen in action. The only additional piece of information you need to know is how to vary the bandwidth:
> plot(density(gun, 0.25))
The second parameter in the density function is the bandwidth. R's definition of bandwidth requires some care in its use. Because it is the standard deviation of a Gaussian (i.e., normal) kernel function, it is actually only around 1/2 of the radius across which events will be 'spread' by the kernel function. Remember that the spatial units we are using here are kilometers. It's probably best to add contours to a plot by storing the result of the density analysis:
> d250 <- density(gun, 0.25) > plot(d250) > contour(d250, add=T)
and you can also add the points themselves, if you wish:
> plot(gun, add=T)
R provides a function that will suggest an optimal bandwidth to use:
> r <- bw.diggle(gun) > r
which will tell you the value it has calculated. You can then use this with d_opt <- density(gun, r). You may not feel that the optimal value is optimal. Or you may find it useful to consider what is 'optimal' about this setting.
Create density maps (in R) of the gun homicide data, experimenting with different kernel density bandwidths. Provide a commentary discussing the most suitable bandwidth choice for this analysis visualization method.
For completeness, this page describes how to perform nearest neighbor distance analysis on a point pattern. However, as discussed in the reading, this approach is rarely used now, so there is no need to report findings if you do not think they are useful.
The spatstat nearest neighbor function is nndist.ppp():
> nnd <- nndist.ppp(gun)
which returns a list of all the nearest neighbor distances in the pattern. You can plot these:
and also summarize them:
For a quick statistical assessment, you can also compare the mean value to that expected for an IRP/CSR pattern of the same intensity:
> mnnd <- mean(nnd) > exp_nnd <- 0.5 / sqrt(gun$n / area.owin(W)) > mnnd / exp_nnd
Give this a try for one or more of the crime patterns. Are they clustered? Or evenly-spaced?
Like nearest neighbor distance analysis, quadrat analysis is a relatively limited method for the analysis of a point pattern, as has been discussed in the text.
However, it is easy to perform in R, and can provide useful insight into the distribution of events in a pattern. The functions you need in spatstat are quadratcount() and quadrat.test():
> q <- quadratcount(hit, 4, 8) > plot(q) > plot(hit, add=T) > quadrat.test(hit, 4, 8)
The second and third parameters supplied to these functions are the number of quadrats to create across the study area in the x (east-west) and y (north-south) directions. The test will report a p-value, whose interpretation is discussed in the course text.
The real workhorses of contemporary point pattern analysis are the distance-based functions: G, F, K (and its relative L) and the more recent pair correlation function.
Once again, spatstat provides full support for all of these, using the built-in functions, Gest(), Fest(), Kest(), Lest() and pcf(). In each case, the 'est' suffix refers to the fact the function is an estimate based on the empirical data. Calculation is straightforward:
> g_gun <- Gest(gun) > plot(g_gun)
When you plot the functions, you will see that spatstat actually provides a number of different estimates of the function. Without getting into the details, the different estimates are based on various possible corrections that can be applied for edge effects.
To make a statistical assessment of any of these functions for our patterns, we need to compare the estimated functions to those we expect to see for IRP/CSR. Given the complexity involved, the easiest way to do this is to calculate the function for a set of simulated realizations of IRP/CSR in the same study area. This is done using the envelope() function:
> g_gun_env <- envelope(gun, Gest, nsim=99, nrank=1) > plot(g_gun_env)
Figure 4.4 shows an example of the output from this (not for the crime data, but for the redwood saplings we saw earlier):
The point pattern, on the left, is clearly clustered. What does the plot show us?
Well, the dashed red line is the theoretical value of the pair correlation function for a pattern generated by IRP/CSR. We aren't much interested in that, except as a point of reference.
The grey region shows us the range of values of the function which occurred across all the simulated realizations of IRP/CSR which you see spatstat producing when you run the envelope function. The black line is the function for the actual pattern (i.e., the redwood seedlings). What we are interested in is whether or not the observed (actual) function lies inside or outside the grey 'envelope'. In this case, the observed function is outside the envelope over the range of distances (on the x-axis) from around 0.01 to around 0.07.
As this is the pair correlation function in this case, this tells us that there are more pairs of events at this range of spacings from one another than we would expect to occur by chance. Over the rest of the range of values shown here, the PCF falls within the expected bounds (except for a minor departure below expected values at around 0.225). This observation supports the view that the pattern is clustered or aggregated at the stated range of distances.
The exact Interpretation of the relationship between the envelope and the observed function is dependent on the function in question, but this should give you the idea.
One thing to watch out for... you may find that it's rather tedious waiting for 99 simulated patterns each time you run the envelope() function. This is the default number that are run. You can change this by specifying a different value for nsim:
> K_e <- envelope(rob, Kest, nsim=19, nrank=1)
Once you are sure what examples you want to use, you will probably want to do a final run with nsim set to 99, so that you have more faith in the envelope generated (since it is based on more realizations and more likely to be stable). Also, you can change the rank setting. This will mean that the 'hi' and 'lo' lines in the plot will be placed at the corresponding low or high values in the range produced by the simulated realizations of IRP/CSR. So, for example:
> G_e <- envelope(hit, Gest, nsim=99, nrank=5)
will run 99 simulations of and place high and low limits on the envelope at the 5th highest and 5th lowest values in the set of simulated patterns.
Something worth knowing is that the L function implemented in R deviates from that discussed in the text, in that it produces a result whose expected behavior for CSR is a upward-right sloping line at 45 degrees, that is expected L(r) = r, this can be confusing if you are not expecting it.
One final (minor) point: for the pair correlation function in particular, the values at short distances can be very high and R will scale the plot to include all the values, making it very hard to see the interesting part of the plot. To control the range of values displayed in a plot use xlim and ylim. For example:
> plot(pcf_e, ylim=c(0, 5))
will ensure that only the range between 0 and 5 is plotted on the y-axis.
Got all that? If you do have questions - as usual, you should post them to the Discussion Forum for this week's project. Also go to the additional resources at the end of this lesson where I have included links to some articles that use some of these methods.
Perform point pattern analysis on two of the three crime datasets (preferably contrasting ones) by using whatever methods seem the most useful, and present your findings in the form of maps, plots, and accompanying commentary.
Please put your write-up, or a link to your write-up, in the Project 4 Drop Box.
For Project 4, the items you are required to submit are as follows:
I suggest that you review "Final Activities for Lesson 4" to be sure you have completed all the required work for Lesson 4.
Now that you are finished with this week's project, you may be interested to know that some of the tools you've been using are available in ArcGIS. You will find mean nearest neighbor distance and Ripley's K tools in the Spatial Statistics - Analyzing Patterns toolbox. The Ripley's K tool in particular has improved significantly in ArcGIS 10, so that it now includes the ability to generate confidence envelopes using simulation just like the envelope() function in R.
For kernel density surfaces, there is a density estimation tool in the Spatial Analyst Tools - Density toolbox. This is essentially the same as the density() tool in R with one very significant difference, namely that Arc does not correct for edge effects. In the figure below, the results of kernel density analysis applied to all the crime events in the project data set are shown for (from left to right) the default settings in Arc, with a mask and processing extent set in Arc to cover the city limits area, and for R.
The search radius in Arc was set to 2km and the 'sigma' parameter in R was set to 1km - these should give roughly equivalent results. More significant than the exact shape of the results is that R is correcting for edge effects. This is most clear at the north end of the map, where R's output implies that the region of higher density runs off the edge of the study area, while Arc confines it to the analysis area. R accomplishes this by basing its density estimate on the area inside the study area at each location.
The extensibility of both packages makes it to some extent a matter of taste which you choose to use for point pattern analysis. At the time of writing (2010), it is clear that R remains the better choice in terms of the range of available options and tools, although Arc may have the edge in terms of its familiarity to GIS analysts. For users starting with limited knowledge of both tools, it is debatable which has the steeper learning curve - certainly neither is simple to use!
To see how some of these methods are applied have a quick look at some of these journal articles.
Here is an article to an MGIS capstone project that investigated sinkholes in Florida. Related to crime, here is a link to an article that uses spatial analysis for understanding crime in national forests and the poaching of elephants.
For a comprehensive read on using crime analysis, look through Crime Modeling and Mapping Using Geospatial Technologies book available through the Penn State Library.
Don't forget to use the library and search for other books that may be applicable to your studies.