GEOG 863
GIS Mashups for Geospatial Professionals

Web Scraping

PrintPrint

A programming technique that offers a great deal of potential for the creation of mapping mashups (and for controversy) is web scraping. Web scraping is essentially reading and parsing data published on the web in formats that were originally intended only for human consumption, not machine consumption. As we saw earlier in the course, JavaScript can't read files hosted on another domain. However, a server-side language like PHP can be used to parse the text from the host site and package the desired data into a format that can be consumed by JavaScript.

Note

PHP is discussed in a bit more detail in the optional database lesson.

Because of time constraints, I will just demonstrate this concept through an example rather than lead you through it yourself.

As a native Pennsylvanian and Penn State alum, I am a big Penn State football fan. I even follow the recruitment of high school players a bit, though I'm a bit embarrassed to admit that! Anyway, as a way of comparing the geography of football recruiting across different schools in Penn State's conference (the Big Ten), I developed a mashup several years ago that showed the hometowns of the players on Big Ten rosters.

So, where did I get these rosters and the latitude and longitude of each player's hometown? Well, I scraped the rosters off of ESPN's website, then geocoded the hometowns using Google's geocoder.

Let's begin by having a look at the HTML source of one of these rosters. If you do a search for the word "hometown," you'll be taken to two lines above where the player listing begins in an HTML <table>. Take note of the pattern that repeats itself for each row in the table.

Now, let's have a look at the scraping program in which I use string manipulation functions in PHP to extract the desired pieces of data from the table:

The program begins by opening up a new XML file to hold the output from the program and an XML header and root element are written to the file.

Note

The usage of XML as the source data format for a Google API-based map is also covered in the optional database lesson.

Next, PHP's file_get_contents() function is used to read the HTML of the specified roster into a variable ($file). The contents of this variable are then shortened to just the part of the page that comes after the first occurrence of the word "HOMETOWN" using the strstr()function. A while loop is then used to process each row in the table. The loop contains several uses of PHP's strpos() and substr() functions. strpos() returns the index position of a search string within a larger string. substr() is used to obtain a smaller portion of a larger string (i.e., all characters to the right of the character at position x). I'm not going to bore you with the details of this part of the script. If you have any questions, you're welcome to post them in the discussion forum.

Let's skip down to the statement where a long maps.google.com URL is constructed. This URL is an alternate means of geocoding an address that is available when working within a server-side script (see http://code.google.com/apis/maps/documentation/geocoding/#GeocodingRequests). Here, the player's town and state are inserted into the URL and the desired output is specified as comma-separated values (csv). This URL is read once again using the file_get_contents() function. The comma-separated values are then converted to an array using the explode() function, which places the latitude and longitude values in positions 2 and 3 of the array, respectively. Once the lat/long values are obtained, all of the desired values are written to the XML output file. The last step in the loop is to whittle down the HTML string stored in the $file variable so that it no longer contains the just-processed row. The program then processes the next row in the table and continues looping until no more rows are left.

Note

There are a number of code libraries in different languages that specialize in scraping data found in HTML tables. If you're interested in doing this sort of thing yourself, I recommend looking into whether one of these existing libraries will make your task easier.

The resulting XML file is then used as the data source for the cfb_rosters.html page, which operates in much the same way as examples from the optional database lesson.

You may be wondering why I didn't just set up these scripts such that the XML was passed directly from the PHP script to the JavaScript in the HTML file. That would be a truly nifty mashup! I actually did set it up that way at first just to confirm that I could make it work. However, there are two reasons why a "geocoding-on-the-fly" solution is not a good idea:

  1. Geocoding is a resource-intensive task and Google asks developers to avoid sending locations to their geocoding service repeatedly (which would happen if your page gets any kind of traffic). Locations should be geocoded once and their coordinates stored locally as in my example.
  2. It is significantly faster to read lat/long values from a local XML file than it is to geocode the entire list of towns on the fly. Since the roster information is changed just once a year, maintaining up-to-date XML files is quite easy. However, if the data were more dynamic, that might call for an application in which the location is first checked against a local database and sent to Google's geocoder only if it doesn't already exist in the local database.

Finally, at the beginning of this section, I mentioned the word "controversy" with regard to web scraping. Hopefully, thoughts on the legality and ethics of using this programming technique ran through your mind while you were reading through this section. I constructed this football roster mashup for fun and to get practice with the technology. I haven't advertised it and certainly haven't tried to profit from it, but there's a chance that I'm violating ESPN's terms of use by not checking with them first before re-purposing their data. This kind of ethics question makes for interesting discussion, which is why I asked you to read and comment on the "Mashups: who’s really in control?" blog post by Richard MacManus.