PHP is discussed in a bit more detail in the optional database lesson.
Because of time constraints, I will just demonstrate this concept through an example rather than lead you through it yourself.
As a native Pennsylvanian and Penn State alum, I am a big Penn State football fan. I even follow the recruitment of high school players a bit, though I'm a bit embarrassed to admit that! Anyway, as a way of comparing the geography of football recruiting across different schools in Penn State's conference (the Big Ten), I developed a mashup several years ago that showed the hometowns of the players on Big Ten rosters.
So, where did I get these rosters and the latitude and longitude of each player's hometown? Well, I scraped the rosters off of ESPN's website, then geocoded the hometowns using Google's geocoder.
Let's begin by having a look at the HTML source of one of these rosters. If you do a search for the word "hometown," you'll be taken to two lines above where the player listing begins in an HTML <table>. Take note of the pattern that repeats itself for each row in the table.
Now, let's have a look at the scraping program in which I use string manipulation functions in PHP to extract the desired pieces of data from the table:
The program begins by opening up a new XML file to hold the output from the program and an XML header and root element are written to the file.
The usage of XML as the source data format for a Google API-based map is also covered in the optional database lesson.
Next, PHP's file_get_contents() function is used to read the HTML of the specified roster into a variable ($file). The contents of this variable are then shortened to just the part of the page that comes after the first occurrence of the word "HOMETOWN" using the strstr()function. A while loop is then used to process each row in the table. The loop contains several uses of PHP's strpos() and substr() functions. strpos() returns the index position of a search string within a larger string. substr() is used to obtain a smaller portion of a larger string (i.e., all characters to the right of the character at position x). I'm not going to bore you with the details of this part of the script. If you have any questions, you're welcome to post them in the discussion forum.
Let's skip down to the statement where a long maps.google.com URL is constructed. This URL is an alternate means of geocoding an address that is available when working within a server-side script (see http://code.google.com/apis/maps/documentation/geocoding/#GeocodingRequests). Here, the player's town and state are inserted into the URL and the desired output is specified as comma-separated values (csv). This URL is read once again using the file_get_contents() function. The comma-separated values are then converted to an array using the explode() function, which places the latitude and longitude values in positions 2 and 3 of the array, respectively. Once the lat/long values are obtained, all of the desired values are written to the XML output file. The last step in the loop is to whittle down the HTML string stored in the $file variable so that it no longer contains the just-processed row. The program then processes the next row in the table and continues looping until no more rows are left.
There are a number of code libraries in different languages that specialize in scraping data found in HTML tables. If you're interested in doing this sort of thing yourself, I recommend looking into whether one of these existing libraries will make your task easier.
The resulting XML file is then used as the data source for the cfb_rosters.html page, which operates in much the same way as examples from the optional database lesson.
- Geocoding is a resource-intensive task and Google asks developers to avoid sending locations to their geocoding service repeatedly (which would happen if your page gets any kind of traffic). Locations should be geocoded once and their coordinates stored locally as in my example.
- It is significantly faster to read lat/long values from a local XML file than it is to geocode the entire list of towns on the fly. Since the roster information is changed just once a year, maintaining up-to-date XML files is quite easy. However, if the data were more dynamic, that might call for an application in which the location is first checked against a local database and sent to Google's geocoder only if it doesn't already exist in the local database.