Now that you have a solid handle on the basics of relational database design and query writing, we're ready to dive into spatial database technology. Over the next two lessons, we'll experiment with the open-source RDBMS PostgreSQL (pronounced pōst-grɛs kyü'-ɛl ) and its spatial extension PostGIS (pronounced pōst-jis). This software combination is quite popular for those looking for an alternative to vendor solutions that are often more costly than their organization can afford.
Unlike MS-Access, which is intended for relatively small projects, Postgres is a full-fledged enterprise RDBMS more akin to the leading vendor products (e.g., Oracle and SQL Server). Though there are certainly differences between Postgres and Access, you should find that the concepts you learned earlier in the course will transfer over to this new environment.
After orienting you to working with Postgres, we'll get into the spatial functionality provided by PostGIS.
At the successful completion of this lesson, students should be able to:
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 3 Discussion Forum.
Lesson 3 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below:
Download and install Postgres 16.x [1], accepting all of the default settings - Go with the 64-bit version unless your computer will not support it.
When running the installation, you will need to have access to the Internet.
- After Postgres is installed, you'll be asked if you want to launch the Stack Builder, a separate package that allows for the installation of add-ons to Postgres. Check the box for Stack Builder may be used... and in the next window, choose PostgreSQL 16.x on Port 5432 as the installation you are installing software for.
- Under the category of Spatial Extensions, choose the PostGIS 3.4 Bundle... Installing this add-on will enable you to execute the spatial queries covered in Lessons 3-4.
- Make your way through the installation, accepting the defaults. When the Choose Components dialog appears, check the Create spatial database box.
- For the Database Connection, leave the User Name set to postgres and Port set to 5432. Set the Password to postgres, same as the User Name, making it easy to remember.
- For the Database Name, leave it set to the default of postgis_34_sample.
In this first part of the lesson, you'll get an introduction to Postgres's graphical interface called pgAdmin. You'll also import a shapefile, load data from a text file, and see how queries are performed in pgAdmin.
A common workflow for PostGIS users is to convert their data from Esri shapefile format to PostGIS tables. Fortunately, some PostGIS developers have created a Shapefile Import/Export Manager that makes this conversion easy. In prior versions of pgAdmin, the shapefile importer was accessible as a plug-in. In pgAdmin 4, it must be run as a separate application.
If you encounter an error that the file "libintl-9.dll is missing," the easiest fix for this problem is to navigate up to the bin folder where libintl-9.dll is found, copy it, and paste it into the postgisgui folder.
Since we will be using this executable several times, I suggest that you make a desktop shortcut for it.
It's sometimes necessary to refresh the GUI after creating new objects like this. This can be done by right-clicking on the schema or Tables node in the Browser and selecting Refresh (or hitting F5 on the keyboard).
Loading data from a comma-delimited text file is a common workflow for database developers. Let's see how this can be done in Postgres by loading some state demographic info from the 2010 Census. We'll begin by creating a new blank table.
Instead of having to expand the Data Type pick list, you can start typing the word integer, and the slot will let you autofill with choices. After you type inte, you can pick "integer." Be sure to add the columns in this order, otherwise the data load will not work properly.
male |
female |
white |
black |
amind |
asian |
hawaiian |
other |
mixed |
Before executing the command that will import the data into the table, let's have a look at the data file in a plain text editor and also note its location.
COPY usa.census2010 FROM 'C:\PSU\Geog868\Lesson3data\census2010.csv' WITH (FORMAT csv, HEADER True);
If you encounter a "permission denied" error, it means the "postgres" database login doesn't have permission to read the csv file where it is currently located. Try copying it to a sub-directory belonging to the "Public" user (e.g., 'C:\Users\Public\Public Documents') or to a location that has no permission restrictions (e.g., 'C:\temp'). You could also reset the permissions on the folder that stores the CSV file as outlined in this stackoverflow thread [5].
COPY usa.census2010 (state, total, male....) FROM ....
SELECT name, sub_region FROM states WHERE sub_region = 'Soda';
SELECT name, sub_region FROM usa.states WHERE sub_region = 'Soda';The second solution is to reset pgAdmin's search path so that the schema you're using is part of that path. By default, pgAdmin searches only the public schema. We will take this second approach since it allows us to omit the schema qualifier.
SET search_path TO usa, public;
SHOW search_path;
You may still be receiving an error if you left the table's name set to States rather than states during the import process; pgAdmin converts all table/column names to lower-case prior to execution by default. Thus, even if your FROM clause reads "FROM States", it will be evaluated as "FROM states". And if your table is named States, pgAdmin won't find a matching table. To override this case conversion, you can put the table/column name in double quotes like this:
SELECT name, sub_region FROM "States" WHERE sub_region = 'Soda';To avoid having to qualify your table/column names in this way, it's best to use lower case in your naming.
select name, sub_region from states where sub_region = 'Soda';
Now that you have a feel for how Postgres works, go on to the next page to practice writing queries in pgAdmin.
To help you get oriented to writing SQL queries on the pgAdmin command line, try your hand at the following exercises. Recall that the 2008 population data, soft drink data, and geometries are in the states table, and that the 2010 data are in the census2010 table.
Solutions [7] (This link takes you to the bottom of the Lesson.)
What sets spatial databases apart from their non-spatial counterparts is their support for answering geometric and topological questions. Let's have a look at some simple examples to demonstrate. We'll continue working with the states table we created in the last section.
SELECT name, ST_Centroid(geom) AS centroid FROM states WHERE sub_region = 'Soda';
SELECT name, ST_AsText(ST_Centroid(geom)) AS centroid FROM states WHERE sub_region = 'Soda';
SELECT name, ST_AsText(ST_Centroid(geom)) AS centroid, ST_Area(geom) AS area FROM states WHERE sub_region = 'Soda';In the area column, take note of the values returned by ST_Area(). They are in the units of the input geometry, squared. Recall that the Lesson 3 shapefiles are in latitude/longitude coordinates, which means the area values we're seeing are in square degrees. Hopefully, you recognize that this is a poor way to compute area, since a square degree represents a different area depending on the part of the globe you're dealing with. The ST_Transform() function exists exactly for situations like this. It takes a geometry (in whatever spatial reference) as input and re-projects it into some other spatial reference.
SELECT name, ST_AsText(ST_Centroid(geom)) AS centroid, ST_Area(ST_Transform(geom,2163)) AS area FROM states WHERE sub_region = 'Soda';Take note of the values now displayed in the area column. In this version of the query, the ST_Transform() function is first used to re-project the geometry into the spatial reference 2163 before ST_Area() is called. That spatial reference is an equal-area projection in meters that is suitable for the continental U.S. Don't worry, we'll discuss how you'd find that information later in this lesson.
We'll spend much more time discussing the spatial functions that are available in PostGIS later. Right now, let's go over the geometry types that are supported.
In the last section, we worked with a table – usa.states – containing geometries of the type POLYGON. The other basic geometry types are POINT and LINESTRING. As we'll see momentarily, there are numerous other geometry types available in PostGIS that allow for the storage of multipart shapes, 3-dimensional shapes, and shapes that have a measure (or M value) associated with its vertices. If keeping all of the various types straight becomes difficult, it may help to remember that the simple geometries we deal with most often are POINT, LINESTRING, and POLYGON.
To demonstrate some of the concepts in this section, we're going to create a new schema to store points of interest in New York City. Unlike the last schema where we used the Shapefile Import/Export Manager to both create and populate a table at the same time, here we'll carry out those steps separately.
SET search_path TO nyc_poi, public;
SELECT AddGeometryColumn('nyc_poi','pts','geom',4269,'POINT',2);First, let's address the unusual syntax of this statement. You've no doubt grown accustomed to listing column names (or *) in the SELECT clause, but here we're plugging in a function without any columns. We're forced to use this awkward syntax because SQL rules don't allow for invoking functions directly. Function calls must be made in one of the statement types we've encountered so far (SELECT, INSERT, UPDATE, or DELETE). In this situation, a SELECT statement is the most appropriate.
We're about to add rows to our pts table through a series of INSERT statements. You'll find it much easier to copy and paste these statements rather than typing them manually, if not now, then certainly when we insert polygons later using long strings of coordinates.
INSERT INTO pts (name, geom) VALUES ('Empire State Building', ST_GeomFromText('POINT(-73.985744 40.748549)',4269));The key point to take away from this statement (no pun intended) is the call to the ST_GeomFromText() function. This function converts a geometry supplied in text format to the hexadecimal form that PostGIS geometries are stored in. The other argument is the spatial reference of the geometry. This argument is required in this case because when we created the geom column using AddGeometryColumn(), it added a constraint that values in that column must be in a particular spatial reference (which we specified as 4269).
INSERT INTO pts (name, geom) VALUES ('Statue of Liberty', ST_GeomFromText('POINT(-74.044508 40.689229)',4269)); INSERT INTO pts (name, geom) VALUES ('World Trade Center', ST_GeomFromText('POINT(-74.013371 40.711549)',4269));
INSERT INTO pts (name, geom) VALUES ('Grand Central Station', ST_SetSRID(ST_MakePoint(-73.976522, 40.7528),4269));
INSERT INTO pts (name, geom) VALUES ('Radio City Music Hall', ST_GeomFromText('POINT(-73.97988 40.760171)',4269)), ('Madison Square Garden', ST_GeomFromText('POINT(-73.993544 40.750541)',4269));
INSERT INTO lines (name, geom) VALUES ('Holland Tunnel',ST_GeomFromText('LINESTRING( -74.036486 40.730121, -74.03125 40.72882, -74.011123 40.725958)',4269)), ('Lincoln Tunnel',ST_GeomFromText('LINESTRING( -74.019921 40.767119, -74.002841 40.759773)',4269)), ('Brooklyn Bridge',ST_GeomFromText('LINESTRING( -73.99945 40.708231, -73.9937 40.703676)',4269));Note that I've split this statement across several lines to improve its readability, not for any syntax reasons. You should feel welcome to format your statements however you see fit.
INSERT INTO polys (name, geom) VALUES ('Central Park',ST_GeomFromText('POLYGON(( -73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356))',4269));While the syntax for constructing a polygon looks very similar to that of a linestring, there are two important differences:
INSERT INTO polys (name, geom) VALUES ('Central Park',ST_GeomFromText('POLYGON(( -73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356), (-73.966681 40.785221, -73.966058 40.787674, -73.965586 40.788064, -73.9649 40.788291, -73.963913 40.788194, -73.963333 40.788291, -73.962539 40.788259, -73.962153 40.788389, -73.96181 40.788714, -73.961359 40.788909, -73.960887 40.788925, -73.959986 40.788649, -73.959492 40.788649, -73.958913 40.78873, -73.958269 40.788974, -73.957797 40.788844, -73.957497 40.788568, -73.957497 40.788259, -73.957776 40.787739, -73.95784 40.787057, -73.957819 40.786569, -73.960801 40.782394, -73.961145 40.78215, -73.961638 40.782036, -73.962518 40.782199, -73.963076 40.78267, -73.963677 40.783661, -73.965694 40.784457, -73.966681 40.785221) )',4269));
Earlier in this section, we discussed 3-dimensional (XYZ and XYM) and 4-dimensional (XYZM) geometries in the context of properly specifying the dimension argument to the AddGeometryColumn() function. We won't be doing so in this course, but let's look for a moment at the syntax used for creating these geometries.
To define a column that can store M values as part of the geometry, use the POINTM, LINESTRINGM, and POLYGONM data types. When specifying objects of these types, the M value should appear last. For example, an M value of 9999 is attached to each coordinate in these features from our nyc_poi schema:
POINTM(-73.985744 40.748549 9999) LINESTRINGM(-74.019921 40.767119 9999, -74.002841 40.759773 9999) POLYGONM((-73.973057 40.764356 9999, -73.981898 40.768094 9999, -73.958209 40.800621 9999, -73.949282 40.796853 9999, -73.973057 40.764356 9999)
Perhaps the most common usage of M coordinates is in linear referencing (e.g., to store the distance from the start of a road, power line, pipeline, etc.). This Wikipedia article on Linear Referencing [9] provides a good starting point if you're interested in learning more.
To define a column capable of storing Z values along with X and Y, use the "plain" POINT, LINESTRING and POLYGON data types rather than their "M" counterparts. The syntax for specifying an XYZ coordinate is the same as that for an XYM coordinate. The "plain" data type name tells PostGIS that the third coordinate is a Z value rather than an M value. For example, we could include sea level elevation in the coordinates for the Empire State Building (in feet):
POINT(-73.985744 40.748549 190).
Finally, in the event you want to store both Z and M values, again use the "plain" POINT, LINESTRING and POLYGON data types. The Z value should be listed third and the M value last. For example:
POINT(-73.985744 40.748549 190 9999)
PostGIS provides support for features with multiple parts through the MULTIPOINT, MULTILINESTRING, and MULTIPOLYGON data types. A classic example of multipart geometry is the state of Hawaii, which is composed of multiple disconnected islands. The syntax for specifying a MULTIPOLYGON builds upon the rules for a regular POLYGON; the parts are separated by commas and an additional set of parentheses is used to enclose the full coordinate list. The footprints of the World Trade Center Towers 1 and 2 (now fountains in the 9/11 Memorial) can be represented as a single multipart polygon as follows:
MULTIPOLYGON(((-74.013751 40.711976, -74.01344 40.712439, -74.012834 40.712191, -74.013145 40.711732, -74.013751 40.711976)), ((-74.013622 40.710772, -74.013311 40.711236, -74.012699 40.710992, -74.013021 40.710532, -74.013622 40.710772)))
This basic example shows the syntax for storing just X and Y coordinates. Keep in mind that Z values and M values are also supported for multipart geometries. As you might guess, the "MULTI" data types have "M" counterparts too: MULTIPOINTM, MULTILINESTRINGM and MULTIPOLYGONM.
The tables we've created so far reflect a bias toward Esri-centric design with each table storing a single column of homogeneous geometries (i.e. all points, or all lines, or all polygons, but not a mix). However, PostGIS supports two design approaches that are good to keep in mind when putting together a database:
Let's see how this heterogeneous column approach can be used to store all of our nyc_poi data in the same table.
INSERT INTO mixed (name, geom) VALUES ('Empire State Building', ST_GeomFromText('POINT(-73.985744 40.748549)',4269)), ('Statue of Liberty', ST_GeomFromText('POINT(-74.044508 40.689229)',4269)), ('World Trade Center', ST_GeomFromText('POINT(-74.013371 40.711549)',4269)), ('Radio City Music Hall', ST_GeomFromText('POINT(-73.97988 40.760171)',4269)), ('Madison Square Garden', ST_GeomFromText('POINT(-73.993544 40.750541)',4269)), ('Holland Tunnel',ST_GeomFromText('LINESTRING( -74.036486 40.730121, -74.03125 40.72882, -74.011123 40.725958)',4269)), ('Lincoln Tunnel',ST_GeomFromText('LINESTRING( -74.019921 40.767119, -74.002841 40.759773)',4269)), ('Brooklyn Bridge',ST_GeomFromText('LINESTRING( -73.99945 40.708231, -73.9937 40.703676)',4269)), ('Central Park',ST_GeomFromText('POLYGON(( -73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356))',4269));
At some point in this lesson, you probably thought to yourself, "This is fine, but what if I want to see the geometries?" Well, you can get a quick look at the geometries returned by a query in pgAdmin by clicking on the "eye" icon that appears on the right side of the geometry column header. But you'll likely want to go beyond this, for example, to utilize your geometries in the context of other data layers. That is the focus of the next part of the lesson, where we will use the third-party application Quantum GIS (QGIS) to view our PostGIS data.
Quantum GIS (QGIS, pronounced kyü'-jis) is a free and open-source desktop GIS package analogous to Esri's ArcMap/ArcGIS Pro. Because of its support for viewing PostGIS data and strong cartographic capabilities, QGIS and PostGIS are often found paired together. (OpenJUMP is another desktop application often used in combination with PostGIS, though its strengths are in spatial querying and geoprocessing.)
Let's see how QGIS can be used to view the tables we created and populated in the previous section.
"name" LIKE '%Tunnel'
The % character is the wildcard character in Postgres; we saw it was the * character in MS-Access in Lesson 1.
gid < 4Note that you can build the expression graphically by expanding the Fields and Values and Operators lists, then double-clicking on items in those lists.
That completes our quick tour of QGIS. In the next section, we'll return to pgAdmin to see how queries can be saved for later re-use.
In Lesson 1, we saved a number of our MS-Access queries so that we could easily re-run them later and, in a couple of cases, to build a query upon another query rather than a table. In Postgres and other sophisticated RDBMSs, stored SQL statements like these are called views. In this section, we'll see how views can be created in Postgres.
SELECT * FROM usa.cities WHERE capital = 1 ORDER BY stateabb;(Sorry Montpelier, I guess you were too small.)
Just as we saw in MS-Access, the records returned by views can be used as the source for a query.
SELECT * FROM usa.vw_capitals WHERE popclass = 2;
Views can also include spatial functions, or a combination of spatial and non-spatial criteria, in their definition. To demonstrate this, let's create views that re-project our states and cities data on the fly.
SELECT gid, name, pop2008, sub_region, ST_Transform(geom,2163) AS geom FROM usa.states;
SELECT *, ST_Transform(geom,2163) AS geom_2163 FROM usa.cities;
QGIS makes it possible to add both tables and views as layers. We'll take advantage of this feature now by creating layers from the views we just created.
This section showed how to save queries as views, which can then be utilized in the same way as tables. In the next section, we'll go into a bit more detail on the topic of spatial references.
As we’ve seen, populating a geometry column with usable data requires specifying the spatial reference of the data. We also saw that geometries can be re-projected from one spatial reference to another using the ST_Transform() function. In both cases, it is necessary to refer to spatial reference systems by an SRID (Spatial Reference ID). So, where do these IDs come from, and where can a list of them be found?
The answer to the question of where the IDs come from is that PostGIS uses the spatial reference IDs defined by the European Petroleum Survey Group (EPSG). As for finding the ID for a spatial reference you want to use, there are a few different options.
All of the spatial reference IDs are stored in a Postgres table in the public schema called spatial_ref_sys.
SELECT srid, srtext FROM spatial_ref_sys WHERE srtext LIKE '%Pennsylvania%';This query shows the SRIDs of each Pennsylvania-specific spatial reference supported in PostGIS.
Another way to find SRIDs is to look them up in QGIS.
The Prj2EPSG website [12] provides an easy-to-use interface for finding EPSG IDs. As its name implies, it allows the user to upload a .prj file (used by Esri to store projection metadata) and get back the matching EPSG ID. The site also makes it possible to enter search terms. My test search for ‘pennsylvania state plane’ yielded some garbage matches, but also the ones that I would expect.
This service appears to be down as of 5/29/2020. I leave this section here in case anyone is aware of a web-based alternative. If so, please share with the class in the discussion forum!
We’ve seen that the public schema contains a table called spatial_ref_sys that stores all of the spatial references supported by PostGIS. Another important item in that schema is the geometry_columns view. Have a look at the data returned by that view and note that it includes a row for each geometry column in the database. Among the metadata stored here are the parent schema, the parent table, the geometry column’s name, the coordinate dimension, the SRID and the geometry type (e.g., POINT, LINESTRING, etc.). Being able to conduct spatial analysis with PostGIS requires accurate geometry column information, so the PostGIS developers have made these data accessible through a read-only view rather than a table.
Earlier in the lesson, we used the AddGeometryColumn() function instead of adding the geometry column through the table definition GUI. An important reason for adding the geometry column in that manner is that it updates the geometry metadata that you can see through the geometry_columns view, something that would not happen if we had used the GUI.
We’ll talk more about measuring lengths, distances, and areas in the next lesson, but while we’re on the topic of spatial references, it makes sense to consider 2D Cartesian measurement in the context of planimetric map data versus measurement in the context of the spherical surface of the Earth. For example, the PostGIS function ST_Distance() can be used to calculate the distance between two geometries. When applied to geometries of the type we’ve dealt with so far, ST_Distance() will calculate distances in 2D Cartesian space. This is fine at a local or regional scale, since the impact of the curvature of the earth at those scales is negligible, but, over a continental or global scale, a significant error would result.
PostGIS offers a couple of alternative approaches to taking the earth’s curvature into account. Let’s assume that we wanted to measure the distance between points in the (U.S.) cities table that we created earlier in the lesson. We could use a version of the ST_Distance() function called ST_Distance_Spheroid(). As its name implies, this function is designed to calculate the minimum great-circle distance between two geometries.
The other approach is to store the features using a data type introduced in PostGIS 1.5 called geography. Unlike the geometry data type, the geography data type is meant for storing only latitude/longitude coordinates. The advantage of the geography data type is that measurement functions like ST_Distance(), ST_Length() and ST_Area() will return measures calculated in 3D space rather than 2D space. The disadvantage is that the geography data type is compatible with a significantly smaller subset of functions as compared to the geometry type. Calculating spherical measures can also take longer than Cartesian measures, since the mathematics involved is more complex.
The take-away message is that the geography data type can simplify data handling for projects that cover a continental-to-global scale. For projects covering a smaller portion of the earth’s surface, you are probably better off sticking with the geometry data type.
With that, we've covered all of the content for Lesson 3. In the next section, you'll find a project that will allow you to put what you've learned to use.
For Project 3, I would like you to map the hometowns of everyone on the class roster using Postgres/PostGIS and QGIS. Included in the data you downloaded at the beginning of the lesson were U.S. state and counties shapefiles, along with a comma-separated values file called postal_codes.txt that stores U.S. and Canadian postal codes and the coordinates of their centroids.
Right-click here to download the class roster with postal codes [13]. (If there are any students from outside the U.S. and Canada, they will appear at the bottom of the file with a special code. There will be a matching record at the bottom of the postal_codes.txt file.)
Here are the broad steps you should follow:
'POINT(' || lon || ' ' || lat || ')'
This project is one week in length. Please refer to Canvas Calendar for the due date.
SELECT * FROM states WHERE pop2008 > 10000000;
SELECT * FROM cities WHERE capital = 1;
SELECT * FROM states WHERE name LIKE 'New%';
SELECT * FROM cities WHERE name LIKE '%z%';
SELECT * FROM states ORDER BY pop2008 DESC;
SELECT * FROM states ORDER BY sub_region, name;
SELECT * FROM states WHERE pop2008 > 10000000 AND sub_region = 'Pop';
SELECT * FROM cities WHERE stateabb IN ('US-NY','US-NJ','US-PA');
SELECT state, (white::double precision/total) * 100 AS pctwhite FROM census2010;
SELECT sub_region, Sum(pop2008) FROM states GROUP BY sub_region;
SELECT states.name, census2010.total, states.geom FROM states INNER JOIN census2010 ON states.name = census2010.state;
SELECT states.sub_region, Avg(census2010.male) FROM states INNER JOIN census2010 ON states.name = census2010.state GROUP BY states.sub_region;
Links
[1] https://www.postgresql.org/download/
[2] https://postgis.net/windows_downloads/
[3] http://qgis.org/en/site/forusers/download.html
[4] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Lesson3Data.zip
[5] http://stackoverflow.com/questions/14083311/permission-denied-when-trying-to-import-a-csv-file-from-pgadmin
[6] https://www.postgresql.org/docs/current/static/sql-copy.html
[7] https://www.e-education.psu.edu/spatialdb/L3_practice_solutions.html
[8] http://en.wikipedia.org/wiki/Hexadecimal
[9] http://en.wikipedia.org/wiki/Linear_referencing
[10] http://en.wikipedia.org/wiki/SpatiaLite
[11] http://en.wikipedia.org/wiki/SQLite
[12] http://prj2epsg.org/
[13] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/868_roster_sp24.txt
[14] https://www.postgresql.org/docs/current/static/index.html