One of the core skills required by database professionals is the ability to manipulate and retrieve data held in a database using Structured Query Language (SQL, sometimes pronounced "sequel"). As we'll see, queries can be written to create tables, add records to tables, update existing table records, and retrieve records meeting certain criteria. This last function, retrieving data through SELECT queries, is arguably the most important because of its utility in answering the questions that led the database developer to create the database in the first place. SELECT queries are also the most fun type to learn about, so, in this first lesson, we will focus on using them to retrieve data from an existing database. Later, we'll see how new databases can be designed, implemented, populated and updated using other types of queries.
At the successful completion of this lesson, students should be able to:
Conversation and comments in this course will take place within the course discussion forums. If you have any questions now or at any point during this week, please feel free to post them to the Lesson 1 Discussion Forum.
Lesson 1 is one week in length (see the Canvas Calendar for specific due dates). To finish this lesson, you must complete the activities listed below:
A number of RDBMS vendors provide a GUI to aid their users in developing queries. These can be particularly helpful to novice users, as it enables them to learn the overarching concepts involved in query development without getting bogged down in syntax details. For this reason, we will start the course with Microsoft Access, which provides perhaps the most user-friendly interface.
Throughout this lesson, we'll use a database of baseball statistics to help demonstrate the basics of SELECT queries.
With our first query, we'll retrieve data from selected fields in the STATS table.
You've probably recognized by now that the output from these queries is not particularly human friendly. In the next part of the lesson, we'll see how to use a join between the two tables to add the names of the players to the query output.
One of the essential functions of an RDBMS is the ability to pull together data from multiple tables. In this section, we'll make a join between our PLAYERS and STATS tables to produce output in which the stats are matched up with the players who compiled them.
You'll probably notice the key symbol next to PLAYER_ID in the PLAYERS table and next to PLAYER_ID, YEAR and TEAM in the STATS table. This symbol indicates that the field is part of the table's primary key, a topic we'll discuss more later. For now, it's good to understand that a field need not be a key field to participate in a join, and that while the fields often share the same name (as is the case here), that is not required. What's important is that the fields contain matching values of compatible data types (e.g., a join between a numeric zip code field and a string zip code field will not work because of the mismatch between data types).
PLAYER: [PLAYERS]![FIRST_NAME] & ’ ’ & [PLAYERS]![LAST_NAME]
Suppose you wanted to sort this list by the last name, then first name, and then by the number of RBI. This would make it easier to look at each player's best seasons in terms of RBI production.
If you were to count the number of players being displayed by our query, you'd find that there are 10 (PLAYER_IDs 1 through 10).
Another type of join that we'll use later when we write spatial queries in PostGIS is the cross join, which produces the Cartesian, or cross, product of the two tables. If one table has 10 records and the other 5, the cross join will output 50 records.
The usefulness of this kind of join may not be obvious given this particular example. For a better example, imagine a table of products that includes each product's retail price and a table of states and their sales taxes. Using a cross join between these two tables, you could calculate the sales price of each product in each state.
Finally, let's put together a query that joins together data from 3 tables.
In this way, it is possible to create queries containing several joins. However, in a real database, you are likely to notice a drop in performance as the number of joins increases.
In this section, we saw how to pull together data from two or more related tables, how to concatenate values from two text fields, and how to sort query records based on fields not included in the output. This enabled us to produce a much more user- friendly output than we had at the end of the previous section. In the next section, we'll further explore the power of SELECT queries by compiling career statistics and calculating new statistics on the fly from values stored in the STATS table.
One of the beauties of SQL is in its ability to construct groupings from the values in a table and produce summary statistics for those groupings. In our baseball example, it is quite easy to calculate each player's career statistics, seasons played, etc. Let's see how this type of query is constructed by calculating each player's career RBI total.
Home run hitters are often compared based on the frequency of their home run hitting (i.e., 'Player X averages a home run every Y times at bat'). Let's calculate this value for each player and season.
ABPERHR: [STATS]![AB]/[STATS]![HR]
Hopefully, you've found the MS-Access query builder to be helpful in learning the basics of retrieving data from a database. As you move forward, it will become important for you to learn some of the syntax details (i.e., how to write the statements you saw in SQL View without the aid of a graphical query builder). That will be the focus of this part of the lesson.
All SELECT queries begin with a SELECT clause whose purpose is to specify which columns should be retrieved. The desired columns are separated by commas. Our first query in this lesson had a SELECT clause that looked like the following:
SELECT STATS.PLAYER_ID, STATS.HR, STATS.RBI
Note that each column name is preceded by the name of its parent table, followed by a dot. This is critical when building a query involving joins and one of the desired columns is found in multiple tables. Including the parent table eliminates any confusion the SQL interpreter would have in deciding which column should be retrieved. However, you should keep in mind that the table name can be omitted when the desired column is unambiguous. For example, because our simple query above is only retrieving data from one table, the SELECT clause could be reduced to:
SELECT PLAYER_ID, HR, RBI
The easiest way to retrieve data from a table is to substitute the asterisk character (*) for the list of columns:
SELECT *
This will retrieve data from all of the columns. While it's tempting to use this syntax because it's so much easier to type, you should be careful to do so only when you truly want all of the data held in the table or when the table is rather small. Retrieving all the data from a table can cause significant degradation in the performance of an SQL-based application, particularly when the data is being transmitted over the Internet. Grab only what you need!
Generally speaking, SQL is a case-insensitive language. You'll find that the following SELECT clause will retrieve the same data as the earlier ones:
SELECT player_id, hr, rbi
Why did I say "generally speaking?" There are some RDBMS that are case sensitive depending on the operating system they are installed on. Also, some RDBMS have administrative settings that make it possible to turn case sensitivity on. For these reasons, I suggest treating table and column names as if they are case sensitive.
On the subject of case, one of the conventions followed by many SQL developers is to capitalize all of the reserved words (i.e., the words that have special meaning in SQL) in their queries. Thus, you'll often see the words SELECT, FROM, WHERE, etc., capitalized. You're certainly not obligated to follow this convention, but doing so can make your queries more readable, particularly when table and column names are in lower or mixed case.
Finally, recall that we renamed our output columns (assigned aliases) in some of the earlier query builder exercises by putting the desired alias and a colon in front of the input field name or expression. Doing this in the query builder results in the addition of the AS keyword to the query's SELECT clause, as in the following example:
SELECT PLAYER_ID, HR AS HomeRuns, RBI
The other required clause in a SELECT statement is the FROM clause. This specifies the table(s) containing the columns specified in the SELECT clause. In the first queries we wrote, we pulled data from a single table, like so:
SELECT STATS.PLAYER_ID, STATS.HR, STATS.RBI FROM STATS
Or, omitting the parent table from the column specification:
SELECT PLAYER_ID, HR, RBI FROM STATS
The FROM clause becomes a bit more complicated when combining data from multiple tables. The Access query builder creates an explicit inner join by default. Here is the code for one of our earlier queries:
SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, STATS.YEAR, STATS.RBI FROM PLAYERS INNER JOIN STATS ON PLAYERS.PLAYER_ID = STATS.PLAYER_ID
An inner join like this one causes a row-by-row comparison to be conducted between the two tables looking for rows that meet the criterion laid out in the ON predicate (in this case, that the PLAYER_ID value in the PLAYERS table is the same as the PLAYER_ID value in the STATS table). When a match is found, a new result set is produced containing all of the columns listed in the SELECT clause. If one of the tables has a row with a join field value that is not found in the other table, that row would not be included in the result.
We also created a couple of outer joins to display data from records that would otherwise not appear in an inner join. In the first example, we chose to include all data from the left table and only the matching records from the right table. That yielded the following code:
SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME] AS PLAYER, STATS.YEAR, STATS.RBI FROM STATS LEFT JOIN PLAYERS ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID
Note that the only difference between this FROM clause and the last is that the INNER keyword was changed to the word LEFT. As you might be able to guess, our second outer join query created code that looks like this:
SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME] AS PLAYER, STATS.YEAR, STATS.RBI FROM STATS RIGHT JOIN PLAYERS ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID
Here the word LEFT is replaced with the word RIGHT. In practice, RIGHT JOINs are more rare than LEFT JOINs since it's possible to produce the same results by swapping the positions of the two tables and using a LEFT JOIN. For example:
SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME] AS PLAYER, STATS.YEAR, STATS.RBI FROM PLAYERS LEFT JOIN STATS ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID
Recall that the cross join query we created was characterized by the fact that it had no line connecting key fields in the two tables. As you might guess, this translates to a FROM clause that has no ON predicate:
SELECT STATS.PLAYER_ID, TEAMS.ABBREV FROM STATS, TEAMS;
In addition to lacking an ON predicate, the clause also has no form of the word JOIN. The two tables are simply separated by commas.
Finally, we created a query involving two inner joins. Here is how that query was translated to SQL:
SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, STATS.YEAR, STATS.HR, TEAMS.CITY, TEAMS.NICKNAME FROM (PLAYERS INNER JOIN STATS ON PLAYERS.PLAYER_ID = STATS.PLAYER_ID) INNER JOIN TEAMS ON STATS.TEAM = TEAMS.ABBREV;
The first join is enclosed in parentheses to control the order in which the joins are carried out. Like in algebra, operations in parentheses are carried out first. The result of the first join then participates in the second join.
As we saw earlier, it is possible to limit a query's result to records meeting certain criteria. These criteria are spelled out in the query's WHERE clause. This clause includes an expression that evaluates to either True or False for each row. The "True" rows are included in the output; the "False" rows are not. Returning to our earlier examples, we used a WHERE clause to identify seasons of 100+ RBIs:
SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.RBI FROM STATS WHERE STATS.RBI > 99
As with the specification of columns in the SELECT clause, columns in the WHERE clause need not be prefaced by their parent table if there is no confusion as to where the columns are coming from.
This statement exemplifies the basic column-operator-value pattern found in most WHERE clauses. The most commonly used operators are:
Operator | Description |
---|---|
= | equals |
<> | not equals |
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
BETWEEN | within a value range |
LIKE | matching a search pattern |
IN | matching an item in a list |
The usage of most of these operators is straightforward. But let's talk for a moment about the LIKE and IN operators. LIKE is used in combination with the wildcard character (the * character in Access and most other RDBMS, but the % character in some others) to find rows that contain a text string pattern. For example, the clause WHERE FIRST_NAME LIKE 'B*' would return Babe Ruth and Barry Bonds. WHERE FIRST_NAME LIKE '*ar*' would return Barry Bonds and Harmon Killebrew. WHERE FIRST_NAME LIKE '*ank' would return Hank Aaron and Frank Robinson.
Text strings like those in the WHERE clauses above must be enclosed in quotes. Single quotes are recommended, though double quotes will also work in most databases. This is in contrast to numeric values, which should not be quoted. The key point to consider is the data type of the column in question. For example, zip code values may look numeric, but if they are stored in a text field (as they should be because some begin with zeroes), they must be enclosed in quotes.
The IN operator is used to identify values that match an item in a list. For example, you might find seasons compiled by members of the Giants franchise (which started in New York and moved to San Francisco) using a WHERE clause like this:
WHERE TEAM IN ('NYG', 'SFG')
Finally, remember that WHERE clauses can be compound. That is, you can evaluate multiple criteria using the AND and OR operators. The AND operator requires that both the expression on the left and the expression on the right evaluate to True, whereas the OR operator requires that at least one of the expressions evaluates to True. Here are a couple of simple illustrations:
WHERE TEAM = 'SFG' AND YEAR > 2000
WHERE TEAM = 'NYG' OR TEAM = 'SFG'
Sometimes WHERE clauses require parentheses to ensure that the filtering logic is carried out properly. One of the queries you were asked to write earlier sought to identify 100+ RBI seasons compiled before 1960 or after 1989. The SQL generated by the query builder looks like this:
SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.RBI FROM STATS WHERE (((STATS.YEAR)>1989) AND ((STATS.RBI)>99)) OR (((STATS.YEAR)<1960) AND ((STATS.RBI)>99));
It turns out that this statement is more complicated than necessary for a couple of reasons. The first has to do with the way I instructed you to build the query:
This resulted in the RBI>99 criterion inefficiently being added to the query twice. A more efficient approach would be to merge the two-year criteria into a single row in the design grid, as follows:
This yields a more efficient version of the statement:
SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.RBI FROM STATS WHERE (((STATS.YEAR)>1989 Or (STATS.YEAR)<1960) AND ((STATS.RBI)>99));
It is difficult to see, but this version puts a set of parentheses around the two year criteria so that the condition of being before 1960 or after 1989 is evaluated separately from the RBI condition.
One reason the logic is difficult to follow is because the query builder adds parentheses around each column specification. Eliminating those parentheses and the table prefixes produces a more readable query:
SELECT PLAYER_ID, YEAR, RBI FROM STATS WHERE (YEAR>1989 Or YEAR<1960) AND RBI>99;
The GROUP BY clause was generated by the query builder when we calculated each player's career RBI total:
SELECT LAST_NAME, FIRST_NAME, Sum(RBI) AS SumOfRBI FROM STATS GROUP BY LAST_NAME, FIRST_NAME;
The idea behind this type of query is that you want the RDBMS to find all of the unique values (or value combinations) in the field (or fields) listed in the GROUP BY clause and include each in the result set. When creating this type of query, each field in the GROUP BY clause must also be listed in the SELECT clause. In addition to the GROUP BY fields, it is also common to use one of the SQL aggregation functions in the SELECT clause to produce an output column. The function we used to calculate the career RBI total was Sum(). Other useful aggregation functions include Max(), Min(), Avg(), Last(), First() and Count(). Because the Count() function is only concerned with counting the number of rows associated with each grouping, it doesn't really matter which field you plug into the parentheses. Frequently, SQL developers will use the asterisk rather than some arbitrary field. This modification to the query will return the number of years in each player's career, in addition to the RBI total:
SELECT LAST_NAME, FIRST_NAME, Sum(RBI) AS SumOfRBI, Count(*) AS Seasons FROM STATS GROUP BY LAST_NAME, FIRST_NAME;
Note: In actuality, this query would over count the seasons played for players who switched teams mid-season (e.g., Mark McGwire was traded in the middle of 1997 and has a separate record in the STATS table for each team). We'll account for this problem using a sub-query later in the lesson.
The purpose of the ORDER BY clause is to specify how the output of the query should be sorted. Any number of fields can be included in this clause, so long as they are separated by commas. Each field can be followed by the keywords ASC or DESC to indicate ascending or descending order. If this keyword is omitted, the sorting is done in an ascending manner. The query below sorts the seasons in the STATS table from the best RBI total to the worst:
SELECT PLAYER_ID, YEAR, RBI FROM STATS ORDER BY RBI DESC;
That concludes our review of the various clauses found in SELECT queries. You're likely to memorize much of this syntax if you're called upon to write much SQL from scratch. However, don't be afraid to use the Access query builder or other GUIs as an aid in producing SQL statements. I frequently take that approach when writing database-related PHP scripts. Even though the data is actually stored in a MySQL database, I make links to the MySQL tables in Access and use the query builder to develop the queries I need. That workflow (which we'll explore in more depth later in the course) is slightly complicated by the fact that the flavor of SQL produced by Access differs slightly from industry standards. We'll discuss a couple of the important differences in the next section.
If you plan to use the SQL generated by the Access query builder in other applications as discussed on the previous page, you'll need to be careful of some of the differences between Access SQL and other RDBMS.
We used the Expression Builder to combine the players' first and last names, and later to calculate their home run per at-bat ratio. Unfortunately, the Expression Builder specifies table/column names in a format that is unique to Access. Here is a simplified version of the HR/AB query that doesn't bother including names from the PLAYERS table:
SELECT STATS.PLAYER_ID, STATS.YEAR, [STATS]![AB]/[STATS]![HR] AS ABPERHR FROM STATS;
The brackets and exclamation point used in this expression is non-standard SQL and would not work outside of Access. A safer syntax would be to use the same table.column notation used earlier in the SELECT clause:
SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.AB/STATS.HR AS ABPERHR FROM STATS;
Or, to leave out the parent table name altogether:
SELECT PLAYER_ID, YEAR, AB/HR AS ABPERHR FROM STATS;
Note that this modified syntax will work in Access.
In our queries that combined the players' first and last names, the following syntax was used:
SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME]....
As mentioned above, columns should be specified using the table.column notation rather than with brackets and exclamation points. In addition to the different column specification, the concatenation itself would also be done differently in other RDBMS. While Access uses the & character for concatenation, the concatenation operator in standard SQL is ||. However, there is considerable variation in the adherence to this standard. For a comparison of a number of the major RDBMS [2], see http://troels.arvin.dk/db/rdbms/#functions-concat.
One of the points of comparison in the linked page is whether or not there is automatic casting of values. All this is getting at is what happens if the developer tries to concatenate a string with some other data type (say, a number). Some SQL implementations will automatically cast, or convert, the number to a string. Others are less flexible and require the developer to perform such a cast explicitly.
There comes a time when every SQL developer has a problem that is too difficult to solve using only the methods we've discussed so far. In our baseball stats database, difficulty arises when you consider the fact that players who switch teams mid-season have a separate row for each team in the STATS table. For example, Mark McGwire started the 1997 season with OAK, before being traded to and spending the rest of that season with STL. The STATS table contains two McGwire-1997 rows; one for his stats with OAK and one for his stats with STL.
If you wanted to identify each player's best season (in terms of batting average, home runs or runs batted in), you wouldn't be able to do a straight GROUP BY on the player's name or ID because that would not account for the mid-season team switchers. What's needed in this situation is a multistep approach that first computes each player's yearly aggregated stats, then identifies the maximum value in that result set.
A novice's approach to this problem would be to output the results from the first query to a new table, then build a second query on top of the table created by the first. The trouble with this approach is that it requires re-running the first query each time the STATS table changes.
A more ideal solution to the problem can be found through the use of a subquery. Let's have a look at how to build a subquery in Access.
SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, Max(STATS_AGG.SumOfHR) AS MaxOfSumOfHR FROM PLAYERS INNER JOIN (SELECT STATS.PLAYER_ID, STATS.YEAR, Sum(STATS.AB) AS SumOfAB, Sum(STATS.HITS) AS SumOfHITS, Sum(STATS.HR) AS SumOfHR, Sum(STATS.RBI) AS SumOfRBI, Count(STATS.TEAM) AS CountOfTEAM FROM STATS GROUP BY STATS.PLAYER_ID, STATS.YEAR) AS STATS_AGG ON PLAYERS.PLAYER_ID = STATS_AGG.PLAYER_ID GROUP BY PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME;
This approach is a bit more complicated than the first, since it cannot be carried out with the graphical Query Builder. However, the advantage is that all of the code involved can be found in one place, and you don't have a second intermediate query cluttering the database's object list.
This completes the material on retrieving data from an RDBMS using SELECT queries. In the next section, you'll be posed a number of questions that will test your ability to write SELECT queries on your own.
To demonstrate that you've learned the material from Lesson 1, please build queries in your Access database for each of the problems posed below. Some tips:
Here are the problems to solve:
Name your queries so that it's easy for me to identify them (e.g., Proj1_1 through Proj1_10).
This project is one week in length. Please refer to the Canvas Calendar for the due date.
Now that you've gotten a taste of the power provided by relational databases to answer questions, let's shift our attention to the dirty work: designing, implementing, populating and maintaining a database.
At the successful completion of this lesson, students should be able to:
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 2 Discussion Forum.
Lesson 2 is one week in length, see the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below:
Before digging deeper into the workings of relational database management systems (RDBMSs), it's a good idea to take a step back and look briefly at the history of the relational model and how it fits into the bigger picture of DBMSs in general. The relational model of data storage was originally proposed by an IBM employee named Edgar Codd in the early 1970s. Prior to that time, computerized data storage techniques followed a "navigational" model that was not well suited to searching. The ground-breaking aspect of Codd's model was the allocation of data to separate tables, all linked to one another by keys - values that uniquely identify particular records. The process of breaking data into multiple tables is referred to as normalization.
If you do much reading on the relational model, you're bound to come across the terms relation, attribute, and tuple. While purists would probably disagree, for our purposes you can consider relation to be synonymous with table, attribute with the terms column and field, and tuple with the terms row and record.
SQL, which we started learning about in Lesson 1, was developed in response to Codd's relational model. Today, relational databases built around SQL dominate the data storage landscape. However, it is important to recognize that the relational model is not the only ballgame in town. Object-oriented databases [3] arose in the 1980s in parallel with object-oriented programming languages. Some of the principles of object-oriented databases (such as classes and inheritance) have made their way into most of today's major RDBMSs, so it is more accurate to describe them as object-relational hybrids.
More recently, a class of DBMSs that deviate significantly from the relational model has developed. These NoSQL databases [4] seek to improve performance when dealing with large volumes of data (often in web-based applications). Essentially, these systems sacrifice some of the less critical functions found in an RDBMS in exchange for gains in scalability and performance.
When building a relational database from scratch, it is important that you put a good deal of thought into the process. A poorly designed database can cause a number of headaches for its users, including:
Entire courses can be spent on database design concepts, but we don't have that kind of time, so let's just focus on some basic design rules that should serve you well. A well-designed table is one that:
The process of designing a database according to the rules described above is formally referred to as normalization. All database designers carry out normalization, whether they use that term to describe the process or not. Hardcore database designers not only use the term normalization, they're also able to express the extent to which a database has been normalized:
In most cases, normalizing a database so that it is in 3NF is sufficient. However, it is worth pointing out that there are other normal forms including Boyce-Codman normal form (BCNF, or 3.5NF), fourth normal form (4NF) and fifth normal form (5NF). Rather than spend time going through examples of these other forms, I encourage you to simply keep in mind the basic characteristics of a well-designed table listed above. If you follow those guidelines carefully, in particular, constantly being on the lookout for redundant data, you should be able to reap the benefits of normalization.
Generally speaking, a higher level of normalization results in a higher number of tables. And as the number of tables increases, the costs of bringing together data through joins increases as well, both in terms of the expertise required in writing the queries and in the performance of the database. In other words, the normalization process can sometimes yield a design that is too difficult to implement or that performs too slowly. Thus, it is important to bear in mind that database design is often a balancing of concerns related to data integrity and storage efficiency (why we normalize) versus concerns related to its usability (getting data into and out of the database).
Earlier, we talked about city/state combinations being redundant with zip code. That is a great example of a situation in which de-normalizing the data might make sense. I have no hard data on this, but I would venture to say that the vast majority of relational databases that store these three attributes keep them all together in the same table. Yes, there is a benefit to storing the city and state names once in the zip code table (less chance of a misspelling, less disk space used). However, my guess is that the added complexity of joining the city/state together with the rest of the address elements outweighs that benefit to most database designers.
Let's work through an example design scenario to demonstrate how these rules might be applied to produce an efficient database. Ice cream entrepreneurs Jen and Barry have opened their business and now need a database to track orders. When taking an order, they record the customer's name, the details of the order such as the flavors and quantities of ice cream needed, the date the order is needed, and the delivery address. Their database needs to help them answer two important questions:
A first crack at storing the order information might look like this:
Customer | Order | DeliveryDate | DeliveryAdd |
---|---|---|---|
Eric Cartman | 1 vanilla, 2 chocolate | 12/1/11 | 101 Main St |
Bart Simpson | 10 chocolate, 10 vanilla, 5 strawberry | 12/3/11 | 202 School Ln |
Stewie Griffin | 1 rocky road | 12/3/11 | 303 Chestnut St |
Bart Simpson | 3 mint chocolate chip, 2 strawberry | 12/5/11 | 202 School Ln |
Hank Hill | 2 coffee, 3 vanilla | 12/8/11 | 404 Canary Dr |
Stewie Griffin | 5 rocky road | 12/10/11 | 303 Chestnut St |
The problem with this design becomes clear when you imagine trying to write a query that calculates the number of gallons of vanilla that have been ordered. The quantities are mixed with the names of the flavors, and any one flavor could be listed anywhere within the order field (i.e., it won't be consistently listed first or second).
A design like the following would be slightly better:
Customer | Flavor1 | Qty1 | Flavor2 | Qty2 | Flavor3 | Qty3 | DeliveryDate | DeliveryAdd |
---|---|---|---|---|---|---|---|---|
Eric Cartman | vanilla | 1 | chocolate | 2 | 12/1/11 | 101 Main St |
||
Bart Simpson | chocolate | 10 | vanilla | 10 | strawberry | 5 | 12/3/11 | 202 School Ln |
Stewie Griffin | rocky road | 1 | 12/3/11 | 303 Chestnut St |
||||
Bart Simpson | mint chocolate chip | 3 | strawberry | 2 | 12/5/11 | 202 School Ln |
||
Hank Hill | coffee | 2 | vanilla | 3 | 12/8/11 | 404 Canary Dr |
||
Stewie Griffin | rocky road | 5 | 12/10/11 | 303 Chestnut St |
This is an improvement because it enables querying on flavors and summing quantities. However, to calculate the gallons of vanilla ordered you would need to sum the values from three fields. Also, the design would break down if a customer ordered more than three flavors.
Slightly better still is this design:
Customer | Flavor | Qty | DeliveryDate | DeliveryAdd |
---|---|---|---|---|
Eric Cartman | vanilla | 1 | 12/1/11 | 101 Main St |
Eric Cartman | chocolate | 2 | 12/1/11 | 101 Main St |
Bart Simpson | chocolate | 10 | 12/3/11 | 202 School Ln |
Bart Simpson | vanilla | 10 | 12/3/11 | 202 School Ln |
Bart Simpson | strawberry | 5 | 12/3/11 | 202 School Ln |
Stewie Griffin | rocky road | 1 | 12/3/11 | 303 Chestnut St |
Hank Hill | coffee | 2 | 12/8/11 | 404 Canary Dr |
Hank Hill | vanilla | 3 | 12/8/11 | 404 Canary Dr |
Stewie Griffin | rocky road | 5 | 12/10/11 | 303 Chestnut St |
This design makes calculating the gallons of vanilla ordered much easier. Unfortunately, it also produces a lot of redundant data and spreads a complete order from a single customer across multiple rows.
Better than all of these approaches would be to separate the data into four entities (Customers, Flavors, Orders and Order Items):
CustID | NameLast | NameFirst | DeliveryAdd |
---|---|---|---|
1 | Cartman | Eric | 101 Main St |
2 | Simpson | Bart | 202 School Ln |
3 | Griffin | Stewie | 303 Chestnut St |
4 | Hill | Hank | 404 Canary Dr |
FlavorID | Name |
---|---|
1 | vanilla |
2 | chocolate |
3 | strawberry |
4 | rocky road |
5 | mint chocolate chip |
6 | coffee |
OrderID | CustID | DeliveryDate |
---|---|---|
1 | 1 | 12/1/11 |
2 | 2 | 12/3/11 |
3 | 3 | 12/3/11 |
4 | 2 | 12/5/11 |
5 | 4 | 12/8/11 |
6 | 3 | 12/10/11 |
OrderItemID | OrderID | FlavorID | Qty |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 1 | 2 | 2 |
3 | 2 | 2 | 10 |
4 | 2 | 1 | 10 |
5 | 2 | 3 | 5 |
6 | 3 | 4 | 1 |
7 | 4 | 5 | 3 |
8 | 4 | 3 | 2 |
9 | 5 | 6 | 2 |
10 | 5 | 1 | 3 |
11 | 6 | 4 | 5 |
If one were to implement a design like this in MS-Access, the query needed to display orders that must be delivered in the next 2 days would look like this in the GUI:
It would produce the following SQL:
If you aren't experienced with relational databases, then such a table join may seem intimidating. You probably won't need to do anything quite so complex in this course. The purpose of this example is two-fold:
Whether it's just a quick sketch on a napkin or a months-long process involving many stakeholders, the life cycle of any effective database begins with data modeling. Data modeling itself begins with a requirements analysis, which can be more or less formal, depending on the scale of the project. One of the common products of the data modeling process is an entity-relationship (ER) diagram. This sort of diagram depicts the categories of data that must be stored (the entities) along with the associations (or relationships) between them. The Wikipedia entry on ER diagrams is quite good, so I'm going to point you there to learn more:
Entity-relationship model article at Wikipedia [6] [http://en.wikipedia.org/wiki/Entity-relationship_model]
An ER diagram is essentially a blueprint for a database structure. Some RDBMSs provide diagramming tools (e.g., Oracle Designer, MySQL Workbench) and often include the capability of automatically creating the table structure conceptualized in the diagram.
In a GIS context, Esri makes it possible to create new geodatabases based on diagrams authored using CASE (Computer-Aided Software Engineering) tools. This blog post, Using Case tools in Arc GIS 10 [7], [http://blogs.esri.com/esri/arcgis/2010/08/05/using-case-tools-in-arcgis -10/] provides details if you are interested in learning more.
To help drive these concepts home, here is a scenario for you to consider. You work for a group with an idea for a fun website: to provide a place for music lovers to share their reviews of albums on the 1001 Albums You Must Hear Before You Die list [8]. (All of these albums had been streamable from Radio Romania 3Net [9], but sadly it appears that's no longer the case.)
Spend 15-30 minutes designing a database (on paper, no need to implement it in Access) for this scenario. Your database should be capable of efficiently storing all of the following data:
When you're satisfied with your design, move to the next page and compare yours to mine.
There is no one correct design for the music scenario posed on the last page. The figure below depicts one design. Note the FK notation beside some of the fields. FK stands for foreign key and indicates a field whose values uniquely identify rows in another table. There is no way to designate a field as a foreign key in MS-Access. For our purposes, you should just remember that building foreign keys into your database is what will enable you to create the table joins needed to answer the questions you envision your database answering.
Note also the 1 and * labels at the ends of the relationship lines, which indicate the number of times any particular value will appear in the connected field. As shown in the diagram, values in the Albums table's Album_ID field will appear just once; the same values in the associated field in the Tracks table may appear many times (conveyed by the * character). For example, an album with 10 tracks will have its Album_ID value appear in the Albums table just once. That Album_ID value would appear in the Tracks table 10 times. This uniqueness of values in fields is referred to as cardinality.
If you have questions about this design or see ways to improve it, please share your thoughts in the Lesson 2 Discussion Forum.
In this part of the lesson, we'll create and populate some of the tables depicted in the diagram above to give you a sense of how to move from the design phase to the implementation phase of a database project. Table creation is something that can be accomplished using SQL, as we'll see later in the course. Most RDBMSs also make it possible to create tables through a GUI, which is the approach we will take right now.
An integral part of creating a table is defining its fields and the types of data that can be stored in those fields. Here is a list of the most commonly used data types available in Access:
Beyond the data type, fields have a number of other properties that are sometimes useful to set. Among these are Default Value which specifies the value the field takes on by default, and Required, which specifies whether or not the field is allowed to contain Null values. It is also possible to specify that a field is the table's primary key (or part of a multi-field primary key).
Perhaps the most commonly set field property is the Field Size. For Text fields this property defaults to a value of 255 characters. This property should be set to a lower value when appropriate, for a couple of reasons. First, it will reduce the size of the database, and secondly, it can serve as a form of data validation (e.g., assigning a Field Size of 2 to a field that will store state abbreviations will ensure that no more than 2 characters are entered).
When dealing with a Number field, the Field Size property is used to specify the type of number:
As with Text fields, it is good practice to choose the smallest possible Field Size for a Number field.
With this background on fields in mind, let's move on to implementing the music database.
Name | Type |
---|---|
Album_ID | AutoNumber |
Title | Text (200) |
Artist_ID | Long Integer |
Release_Year | Integer |
Label_ID | Long Integer |
Try adding a field with a name of Year to see why we used the name Release_Year instead.
Name | Type |
---|---|
Artist_ID | AutoNumber |
Artist_Name | Text (200) |
Year_Begin | Integer |
Year_End | Integer |
Name | Type |
---|---|
Label_ID | AutoNumber |
Label_Name | Text (200) |
While it's not strictly necessary to do so, it can be beneficial to spell out the relationships between the tables that you envisioned during the design phase.
With your database design implemented, you are now ready to add records to your tables.
Records may be added to tables in three ways: manually through the table GUI, using an SQL INSERT query to add a single record, and using an INSERT query to add multiple records in bulk.
This method is by far the easiest, but also the most tedious. It is most suitable for populating small tables.
Label_ID | Label_Name |
---|---|
1 | Capitol |
2 | Pye |
3 | Columbia |
4 | Track |
5 | Brunswick |
6 | Parlorphone |
7 | Apple |
When a record is deleted from a table with an AutoNumber field like this one, the AutoNumber value that had been associated with that record is gone forever. For example, imagine that the Columbia Records row was deleted from the table. The next new record would take on a Label_ID of 8. The Label_ID of 3 has already been used and will not be used again.
INSERT INTO Artists ( Artist_Name, Year_Begin, Year_End ) SELECT "The Beatles" AS Expr1, 1957 AS Expr2, 1970 AS Expr3;
INSERT INTO Artists ( Artist_Name, Year_Begin, Year_End ) VALUES ( "The Kinks", 1964, 1996 );
INSERT INTO Artists ( Artist_Name, Year_Begin ) VALUES ( "The Who", 1964 );
The last method of adding records to a table is perhaps the most common one, the bulk insertion of records that are already in some other digital form. To enable you to see how this method works in Access, I've created a comma-delimited text file of albums released in the 1960s by the three artists above.
The Albums.txt text file you just imported was written assuming that the records in the Labels table are stored with Label_ID values as specified in the figure in Part A and Artist_ID values of 1 for The Beatles, 2 for The Kinks and 3 for The Who. If your tables don't have these ID values (e.g., if you ran into a problem with one of the inserts into the Artists table and had to add that artist again), you should modify the ID values in the Albums_temp table so that they match up with the ID values in your Labels and Artists tables before building the Append query in the steps below.
As with adding records, updating existing records can be done both manually through the GUI and programmatically with SQL. The manual method simply requires navigating to the correct record (sorting the table can often help with this step) and replacing the old value(s) with new one(s).
SQL UPDATE statements are used in a couple of different scenarios. The first is when you have bulk edits to make to a table. For example, let's say you wanted to add a region field to a states table. You might add the field through the table design GUI, then execute a series of UPDATE queries to populate the new region field. The logic of the queries would take the form, "Update the region field to X for the states in this list Y." We'll perform a bulk update like this in a moment.
The other common usage for the UPDATE statement goes back to the website account scenario. Let's say, you've moved and need to provide the company with your new address. Upon clicking the Submit button, a script will generate and execute an UPDATE query on your record in the table.
After working with your album database for a while, you notice that you're constantly using string concatenation to add the word "Records" to the names of the companies in the Labels table. You decide that the benefit of adding that text to the label names outweighs the drawback of storing that text redundantly.
[Labels]![Label_Name] & " Records"If you needed to, you could specify selection criteria to limit the update to a subset of records. For example, following on the state-region example mentioned above, you might update the region field to "New England" using a WHERE condition of:
UPDATE Labels SET Labels.Label_Name = [Labels]![Label_Name] & " Records";As discussed earlier when we were dealing with SELECT queries, the brackets and exclamation point do not follow the SQL standard. You could also omit the parent table in the field specification if there is no chance of duplicate field names in your query. Thus, the following syntax would also work and be closer to standard SQL:
UPDATE Labels SET Label_Name = Label_Name & " Records";Also, as mentioned before, concatenation requires a different syntax depending on the RDBMS.
When working in Access, deleting records can be as easy as opening the table, navigating to the desired records, selecting them using the gray selection boxes to the left of the records, and hitting the Delete key on the keyboard. Likewise, it is possible to follow the same methodology to delete records returned by a SELECT query (if it's a one-table query).
However, as with the other query types, it is sometimes necessary to write a query to perform the task. For example, if you've ever canceled a hotel reservation or an order from an online vendor, a DELETE query was probably executed behind the scenes. Let's see how such a query can be built using the Access GUI and examine the resulting SQL.
Let's say a revised version of the 1001 albums list is released, and the albums recorded by The Who on the Track Records label have dropped off the list. Knowing that Track Records no longer has albums on the list, you decide to delete it from the Labels table.
DELETE Labels.*, Labels.Label_Name FROM Labels WHERE (((Labels.Label_Name)="Track Records"));
DELETE * FROM Labels WHERE Label_Name="Track Records";
One final note before moving on to the graded assignment. I had you delete the record label from the database because I wanted to show you how to write a DELETE query. However, think about other actions you might have taken instead under that scenario. Depending on the circumstance, it might make more sense to leave that label in the database and delete only the albums (in case the label once again has an album on the list). The upside of deleting the label is that answering a question like, "Which record companies released the albums on the list?" requires dealing only with the Labels table. On the other hand, if the Labels table included companies that had no albums on the list, answering that question would require a more complex query, involving a join between Albums and Labels.
Another response to the revised-list scenario might be to make it possible for the database to store all versions of the list, rather than just its current state. Generally speaking, it's better to err on the side of keeping data versus deleting. Doing so makes your database capable of answering a larger array of questions; it's quite common for database users to come up with questions that could not have been predicted in the initial design. The downside to this decision is that it necessitates changes to the database design. What changes would be required in this scenario?
This lesson focused on concepts involved in the design of a relational database. To demonstrate your understanding of these concepts, you'll now complete an exercise in database design. Here is the scenario; you work for a UFO investigation organization that has collected a lot of information related to sightings, and you've been tasked with designing a database that will be used as the back end for a website. The information to be stored in the database includes the following:
In addition to this basic information, the organization has many digital files that must be made available through the website:
While most RDBMSs make it possible to store such data in the form of BLOBs (Binary Large OBjects), doing so can be tricky. An alternative approach is to store the names of the files and use front-end application logic to present the file data appropriately.
If you have access to software that can be used to produce an entity-relationship diagram of your proposed database structure, feel free to use it. However, hand-drawn and scanned diagrams are perfectly acceptable as well. If you don't have access to a scanner at home or work, try your local library or FedEx Office store.
As part of your submission, please include notations of the key fields (primary and foreign), the relationships and cardinality between tables, and a short narrative that explains the thinking that went into your design.
This project is one week in length. Please refer to the Canvas Calendar for the due date.
Now that you have a solid handle on the basics of relational database design and query writing, we're ready to dive into spatial database technology. Over the next two lessons, we'll experiment with the open-source RDBMS PostgreSQL (pronounced pōst-grɛs kyü'-ɛl ) and its spatial extension PostGIS (pronounced pōst-jis). This software combination is quite popular for those looking for an alternative to vendor solutions that are often more costly than their organization can afford.
Unlike MS-Access, which is intended for relatively small projects, Postgres is a full-fledged enterprise RDBMS more akin to the leading vendor products (e.g., Oracle and SQL Server). Though there are certainly differences between Postgres and Access, you should find that the concepts you learned earlier in the course will transfer over to this new environment.
After orienting you to working with Postgres, we'll get into the spatial functionality provided by PostGIS.
At the successful completion of this lesson, students should be able to:
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 3 Discussion Forum.
Lesson 3 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below:
Download and install Postgres 16.x [11], accepting all of the default settings - Go with the 64-bit version unless your computer will not support it.
When running the installation, you will need to have access to the Internet.
- After Postgres is installed, you'll be asked if you want to launch the Stack Builder, a separate package that allows for the installation of add-ons to Postgres. Check the box for Stack Builder may be used... and in the next window, choose PostgreSQL 16.x on Port 5432 as the installation you are installing software for.
- Under the category of Spatial Extensions, choose the PostGIS 3.4 Bundle... Installing this add-on will enable you to execute the spatial queries covered in Lessons 3-4.
- Make your way through the installation, accepting the defaults. When the Choose Components dialog appears, check the Create spatial database box.
- For the Database Connection, leave the User Name set to postgres and Port set to 5432. Set the Password to postgres, same as the User Name, making it easy to remember.
- For the Database Name, leave it set to the default of postgis_34_sample.
In this first part of the lesson, you'll get an introduction to Postgres's graphical interface called pgAdmin. You'll also import a shapefile, load data from a text file, and see how queries are performed in pgAdmin.
A common workflow for PostGIS users is to convert their data from Esri shapefile format to PostGIS tables. Fortunately, some PostGIS developers have created a Shapefile Import/Export Manager that makes this conversion easy. In prior versions of pgAdmin, the shapefile importer was accessible as a plug-in. In pgAdmin 4, it must be run as a separate application.
If you encounter an error that the file "libintl-9.dll is missing," the easiest fix for this problem is to navigate up to the bin folder where libintl-9.dll is found, copy it, and paste it into the postgisgui folder.
Since we will be using this executable several times, I suggest that you make a desktop shortcut for it.
It's sometimes necessary to refresh the GUI after creating new objects like this. This can be done by right-clicking on the schema or Tables node in the Browser and selecting Refresh (or hitting F5 on the keyboard).
Loading data from a comma-delimited text file is a common workflow for database developers. Let's see how this can be done in Postgres by loading some state demographic info from the 2010 Census. We'll begin by creating a new blank table.
Instead of having to expand the Data Type pick list, you can start typing the word integer, and the slot will let you autofill with choices. After you type inte, you can pick "integer." Be sure to add the columns in this order, otherwise the data load will not work properly.
male |
female |
white |
black |
amind |
asian |
hawaiian |
other |
mixed |
Before executing the command that will import the data into the table, let's have a look at the data file in a plain text editor and also note its location.
COPY usa.census2010 FROM 'C:\PSU\Geog868\Lesson3data\census2010.csv' WITH (FORMAT csv, HEADER True);
If you encounter a "permission denied" error, it means the "postgres" database login doesn't have permission to read the csv file where it is currently located. Try copying it to a sub-directory belonging to the "Public" user (e.g., 'C:\Users\Public\Public Documents') or to a location that has no permission restrictions (e.g., 'C:\temp'). You could also reset the permissions on the folder that stores the CSV file as outlined in this stackoverflow thread [15].
COPY usa.census2010 (state, total, male....) FROM ....
SELECT name, sub_region FROM states WHERE sub_region = 'Soda';
SELECT name, sub_region FROM usa.states WHERE sub_region = 'Soda';The second solution is to reset pgAdmin's search path so that the schema you're using is part of that path. By default, pgAdmin searches only the public schema. We will take this second approach since it allows us to omit the schema qualifier.
SET search_path TO usa, public;
SHOW search_path;
You may still be receiving an error if you left the table's name set to States rather than states during the import process; pgAdmin converts all table/column names to lower-case prior to execution by default. Thus, even if your FROM clause reads "FROM States", it will be evaluated as "FROM states". And if your table is named States, pgAdmin won't find a matching table. To override this case conversion, you can put the table/column name in double quotes like this:
SELECT name, sub_region FROM "States" WHERE sub_region = 'Soda';To avoid having to qualify your table/column names in this way, it's best to use lower case in your naming.
select name, sub_region from states where sub_region = 'Soda';
Now that you have a feel for how Postgres works, go on to the next page to practice writing queries in pgAdmin.
To help you get oriented to writing SQL queries on the pgAdmin command line, try your hand at the following exercises. Recall that the 2008 population data, soft drink data, and geometries are in the states table, and that the 2010 data are in the census2010 table.
Solutions [17] (This link takes you to the bottom of the Lesson.)
What sets spatial databases apart from their non-spatial counterparts is their support for answering geometric and topological questions. Let's have a look at some simple examples to demonstrate. We'll continue working with the states table we created in the last section.
SELECT name, ST_Centroid(geom) AS centroid FROM states WHERE sub_region = 'Soda';
SELECT name, ST_AsText(ST_Centroid(geom)) AS centroid FROM states WHERE sub_region = 'Soda';
SELECT name, ST_AsText(ST_Centroid(geom)) AS centroid, ST_Area(geom) AS area FROM states WHERE sub_region = 'Soda';In the area column, take note of the values returned by ST_Area(). They are in the units of the input geometry, squared. Recall that the Lesson 3 shapefiles are in latitude/longitude coordinates, which means the area values we're seeing are in square degrees. Hopefully, you recognize that this is a poor way to compute area, since a square degree represents a different area depending on the part of the globe you're dealing with. The ST_Transform() function exists exactly for situations like this. It takes a geometry (in whatever spatial reference) as input and re-projects it into some other spatial reference.
SELECT name, ST_AsText(ST_Centroid(geom)) AS centroid, ST_Area(ST_Transform(geom,2163)) AS area FROM states WHERE sub_region = 'Soda';Take note of the values now displayed in the area column. In this version of the query, the ST_Transform() function is first used to re-project the geometry into the spatial reference 2163 before ST_Area() is called. That spatial reference is an equal-area projection in meters that is suitable for the continental U.S. Don't worry, we'll discuss how you'd find that information later in this lesson.
We'll spend much more time discussing the spatial functions that are available in PostGIS later. Right now, let's go over the geometry types that are supported.
In the last section, we worked with a table – usa.states – containing geometries of the type POLYGON. The other basic geometry types are POINT and LINESTRING. As we'll see momentarily, there are numerous other geometry types available in PostGIS that allow for the storage of multipart shapes, 3-dimensional shapes, and shapes that have a measure (or M value) associated with its vertices. If keeping all of the various types straight becomes difficult, it may help to remember that the simple geometries we deal with most often are POINT, LINESTRING, and POLYGON.
To demonstrate some of the concepts in this section, we're going to create a new schema to store points of interest in New York City. Unlike the last schema where we used the Shapefile Import/Export Manager to both create and populate a table at the same time, here we'll carry out those steps separately.
SET search_path TO nyc_poi, public;
SELECT AddGeometryColumn('nyc_poi','pts','geom',4269,'POINT',2);First, let's address the unusual syntax of this statement. You've no doubt grown accustomed to listing column names (or *) in the SELECT clause, but here we're plugging in a function without any columns. We're forced to use this awkward syntax because SQL rules don't allow for invoking functions directly. Function calls must be made in one of the statement types we've encountered so far (SELECT, INSERT, UPDATE, or DELETE). In this situation, a SELECT statement is the most appropriate.
We're about to add rows to our pts table through a series of INSERT statements. You'll find it much easier to copy and paste these statements rather than typing them manually, if not now, then certainly when we insert polygons later using long strings of coordinates.
INSERT INTO pts (name, geom) VALUES ('Empire State Building', ST_GeomFromText('POINT(-73.985744 40.748549)',4269));The key point to take away from this statement (no pun intended) is the call to the ST_GeomFromText() function. This function converts a geometry supplied in text format to the hexadecimal form that PostGIS geometries are stored in. The other argument is the spatial reference of the geometry. This argument is required in this case because when we created the geom column using AddGeometryColumn(), it added a constraint that values in that column must be in a particular spatial reference (which we specified as 4269).
INSERT INTO pts (name, geom) VALUES ('Statue of Liberty', ST_GeomFromText('POINT(-74.044508 40.689229)',4269)); INSERT INTO pts (name, geom) VALUES ('World Trade Center', ST_GeomFromText('POINT(-74.013371 40.711549)',4269));
INSERT INTO pts (name, geom) VALUES ('Grand Central Station', ST_SetSRID(ST_MakePoint(-73.976522, 40.7528),4269));
INSERT INTO pts (name, geom) VALUES ('Radio City Music Hall', ST_GeomFromText('POINT(-73.97988 40.760171)',4269)), ('Madison Square Garden', ST_GeomFromText('POINT(-73.993544 40.750541)',4269));
INSERT INTO lines (name, geom) VALUES ('Holland Tunnel',ST_GeomFromText('LINESTRING( -74.036486 40.730121, -74.03125 40.72882, -74.011123 40.725958)',4269)), ('Lincoln Tunnel',ST_GeomFromText('LINESTRING( -74.019921 40.767119, -74.002841 40.759773)',4269)), ('Brooklyn Bridge',ST_GeomFromText('LINESTRING( -73.99945 40.708231, -73.9937 40.703676)',4269));Note that I've split this statement across several lines to improve its readability, not for any syntax reasons. You should feel welcome to format your statements however you see fit.
INSERT INTO polys (name, geom) VALUES ('Central Park',ST_GeomFromText('POLYGON(( -73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356))',4269));While the syntax for constructing a polygon looks very similar to that of a linestring, there are two important differences:
INSERT INTO polys (name, geom) VALUES ('Central Park',ST_GeomFromText('POLYGON(( -73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356), (-73.966681 40.785221, -73.966058 40.787674, -73.965586 40.788064, -73.9649 40.788291, -73.963913 40.788194, -73.963333 40.788291, -73.962539 40.788259, -73.962153 40.788389, -73.96181 40.788714, -73.961359 40.788909, -73.960887 40.788925, -73.959986 40.788649, -73.959492 40.788649, -73.958913 40.78873, -73.958269 40.788974, -73.957797 40.788844, -73.957497 40.788568, -73.957497 40.788259, -73.957776 40.787739, -73.95784 40.787057, -73.957819 40.786569, -73.960801 40.782394, -73.961145 40.78215, -73.961638 40.782036, -73.962518 40.782199, -73.963076 40.78267, -73.963677 40.783661, -73.965694 40.784457, -73.966681 40.785221) )',4269));
Earlier in this section, we discussed 3-dimensional (XYZ and XYM) and 4-dimensional (XYZM) geometries in the context of properly specifying the dimension argument to the AddGeometryColumn() function. We won't be doing so in this course, but let's look for a moment at the syntax used for creating these geometries.
To define a column that can store M values as part of the geometry, use the POINTM, LINESTRINGM, and POLYGONM data types. When specifying objects of these types, the M value should appear last. For example, an M value of 9999 is attached to each coordinate in these features from our nyc_poi schema:
POINTM(-73.985744 40.748549 9999) LINESTRINGM(-74.019921 40.767119 9999, -74.002841 40.759773 9999) POLYGONM((-73.973057 40.764356 9999, -73.981898 40.768094 9999, -73.958209 40.800621 9999, -73.949282 40.796853 9999, -73.973057 40.764356 9999)
Perhaps the most common usage of M coordinates is in linear referencing (e.g., to store the distance from the start of a road, power line, pipeline, etc.). This Wikipedia article on Linear Referencing [19] provides a good starting point if you're interested in learning more.
To define a column capable of storing Z values along with X and Y, use the "plain" POINT, LINESTRING and POLYGON data types rather than their "M" counterparts. The syntax for specifying an XYZ coordinate is the same as that for an XYM coordinate. The "plain" data type name tells PostGIS that the third coordinate is a Z value rather than an M value. For example, we could include sea level elevation in the coordinates for the Empire State Building (in feet):
POINT(-73.985744 40.748549 190).
Finally, in the event you want to store both Z and M values, again use the "plain" POINT, LINESTRING and POLYGON data types. The Z value should be listed third and the M value last. For example:
POINT(-73.985744 40.748549 190 9999)
PostGIS provides support for features with multiple parts through the MULTIPOINT, MULTILINESTRING, and MULTIPOLYGON data types. A classic example of multipart geometry is the state of Hawaii, which is composed of multiple disconnected islands. The syntax for specifying a MULTIPOLYGON builds upon the rules for a regular POLYGON; the parts are separated by commas and an additional set of parentheses is used to enclose the full coordinate list. The footprints of the World Trade Center Towers 1 and 2 (now fountains in the 9/11 Memorial) can be represented as a single multipart polygon as follows:
MULTIPOLYGON(((-74.013751 40.711976, -74.01344 40.712439, -74.012834 40.712191, -74.013145 40.711732, -74.013751 40.711976)), ((-74.013622 40.710772, -74.013311 40.711236, -74.012699 40.710992, -74.013021 40.710532, -74.013622 40.710772)))
This basic example shows the syntax for storing just X and Y coordinates. Keep in mind that Z values and M values are also supported for multipart geometries. As you might guess, the "MULTI" data types have "M" counterparts too: MULTIPOINTM, MULTILINESTRINGM and MULTIPOLYGONM.
The tables we've created so far reflect a bias toward Esri-centric design with each table storing a single column of homogeneous geometries (i.e. all points, or all lines, or all polygons, but not a mix). However, PostGIS supports two design approaches that are good to keep in mind when putting together a database:
Let's see how this heterogeneous column approach can be used to store all of our nyc_poi data in the same table.
INSERT INTO mixed (name, geom) VALUES ('Empire State Building', ST_GeomFromText('POINT(-73.985744 40.748549)',4269)), ('Statue of Liberty', ST_GeomFromText('POINT(-74.044508 40.689229)',4269)), ('World Trade Center', ST_GeomFromText('POINT(-74.013371 40.711549)',4269)), ('Radio City Music Hall', ST_GeomFromText('POINT(-73.97988 40.760171)',4269)), ('Madison Square Garden', ST_GeomFromText('POINT(-73.993544 40.750541)',4269)), ('Holland Tunnel',ST_GeomFromText('LINESTRING( -74.036486 40.730121, -74.03125 40.72882, -74.011123 40.725958)',4269)), ('Lincoln Tunnel',ST_GeomFromText('LINESTRING( -74.019921 40.767119, -74.002841 40.759773)',4269)), ('Brooklyn Bridge',ST_GeomFromText('LINESTRING( -73.99945 40.708231, -73.9937 40.703676)',4269)), ('Central Park',ST_GeomFromText('POLYGON(( -73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356))',4269));
At some point in this lesson, you probably thought to yourself, "This is fine, but what if I want to see the geometries?" Well, you can get a quick look at the geometries returned by a query in pgAdmin by clicking on the "eye" icon that appears on the right side of the geometry column header. But you'll likely want to go beyond this, for example, to utilize your geometries in the context of other data layers. That is the focus of the next part of the lesson, where we will use the third-party application Quantum GIS (QGIS) to view our PostGIS data.
Quantum GIS (QGIS, pronounced kyü'-jis) is a free and open-source desktop GIS package analogous to Esri's ArcMap/ArcGIS Pro. Because of its support for viewing PostGIS data and strong cartographic capabilities, QGIS and PostGIS are often found paired together. (OpenJUMP is another desktop application often used in combination with PostGIS, though its strengths are in spatial querying and geoprocessing.)
Let's see how QGIS can be used to view the tables we created and populated in the previous section.
"name" LIKE '%Tunnel'
The % character is the wildcard character in Postgres; we saw it was the * character in MS-Access in Lesson 1.
gid < 4Note that you can build the expression graphically by expanding the Fields and Values and Operators lists, then double-clicking on items in those lists.
That completes our quick tour of QGIS. In the next section, we'll return to pgAdmin to see how queries can be saved for later re-use.
In Lesson 1, we saved a number of our MS-Access queries so that we could easily re-run them later and, in a couple of cases, to build a query upon another query rather than a table. In Postgres and other sophisticated RDBMSs, stored SQL statements like these are called views. In this section, we'll see how views can be created in Postgres.
SELECT * FROM usa.cities WHERE capital = 1 ORDER BY stateabb;(Sorry Montpelier, I guess you were too small.)
Just as we saw in MS-Access, the records returned by views can be used as the source for a query.
SELECT * FROM usa.vw_capitals WHERE popclass = 2;
Views can also include spatial functions, or a combination of spatial and non-spatial criteria, in their definition. To demonstrate this, let's create views that re-project our states and cities data on the fly.
SELECT gid, name, pop2008, sub_region, ST_Transform(geom,2163) AS geom FROM usa.states;
SELECT *, ST_Transform(geom,2163) AS geom_2163 FROM usa.cities;
QGIS makes it possible to add both tables and views as layers. We'll take advantage of this feature now by creating layers from the views we just created.
This section showed how to save queries as views, which can then be utilized in the same way as tables. In the next section, we'll go into a bit more detail on the topic of spatial references.
As we’ve seen, populating a geometry column with usable data requires specifying the spatial reference of the data. We also saw that geometries can be re-projected from one spatial reference to another using the ST_Transform() function. In both cases, it is necessary to refer to spatial reference systems by an SRID (Spatial Reference ID). So, where do these IDs come from, and where can a list of them be found?
The answer to the question of where the IDs come from is that PostGIS uses the spatial reference IDs defined by the European Petroleum Survey Group (EPSG). As for finding the ID for a spatial reference you want to use, there are a few different options.
All of the spatial reference IDs are stored in a Postgres table in the public schema called spatial_ref_sys.
SELECT srid, srtext FROM spatial_ref_sys WHERE srtext LIKE '%Pennsylvania%';This query shows the SRIDs of each Pennsylvania-specific spatial reference supported in PostGIS.
Another way to find SRIDs is to look them up in QGIS.
The Prj2EPSG website [22] provides an easy-to-use interface for finding EPSG IDs. As its name implies, it allows the user to upload a .prj file (used by Esri to store projection metadata) and get back the matching EPSG ID. The site also makes it possible to enter search terms. My test search for ‘pennsylvania state plane’ yielded some garbage matches, but also the ones that I would expect.
This service appears to be down as of 5/29/2020. I leave this section here in case anyone is aware of a web-based alternative. If so, please share with the class in the discussion forum!
We’ve seen that the public schema contains a table called spatial_ref_sys that stores all of the spatial references supported by PostGIS. Another important item in that schema is the geometry_columns view. Have a look at the data returned by that view and note that it includes a row for each geometry column in the database. Among the metadata stored here are the parent schema, the parent table, the geometry column’s name, the coordinate dimension, the SRID and the geometry type (e.g., POINT, LINESTRING, etc.). Being able to conduct spatial analysis with PostGIS requires accurate geometry column information, so the PostGIS developers have made these data accessible through a read-only view rather than a table.
Earlier in the lesson, we used the AddGeometryColumn() function instead of adding the geometry column through the table definition GUI. An important reason for adding the geometry column in that manner is that it updates the geometry metadata that you can see through the geometry_columns view, something that would not happen if we had used the GUI.
We’ll talk more about measuring lengths, distances, and areas in the next lesson, but while we’re on the topic of spatial references, it makes sense to consider 2D Cartesian measurement in the context of planimetric map data versus measurement in the context of the spherical surface of the Earth. For example, the PostGIS function ST_Distance() can be used to calculate the distance between two geometries. When applied to geometries of the type we’ve dealt with so far, ST_Distance() will calculate distances in 2D Cartesian space. This is fine at a local or regional scale, since the impact of the curvature of the earth at those scales is negligible, but, over a continental or global scale, a significant error would result.
PostGIS offers a couple of alternative approaches to taking the earth’s curvature into account. Let’s assume that we wanted to measure the distance between points in the (U.S.) cities table that we created earlier in the lesson. We could use a version of the ST_Distance() function called ST_Distance_Spheroid(). As its name implies, this function is designed to calculate the minimum great-circle distance between two geometries.
The other approach is to store the features using a data type introduced in PostGIS 1.5 called geography. Unlike the geometry data type, the geography data type is meant for storing only latitude/longitude coordinates. The advantage of the geography data type is that measurement functions like ST_Distance(), ST_Length() and ST_Area() will return measures calculated in 3D space rather than 2D space. The disadvantage is that the geography data type is compatible with a significantly smaller subset of functions as compared to the geometry type. Calculating spherical measures can also take longer than Cartesian measures, since the mathematics involved is more complex.
The take-away message is that the geography data type can simplify data handling for projects that cover a continental-to-global scale. For projects covering a smaller portion of the earth’s surface, you are probably better off sticking with the geometry data type.
With that, we've covered all of the content for Lesson 3. In the next section, you'll find a project that will allow you to put what you've learned to use.
For Project 3, I would like you to map the hometowns of everyone on the class roster using Postgres/PostGIS and QGIS. Included in the data you downloaded at the beginning of the lesson were U.S. state and counties shapefiles, along with a comma-separated values file called postal_codes.txt that stores U.S. and Canadian postal codes and the coordinates of their centroids.
Right-click here to download the class roster with postal codes [23]. (If there are any students from outside the U.S. and Canada, they will appear at the bottom of the file with a special code. There will be a matching record at the bottom of the postal_codes.txt file.)
Here are the broad steps you should follow:
'POINT(' || lon || ' ' || lat || ')'
This project is one week in length. Please refer to Canvas Calendar for the due date.
SELECT * FROM states WHERE pop2008 > 10000000;
SELECT * FROM cities WHERE capital = 1;
SELECT * FROM states WHERE name LIKE 'New%';
SELECT * FROM cities WHERE name LIKE '%z%';
SELECT * FROM states ORDER BY pop2008 DESC;
SELECT * FROM states ORDER BY sub_region, name;
SELECT * FROM states WHERE pop2008 > 10000000 AND sub_region = 'Pop';
SELECT * FROM cities WHERE stateabb IN ('US-NY','US-NJ','US-PA');
SELECT state, (white::double precision/total) * 100 AS pctwhite FROM census2010;
SELECT sub_region, Sum(pop2008) FROM states GROUP BY sub_region;
SELECT states.name, census2010.total, states.geom FROM states INNER JOIN census2010 ON states.name = census2010.state;
SELECT states.sub_region, Avg(census2010.male) FROM states INNER JOIN census2010 ON states.name = census2010.state GROUP BY states.sub_region;
The real power of a spatial database is in its ability to conduct spatial analysis. This lesson focuses on the many spatial functions that are made available in the Postgres environment by the PostGIS extension. These functions are categorized in Chapter 8 of the online PostGIS manual [25] based on the jobs they do:
We won't discuss every function in all of these categories, but we will go through most of the more useful ones with plenty of examples.
At the successful completion of this lesson, students should be able to:
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 4 Discussion Forum.
Lesson 4 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below:
AddGeometryColumn()
We used this function in the previous lesson. Recall that we used this function instead of adding the geometry column through the Postgres table definition GUI because PostGIS requires an entry in the geometry_columns view. AddGeometryColumn() handles this for us.
DropGeometryColumn()
As the name implies, this function does the opposite of AddGeometryColumn().
UpdateGeometrySRID()
This function is the equivalent of Esri’s Define Projection tool. Use it if a geometry column has its SRID defined incorrectly. As with the Define Projection tool, this function has no effect on coordinate values; it simply modifies the metadata for the geometry column.
The constructor functions are used to create new geometries. While it’s natural to think of using these functions to populate a table’s geometry column, you should keep in mind that they can also be used to create short-lived geometries that are used only within a query. For example, you might use the ST_MakeEnvelope() constructor function to create a bounding box that you then use to select intersecting geometries in some table.
ST_GeomFromText()
We used this function throughout Lesson 3 to create various types of geometries. In the documentation, you may see that it takes Well-Known Text (WKT) as its input. The “Well-Known” in Well-Known Text refers to the fact that the text follows an industry standard for specifying geometries.
There are a number of other “FromText” functions (e.g., ST_LineFromText, ST_PointFromText) that produce the same result as ST_GeomFromText(). The difference in these functions is that they validate the input text to ensure that it matches the expected geometry type. Thus, ST_PointFromText() will return Null if it is passed something other than a POINT geometry. If you don’t see any benefit to this validation, then you’re better served using the more generic ST_GeomFromText() as it performs a bit better.
ST_GeogFromText()
At the end of Lesson 3, we talked about the geography data type. You can use ST_GeogFromText() to populate geography columns in the same way we used ST_GeomFromText() to populate geometry columns. Both functions allow for specifying or omitting the SRID of the coordinates. If omitted, ST_GeomFromText() makes no assumption about the SRID; it is set to 0. ST_GeogFromText(), on the other hand, assumes that the SRID is 4326 unless specified otherwise.
An alternative to creating geometries with ST_GeomFromText() are the “Make” functions: ST_MakePoint(), ST_MakeLine() and ST_MakePolygon(). These functions offer better performance than ST_GeomFromText() and their syntax is a bit more compact. So, why does ST_GeomFromText() even exist? Well, ST_GeomFromText() provides compliance with the Open Geospatial Consortium (OGC)’s Simple Features specification. The advantage to using ST_GeomFromText() is that other RDBMSs (Oracle, SQL Server) offer nearly identical functions; your familiarity with the function would transfer to those other environments. One drawback of the “Make” functions is that they do not take SRID as an input, which results in the returned geometry having an undefined SRID. Thus, calls to these functions are normally nested inside a call to ST_SetSRID.
Given the advantages offered by the “Make” functions, let’s have a look at their use.
ST_MakePoint()
As you’d probably guess, this function accepts an X coordinate and a Y coordinate (and optionally Z and M values) as inputs. Here is how to add the Empire State Building to our nyc_poi.pts table using ST_MakePoint():
INSERT INTO pts (name, geom) VALUES ('Empire State Building', ST_SetSRID(ST_MakePoint(-73.985744, 40.748549),4269));
ST_MakeLine()
This function has three alternative syntaxes. The first simply accepts two point geometries as inputs and returns a straight line connecting the two. We could use this syntax to add the Lincoln Tunnel feature:
INSERT INTO lines (name, geom) VALUES ('Lincoln Tunnel', ST_SetSRID(ST_MakeLine(ST_MakePoint(-74.019921, 40.767119), ST_MakePoint(-74.002841, 40.759773)),4269));
The second syntax is used when the points that make up the line are stored in a table. The field containing the points is specified as the only input to the function. This example from the PostGIS documentation shows creating a line from a set of GPS points:
SELECT gps.gps_track, ST_MakeLine(gps.the_geom) As newgeom FROM (SELECT gps_track, gps_time, the_geom FROM gps_points ORDER BY gps_track, gps_time) As gps GROUP BY gps.gps_track;
In the example, a table called gps_points stores a track identifier (gps_track), the time (gps_time) and the point geometry (the_geom). The data held in those three columns are retrieved as a subquery with an alias of gps. The parent query groups by the track identifier and passes the point geometry field to the ST_MakeLine() function to create a line from the points.
The last ST_MakeLine() syntax accepts an array of points as its input. We could use this syntax to add the Holland Tunnel:
INSERT INTO lines (name, geom) VALUES ('Holland Tunnel', ST_SetSRID(ST_MakeLine(ARRAY[ST_MakePoint(-74.036486,40.730121), ST_MakePoint(-74.03125,40.72882), ST_MakePoint(-74.011123,40.725958)]),4269));
This example demonstrates the use of the Postgres ARRAY data type. Arrays are built using the ARRAY keyword followed by a list of items enclosed in square brackets.
The documentation shows a clever use of the array syntax, in which the centroids of polygons stored in a table called visit_locations are used as input to ST_MakeLine():
SELECT ST_MakeLine(ARRAY(SELECT ST_Centroid(the_geom) FROM visit_locations ORDER BY visit_time));
ST_MakePolygon()
This function takes a LINESTRING representing the polygon’s exterior ring as an input. Optionally, an array of interior ring LINESTRINGs can be included as a second input. Let’s have a look at an example of both, starting with the simpler case:
INSERT INTO polys (name, geom) VALUES ('Central Park',ST_SetSRID(ST_MakePolygon(ST_GeomFromText('LINESTRING(-73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356)')),4269));
In this example, I used ST_GeomFromText() to create the LINESTRING because supplying the string of points is much easier than if I had taken the ST_MakePoint() approach used for the Holland Tunnel example. In our previous uses of ST_GeomFromText(), we included the optional SRID argument but in this example I omitted it. Why? Because ST_MakePolygon() will return an SRID-less geometry no matter what, so it’s sensible to specify the SRID just once in the call to the ST_SetSRID() function.
And now, here’s an example that uses ST_MakePolygon() to cut out the reservoir from the Central Park polygon:
INSERT INTO polys (name, geom) VALUES ('Central Park',ST_SetSRID(ST_MakePolygon(ST_GeomFromText('LINESTRING(-73.973057 40.764356, -73.981898 40.768094, -73.958209 40.800621, -73.949282 40.796853, -73.973057 40.764356)'), ARRAY[ST_GeomFromText('LINESTRING(-73.966681 40.785221, -73.966058 40.787674, -73.9649 40.788291, -73.963913 40.788194, -73.963333 40.788291, -73.962539 40.788259, -73.962153 40.788389, -73.96181 40.788714, -73.961359 40.788909, -73.960887 40.788925, -73.959986 40.788649, -73.959492 40.788649, -73.958913 40.78873, -73.958269 40.788974, -73.957797 40.788844, -73.957497 40.788568, -73.957497 40.788259, -73.957776 40.787739, -73.95784 40.787057, -73.957819 40.786569, -73.960801 40.782394, -73.961145 40.78215, -73.961638 40.782036, -73.962518 40.782199, -73.963076 40.78267, -73.963677 40.783661, -73.965694 40.784457, -73.966681 40.785221)')]),4269));
ST_MakeEnvelope()
The ST_MakeEnvelope() function is used to create a rectangular box from a list of bounding coordinates: the box’s minimum x value, minimum y value, maximum x value, and maximum y value. While it’s rare that such a geometry would be used to depict a real-world feature, envelopes are often used as inputs to other functions (e.g., selecting all features that are within a bounding box). Here is an example that produces an envelope surrounding Pennsylvania:
SELECT ST_MakeEnvelope(-80.52, 39.72, -74.70, 42.27, 4269);
This example simply demonstrates the syntax of ST_MakeEnvelope(). Note that the SRID of the envelope is provided as the last input to the function. We’ll see a practical use for this envelope later in the lesson when we talk about the ST_Intersects() function.
Unlike the previous category of functions which were concerned with creating new geometries, this category involves functions used to retrieve information about geometries that already exist.
GeometryType()
This function returns the input geometry’s type as a string (e.g., ‘POINT’, ‘LINESTRING’, or ‘POLYGON’). It comes in particularly handy when dealing with a table of mixed geometries. Here we retrieve just the lines from the mixed nyc_poi table:
SELECT name FROM nyc_poi.mixed WHERE GeometryType(geom) = 'LINESTRING';
ST_X() and ST_Y()
These functions take a point as input and return its X or Y coordinate in numeric form. Similar functions exist for the M and Z coordinates as well. Here we get the coordinates of our nyc_poi pts data:
SELECT name, ST_X(geom), ST_Y(geom) FROM nyc_poi.pts;
ST_StartPoint() and ST_EndPoint()
These functions take a LINESTRING or POLYGON as input and return the first and last vertex of that geometry. Here is an example based on our nyc_poi lines table:
SELECT name, ST_AsText(ST_StartPoint(geom)), ST_AsText(ST_EndPoint(geom)) FROM nyc_poi.lines;
Note that this example, and many others throughout this section, use the ST_AsText() function to output the returned geometry in a more human-friendly WKT.
ST_NPoints()
This function returns the number of points (vertices) that define the input geometry. Here we get the number of vertices from the states table:
SELECT name, ST_NPoints(geom) FROM usa.states ORDER BY name;
ST_Envelope()
This function accepts any type of geometry and returns that geometry’s minimum bounding box. Here we get the bounding box for Pennsylvania:
SELECT ST_AsText(ST_Envelope(geom)) FROM usa.states WHERE name = 'Pennsylvania';
ST_ExteriorRing()
This function takes a polygon as input and returns its exterior ring as a LINESTRING. Example:
SELECT name, ST_AsText(ST_ExteriorRing(geom)) FROM nyc_poi.polys;
ST_NumInteriorRings()
This function takes a polygon as input and returns the number of interior rings it contains. Example:
SELECT name, ST_NumInteriorRings(geom) FROM nyc_poi.polys;
ST_InteriorRingN()
This function takes a polygon and interior ring number as inputs and returns that ring as a LINESTRING. Note that the rings are numbered beginning with 1. This may seem obvious, but in many programming contexts items are numbered beginning with 0. Here we retrieve the Central Park reservoir ring:
SELECT name, ST_AsText(ST_InteriorRingN(geom,1)) FROM nyc_poi.polys;
ST_SetSRID()
This function, as we've already seen, is used to set the SRID of a geometry whose SRID is undefined or defined incorrectly. The statement below would re-set the SRID of the geometries in the states table to 4269:
SELECT ST_SetSRID(geom, 4269) FROM usa.states;
Just as the Define Projection tool in ArcMap only changes metadata and not the coordinate values themselves, ST_SetSRID() also has no effect on coordinate values. Where ST_SetSRID() differs from the Define Projection tool is that it is applied on an individual geometry basis rather than on a table basis. To re-project geometries into different coordinate systems (changing both the data and metadata), use the ST_Transform() function.
ST_Transform()
We worked with this function in Lesson 3. If ST_SetSRID() is analogous to the Define Projection tool, then ST_Transform() is analogous to the Project tool. An important thing to remember about ST_Transform() is that it leaves the underlying geometry unchanged when used in a SELECT query. If you want to store the transformed version of the geometry in a table, you should use ST_Transform() in an UPDATE or INSERT query.
PostGIS offers a number of functions for converting geometries between different forms. We saw ST_AsText() in a few examples from earlier in the lesson. Here we'll look at a few others.
ST_AsBinary()
This function outputs the geometry in Well-Known Binary (WKB) format, as laid out in the OGC specification. Outputting geometries in this format is sometimes necessary to interoperate with third-party applications. Here we output the NYC points in WKB format:
SELECT ST_AsBinary(geom) FROM nyc_poi.pts;
ST_AsEWKB()
One of the shortcomings of ST_AsBinary() is that it doesn't provide the geometry's SRID as part of its output. That's where the ST_AsEWKB() function comes in (the "E" in "EWKB" stands for extended). Note that the SRID is binary-encoded just like the coordinates, so you shouldn't expect to be able to read the SRID in the output. Here we use ST_AsEWKB() on the same NYC points:
SELECT ST_AsEWKB(geom) FROM nyc_poi.pts;
ST_AsText()
As we've seen in a number of examples, this function can be used to output geometries in a human-readable format.
ST_AsEWKT()
Just as the ST_AsBinary() function does not include SRID, the same is true of ST_AsText(). ST_AsEWKT() outputs the same text as ST_AsText(), but it also includes the SRID. For example:
SELECT ST_AsEWKT(geom) FROM nyc_poi.pts;
Other output functions
Other formats supported by PostGIS include GeoJSON (ST_AsGeoJSON), Geography Markup Language (ST_AsGML), Keyhole Markup Language (ST_AsKML) and scalable vector graphics (ST_AsSVG). Consult the PostGIS documentation [27] for details on using these functions.
The functions in this category are the big ones in terms of providing the true power of a GIS. (So pay attention!)
ST_Contains()
This function takes two geometries as input and determines whether or not the first geometry contains the other. The example below selects each city in the state of New York and checks to see if it is contained by a bounding box, the box representing the bounds of Pennsylvania which we created earlier using MakeEnvelope. This query should return True values for the border cities of Binghamton, Elmira and Jamestown, and False for all other cities.
SELECT name, ST_Contains(ST_MakeEnvelope(-80.52, 39.72, -74.70, 42.27, 4269),geom) FROM usa.cities WHERE stateabb = 'US-NY';
ST_Within()
The converse of the ST_Contains() function is ST_Within(), which determines whether or not the first geometry is within the other. Thus, you could obtain the same results returned by ST_Contains() by reversing the geometries:
SELECT name, ST_Within(geom,ST_MakeEnvelope(-80.52, 39.72, -74.70, 42.27, 4269)) FROM usa.cities WHERE stateabb = 'US-NY';
ST_Covers()
This function will return the same results as ST_Contains() in most cases. To illustrate the difference between the two functions, imagine a road segment that is exactly coincident with a county boundary (i.e., the road forms the boundary between two counties). If the road segment and county geometries were fed to the ST_Contains() function, it would return False. The ST_Covers() function, on the other hand, would return True.
ST_CoveredBy()
This function is to ST_Covers() as ST_Within() is to ST_Contains().
ST_Intersects()
This function determines whether or not two geometries share the same space in any way. Unlike ST_Contains(), which tests whether or not one geometry is fully within another, ST_Intersects() looks for intersection between any parts of the geometries. Returning to the road/county example, a road segment that is partially within a county and partially outside of it would return False using ST_Contains(), but True using ST_Intersects().
ST_Disjoint()
This function is the converse of ST_Intersects(). It returns True if the two geometries share no space, and False if they intersect.
ST_Overlaps()
This function is quite similar to ST_Intersects with a couple of exceptions: a. the geometries must be of the same dimension (i.e., two lines or two polygons), and b. one geometry cannot completely contain the other.
ST_Touches()
This function returns True if the two geometries are tangent to one another but do not share any interior space. If the geometries are disjoint or overlapping, the function returns False. Two neighboring land parcels would return True when fed to ST_Touches(); a county and its parent state would yield a return value of False.
ST_DWithin()
This function performs "within a distance of" logic, accepting two geometries and a distance as inputs. It returns True if the geometries are within the specified distance of one another, and False if they are not. The example below reports on whether or not features in the NYC pts table are within a distance of 2 miles (5280 feet x 2) of the Empire State Building.
SELECT ptsA.name, ptsB.name, ST_DWithin(ST_Transform(ptsA.geom,2260),ST_Transform(ptsB.geom,2260),5280*2) FROM pts AS ptsA, pts AS ptsB WHERE ptsA.name = 'Empire State Building';
Some important aspects of this query are:
ST_DFullyWithin()
This function is similar to ST_DWithin(), with the difference being that ST_DFullyWithin() requires each point that makes up the two geometries to be within the search distance, whereas ST_DWithin() is satisfied if any of the points comprising the geometries are within the search distance. The example below demonstrates the difference by performing a cross join between the NYC pts and polys.
SELECT pts.name, polys.name, ST_DWithin(ST_Transform(pts.geom,2260),ST_Transform(polys.geom,2260),5280*2), ST_DFullyWithin(ST_Transform(pts.geom,2260),ST_Transform(polys.geom,2260),5280*2) FROM pts CROSS JOIN polys WHERE pts.name = 'Empire State Building';
ST_DWithin() reports that the Empire State Building and Central Park are within 2 miles of each other, whereas ST_DFullyWithin() reports that they are not (because part of the Central Park polygon is greater than 2 miles away). Note that this query shows an alternative syntax for specifying a cross join in Postgres.
ST_Area()
The key point to remember with this function is to use it on a geometry that is suitable for measuring areas. As we saw in Lesson 3, the ST_Transform() function can be used to re-project data on the fly if it is not stored in an appropriate projection.
ST_Area() can be used on both geometry and geography data types. Though geography objects are in latitude/longitude coordinates by definition, ST_Area() is programmed to return area values in square meters when a geography object is passed to it. By default, the area will be calculated using the WGS84 spheroid. This can be costly in terms of performance, so the function has an optional use_spheroid parameter. Setting that parameter to false causes the function to use a much simpler, but less accurate, sphere.
ST_Centroid()
See Lesson 3 for example usages of this function.
ST_Distance()
This function calculates the 2D (Cartesian) distance between two geometries. It should only be used at a local or regional scale when the curvature of the earth's surface is not a significant factor. The example below again uses a cross join between the NYC pts table and itself to compute the distance in miles between the Empire State Building and the other features in the table:
SELECT ptsA.name, ptsB.name, ST_Distance(ST_Transform(ptsA.geom,2260),ST_Transform(ptsB.geom,2260))/5280 FROM pts AS ptsA CROSS JOIN pts AS ptsB WHERE ptsA.name = 'Empire State Building';
The ST_Distance() function can also be used to calculate distances between geography data types. If only geography objects are supplied in the call to the function, the distance will be calculated based on a simple sphere. For a more accurate calculation, an optional use_spheroid argument can be set to True, as we saw with ST_Area().
ST_DistanceSpheroid() and ST_DistanceSphere()
These functions exist to provide for high-accuracy distance measurement when the data are stored using the geometry data type (rather than geography) and the distance is large enough for the earth's curvature to have an impact. They essentially eliminate the need to transform lat/long data stored as geometries prior to using ST_Distance(). The example below illustrates the use of both functions to calculate the distance between Los Angeles and New York.
SELECT cityA.name, cityB.name, ST_DistanceSphere(cityA.geom,cityB.geom)/1000 AS dist_sphere, ST_DistanceSpheroid(cityA.geom,cityB.geom,'SPHEROID["GRS 1980",6378137,298.257222101]')/1000 AS dist_spheroid FROM cities AS cityA CROSS JOIN cities AS cityB WHERE cityA.name = 'Los Angeles' AND cityB.name = 'New York';
Note that the Spheroid function requires specification of a spheroid. In this case, the GRS80 spheroid is used because it is associated with the NAD83 GCS. Other spheroid specifications can be found in the spatial_ref_sys table in the public schema. You can query that table like so:
SELECT srtext FROM spatial_ref_sys WHERE srid = 4326;
The query above returns the description of the WGS84 GCS, including its spheroid parameters. These parameters could be copied for use in the ST_DistanceSpheroid() function as in the example above.
ST_Length()
This function returns the length of a linestring. The length of polygon outlines is provided by ST_Perimeter(); see below. As with measuring distance, be sure to use an appropriate spatial reference. Here we get the length of the features in our NYC lines table in feet:
SELECT name, ST_Length(ST_Transform(geom,2260)) FROM lines;
As with the ST_Distance() function, ST_Length() accepts the geography data type as an input and can calculate length using either a sphere or spheroid.
ST_3DLength()
This function is used to measure the lengths of linestrings that have a Z dimension.
ST_LengthSpheroid()
Like the ST_DistanceSpheroid() function, this function is intended for measuring the lengths of lat/long geometries without having to transform to a different spatial reference. It can be used on 2D or 3D geometries.
ST_Perimeter()
This function is used to measure the length of a polygon's perimeter. Here we obtain the perimeter of Central Park:
SELECT name, ST_Perimeter(ST_Transform(geom,2260)) FROM polys;
Note that the returned length will include the perimeter of both the exterior ring and any interior rings. For example, the function returns a length of just over 6 miles for the version of Central Park without the lake and just under 8 miles for the version with the lake.
ST_3DPerimeter()
This function is used to measure the perimeter of polygons whose boundaries include a Z dimension.
Before moving on to Project 4, try your hand at the following practice exercises. Solutions can be found at the bottom of the page.
Solutions [28]
INSERT INTO nyc_poi.pts (name, geom) VALUES ('Times Square', ST_SetSRID(ST_MakePoint(-73.985727,40.757685),4269));
OR
INSERT INTO nyc_poi.pts (name, geom) VALUES ('Times Square', ST_GeomFromText('POINT(-73.985727 40.757685)',4269));
SELECT name, ST_NPoints(geom) FROM states INNER JOIN census2010 ON states.name = census2010.state WHERE census2010.total > 10000000;
SELECT name, ST_Transform(geom,32015) FROM nyc_poi.pts;
OR for human-readable geom:
SELECT name, ST_AsText(ST_Transform(geom,32015)) FROM nyc_poi.pts;
SELECT name, ST_AsEWKT(geom) FROM usa.cities;
SELECT name, ST_AsText(ST_Centroid(geom)) FROM states INNER JOIN census2010 ON states.name = census2010.state WHERE male > female;
SELECT states.name FROM states CROSS JOIN cities WHERE ST_Contains(states.geom, cities.geom) AND cities.name = 'Springfield';
OR
SELECT states.name FROM states CROSS JOIN cities WHERE ST_Covers(states.geom, cities.geom) AND cities.name = 'Springfield';
SELECT cities.name, cities.stateabb FROM states CROSS JOIN cities WHERE ST_Within(cities.geom, states.geom) AND states.sub_region='Soda' ORDER BY cities.stateabb, cities.name;
OR
SELECT cities.name, cities.stateabb FROM states CROSS JOIN cities WHERE ST_Intersects(cities.geom, states.geom) AND states.sub_region='Soda' ORDER BY cities.stateabb, cities.name;
SELECT lines.name FROM lines CROSS JOIN pts WHERE pts.name = 'Madison Square Garden' AND ST_DWithin(ST_Transform(lines.geom,2260),ST_Transform(pts.geom,2260),5280*1);
SELECT cityA.name, cityB.name, ST_DistanceSphere(cityA.geom, cityB.geom) / 1000 AS dist_km FROM cities AS cityA CROSS JOIN cities AS cityB WHERE (cityA.capital = 1 AND cityB.capital = 1) AND (cityA.name != cityB.name);
SELECT * FROM spatial_ref_sys WHERE srtext LIKE '%Pennsylvania North%';
The correct spatial reference has an SRID of 2271.
SELECT DISTINCT states.name FROM states CROSS JOIN cities WHERE ST_Contains(states.geom, cities.geom) AND cities.popclass >= 4;
If you completed this exercise on your own, you probably had Texas appear in your results 3 times. Remember that a cross join creates a result set that combines all the rows from table A with all the rows from table B. The WHERE clause narrows down the cities in the output to those 9 having a popclass of 4 or 5, but 3 of those 9 cities are in Texas. That explains why the query returns Texas 3 times. The answer in a case like this is to insert the DISTINCT keyword after SELECT. This ensures that none of the output rows will be duplicated.
If you took GEOG 483, the first project had you finding the best locations for "Jen and Barry" to open an ice cream business. We're going to revisit that scenario for this project. Your task is to import the project shapefiles into a PostGIS schema and then write a series of SQL statements that automate the site selection process. If you need a new copy of the data, you can download and unzip the Project 4 Data [26] file. Here's a reminder of Jen and Barry's original selection criteria:
You should narrow the cities down to 9, based on the county- and city-level criteria. After evaluating the interstate and recreation area criteria (which are a bit more difficult), that should get you down to 4 cities.
This project is one week in length. Please refer to the Canvas Calendar for the due date.
Critical to maintaining spatial databases is the job of ensuring the integrity of your spatial and nonspatial data. There are several behaviors that you can impose upon your data that will give you access to robust methods of assessing the integrity of existing data and of ensuring the integrity of newly created data. This lesson will expose you to some of those techniques.
Note that there are two sets of pages in this lesson; the first written for ArcGIS Pro and the second for ArcGIS Desktop (ArcMap). I strongly encourage you to work through the ArcGIS Pro pages since it is the more modern of the two packages. If you do that, there is no need to also work through the ArcMap pages, unless you're particularly interested. But if you have good reason to avoid Pro, such as having no experience with it or working in an environment where you're required to use ArcMap instead, then you have that option.
The point has been made that the best practice is to think through the implementation of your database design before you implement it. That philosophy certainly holds for what we will be covering in this lesson. However, in order to illustrate the concepts that we want to cover, it is advantageous to already have some existing data, both spatial and attribute. The scope of this lesson does not include a lot of tracing of spatial features and entering of attribute values. So, we will be applying the various techniques to existing data.
At the successful completion of this lesson, students should be able to:
Conversation and comments in this course will take place within the course discussion forums.
Lesson 5 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below:
In Geography 484: GIS Database Development, you were introduced to attribute domains; however, it is possible to go through that course without having actually implemented them. So we will go over Coded Value Domains and Range Domains in the following section. Some of the material will be review for some of you, but even if you did work with attribute domains in Geography 484, you are apt to be exposed to functionality associated with them that was not covered in that course.
Follow this link to download the data for this part of the lesson: AttributeDomains.zip [29]
The zip archive contains two Esri File Geodatabases:
A coded value domain allows you to choose from a list of values that are allowed in a field. A range domain allows you to specify a valid range for values in a numeric field. Both types of attribute domain thus provide means of enforcing data integrity.
Additional behaviors that can accompany attribute domains allow us to define what values get assigned to the field when a feature is split into two or when two features are merged into one. According to the Esri documentation, "...merge policies are not evaluated when merging features in the Editor. However, developers can take advantage of merge policies when writing their own merge implementations." The scope of this lesson does not include developer-level exercises, so we will visit only the use of split policies.
For the homework deliverable for this part of the lesson, I am going to have you create two attribute domains to be assigned to existing fields in a feature class attribute table.
The data you will use is found in the AttributeDomainsProject.gdb file geodatabase. That geodatabase contains a single feature class named BuildingFootprints.
Because attribute domains are stored in the geodatabase, they can be used by any dataset: feature class, table, subtype (covered in the next part of the lesson), within the geodatabase.
You can now move on to the Subtypes part of the lesson.
In a situation where we have a lot of similar features, roads for example, that we want to easily assign separate attribute values to and symbolize differently, etc., we can designate them as what Esri calls subtypes. The alternative for managing such data and information would be to create separate feature classes for, in this example, each type of road. This practice can affect the performance and ease of use of the database.
Subtypes can be created for spatial features in a feature class or for objects (records) in a nonspatial, stand alone table. Subtypes can be created based on an existing field or a new field is created to hold the subtype values.
A subtype is defined by a code and a description. The codes are stored in a field in the attribute table and must be either Short or Long Integer values. Each code value is equated to a descriptive category. Sounds a lot like a coded value domain but, as you will see, they are not the same.
In our exercise example, we will be creating road subtypes from integer values that represent six different road type categories.
Follow this link to download the data for this part of the lesson: Subtypes.zip [33]
The zip archive contains two Esri file geodatabases:
We are going to place roads in subtype categories according to the values in an existing field that contains integer values that represent six road type categories.
So instead of creating separate feature classes for the different road types, we put the road line features into subtypes which gave us the ability to specify unique default values for each new feature of a different road type category, and to assign, or not, certain attribute domains to each category.
As was mentioned above, subtypes can be created for objects (records) in a nonspatial table. In this homework project, that is what you will do.
Investigate the field structure and attribute data in the Encounters table residing in the SubtypesProject.gdb geodatabase. In addition to the OBJECTID field that is automatically created by the GIS, you will find the following five fields: ID, Encounter, Category, EducationLevel, and BloodAlcohol.
The values of 1 and 2 in the Encounter field represent Distant Encounters and Close Encounters respectively.
The descriptions of the values in the Category field are as follows:
The values in the EducationLevel field are the number of years of school attended.
The values in the BloodAlcohol field are blood alcohol content measures. (Wikipedia Blood Alcohol Content Page [35]/ http://en.wikipedia.org/wiki/Blood_alcohol_content)
Here is what I want you to do for this project:
A geodatabase topology is another construct that is stored within a geodatabase and gives us added control over assessing and maintaining the integrity of our spatial data.
Follow this link to download the data for this part of the lesson: Topology.zip [36]
The zip archive contains the following:
An Esri File Geodatabase: geodatabasetopol.gdb
A zip archive: TopologyProject.zip
A geodatabase topology provides a robust way of defining topological relationships among spatial features. It does so by analyzing coordinate locations of feature vertices both among features within a feature class and between features in multiple feature classes taking part in the topology. Therefore, it is not only important that all of the feature classes participating in a geodatabase topology be in the same coordinate system, but also that the measurement precision defined for each feature class be the same. To assure that this is the case, all feature classes that take part in a geodatabase topology must reside within what is known as a Feature Dataset. When a feature dataset is created, the coordinate system and precision are defined, and any subsequent feature class that is added to the feature dataset inherits that coordinate system and precision. In the exercise that follows, you will see that the precision is controlled by the Tolerance and Resolution settings. I encourage you to read more about these topics in the Topology in ArcGIS [37] entry in the ArcGIS Pro documentation.
A geodatabase topology is governed by topology "rules" that the user specifies, and those rules are based on knowledge of the relationships between and among the features that will be taking part in the topology. So, the onus is on the user to understand the data being created/edited in order that appropriate rules are specified.
The manifestation of a geodatabase topology is as a layer in the feature dataset. As such, the topology errors that it contains are symbolized just as are the features in any other map layer in ArcGIS. In the documentation, the Validate and fix geodatabase topology [38] topic provides an overview of the error fixing process along with links to the rules available for points, polylines, and polygons. If you are inclined to adorn your walls with GIS help guides, you may want to print the topology rules poster [39]. Whether you print it out or not. it offers a bit more in the way of graphic description and examples of the rules.
Once topology rules have been imposed on the data in a feature dataset. errors are discovered by "validating" the topology. Validation can be done on the entire visible extent of the data or on a smaller specified area. The latter technique allows you to just check an area that you have been editing, rather than the entire dataset. This can save time when the entire dataset is large.
As is mentioned above, one needs to be aware of how the features involved in a geodatabase topology relate to each other in order to be able to define appropriate topology rules to govern the creation of spatial data and aid in discovering errors in existing data. In the following exercise, we will be working with the data depicted in the image below, and I will be the arbitrary source of what is known about the relationships among the features in the four feature classes involved. We will be basing our choices of rules on the following:
With a geodatabase topology in place, there are certain tools at your disposal that enable you to maintain topological relationships when you make edits to spatial data. Here, we contrast editing a shared polygon boundary with and without having a geodatabase topology set up. It is important to realize that the "shared" boundary between two polygons actually is a duplicate set of line segments, each polygon boundary exists as complete and independent from the adjacent polygon. So, when a shared boundary needs to be edited, one must actually edit the boundaries of two polygons.
Because this is not a course in inputting and editing spatial and attribute data, we are choosing to focus on what needs to be done to prepare to implement a geodatabase topology. There will be some errors in spatial data to repair, but it involves going over ground already covered in section A of the lesson.
Unzip the TopologyProject.zip archive. The archive contains a folder named TopologyProject. Within that folder is a geodatabase (TopologyProject.gdb) and a georeferenced image of a map (with ancillary files).
The BuildingFootprints feature class contains some instances of polygon overlap that need to be repaired.
In order for you to accomplish finding and repairing the errors in the BuildingFootprints feature class, you are going to have to create and employ a geodatabase topology. You know from section A how to create a geodatabase topology, and you know that in order to do so, the data in question has to reside within a feature dataset. What we did not explicitly go over in the lesson was how to (1) create a feature dataset, and (2) how to get existing data, in this case the feature class contained in the TopologyProject.gdb, into a feature dataset that we create. But that's what I want you to do. Given that Geography 484 or comparable experience was the prerequisite for taking this course, you should be able to do it.
Once you have accomplished that, proceed to find and correct the overlapping building footprint polygons. All of the offending features will be in the area covered by the included georeferenced map image. You can use it as reference to make sure you are performing the corrections to the polygons correctly.
When you finish, zip up your version of the TopologyProject.gdb and upload it to the Lesson 5 Drop Box.
In Geography 484: GIS Database Development, you were introduced to attribute domains; however, it is possible to go through that course without having actually implemented them. So we will go over Coded Value Domains and Range Domains in the following section. Some of the material will be review for some of you, but even if you did work with attribute domains in Geography 484, you are apt to be exposed to functionality associated with them that was not covered in that course.
Follow this link to download the data for this part of the lesson: AttributeDomains.zip [29]
The zip archive contains two Esri File Geodatabases:
A coded value domain allows you to choose from a list of values that are allowed in a field. A range domain allows you to specify a valid range for values in a numeric field. Both types of attribute domain thus provide means of enforcing data integrity.
Additional behaviors that can accompany attribute domains allow us to define what values get assigned to the field when a feature is split into two or when two features are merged into one. According to the Esri documentation, "...merge policies are not evaluated when merging features in the Editor. However, developers can take advantage of merge policies when writing their own merge implementations." The scope of this lesson does not include developer-level exercises, so we will visit only the use of split policies.
For the homework deliverable for this part of the lesson, I am going to have you create two attribute domains to be assigned to existing fields in a feature class attribute table.
You will use the behavior functionality that the domains give you to find several attribute errors that reside in the dataset.
The data you will use is found in the AttributeDomainsProject.gdb file geodatabase. That geodatabase contains a single feature class named BuildingFootprints.
Because attribute domains are stored in the geodatabase, they can be used by any dataset: feature class, table, subtype (covered in the next part of the lesson), within the geodatabase.
You can now move on to the Subtypes part of the lesson.
In a situation where we have a lot of similar features, roads for example, that we want to easily assign separate attribute values to and symbolize differently, etc., we can designate them as what are called Subtypes. The alternative for managing such data and information would be to create separate feature classes for, in this example, each type of road. This practice can affect the performance and ease of use of the database.
Subtypes can be created for spatial features in a feature class or for objects (records) in a nonspatial, standalone table. Subtypes can be created based on an existing field, or a new field is created to hold the subtype values.
A subtype is defined by a code and a description. The codes are stored in a field in the attribute table and must be either Short or Long Integer values. Each code value is equated to a descriptive category. Sounds a lot like a coded value domain but, as you will see, they are not the same.
In our exercise example, we will be creating road subtypes from integer values that represent six different road type categories.
Follow this link to download the data for this part of the lesson: Subtypes.zip [33]
The zip archive contains two Esri File Geodatabases:
We are going to place roads in subtype categories according to the values in an existing field that contains integer values that represent six road type categories.
So instead of creating separate feature classes for the different road types, we put the road line features into subtypes which gave us the ability to specify unique default values for each new feature of a different road type category, and to assign, or not, certain attribute domains to each category.
As was mentioned above, subtypes can be created for objects (records) in a nonspatial table. In this homework project, that is what you will do.
Investigate the field structure and attribute data in the Encounters table residing in the SubtypesProject.gdb geodatabase. In addition to the OBJECTID field that is automatically created by the GIS, you will find the following five fields: ID, Encounter, Category, EducationLevel, and BloodAlcohol.
The values of 1 and 2 in the Encounter field represent Distant Encounters and Close Encounters respectively.
The descriptions of the values in the Category field are as follows:
The values in the EducationLevel field are the number of years of school attended.
The values in the BloodAlcohol field are blood alcohol content measures. (Wikipedia Blood Alcohol Content Page [35]/ http://en.wikipedia.org/wiki/Blood_alcohol_content)
Here is what I want you to do for this project:
A geodatabase topology is another construct that is stored within a geodatabase and gives us added control over assessing and maintaining the integrity of our spatial data.
Follow this link to download the data for this part of the lesson: Topology_ArcMap.zip [40]
The zip archive contains the following:
An Esri Personal Geodatabase: geodatabasetopol.mdb
A zip archive: TopologyProject.zip
A geodatabase topology provides a robust way of defining topological relationships among spatial features. It does so by analyzing coordinate locations of feature vertices, both among features within a feature class and between features in multiple feature classes taking part in the topology. Therefore, it is not only important that all of the feature classes participating in a geodatabase topology be in the same coordinate system, but also that the measurement precision defined for each feature class be the same. To assure that this is the case, all feature classes that take part in a geodatabase topology must reside within what is known as a Feature Dataset. When a feature dataset is created, the coordinate system and precision are defined, and any subsequent feature class that is added to the feature dataset inherits that coordinate system and precision. In the exercise that follows, you will see that the precision is controlled by the Tolerance and Resolution settings. I encourage you to read more about these topics in the Desktop Help when you have time. Search on "topology rules", and then select the "Topology in ArcGIS" topic entry.
A geodatabase topology is governed by topology "rules" that the user specifies, and those rules are based on knowledge of the relationships between and among the features that will be taking part in the topology. So, the onus is on the user to understand the data being created/edited in order that appropriate rules are specified. Rules can be added to and taken away from a topology as long as the data is not part of an active ArcMap editing session.
The manifestation of a geodatabase topology is as a layer in the feature dataset. As such, the topology errors that it contains are symbolized just as are the features in any other map layer in ArcMap. In the Desktop Help, search on "topology rules", and then select the "Geodatabase topology rules and topology error fixes" topic entry. This will take you to a compilation of the types of rules that can be defined along with suggestions for ways to fix violations of each rule and a description of how the errors are symbolized in the map data frame, as Points, Lines, and Areas. If you are inclined to adorn your walls with GIS help guides, you will find a link to a topology rules poster in the opening passage to the Help page I just referred you to. Whether you print it out or not. it offers a bit more in the way of graphic description and examples of the rules.
Once topology rules have been imposed on the data in a feature dataset. errors are discovered by "validating" the topology. Validation can be done on the entire visible extent of the data or on a smaller specified area. The latter technique allows you to just check an area that you have been editing, rather than the entire dataset. This can save time when the entire dataset is large.
As is mentioned above, one needs to be aware of how the features involved in a geodatabase topology relate to each other in order to be able to define appropriate topology rules to govern the creation of spatial data and aid in discovering errors in existing data. In the following exercise, we will be working with the data depicted in the image below, and I will be the arbitrary source of what is known about the relationships among the features in the four feature classes involved. We will be basing our choices of rules on the following:
With a geodatabase topology in place, there are certain tools at your disposal that enable you to maintain topological relationships when you make edits to spatial data. Here, we contrast editing a shared polygon boundary with and without having a geodatabase topology set up. It is important to realize that the "shared" boundary between two polygons actually is a duplicate set of line segments, each polygon boundary exists as complete and independent from the adjacent polygon. So when a shared boundary needs to be edited, one must actually edit the boundaries of two polygons.
Because this is not a course in inputting and editing spatial and attribute data, we are choosing to focus on what needs to be done to prepare to implement a geodatabase topology. There will be some errors in spatial data to repair, but it involves going over ground already covered in section A of the lesson.
Unzip the TopologyProject.zip archive. The archive contains a folder named TopologyProject. Within that folder is a map document (TopologyProject.mxd file), a geodatabase (TopologyProject.gdb) and a georeferenced image of a map (with ancillary files). When you unzip the archive, keep the folder and its contents intact. The map document is set to maintain relative paths to the data that it points to, so after you have extracted the folder and you navigate into it and double-click on the TopologyProject.mxd, the map image and the single feature class contained in the geodatabase should be present in the resulting ArcMap session.
The BuildingFootprints feature class contains some instances of polygon overlap that need to be repaired.
In order for you to accomplish finding and repairing the errors in the BuildingFootprints feature class, you are going to have to create and employ a geodatabase topology. You know from section A how to create a geodatabase topology, and you know that in order to do so, the data in question has to reside within a feature dataset. What we did not explicitly go over in the lesson was how to (1) create a feature dataset, and (2) how to get existing data, in this case the feature class contained in the TopologyProject.gdb, into a feature dataset that we create. But that's what I want you to do. Given that Geography 484 or comparable experience was the prerequisite for taking this course, you should be able to do it.
Once you have accomplished that, proceed to find and correct the overlapping building footprint polygons. All of the offending features will be in the area covered by the included georeferenced map image. You can use it as a reference to make sure you are performing the corrections to the polygons correctly.
When you finish, zip up your version of the TopologyProject.gdb and upload it to the Lesson 5 Drop Box.
This project is one week in length. Please refer to the Canvas Calendar for the due date.
The various Project deliverables were described along the way throughout the lesson. Below is a list summarizing what you need to submit.
Teaching enterprise geodatabase concepts online has historically been a challenge because it's been difficult to provide students with a "playground" to get hands-on experience. Fortunately, today it is much easier to give students hands-on experience with enterprise geodatabases through virtual machines hosted by Amazon. Over the next three weeks, you will use this Amazon cloud-based solution to set up and experiment with your own remote enterprise geodatabase server.
Objectives
At the successful completion of this lesson, students should be able to:
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 6 Discussion Forum.
Lesson 6 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.
Using the Amazon service mentioned above, Esri makes it possible to set up cloud-based installations of ArcGIS Enterprise, software required for implementing an enterprise geodatabase. Because our Cloud and Server GIS course also guides students through putting together an instance of ArcGIS Enterprise, the instructions below are borrowed from that course. While our focus in this course will be on the enterprise geodatabase that we'll implement on the instance with ArcGIS Enterprise, a side benefit is that you'll also be able to experiment with ArcGIS Enterprise functionality if you like.
Go to Amazon Web Services [41] and click on "Create an AWS Account."
If prompted to create a Business or Personal account, choose Personal. Please make careful note of the password you select when setting up your account, you will be needing it. It is characteristic of Amazon Web Services that things work the way they are supposed to, but you don't get a lot of hand-holding. So, if you lose your password, I'm not sure it would be easy to recover it, and you will need to fill out a form with personal information.
Second, you will need to provide payment information, including a credit card number. If you are careful and follow the course instructions about explicitly stopping your instance (virtual machine) when you are not using it, you should be able to complete the coursework while incurring charges of $20-$40. The current step of signing up is free, but you should be aware that you will start being charged immediately upon starting to use AWS services. As part of this step, you'll be asked to select a support plan. The Basic (Free) support is all you need for this class.
Third, there is an identity confirmation step during which you will receive a phone call and enter a code. This ensures you are a human. Amazon does not use the phone number provided here for other purposes.
You can monitor your billing status by clicking your user name at the top-right of the AWS screen and choosing "My Billing Dashboard." On the right side of the billing dashboard, you should see a "Month-to-Date Spend by Service" section, from which you can view details about what you're being charged for in the current month or any other month.
If any of these directions are confusing or inaccurate, please post a question or comment to the Lesson 6 discussion forum.
In a few moments, we'll see that two of the settings involved in launching a new instance in EC2 are the VPC and key pair. A VPC (Virtual Private Cloud) is sort of your own special space carved out of Amazon's cloud. Instances in a VPC can see each other and your own network fairly easily, but they're not immediately accessible from elsewhere without some extra work on your part. That's a good thing for security.
A key pair is another security measure that will come into play when you log in to your instance for the first time. You will be logging in to your instance as a user named Administrator. The password for the Administrator user will be encrypted by AWS. The procedure for getting that password so that you can log in to your instance involves 1) creating a key pair (one key held by Amazon and another key given to you) in the AWS Management Console, 2) providing the name of that key pair when launching the instance, and 3) using the key pair after the instance has been created to decrypt the password.
Note: If you've taken our Cloud and Server GIS course (GEOG 865), you will have already created a VPC and key pair. You may skip over the steps in this section (or do them again if you like) and pick up with section C below.
Esri provides two ways to deploy ArcGIS in AWS: using Amazon's CloudFormation service and Amazon's AWS Management Console. For our purposes, the AWS Management Console is the best option, so we will lead you through the launching of an instance via that route. However, if you decide to deploy ArcGIS in the cloud as part of your job, you may want to explore the CloudFormation option as well. Instructions for both can be found in Esri's documentation [44] [http://server.arcgis.com/en/server/latest/cloud/amazon/use-aws-management-console-with-arcgis-server.htm]. Note that this link opens the Management Console instructions; instructions for the CloudFormation method can be found through the navigation headings on the left side of the page.
The basic idea behind what we're about to do is that Amazon has made it possible for vendors like Esri to create machine images (configurations of operating system, software, data, etc.) that can serve as blueprints for the making of child instances. Esri has created several of these AMIs (Amazon Machine Images): one that runs ArcGIS Enterprise on the Linux OS Ubuntu with Postgres, one that runs ArcGIS Enterprise on Windows with SQL Server, etc. Third parties (like us) can discover and "subscribe" to these AMIs through the AWS Marketplace.
There were a lot of steps involved in launching this instance and some of it may have seemed confusing, but the good news is you should not need to go through these steps again. Stopping and re-starting your instance now that it's been created is a much simpler process.
Whether starting or stopping your instance, you'll want to be viewing the Instances page in the AWS Management Console [48].
When you've finished your coursework for the day, you can Stop your instance as follows:
When you want to Start your instance:
Note: The biggest benefit to having a consistent IP address associated with an ArcGIS Enterprise/Server instance comes when you're using it to develop map/feature/geoprocessing services to be consumed by an app or a third party. A constantly changing IP address would render such services practically unusable. We won't be authoring such services in this course, though you can check out our Cloud & Server GIS course [49] if you're interested in learning how to do that. In this course, the benefit to the Elastic IP will be avoiding the need to locate the instance's new Public DNS address each time you want to re-start it and connect to it through remote desktop. If you'd rather not bother with associating an Elastic IP with your instance, then you're welcome to skip that step and instead look up the instance's new Public DNS and connect through that address.
If you just completed section B above, then your Enterprise Geodatabase instance is currently running, and we're going to work with it in the next section. If you are going to continue on, you don't need to stop your instance now. But be sure to stop it when you're ready to quit working.
Now that your site has been created, you can get ready to log in to the instance and start working with your software.
Your instance needs to be running, so if you did Stop it at the end of the previous section, open the AWS Management Console and Start it again.
Recall that when launching the instance, you created a new Security Group using the default Create new based on seller settings option. This set up the instance so that it would accept http and https connections from any IP address. We now want to log in to the instance using the Windows Remote Desktop Connection app, but to do that, we need to add a rule to the security group allowing that sort of traffic.
Next, choose My IP from the Source dropdown list and click Save rules.
Important: You've just specified that your instance should accept remote desktop connections from your current IP address, and you'll shortly make your first remote desktop connection. It's possible that your IP address will change over the rest of the term. For example, your Internet service provider might use dynamic IP address assignment. Or you might be working on a laptop in a different location. If that's the case, then you'll need to come back and edit your RDP rule to accept connections from whatever your new IP address happens to be at that time. Alternatively, you could also choose to set the Source to Anywhere - IPv4, which would allow any IP address to attempt an RDP connection, but eliminate the need to update the RDP rule whenever your connecting-from IP address changes. For a low-stakes instance such as the one you're using in this class, you may decide it's worth the risk. (Anyone wanting access to your instance would still need to supply your password. More on that below.)
Now that the instance is ready to accept remote desktop connections, there are two bits of information we'll need to make a connection: the instance's IP address and the password of the Administrator account.
Copy and Paste your Elastic IP address into a simple text editor like Notepad.
In the Windows Security dialog, log in with the following credentials:
User name: Administrator
Password: the password you decrypted in the AWS Management Console
Click OK.
You'll probably receive a warning that "the identity of the remote computer cannot be verified." Go ahead and answer Yes, that you want to connect anyway.
You should see the desktop of your remote instance open up.
As a security precaution, it's usually not a good idea to go around browsing the web from your production server machine. To do so is to invite malware intrusions onto one of your most sensitive computers. The operating system on your instance, Windows Server 2016, enforces this by blocking Internet Explorer from accessing most sites. This is called IE Enhanced Security Configuration (ESC). IE ESC gets burdensome when you're using the server solely for development or testing purposes, like we are. To smooth out the workflows in this course, you'll disable IE ESC right now and leave it off for the duration of the course.
Amazon gave you a pretty strong password for this instance, but it's not one you're liable to remember easily. You should change the administrator password to something you'll remember.
Figure 6.1: Password Security Setting Rules and Restrictions
As with the resetting of the instance password, these licensing steps need only be performed once after launching your instance.
Esri supports the implementation of enterprise geodatabases using a number of relational database management packages (e.g., Oracle, SQL Server, Postgres). To expose you to another RDBMS that's commonly used in the industry, I'm going to ask you to install SQL Server Express. (This is a free, lite version of Microsoft's SQL Server package. For everything we'll do in this class, the two SQL Server packages operate the same.)
With that, we're ready to begin playing with our enterprise geodatabases. First, you'll read a bit about ArcSDE, a technology that enables the ArcGIS products to work with data stored in an RDBMS.
ArcSDE is software that enables ArcGIS applications to store, manage and retrieve data in a RDBMS. The “Arc” comes from the ubiquitous naming convention used by Esri for their products. The SDE part stands for Spatial Database Engine.
ArcSDE is sometimes described as middleware, a layer of software that sits between Esri’s ArcGIS products and RDBMS software and manages data exchanges between them. It makes it possible for GIS data users to utilize their data without need for special knowledge of the underlying RDBMS.
As of ArcGIS 10, ArcSDE supports the commercial RDBMS’s Oracle, SQL Server, DB2, and Informix, and the free and open-source PostgreSQL.
In the last lesson, you worked with file geodatabases, a format that Esri recommends for small, single-user projects. Esri often refers to geodatabases stored in an RDBMS and accessed using ArcSDE as multiuser geodatabases because they are better able to support access by more than one user. The advantages of a multiuser geodatabase include:
Esri enables users in the Amazon cloud to run ArcGIS Enterprise on either the Ubuntu operating system or Windows. While there would be some benefit to seeing how Postgres operates as an Esri enterprise geodatabase, we're going to proceed with SQL Server on Windows, as SQL Server is more commonly used in the industry as a geodatabase DBMS. That said, the concepts involved in administering an enterprise geodatabase are similar regardless of the RDBMS used, so what you learn here will be transferrable to other RDBMS’s.
One of the first things you might want to do after launching an enterprise geodatabase is set up login roles and privileges for those roles. That will be the focus of the next section of the lesson.
Before diving into enterprise geodatabase concepts, it's important to understand some basics of working with SQL Server, the DBMS that we'll be using on our Amazon cloud instances. (While we're actually using SQL Server Express, I will refer to the software as simply SQL Server, since the two operate essentially the same for our purposes in this class.) Access to SQL Server itself is granted through logins. A login allows someone to authenticate with SQL Server and answers the question, "Who is connecting?" Access to an individual database (of which there could be several for a given SQL Server instance) is granted through users. Whereas logins are concerned with authentication, user accounts provide authorization to perform different tasks with a database. They answer the question, "What can this person do in the database?" As we'll see, logins must be mapped to database user accounts in order to work with database objects. This blog post [50] provides further information.
Related to logins and users is the concept of roles. A role provides a way to group similar users together so that permissions can be granted more easily. Database administrators can define their own desired roles and make permission changes for those roles as they see fit. SQL Server also comes with a set of fixed server and database roles whose permissions cannot be changed. Among these is the sysadmin role, which, as you might guess, has permissions that grant full control over the server.
All databases found on a SQL Server instance have a special user called DBO (short for database owner). Any member of the sysadmin fixed server role who uses a database is mapped to the DBO user, and any object created by any member of the sysadmin fixed server role belongs to DBO automatically.
An important consideration when implementing an enterprise geodatabase in SQL Server is who will be the owner of the geodatabase -- the DBO user or the SDE user. While the DBO user is the person who administers the database at the SQL Server level (e.g., creating new users and roles), the SDE user is the person who administers the database within ArcGIS (e.g., updating database statistics and compressing the database). Esri's documentation includes a page that discusses the pros and cons of each [51] [https://pro.arcgis.com/en/pro-app/help/data/geodatabases/manage-sql-server/comparison-geodatabase-owners-sqlserver.htm]. Generally speaking, if the SQL Server database administrator and the geodatabase administrator are the same person, then having the DBO user own the geodatabase is sensible. If, on the other hand, those two roles are filled by different people, then having the SDE user own the geodatabase is probably more advisable. In this course, we're going to work with a DBO-owned geodatabase.
In the next section, you will be back working, via your Remote Desktop Connection, on your instance of the Amazon Machine Image (AMI) that holds our enterprise geodatabase, etc.
If you STOP-ed your instance on purpose, or if your Remote Desktop Connection gets interrupted, you will need to re-START your instance via the EC2 Console.
And, if the IP address of the machine you're connecting with has changed since your last connection, remember that you may need to edit the RDP rule to allow connection from your new IP address.
Before we move on to setting up geodatabase users and roles, there are a couple of authentication settings we will change to make our lives a bit easier in this safe learning environment.
The first will be to set up SQL Server to allow mixed-mode authentication. By default, SQL Server only allows authentication through operating system logins. However, it is also possible to authenticate via user names and passwords stored in the database. In a real-world implementation, you may find it advantageous to utilize Windows authentication. But since that involves a bit of extra work on the instances we've just created, we'll go with SQL Server authentication.
The second setting we'll make will be to relax the default password standards on the Windows Server operating system that filter down to SQL Server. This will allow us to avoid the need to create lengthy, complex passwords.
We'll now be able to authenticate using simple, easy-to-remember passwords. Obviously, in a real-world implementation, you would want to think twice before lessening your database security like this.
Esri's ArcGIS Server/Enterprise AMIs have gone back and forth, sometimes including database software and a pre-configured enterprise geodatabase, sometimes not. This version's AMI does not -- we had to install the RDBMS ourselves -- so let's now create our first database.
We'll start our work with the egdb database by connecting to it as the DBO user.
Before adding some data to the geodatabase, let's discuss the importance of users and roles in a geodatabase context and lay out the scenario for the rest of our work in this section.
Database roles enable the database administrator to improve security by assigning users or groups of users various levels of access to the database tables. Commonly used roles in an enterprise geodatabase include viewer, editor, etc. As the names of these roles imply, one user group might be limited to read-only access to data while another group might be allowed to both read and edit the data. While we're only getting to this topic now, you should keep in mind that similar strategies can also be applied to the kind of Postgres/PostGIS databases we talked about in previous lessons.
For illustration purposes, imagine you work in the U.S. federal government and that you are administering a geodatabase that will host data from a number of different departments: state boundaries from the Census Bureau, major highways from the Department of Transportation, rivers from the Department of the Interior, etc. You want each department to have stewardship over their data and to have just read-only access to other departments' data.
Click here to download to your machine the Lesson 6 data [53]. In the steps below, you will then copy said us_data.zip archive file from your local machine to your remote instance. The Downloads (or Documents) folder on your remote instance is a logical place to paste and unzip the data.
Note that in our enterprise database environment, vector data is stored as feature classes. So, later, the Shapefile datasets that I gave you will be uploaded and converted to feature classes.
One aspect of the data loading/creation process that's worth keeping in mind is that when a feature class or feature dataset is created in the geodatabase, its name in the ArcGIS applications has two parts:
<owner>.<feature class/feature dataset>
Some organizations set up users specifically to handle data loading/data creation so that the owner listed in the two-part name clearly conveys the department that maintains the data. Following on this practice, let's add a user to the database who will control the loading of the Census data.
The Create Database User tool is one of several geodatabase administration tools available through the ArcGIS Pro GUI.
With the census user created, let's work as that user to load the state and city data you downloaded above. Because feature datasets are frequently used to house feature classes, let's create one of those first.
An important point to note about the process we just went through is that users created via the Create Database User tool have the ability to load/create data. It's generally considered a best practice to restrict this ability to a small number of trusted administrators to avoid cluttering the database with unwanted data. We'll now see that following this best practice requires adding lower-permission users through a different process.
Returning to our scenario, the users in the departments are as follows:
Department | editors | viewers |
---|---|---|
Census | Moe | Larry, Curly |
Transportation | Lisa | Bart, Homer |
Interior | Stan | Cartman, Kenny |
To add these users, we'll need to return to the SQL Server Management Studio.
In order for users to have the ability to view and/or edit the data in a geodatabase, the data owner must grant those privileges. That can be done on a user basis (using an ArcGIS tool), but the process can be done more efficiently by assigning users to roles and assigning privileges to the roles. In this section, you'll create a viewer role and an editor role in the egdb database.
One important note on roles: while the privileges must be granted by the data owner (the census and transportation users in our scenario), the roles must be created by someone with sysadmin privileges. That is the DBO user in our implementation.
Finally, we'll now work through the data owner connections to assign the proper privileges to go with the roles just created.
Note:
This concludes the hands-on activities you'll be doing with your geodatabase instance in this lesson. Feel free to do some experimenting with your instance. When done, close the remote desktop connection to your instance, and most importantly, remember to stop your instance through the EC2 Console to avoid racking up unnecessary charges.
Go on to the next page to see this week's graded activity.
With that, we've finished our tutorial on creating a SQL Server geodatabase, adding users, loading some data, and assigning data privileges to the users. You weren't specifically instructed to add the Department of the Interior users and import the us_hydro shapefile, but you're welcome to do so if you're looking for additional practice.
In the Adding Users and Data to Your Geodatabase part of the lesson, it discusses the various 'roles' that people who are involved with an enterprise geodatabase might have.
For Project 6, I'd like you to spend some time reflecting on the data users in your organization. If you already utilize an enterprise geodatabase, tell me a bit about your system, who uses it, through what login roles, and with what privileges. In preparation for this write-up, you may want to talk to your geodatabase administrator if that person is not you.
If your organization isn't using an enterprise geodatabase, tell me a bit about your data workflows and users, and spend some time describing the login roles and privileges you think would best meet your organization's needs if you were to implement one.
It's not necessary for you to divulge the identities of co-workers, but you may find that your write-up flows better if you use at least first names. You can always use fictitious names if you have concerns about identifying people by their real names.
Some of the job responsibilities frequently associated with geodatabase usage that you may want to consider include:
Note: If you feel you're unable to report on a current or former workplace, I will also accept an analysis of any organization that has published information on its enterprise geodatabase implementation. Such reports can be found by doing an online search for GIS + "master plan" or "strategic plan."
This project is one week in length. Please refer to the Canvas Calendar for the due date.
Now that you've gotten your feet wet with an enterprise geodatabase, it's time to dig into some of the details of managing vector and raster data in that environment. In this lesson, you'll learn how to add new feature classes and raster data sets to an enterprise geodatabase and about steps that should be taken after the creation of new datasets. You'll also see how geodatabase data is stored within SQL Server.
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 7 Discussion Forum.
Lesson 7 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.
In the previous lesson, we saw how to import vector data into new geodatabase feature classes as part of our experimentation of logins, users and roles. Most of this section of Lesson 7 will be reviewed as we import more data, though the process is outlined in greater detail here.
We are going to transfer four shapefile datasets from your computer to your Amazon cloud enterprise geodatabase instance. The shapefiles are data you worked with back in Lessons 3 and 4: the States, counties, and us_cities shapefiles that we used in Lesson 3, and also the cities shapefile that you used in Project 4 in the Jen and Berry's site selection exercise. I gave you copies of them in the DataFromLessons3and4.zip archive that you downloaded in the Checklist section, above.
A common workflow for organizations that are migrating their data to an enterprise geodatabase is to import data that already exist in other formats. Let's walk through that process.
Recall from the last lesson that users created using the Create Database User tool are able to load data and create new feature classes from scratch, and that it's considered a best practice to limit these capabilities to administrative staff. We used SQL Server Management Studio to create users who we didn't want to grant data loading ability.
Speaking of creating new tables from scratch, let's take a look at that workflow in the next section.
Always remember to Stop your EC2 Instance when you finish or when you take a long break.
In this section of the lesson, we'll look at creating a new feature class and populating it using the Append tool. To illustrate the process, imagine you're again working for Jen and Barry.
Name | data type |
---|---|
population | Long Integer |
total_crim | Long Integer |
crime_inde | Double |
university | Short Integer |
New features can be added to the feature class using Pro's editing tools, which were covered in detail in GEOG 484. Another way to populate a feature class is by using the Append tool.
As its name implies, the Append tool is used to append features held in feature classes/shapefiles to another existing feature class. Let's use it to append the features in our Jen and Barry's cities shapefile to the empty cities feature class we just added to our geodatabase.
The next part of the dialog is concerned with whether the fields in the input dataset match the fields in the target dataset. The default must match option checks to see if the fields match (in name and data type) and will not allow the append operation to occur if there is a mismatch. The Use the field map option allows for differences between the datasets.
Because our cities feature class doesn't have all of the fields found in the cities shapefile, select the Use the field map option.A couple of notes on the Append tool that you should keep in mind:
Always remember to Stop your Instance when you finish or when you take a long break.
Esri recommends the following after loading data into a geodatabase feature class:
The second item above, which we covered last lesson, is the only one that is absolutely critical. The first item, which we'll discuss in a moment, can greatly improve performance, especially as the size of the feature class increases. Metadata, covered in GEOG 484 is often overlooked, but can save a lot of headaches for anyone who has questions about the data. Geodatabase behavior functionality, covered in GEOG 484 and in Lesson 5 in this course, offers useful ways to improve the efficiency and accuracy of data maintenance workflows.
To this list, I would add the implementation of attribute and spatial indexes to improve performance. This page of the lesson will focus on database statistics and indexes.
Relational database packages like SQL Server provide users with the ability to calculate basic statistics on their tables, such as the common values and data distribution in each column. These statistics are stored in system tables that are utilized by the DBMS to determine the best way to carry out queries, thereby improving performance. As a table's data changes over time, the statistics will become out of date and less helpful in optimizing performance. This is why Esri recommends running the Analyze tool after major edits are made to a feature class. Let's run the tool on the states feature class we imported earlier.
At the bottom of the dialog are checkboxes that control which tables associated with the selected datasets should be analyzed (base, delta and/or archive). The base table (sometimes referred to as the business table) is essentially what you see when you open the feature class's attribute table. The delta table stores changes made to the base data in a versioned feature class, while the archive table stores data enabling database users to retrieve historical states of a feature class. We'll look at these topics in the next lesson. For now, you can just leave all three boxes checked. No harm is done if the feature classes don't have delta or archive tables.
Click Run to execute the tool.Keep in mind that running Analyze may have no perceptible effect for small datasets like we're dealing with here, but might result in significant performance gains in larger datasets.
Attribute indexes are another mechanism used in relational databases to improve performance, particularly in the execution of queries. Developing better indexing algorithms is one of the more popular research topics in the computer science field. A comprehensive review of indexing schemes is outside the scope of this course. But at the very least, you should understand that one of the more common schemes works much like the index of a book.
If you're looking for discussion of a particular topic in a book, you don't skim through each page of the book beginning with page one. You look up the topic in the index, which tells you the pages where you can conduct a much narrower search for your topic. A database index often works in much the same way. Given a WHERE clause like "WHERE city = 'Philadelphia'", the index helps the DBMS begin its search at a particular row of the table rather than at row one.
Some points to keep in mind regarding indexes:
To see how attribute indexes are built in ArcGIS, let's create one on the name column in the us_cities feature class.
I won't bother to have you do a test query before and after because I doubt we'd see much difference in performance with such a small table. Just keep this capability in mind if you find that your queries are taking a long time to execute.
While attribute indexes improve the performance of attribute queries, spatial indexes are used to improve the performance of spatial queries. Esri geodatabases support three different methods of spatial indexing, grid, R-tree, and B-tree. The grid method is analogous to the map index found in road atlases. A grid of equal-sized cells is laid over the feature class, and each row and column of the grid is assigned an identifier. Geometries in the feature class are compared to this grid and a list of grid cells intersected by each geometry is produced. These geometry-grid cell intersections are stored in a table. In the example below, feature 101 intersects three grid cells, while feature 102 is completely within a single cell.
FID | GX | GY |
---|---|---|
101 | 5 | 9 |
101 | 5 | 10 |
101 | 6 | 9 |
102 | 4 | 8 |
Index tables like this are used to enable GIS software to answer spatial questions without having to look at each geometry in the feature class. For example, imagine selecting features from our us_cities feature class that are within the state of Pennsylvania. The software will first look up the grid cells intersected by Pennsylvania. It can then throw out all of the us_cities points that don't intersect those same grid cells. It only needs to test for containment on points that share grid cells with the Pennsylvania polygon. This testing of only the close features is much more efficient than testing all features.
It is possible to define up to three of these spatial grids per feature class. Multiple grids with different resolutions can capture the extent of features more efficiently especially when the feature class contains features that vary greatly in their extent (i.e., some small features and some large).
The grid method is employed by Esri file geodatabases and Oracle-based ArcSDE geodatabases that store geometry using the Esri ST_Geometry type. ArcGIS calculates a default grid that typically provides a high level of performance. This page in the Esri documentation (An overview of spatial indexes in the geodatabase) [58] provides further information on spatial indexes, including when you might want to rebuild one.
SQL Server geodatabase adminstrators have two options available for storing geometries: the Microsoft geometry and Microsoft geography data types, which are similar in concept to the geometry and geography spatial data types we saw in PostGIS. The default storage method when using SQL Server is Microsoft geometry. (More on how spatial indexing works for geometry and geography types can be found below.) This can be changed when creating a feature class by selecting Use configuration keyword on the last panel of the New Feature Class wizard. For example, if you have data covering a large spatial extent and want to use SQL Server's spatial functions to calculate spherical lengths and areas on the SQL command line, then storing the data using the geography type might be the way to go. Further information on these storage options can be found in the documentation (Configuration keywords for enterprise geodatabases) [59].
Another spatial indexing method employed in ArcGIS is the R-tree, which uses a set of irregularly sized rectangles (R stands for rectangle) to group together nearby objects. This (File: R-tree.svg) figure [60] helps to illustrate how an R-tree works. The red rectangles (labeled R8-R19) are the bounding boxes around some set of features (lines or polygons). The blue rectangles (R3-R7) are an aggregation of those features into groups, and the black rectangles (R1-R2) are a higher level of aggregation.
The basic idea of a search is the same, if the search geometry falls within R1 then the software knows it can disregard the features within the bounding boxes R15-R19 and instead focus on R8-R14. After that first check is completed, the blue level of the tree might be used to further narrow the search.
R-tree indexes are used in Oracle geodatabases that utilize the SDO_Geometry type. They are automatically created and managed by ArcGIS and while it is possible to delete and re-create an R-tree index, it's not clear that doing so would improve performance. If you're having performance issues in a geodatabase that uses R-tree indexing, you may want to dig further into the documentation and/or contact Esri customer support.
SQL Server-based geodatabases that implement the geometry or geography spatial type are spatially indexed using a B-tree method [61]. (As noted above, the geometry spatial type is the default in SQL Server.) This is an indexing method commonly used for non-spatial data, but in this context modified by Microsoft to handle spatial indexing as well. Like the R-tree method, this modified B-tree method employs a rectangular grid for locating features.
Finally, Postgres-based geodatabases are spatially indexed using a Generalized Search Tree (GiST) approach [62]. This indexing method was developed as an alternative to the older B-tree and R-tree methods for irregular data structures (such as GIS data). It realizes performance gains by breaking data up into groupings, like objects that are within, objects that overlap, objects to one side, etc.
Now that you've learned about some of the settings to consider when loading data into an enterprise geodatabase, let's look in SQL Server Management Studio to see how feature classes are stored. Reminder: remember to Stop your Instances when you finish or when you take a long break.
Let's take a look at how the data we've been working with in ArcGIS Pro is stored in SQL Server.
ArcSDE relies on a number of tables behind the scenes. Many of these so-called repository tables are owned by the dbo superuser.
It's not really important that you remember much about these repository tables. However, hopefully, you now have a bit of an appreciation for what's going on behind the scenes and will see the tables as a bit less of a mystery.
Always remember to Stop your Instance when you finish or when you take a long break.
Esri offers a number of different options for managing raster data in an enterprise geodatabase. One area that receives a good deal of attention is managing collections of adjacent raster data sets (e.g., aerial photos). The options for dealing with such collections range from working with the raster data sets individually to merging them together into one large data set. In between is something Esri calls a mosaic dataset which attempts to provide the best of both worlds, the ability to work with multiple raster data sets as one combined layer or to break them up into their individual components. We'll talk about a couple of these approaches in this section of the lesson.
Let's see how to bring a raster data set into a geodatabase.
The "_1" part of these table names comes from the rastercolumn_id value assigned to the raster, found in the SDE_raster_columns repository table. If the earthatnight raster instead had a rastercolumn_id of 2, its pixel data would be stored in SDE_blk_2, for example.
Raster datasets can hold either integer or floating-point data. They can also be comprised of multiple bands. If you have a single-band integer raster dataset, a value attribute table (VAT) can be built that stores the number of cells associated with each integer value. The earthatnight raster holds integer values, but it is comprised of three bands. If it were a single-band integer raster dataset, we would see an SDE_vat_1 table in addition to the other tables.
As mentioned at the beginning of this section, it is common in GIS to deal with collections of adjacent raster datasets. Esri's mosaic dataset can be used to treat such collections as a single unit while still having the ability to work with the individual files as needed. Let's create a mosaic dataset to manage some elevation raster data sets for the State College, PA, area (found in the doqs folder from the Lesson7_data download).
The nearest neighbor method is best for discrete raster datasets (like a land use or soils grid) and for scanned maps. Bilinear interpolation and cubic convolution are better suited for continuous raster datasets (like an elevation grid) and for satellite imagery and aerial photography. Bilinear interpolation is faster at pyramid creation time than cubic convolution, but on the flip side, cubic convolution typically produces the most visually pleasing output.
The pixel data created by the pyramid building process can be compressed to reduce storage requirements and improve performance. Higher levels of compression can be achieved with different methods, though care should be taken to match the data's use to an appropriate compression method.
The LZ77 method is referred to as a loss-less compression method because it results in no degradation of the input data. It should be used when the highest accuracy possible is required. The other method, JPEG, can produce a significantly higher level of compression, though at the expense of some degradation of the input data. Thus, it is referred to as a lossy compression method. The JPEG method can be used in situations when the highest level of spatial accuracy is not really necessary.
A bit of degradation in the raster quality in the pyramid data is acceptable, so let's go with the JPEG compression option.
Choose JPEG from the Compression type dropdown list, and accept the default Quality value of 75.
In the Raster Statistics part of the dialog, you'll see a couple of "skip factor" options, one for the x dimension and one for the y. These values specify how many rows/columns to skip when computing statistics on the raster. The default skip factor value is 1 for each dimension, which means that cell values are retrieved for every other row and every other column. This decreases the time required to calculate the statistics, though it also decreases the accuracy of the statistics. In most cases, a skip value of 1 should produce statistics that are "good enough".
The Statistics ignore value is a value or list of values that should not be included in the statistics calculations. For example, if you used a value like -9999 for missing data, you would want to specify that as the ignore value to avoid generating distorted statistics.
Accept all of the defaults in the Raster Statistics part of the dialog. Click OK to close the Environment Settings window.
Always remember to Stop your Instance when you finish or when you take a long break.
For Project 7, you are going to revisit the historical maps of Charlottesville, VA, that you may have worked with at the end of GEOG 484. In the data download for Lesson 7, you were given a folder containing 4 scanned maps of Charlottesville, circa 1920. You were also given a shapefile of buildings digitized from those scanned maps. Your task for this project is to:
This project is one week in length. Please refer to the Canvas Calendar for the due date.
One of the primary reasons organizations adopt enterprise geodatabase technology is to take advantage of versioned editing. This is a form of editing that makes it possible to perform edits in isolation from the main version of the feature class. After a group of edits has been completed in the edit version, and perhaps run through a quality check, it can be incorporated into the main version. In Lesson 8 you'll see how versioned editing is conducted using Esri tools.
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 8 Discussion Forum.
Lesson 8 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.
* - Data acquired from the National Historic GIS [66] (www.nhgis.org), Minnesota Population Center. National Historical Geographic Information System: Version 2.0. Minneapolis, MN: University of Minnesota 2011.
In the last lesson, you digitized streets from a set of Sanborn maps. Perhaps without realizing it, you were performing non-versioned editing. This editing model, also sometimes called the short transaction model, is typical of traditional non-spatial database systems and has the advantage of being simple to implement. It can be preferable to versioned editing in situations where it is important for edits to be available to users immediately or if the data is frequently accessed by non-Esri applications.
However, non-versioned editing has a longer list of limitations and disadvantages that often makes versioned editing the more logical choice. Among these are:
The versioned editing, or long transaction, model addresses all of these issues. Editors of a versioned feature class are able to undo/redo individual edits and can perform edits on either simple or complex data. Different editors can perform edits to the same features because, as the name implies, all of the editors are working off of their own version of the data. As we'll see, mechanisms exist for merging changes made in different versions and resolving conflicts when they crop up.
As you might guess, versioned editing is especially useful for organizations that make frequent and/or complicated edits to their data. Land parcels are a classic example in that they are in a constant state of flux and require a great deal of care to delineate properly.
Versioned editing relies on the usage of delta tables (tables that track changes made to the base feature class). One good reason to avoid using a versioned editing workflow is that non-Esri applications are not built to work with the data in these delta tables. If your organization uses non-Esri tools to access your data, you may need to develop workarounds or live with the limitations of non-versioned editing.
With these introductory notes behind us, let's take a look at how versioned editing works.
Throughout this lesson, we'll use a feature class that stores the boundaries of the United States at the time of its first decennial census in 1790. We'll make edits to this feature class that reflect changes in the state boundaries as new states were admitted to the union.
Again, you are going to transfer some data from your machine to your Amazon enterprise geodatabase instance. This time, it will be the US_state_1790 shapefile dataset that you download from the Checklist page.
First, you will import the shapefile to your enterprise geodatabase as a feature class, then you'll perform the registration. Throughout this lesson, we'll use the census and Moe users established back in Lesson 6. In an actual implementation of a project like this, you would likely want to load the data via a loader user (like census) and then grant editing privileges to two non-loader users; but for simplicity's sake, we'll just grant editing privileges to Moe and use census as the second editor.
The feature class is now ready for versioned editing. The decision on whether to use a versioned editing workflow is made at the feature class level unless the feature class is part of a feature dataset. In that case, you would register the feature dataset as versioned, and that setting would cascade to all of its feature classes. Note that the setting applies only to feature classes in the feature dataset at that time. If you later add a new feature class to the feature dataset, you'll need re-execute the Register As Versioned command. Further details in this Tech Support article [67].
In our scenario, users logged in as census and as Moe are sharing the responsibility for editing the feature class to reflect when states joined the Union. As the project progresses, snapshots at different moments in time can be taken to produce an historical timeline of the country's expansion.
You will simulate these two users performing their edits by launching a separate ArcGIS Pro application window for each. The users want to isolate their work from the base feature class, so they will create their own versions of the feature class.
Now you'll play the role of the Moe user. In our scenario, imagine that the census and Moe users didn't coordinate their work very well and Moe thought that creating Kentucky was his responsibility.
Now, let's take a look at how the edits you just made are stored in the geodatabase.
When asked to display a version, Esri's software is programmed to start with the parent DEFAULT version of the feature class and incorporate the changes recorded in the delta tables. This process relies on the concept of a feature class state. Every edit made to a feature class version produces a new state with states being tracked using sequential ID values. For example, the first edit to a new version will produce state #1.
The SDE_versions table stores the names and IDs for all of the geodatabase’s versions. It is there that you’ll find the Statehood_edits and Statehood_changes versions created earlier. Other tables used in conjunction with the delta tables to track feature class changes are SDE_states and SDE_state_lineages. The details of how this process works are a bit complicated, so we won’t dig into them. If you’re curious to learn more, you can read through Esri’s Geodatabase system tables in SQL Server [68] documentation page.
It is sometimes necessary to switch the version of the feature class you're currently viewing to some other version you have permission to view. For example, the census user would be able to switch to the DEFAULT version or to the Statehood_changes version (created by Moe). We already saw how to change to a new version immediately after creating it, but let's drive home the point that the version can be changed later as well.
A useful way to inspect changes that exist between two feature class versions is to open the Version Changes view.
At some point, you'll want to merge the changes you've made in your isolated version with the base feature class, perhaps after performing a QA/QC check. That's the topic of the next section.
When a feature class is registered as versioned, the dbo superuser is made owner of the DEFAULT version and permission to work with that version is set to Public. The Public permission setting means that any other user can both view and edit the DEFAULT version. This type of permission may not be ideal though, particularly if you want to avoid mistaken edits from being propagated to the base feature class. One way to safeguard against this happening is to set permission on the DEFAULT version to Protected. This allows anyone to create their own child versions based on DEFAULT, but only the dbo user can post changes to the DEFAULT version.
As the dbo user is the owner of the DEFAULT version, only the dbo user can change permission on that version. So the first step we'll need to take to alter permissions is to open up a connection through the dbo user.
Synchronizing the changes between two versions of the same feature class requires the passing of edits in two directions. These transfers happen through operations called reconcile and post:
Both reconcile and post operations must be carried out while viewing the child version.
To summarize, in Part B, you performed a reconcile and a post between the census user's version and DEFAULT. Then, in Part C, you performed a reconcile and a post between Moe's version and DEFAULT. In Part C, edits made by census have been propagated to Moe's version -- and were rejected -- by virtue of the fact that census had just posted changes to DEFAULT. However, at this point, the changes that were made by Moe have not been propagated to the census user's version. To do that, the census user would need to (a) do another reconcile, or (b) delete the Statehood_edits version and re-create it.
Let's go with option a.
A few final points on versioned editing:
Through this short tutorial, you've seen how an organization can carry out edits to its data in a way that balances the need for preserving data integrity with the need to keep the data as up-to-date as possible.
Move on to the next page to access the graded activity for this lesson.
In the last project, you digitized streets in Charlottesville, VA circa 1920 using a set of Sanborn maps. For Project 8, I’d like you to return to the Sanborn map scenario. Imagine that you've been tasked with leading a team of four editors in capturing the building footprints found on Sanborn maps for a different point in time (say 1950). Your job for Project 8 is to draft a workflow that outlines how you and your team will digitize the building features using versioned editing. Describe how you would lead the development of this 1950 buildings feature class as the dbo superuser. Include in your workflow all of the steps you would follow from the initial creation of the feature class to the incorporation of each editor’s work into the final product. Assume that there are 30 individual scanned map sheets, like the four you saw in Lesson 7, and that your company has told the client that it will take a total of 2000 person-hours to perform the digitizing and to compile the attribute data from all 30 maps. In other words, each editor will have to engage in multiple data-entry sessions.
Note: Recall that you were given the 1920 building footprints in shapefile format for Project 7. You may assume that this dataset is available to you in Project 8 as well.
This project is one week in length. Please refer to the Canvas Calendar for the due date.
The goal of the Final Project is for you to put together what you’ve learned in the course to develop a spatial database that solves a problem of your own choosing. You may choose to attack your problem using either a Postgres/PostGIS approach or an Esri geodatabase approach.
A database’s usefulness is dependent on how well it helps to solve the problem it was built to address. Typically, a database serves as the "back end" of some sort of application which could be centered around a map or not, and could be web-based or not. We haven’t covered how to build such applications and this project’s timeline is fairly short, so I can’t require you to do much in terms of applying your database to your problem. However, what I will ask you to do is some combination of the following:
Those of you who have skills in this area and would like the extra challenge are welcome to put together a front-end application (such as an ArcGIS Server or OpenLayers web map). Just be sure to complete the project’s minimum requirements before getting bogged down in something more advanced.
This project is two weeks in length. Please refer to the Canvas Calendar for the due date.
Submit your write-up in Word or PDF format to the Final Project. Your write-up should contain the following sections:
Your submission will be graded as follows:
Links
[1] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/baseball_stats.accdb
[2] http://troels.arvin.dk/db/rdbms/#functions-concat
[3] http://en.wikipedia.org/wiki/Object_database
[4] http://en.wikipedia.org/wiki/NoSQL
[5] http://www.1keydata.com/database-normalization/second-normal-form-2nf.php
[6] http://en.wikipedia.org/wiki/Entity-relationship_model
[7] http://blogs.esri.com/esri/arcgis/2010/08/05/using-case-tools-in-arcgis-10/
[8] http://www.rocklistmusic.co.uk/1001Albums.htm
[9] https://www.radio3net.ro/
[10] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Albums.txt
[11] https://www.postgresql.org/download/
[12] https://postgis.net/windows_downloads/
[13] http://qgis.org/en/site/forusers/download.html
[14] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Lesson3Data.zip
[15] http://stackoverflow.com/questions/14083311/permission-denied-when-trying-to-import-a-csv-file-from-pgadmin
[16] https://www.postgresql.org/docs/current/static/sql-copy.html
[17] https://www.e-education.psu.edu/spatialdb/L3_practice_solutions.html
[18] http://en.wikipedia.org/wiki/Hexadecimal
[19] http://en.wikipedia.org/wiki/Linear_referencing
[20] http://en.wikipedia.org/wiki/SpatiaLite
[21] http://en.wikipedia.org/wiki/SQLite
[22] http://prj2epsg.org/
[23] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/868_roster_sp24.txt
[24] https://www.postgresql.org/docs/current/static/index.html
[25] http://postgis.net/docs/manual-3.0/reference.html
[26] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Project4.zip
[27] http://postgis.net/docs/manual-3.0/reference.html#Geometry_Outputs
[28] https://www.e-education.psu.edu/spatialdb/L4_practice_solutions.html
[29] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/AttributeDomains.zip
[30] http://desktop.arcgis.com/en/arcmap/latest/tools/data-management-toolbox/table-to-domain.htm
[31] http://desktop.arcgis.com/en/arcmap/latest/tools/data-management-toolbox/an-overview-of-the-domains-toolset.htm
[32] http://www.7-zip.org/
[33] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Subtypes.zip
[34] http://www.ufocasebook.com/Hynek.html
[35] http://en.wikipedia.org/wiki/Blood_alcohol_content
[36] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Topology.zip
[37] https://pro.arcgis.com/en/pro-app/help/data/topologies/topology-in-arcgis.htm
[38] https://pro.arcgis.com/en/pro-app/help/editing/validate-and-fix-geodatabase-topology.htm
[39] https://pro.arcgis.com/en/pro-app/help/editing/pdf/topology_rules_poster.pdf
[40] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Topology_ArcMap.zip
[41] https://aws.amazon.com/
[42] https://console.aws.amazon.com/
[43] https://console.aws.amazon.com/ec2/
[44] http://server.arcgis.com/en/server/latest/cloud/amazon/use-aws-management-console-with-arcgis-server.htm
[45] https://aws.amazon.com/marketplace/pp/prodview-rh32a6tw3ju4a?sr=0-3&ref_=beagle&applicationId=AWSMPContessa
[46] http://aws.amazon.com/ec2/pricing/on-demand/
[47] https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html
[48] https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/getting-started.html
[49] https://www.e-education.psu.edu/geog865/
[50] https://blogs.msdn.microsoft.com/lcris/2007/03/23/basic-sql-server-security-concepts-logins-users-and-principals/
[51] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/manage-sql-server/comparison-geodatabase-owners-sqlserver.htm
[52] https://support.esri.com/en/technical-article/000022869
[53] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/us_data.zip
[54] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/manage-sql-server/user-accounts-groups.htm
[55] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/DataFromLessons3and4.zip
[56] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Lesson7_data.zip
[57] https://www.e-education.psu.edu/geog865/node/247
[58] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/overview/an-overview-of-spatial-indexes-in-the-geodatabase.htm
[59] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/overview/configuration-keywords-for-enterprise-geodatabases.htm#GUID-B588092F-E687-40A8-8661-E7E3AA0DC13C
[60] http://en.wikipedia.org/wiki/File:R-tree.svg
[61] https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview?view=sql-server-2017
[62] https://access.crunchydata.com/documentation/postgis/2.2.7/using_postgis_dbmanagement.html#gist_indexes
[63] https://msdn.microsoft.com/en-us/library/bb933790.aspx
[64] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/earthatnight.zip
[65] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/nhgis0001_shapefile_us_state_1790.zip
[66] http://www.nhgis.org
[67] https://support.esri.com/en-us/knowledge-base/error-editing-a-feature-class-in-a-versioned-feature-da-000029232#:~:text=When%20new%20feature%20classes%20are%20added%20to%20a,feature%20classes%20are%20not%20automatically%20registered%20as%20versioned
[68] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/manage-sql-server/geodatabase-system-tables-sqlserver.htm
[69] https://pro.arcgis.com/en/pro-app/tool-reference/data-management/compress.htm
[70] http://downloads.esri.com/support/whitepapers/ao_/Versioning_Workflows_2.pdf