Now that you've gotten your feet wet with an enterprise geodatabase, it's time to dig into some of the details of managing vector and raster data in that environment. In this lesson, you'll learn how to add new feature classes and raster data sets to an enterprise geodatabase and about steps that should be taken after the creation of new datasets. You'll also see how geodatabase data is stored within SQL Server.
If you have any questions now or at any point during this week, please feel free to post them to the Lesson 7 Discussion Forum.
Lesson 7 is one week in length. See the Canvas Calendar for specific due dates. To finish this lesson, you must complete the activities listed below. You may find it useful to print this page out first so that you can follow along with the directions.
In the previous lesson, we saw how to import vector data into new geodatabase feature classes as part of our experimentation of logins, users and roles. Most of this section of Lesson 7 will be reviewed as we import more data, though the process is outlined in greater detail here.
We are going to transfer four shapefile datasets from your computer to your Amazon cloud enterprise geodatabase instance. The shapefiles are data you worked with back in Lessons 3 and 4: the States, counties, and us_cities shapefiles that we used in Lesson 3, and also the cities shapefile that you used in Project 4 in the Jen and Berry's site selection exercise. I gave you copies of them in the DataFromLessons3and4.zip archive that you downloaded in the Checklist section, above.
A common workflow for organizations that are migrating their data to an enterprise geodatabase is to import data that already exist in other formats. Let's walk through that process.
Recall from the last lesson that users created using the Create Database User tool are able to load data and create new feature classes from scratch, and that it's considered a best practice to limit these capabilities to administrative staff. We used SQL Server Management Studio to create users who we didn't want to grant data loading ability.
Speaking of creating new tables from scratch, let's take a look at that workflow in the next section.
Always remember to Stop your EC2 Instance when you finish or when you take a long break.
In this section of the lesson, we'll look at creating a new feature class and populating it using the Append tool. To illustrate the process, imagine you're again working for Jen and Barry.
Name | data type |
---|---|
population | Long Integer |
total_crim | Long Integer |
crime_inde | Double |
university | Short Integer |
New features can be added to the feature class using Pro's editing tools, which were covered in detail in GEOG 484. Another way to populate a feature class is by using the Append tool.
As its name implies, the Append tool is used to append features held in feature classes/shapefiles to another existing feature class. Let's use it to append the features in our Jen and Barry's cities shapefile to the empty cities feature class we just added to our geodatabase.
The next part of the dialog is concerned with whether the fields in the input dataset match the fields in the target dataset. The default must match option checks to see if the fields match (in name and data type) and will not allow the append operation to occur if there is a mismatch. The Use the field map option allows for differences between the datasets.
Because our cities feature class doesn't have all of the fields found in the cities shapefile, select the Use the field map option.A couple of notes on the Append tool that you should keep in mind:
Always remember to Stop your Instance when you finish or when you take a long break.
Esri recommends the following after loading data into a geodatabase feature class:
The second item above, which we covered last lesson, is the only one that is absolutely critical. The first item, which we'll discuss in a moment, can greatly improve performance, especially as the size of the feature class increases. Metadata, covered in GEOG 484 is often overlooked, but can save a lot of headaches for anyone who has questions about the data. Geodatabase behavior functionality, covered in GEOG 484 and in Lesson 5 in this course, offers useful ways to improve the efficiency and accuracy of data maintenance workflows.
To this list, I would add the implementation of attribute and spatial indexes to improve performance. This page of the lesson will focus on database statistics and indexes.
Relational database packages like SQL Server provide users with the ability to calculate basic statistics on their tables, such as the common values and data distribution in each column. These statistics are stored in system tables that are utilized by the DBMS to determine the best way to carry out queries, thereby improving performance. As a table's data changes over time, the statistics will become out of date and less helpful in optimizing performance. This is why Esri recommends running the Analyze tool after major edits are made to a feature class. Let's run the tool on the states feature class we imported earlier.
At the bottom of the dialog are checkboxes that control which tables associated with the selected datasets should be analyzed (base, delta and/or archive). The base table (sometimes referred to as the business table) is essentially what you see when you open the feature class's attribute table. The delta table stores changes made to the base data in a versioned feature class, while the archive table stores data enabling database users to retrieve historical states of a feature class. We'll look at these topics in the next lesson. For now, you can just leave all three boxes checked. No harm is done if the feature classes don't have delta or archive tables.
Click Run to execute the tool.Keep in mind that running Analyze may have no perceptible effect for small datasets like we're dealing with here, but might result in significant performance gains in larger datasets.
Attribute indexes are another mechanism used in relational databases to improve performance, particularly in the execution of queries. Developing better indexing algorithms is one of the more popular research topics in the computer science field. A comprehensive review of indexing schemes is outside the scope of this course. But at the very least, you should understand that one of the more common schemes works much like the index of a book.
If you're looking for discussion of a particular topic in a book, you don't skim through each page of the book beginning with page one. You look up the topic in the index, which tells you the pages where you can conduct a much narrower search for your topic. A database index often works in much the same way. Given a WHERE clause like "WHERE city = 'Philadelphia'", the index helps the DBMS begin its search at a particular row of the table rather than at row one.
Some points to keep in mind regarding indexes:
To see how attribute indexes are built in ArcGIS, let's create one on the name column in the us_cities feature class.
I won't bother to have you do a test query before and after because I doubt we'd see much difference in performance with such a small table. Just keep this capability in mind if you find that your queries are taking a long time to execute.
While attribute indexes improve the performance of attribute queries, spatial indexes are used to improve the performance of spatial queries. Esri geodatabases support three different methods of spatial indexing, grid, R-tree, and B-tree. The grid method is analogous to the map index found in road atlases. A grid of equal-sized cells is laid over the feature class, and each row and column of the grid is assigned an identifier. Geometries in the feature class are compared to this grid and a list of grid cells intersected by each geometry is produced. These geometry-grid cell intersections are stored in a table. In the example below, feature 101 intersects three grid cells, while feature 102 is completely within a single cell.
FID | GX | GY |
---|---|---|
101 | 5 | 9 |
101 | 5 | 10 |
101 | 6 | 9 |
102 | 4 | 8 |
Index tables like this are used to enable GIS software to answer spatial questions without having to look at each geometry in the feature class. For example, imagine selecting features from our us_cities feature class that are within the state of Pennsylvania. The software will first look up the grid cells intersected by Pennsylvania. It can then throw out all of the us_cities points that don't intersect those same grid cells. It only needs to test for containment on points that share grid cells with the Pennsylvania polygon. This testing of only the close features is much more efficient than testing all features.
It is possible to define up to three of these spatial grids per feature class. Multiple grids with different resolutions can capture the extent of features more efficiently especially when the feature class contains features that vary greatly in their extent (i.e., some small features and some large).
The grid method is employed by Esri file geodatabases and Oracle-based ArcSDE geodatabases that store geometry using the Esri ST_Geometry type. ArcGIS calculates a default grid that typically provides a high level of performance. This page in the Esri documentation (An overview of spatial indexes in the geodatabase) [4] provides further information on spatial indexes, including when you might want to rebuild one.
SQL Server geodatabase adminstrators have two options available for storing geometries: the Microsoft geometry and Microsoft geography data types, which are similar in concept to the geometry and geography spatial data types we saw in PostGIS. The default storage method when using SQL Server is Microsoft geometry. (More on how spatial indexing works for geometry and geography types can be found below.) This can be changed when creating a feature class by selecting Use configuration keyword on the last panel of the New Feature Class wizard. For example, if you have data covering a large spatial extent and want to use SQL Server's spatial functions to calculate spherical lengths and areas on the SQL command line, then storing the data using the geography type might be the way to go. Further information on these storage options can be found in the documentation (Configuration keywords for enterprise geodatabases) [5].
Another spatial indexing method employed in ArcGIS is the R-tree, which uses a set of irregularly sized rectangles (R stands for rectangle) to group together nearby objects. This (File: R-tree.svg) figure [6] helps to illustrate how an R-tree works. The red rectangles (labeled R8-R19) are the bounding boxes around some set of features (lines or polygons). The blue rectangles (R3-R7) are an aggregation of those features into groups, and the black rectangles (R1-R2) are a higher level of aggregation.
The basic idea of a search is the same, if the search geometry falls within R1 then the software knows it can disregard the features within the bounding boxes R15-R19 and instead focus on R8-R14. After that first check is completed, the blue level of the tree might be used to further narrow the search.
R-tree indexes are used in Oracle geodatabases that utilize the SDO_Geometry type. They are automatically created and managed by ArcGIS and while it is possible to delete and re-create an R-tree index, it's not clear that doing so would improve performance. If you're having performance issues in a geodatabase that uses R-tree indexing, you may want to dig further into the documentation and/or contact Esri customer support.
SQL Server-based geodatabases that implement the geometry or geography spatial type are spatially indexed using a B-tree method [7]. (As noted above, the geometry spatial type is the default in SQL Server.) This is an indexing method commonly used for non-spatial data, but in this context modified by Microsoft to handle spatial indexing as well. Like the R-tree method, this modified B-tree method employs a rectangular grid for locating features.
Finally, Postgres-based geodatabases are spatially indexed using a Generalized Search Tree (GiST) approach [8]. This indexing method was developed as an alternative to the older B-tree and R-tree methods for irregular data structures (such as GIS data). It realizes performance gains by breaking data up into groupings, like objects that are within, objects that overlap, objects to one side, etc.
Now that you've learned about some of the settings to consider when loading data into an enterprise geodatabase, let's look in SQL Server Management Studio to see how feature classes are stored. Reminder: remember to Stop your Instances when you finish or when you take a long break.
Let's take a look at how the data we've been working with in ArcGIS Pro is stored in SQL Server.
ArcSDE relies on a number of tables behind the scenes. Many of these so-called repository tables are owned by the dbo superuser.
It's not really important that you remember much about these repository tables. However, hopefully, you now have a bit of an appreciation for what's going on behind the scenes and will see the tables as a bit less of a mystery.
Always remember to Stop your Instance when you finish or when you take a long break.
Esri offers a number of different options for managing raster data in an enterprise geodatabase. One area that receives a good deal of attention is managing collections of adjacent raster data sets (e.g., aerial photos). The options for dealing with such collections range from working with the raster data sets individually to merging them together into one large data set. In between is something Esri calls a mosaic dataset which attempts to provide the best of both worlds, the ability to work with multiple raster data sets as one combined layer or to break them up into their individual components. We'll talk about a couple of these approaches in this section of the lesson.
Let's see how to bring a raster data set into a geodatabase.
The "_1" part of these table names comes from the rastercolumn_id value assigned to the raster, found in the SDE_raster_columns repository table. If the earthatnight raster instead had a rastercolumn_id of 2, its pixel data would be stored in SDE_blk_2, for example.
Raster datasets can hold either integer or floating-point data. They can also be comprised of multiple bands. If you have a single-band integer raster dataset, a value attribute table (VAT) can be built that stores the number of cells associated with each integer value. The earthatnight raster holds integer values, but it is comprised of three bands. If it were a single-band integer raster dataset, we would see an SDE_vat_1 table in addition to the other tables.
As mentioned at the beginning of this section, it is common in GIS to deal with collections of adjacent raster datasets. Esri's mosaic dataset can be used to treat such collections as a single unit while still having the ability to work with the individual files as needed. Let's create a mosaic dataset to manage some elevation raster data sets for the State College, PA, area (found in the doqs folder from the Lesson7_data download).
The nearest neighbor method is best for discrete raster datasets (like a land use or soils grid) and for scanned maps. Bilinear interpolation and cubic convolution are better suited for continuous raster datasets (like an elevation grid) and for satellite imagery and aerial photography. Bilinear interpolation is faster at pyramid creation time than cubic convolution, but on the flip side, cubic convolution typically produces the most visually pleasing output.
The pixel data created by the pyramid building process can be compressed to reduce storage requirements and improve performance. Higher levels of compression can be achieved with different methods, though care should be taken to match the data's use to an appropriate compression method.
The LZ77 method is referred to as a loss-less compression method because it results in no degradation of the input data. It should be used when the highest accuracy possible is required. The other method, JPEG, can produce a significantly higher level of compression, though at the expense of some degradation of the input data. Thus, it is referred to as a lossy compression method. The JPEG method can be used in situations when the highest level of spatial accuracy is not really necessary.
A bit of degradation in the raster quality in the pyramid data is acceptable, so let's go with the JPEG compression option.
Choose JPEG from the Compression type dropdown list, and accept the default Quality value of 75.
In the Raster Statistics part of the dialog, you'll see a couple of "skip factor" options, one for the x dimension and one for the y. These values specify how many rows/columns to skip when computing statistics on the raster. The default skip factor value is 1 for each dimension, which means that cell values are retrieved for every other row and every other column. This decreases the time required to calculate the statistics, though it also decreases the accuracy of the statistics. In most cases, a skip value of 1 should produce statistics that are "good enough".
The Statistics ignore value is a value or list of values that should not be included in the statistics calculations. For example, if you used a value like -9999 for missing data, you would want to specify that as the ignore value to avoid generating distorted statistics.
Accept all of the defaults in the Raster Statistics part of the dialog. Click OK to close the Environment Settings window.
Always remember to Stop your Instance when you finish or when you take a long break.
For Project 7, you are going to revisit the historical maps of Charlottesville, VA, that you may have worked with at the end of GEOG 484. In the data download for Lesson 7, you were given a folder containing 4 scanned maps of Charlottesville, circa 1920. You were also given a shapefile of buildings digitized from those scanned maps. Your task for this project is to:
This project is one week in length. Please refer to the Canvas Calendar for the due date.
Links
[1] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/DataFromLessons3and4.zip
[2] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/Lesson7_data.zip
[3] https://www.e-education.psu.edu/geog865/node/247
[4] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/overview/an-overview-of-spatial-indexes-in-the-geodatabase.htm
[5] https://pro.arcgis.com/en/pro-app/help/data/geodatabases/overview/configuration-keywords-for-enterprise-geodatabases.htm#GUID-B588092F-E687-40A8-8661-E7E3AA0DC13C
[6] http://en.wikipedia.org/wiki/File:R-tree.svg
[7] https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview?view=sql-server-2017
[8] https://access.crunchydata.com/documentation/postgis/2.2.7/using_postgis_dbmanagement.html#gist_indexes
[9] https://msdn.microsoft.com/en-us/library/bb933790.aspx
[10] https://www.e-education.psu.edu/spatialdb/sites/www.e-education.psu.edu.spatialdb/files/earthatnight.zip