After Loading Data
Esri recommends the following after loading data into a geodatabase feature class:
- Calculate database statistics using the Analyze tool.
- Grant privileges for other users to access the data.
- Update metadata.
- Add geodatabase behavior (domains, sub-types, topology rules, etc.).
The second item above, which we covered last lesson, is the only one that is absolutely critical. The first item, which we'll discuss in a moment, can greatly improve performance, especially as the size of the feature class increases. Metadata, covered in GEOG 484 is often overlooked, but can save a lot of headaches for anyone who has questions about the data. Geodatabase behavior functionality, covered in GEOG 484 and in Lesson 5 in this course, offers useful ways to improve the efficiency and accuracy of data maintenance workflows.
To this list, I would add the implementation of attribute and spatial indexes to improve performance. This page of the lesson will focus on database statistics and indexes.
A. Calculating DBMS statistics using the Analyze tool
Relational database packages like SQL Server provide users with the ability to calculate basic statistics on their tables, such as the common values and data distribution in each column. These statistics are stored in system tables that are utilized by the DBMS to determine the best way to carry out queries, thereby improving performance. As a table's data changes over time, the statistics will become out of date and less helpful in optimizing performance. This is why Esri recommends running the Analyze tool after major edits are made to a feature class. Let's run the tool on the states feature class we imported earlier.
- In the Geoprocessing pane, open the Analyze Datasets tool.
- For the Input Database Connection, browse to Databases > census_egdb.sde and click OK. Note that tables can only be analyzed by their owner.
- The Datasets to Analyze box should fill with a list of datasets (in this case all feature classes) owned by the user associated with the connection.
Click Select All.
At the bottom of the dialog are checkboxes that control which tables associated with the selected datasets should be analyzed (base, delta and/or archive). The base table (sometimes referred to as the business table) is essentially what you see when you open the feature class's attribute table. The delta table stores changes made to the base data in a versioned feature class, while the archive table stores data enabling database users to retrieve historical states of a feature class. We'll look at these topics in the next lesson. For now, you can just leave all three boxes checked. No harm is done if the feature classes don't have delta or archive tables.
Click Run to execute the tool.
After a couple of seconds, the process will be complete. Close the Analyze Datasets processing window.
Keep in mind that running Analyze may have no perceptible effect for small datasets like we're dealing with here, but might result in significant performance gains in larger datasets.
B. Attribute indexes
Attribute indexes are another mechanism used in relational databases to improve performance, particularly in the execution of queries. Developing better indexing algorithms is one of the more popular research topics in the computer science field. A comprehensive review of indexing schemes is outside the scope of this course. But at the very least, you should understand that one of the more common schemes works much like the index of a book.
If you're looking for discussion of a particular topic in a book, you don't skim through each page of the book beginning with page one. You look up the topic in the index, which tells you the pages where you can conduct a much narrower search for your topic. A database index often works in much the same way. Given a WHERE clause like "WHERE city = 'Philadelphia'", the index helps the DBMS begin its search at a particular row of the table rather than at row one.
Some points to keep in mind regarding indexes:
- They can be based on one or more columns.
- They must be created by the table owner.
- The degree to which an index will help performance depends on the degree of uniqueness in the values being indexed. Highly unique column content will benefit most from an index, less unique column content will benefit less.
- They require a greater amount of disk space since they are essentially alternate representations of the data that must be stored and consulted by the DBMS.
- They can increase the processing time required for table edits since the DBMS needs to not only perform the edit but also update the index. For this reason, Esri recommends dropping indexes prior to performing bulk edits, then re-creating the indexes after the edits are complete.
To see how attribute indexes are built in ArcGIS, let's create one on the name column in the us_cities feature class.
- In ArcGIS Pro's Catalog pane, expand the usa_L7 feature dataset.
- Right-click on the us_cities feature class, and select Properties.
(Recall that the import process may have added a “_1” to the feature class name.) - In the Feature Class Properties dialog, click on the Indexes tab. Note that an index already exists on the OBJECTID field.
- In the Attribute Indexes section of the dialog, click the Add button.
- In the Fields available list, of the Add Attribute Index dialog, highlight the NAME column, and click the right arrow to copy it over to the Fields Selected list.
Leave the Unique checkbox unchecked. Checking this box specifies that the database can stop searching after the first match is found. Thus, you'd only want to check this box if each value in the index column appears only once. That would be a bad idea in this case, since some of the city names are duplicated.
But, do check the Ascending box. This will create an index in which the city names are sorted in ascending order.
Assign a Name of us_cities_name_idx.
Click OK to create the index. - Click OK again to dismiss the Properties dialog.
I won't bother to have you do a test query before and after because I doubt we'd see much difference in performance with such a small table. Just keep this capability in mind if you find that your queries are taking a long time to execute.
C. Spatial indexes
While attribute indexes improve the performance of attribute queries, spatial indexes are used to improve the performance of spatial queries. Esri geodatabases support three different methods of spatial indexing, grid, R-tree, and B-tree. The grid method is analogous to the map index found in road atlases. A grid of equal-sized cells is laid over the feature class, and each row and column of the grid is assigned an identifier. Geometries in the feature class are compared to this grid and a list of grid cells intersected by each geometry is produced. These geometry-grid cell intersections are stored in a table. In the example below, feature 101 intersects three grid cells, while feature 102 is completely within a single cell.
FID | GX | GY |
---|---|---|
101 | 5 | 9 |
101 | 5 | 10 |
101 | 6 | 9 |
102 | 4 | 8 |
Index tables like this are used to enable GIS software to answer spatial questions without having to look at each geometry in the feature class. For example, imagine selecting features from our us_cities feature class that are within the state of Pennsylvania. The software will first look up the grid cells intersected by Pennsylvania. It can then throw out all of the us_cities points that don't intersect those same grid cells. It only needs to test for containment on points that share grid cells with the Pennsylvania polygon. This testing of only the close features is much more efficient than testing all features.
It is possible to define up to three of these spatial grids per feature class. Multiple grids with different resolutions can capture the extent of features more efficiently especially when the feature class contains features that vary greatly in their extent (i.e., some small features and some large).
The grid method is employed by Esri file geodatabases and Oracle-based ArcSDE geodatabases that store geometry using the Esri ST_Geometry type. ArcGIS calculates a default grid that typically provides a high level of performance. This page in the Esri documentation (An overview of spatial indexes in the geodatabase) provides further information on spatial indexes, including when you might want to rebuild one.
SQL Server geodatabase adminstrators have two options available for storing geometries: the Microsoft geometry and Microsoft geography data types, which are similar in concept to the geometry and geography spatial data types we saw in PostGIS. The default storage method when using SQL Server is Microsoft geometry. (More on how spatial indexing works for geometry and geography types can be found below.) This can be changed when creating a feature class by selecting Use configuration keyword on the last panel of the New Feature Class wizard. For example, if you have data covering a large spatial extent and want to use SQL Server's spatial functions to calculate spherical lengths and areas on the SQL command line, then storing the data using the geography type might be the way to go. Further information on these storage options can be found in the documentation (Configuration keywords for enterprise geodatabases).
Another spatial indexing method employed in ArcGIS is the R-tree, which uses a set of irregularly sized rectangles (R stands for rectangle) to group together nearby objects. This (File: R-tree.svg) figure helps to illustrate how an R-tree works. The red rectangles (labeled R8-R19) are the bounding boxes around some set of features (lines or polygons). The blue rectangles (R3-R7) are an aggregation of those features into groups, and the black rectangles (R1-R2) are a higher level of aggregation.
The basic idea of a search is the same, if the search geometry falls within R1 then the software knows it can disregard the features within the bounding boxes R15-R19 and instead focus on R8-R14. After that first check is completed, the blue level of the tree might be used to further narrow the search.
R-tree indexes are used in Oracle geodatabases that utilize the SDO_Geometry type. They are automatically created and managed by ArcGIS and while it is possible to delete and re-create an R-tree index, it's not clear that doing so would improve performance. If you're having performance issues in a geodatabase that uses R-tree indexing, you may want to dig further into the documentation and/or contact Esri customer support.
SQL Server-based geodatabases that implement the geometry or geography spatial type are spatially indexed using a B-tree method. (As noted above, the geometry spatial type is the default in SQL Server.) This is an indexing method commonly used for non-spatial data, but in this context modified by Microsoft to handle spatial indexing as well. Like the R-tree method, this modified B-tree method employs a rectangular grid for locating features.
Finally, Postgres-based geodatabases are spatially indexed using a Generalized Search Tree (GiST) approach. This indexing method was developed as an alternative to the older B-tree and R-tree methods for irregular data structures (such as GIS data). It realizes performance gains by breaking data up into groupings, like objects that are within, objects that overlap, objects to one side, etc.
Now that you've learned about some of the settings to consider when loading data into an enterprise geodatabase, let's look in SQL Server Management Studio to see how feature classes are stored. Reminder: remember to Stop your Instances when you finish or when you take a long break.