GEOG 868
Spatial Database Management

Database Design Concepts

PrintPrint

Database Design Concepts

A. Database design concepts

When building a relational database from scratch, it is important that you put a good deal of thought into the process. A poorly designed database can cause a number of headaches for its users, including:

  • loss of data integrity over time
  • inability to support needed queries
  • slow performance

Entire courses can be spent on database design concepts, but we don't have that kind of time, so let's just focus on some basic design rules that should serve you well. A well-designed table is one that:

  • seeks to minimize redundant data
  • represents a single subject
  • has a primary key (a field or set of fields whose values will uniquely identify each record in the table)
  • does not contain multi-part fields (e.g., "302 Walker Bldg, University Park, PA 16802")
  • does not contain multi-valued fields (e.g., an Author field shouldn't hold values of the form "Jones, Martin, Williams")
  • does not contain unnecessary duplicate fields (e.g., avoid using Author1, Author2, Author3)
  • does not contain fields that rely on other fields (e.g., don't create a Wage field in a table that has PayRate and HrsWorked fields)

B. Normalization

The process of designing a database according to the rules described above is formally referred to as normalization. All database designers carry out normalization, whether they use that term to describe the process or not. Hardcore database designers not only use the term normalization, they're also able to express the extent to which a database has been normalized:

  • First normal form (1NF) describes a database whose tables represent distinct entities, have no duplicative columns (e.g., no Author1, Author2, Author3), and have a column or columns that uniquely identify each row (i.e., a primary key). Databases meeting these requirements are said to be in first normal form.
  • Second normal form (2NF) describes a database that is in 1NF and also avoids having non-key columns that are dependent on a subset of the primary key. It's understandable if that seems confusing, have a look at this simple example [www.1keydata.com/database-normalization/second-normal-form-2nf.php]
    In the example, CustomerID and StoreID form a composite key - that is, the combination of the values from those columns uniquely identifies the rows in the table. In other words, only one row in the table will have a CustomerID of 1 together with a StoreID of 1, only one row will have a CustomerID of 1 together with a StoreID of 3, etc. The PurchaseLocation column depends on the StoreID column, which is only part of the primary key. As shown, the solution to putting the table in 2NF is to move the StoreID-PurchaseLocation relationship into a separate table. This should make intuitive sense as it spells out the PurchaseLocation values just once rather than spelling them out repeatedly.
  • Third normal form (3NF) describes a database that is in 2NF and also avoids having columns that derive their values from columns other than the primary key. The wage field example mentioned above is a clear violation of the 3NF rule.

In most cases, normalizing a database so that it is in 3NF is sufficient. However, it is worth pointing out that there are other normal forms including Boyce-Codman normal form (BCNF, or 3.5NF), fourth normal form (4NF) and fifth normal form (5NF). Rather than spend time going through examples of these other forms, I encourage you to simply keep in mind the basic characteristics of a well-designed table listed above. If you follow those guidelines carefully, in particular, constantly being on the lookout for redundant data, you should be able to reap the benefits of normalization.

Generally speaking, a higher level of normalization results in a higher number of tables. And as the number of tables increases, the costs of bringing together data through joins increases as well, both in terms of the expertise required in writing the queries and in the performance of the database. In other words, the normalization process can sometimes yield a design that is too difficult to implement or that performs too slowly. Thus, it is important to bear in mind that database design is often a balancing of concerns related to data integrity and storage efficiency (why we normalize) versus concerns related to its usability (getting data into and out of the database).

Earlier, we talked about city/state combinations being redundant with zip code. That is a great example of a situation in which de-normalizing the data might make sense. I have no hard data on this, but I would venture to say that the vast majority of relational databases that store these three attributes keep them all together in the same table. Yes, there is a benefit to storing the city and state names once in the zip code table (less chance of a misspelling, less disk space used). However, my guess is that the added complexity of joining the city/state together with the rest of the address elements outweighs that benefit to most database designers.

C. Example scenario

Let's work through an example design scenario to demonstrate how these rules might be applied to produce an efficient database. Ice cream entrepreneurs Jen and Barry have opened their business and now need a database to track orders. When taking an order, they record the customer's name, the details of the order such as the flavors and quantities of ice cream needed, the date the order is needed, and the delivery address. Their database needs to help them answer two important questions:

  1. Which orders are due to be shipped within the next two days?
  2. Which flavors must be produced in greater quantities?

A first crack at storing the order information might look like this:

Table 2.1: Order information
Customer Order DeliveryDate DeliveryAdd
Eric Cartman 1 vanilla, 2 chocolate 12/1/11 101 Main St
Bart Simpson 10 chocolate, 10 vanilla, 5 strawberry 12/3/11 202 School Ln
Stewie Griffin 1 rocky road 12/3/11 303 Chestnut St
Bart Simpson 3 mint chocolate chip, 2 strawberry 12/5/11 202 School Ln
Hank Hill 2 coffee, 3 vanilla 12/8/11 404 Canary Dr
Stewie Griffin 5 rocky road 12/10/11 303 Chestnut St

The problem with this design becomes clear when you imagine trying to write a query that calculates the number of gallons of vanilla that have been ordered. The quantities are mixed with the names of the flavors, and any one flavor could be listed anywhere within the order field (i.e., it won't be consistently listed first or second).

A design like the following would be slightly better:

Table 2.2: Customer order by flavors
Customer Flavor1 Qty1 Flavor2 Qty2 Flavor3 Qty3 DeliveryDate DeliveryAdd
Eric Cartman vanilla 1 chocolate 2 12/1/11 101
Main St
Bart Simpson chocolate 10 vanilla 10 strawberry 5 12/3/11 202
School Ln
Stewie Griffin rocky road 1 12/3/11 303
Chestnut St
Bart Simpson mint chocolate chip 3 strawberry 2 12/5/11 202
School Ln
Hank Hill coffee 2 vanilla 3 12/8/11 404
Canary Dr
Stewie Griffin rocky road 5 12/10/11 303
Chestnut St

This is an improvement because it enables querying on flavors and summing quantities. However, to calculate the gallons of vanilla ordered you would need to sum the values from three fields. Also, the design would break down if a customer ordered more than three flavors.

Slightly better still is this design:

Table 2.3: Customer flavors and quantities
Customer Flavor Qty DeliveryDate DeliveryAdd
Eric Cartman vanilla 1 12/1/11 101
Main St
Eric Cartman chocolate 2 12/1/11 101
Main St
Bart Simpson chocolate 10 12/3/11 202
School Ln
Bart Simpson vanilla 10 12/3/11 202
School Ln
Bart Simpson strawberry 5 12/3/11 202
School Ln
Stewie Griffin rocky road 1 12/3/11 303
Chestnut St
Hank Hill coffee 2 12/8/11 404
Canary Dr
Hank Hill vanilla 3 12/8/11 404
Canary Dr
Stewie Griffin rocky road 5 12/10/11 303
Chestnut St

This design makes calculating the gallons of vanilla ordered much easier. Unfortunately, it also produces a lot of redundant data and spreads a complete order from a single customer across multiple rows.

Better than all of these approaches would be to separate the data into four entities (Customers, Flavors, Orders and Order Items):

Table 2.4: Customers
CustID NameLast NameFirst DeliveryAdd
1 Cartman Eric 101
Main St
2 Simpson Bart 202
School Ln
3 Griffin Stewie 303
Chestnut St
4 Hill Hank 404
Canary Dr
Table 2.5: Flavors
FlavorID Name
1 vanilla
2 chocolate
3 strawberry
4 rocky road
5 mint chocolate chip
6 coffee
Table 2.6: Orders
OrderID CustID DeliveryDate
1 1 12/1/11
2 2 12/3/11
3 3 12/3/11
4 2 12/5/11
5 4 12/8/11
6 3 12/10/11
Table 2.7: OrderItems
OrderItemID OrderID FlavorID Qty
1 1 1 1
2 1 2 2
3 2 2 10
4 2 1 10
5 2 3 5
6 3 4 1
7 4 5 3
8 4 3 2
9 5 6 2
10 5 1 3
11 6 4 5

If one were to implement a design like this in MS-Access, the query needed to display orders that must be delivered in the next 2 days would look like this in the GUI:

Screen capture of MS Access Design View portion of programming GUI. See caption.
Figure 2.1: Screen capture of MS Access Design View portion of programming GUI. Shows the built query that will select orders that must be delivered in the next 2 days. Join connections between Flavors, Order Items, Orders, and Customers tables are shown in top panel. Bottom panel shows the display status of fields, as well as criteria that control how the output is displayed, and what criteria are used to sort/select records.

It would produce the following SQL:

SELECT Orders.OrderID, Customers.NameLast, Customers.NameFirst, Flavors.Name, OrderItems.Qty, Customers.DeliveryAdd, Orders.DeliveryDate FROM ((Orders INNER JOIN OrderItems ON Orders.OrderID = OrderItems.OrderID) INNER JOIN Customers ON Orders.CustID = Customers.CustID) INNER JOIN Flavors ON OrderItems.FlavorID = Flavors.FlavorID WHERE (((Orders.DeliveryDate) < Now()+2)) ORDER BY Orders.OrderID;

If you aren't experienced with relational databases, then such a table join may seem intimidating. You probably won't need to do anything quite so complex in this course. The purpose of this example is two-fold:

  1. To illustrate that multi-table designs are often preferable to a "one-big-spreadsheet" approach.
  2. To emphasize the importance of SQL in pulling together data spread across multiple tables to perform valuable database queries.

D. Data modeling

Whether it's just a quick sketch on a napkin or a months-long process involving many stakeholders, the life cycle of any effective database begins with data modeling. Data modeling itself begins with a requirements analysis, which can be more or less formal, depending on the scale of the project. One of the common products of the data modeling process is an entity-relationship (ER) diagram. This sort of diagram depicts the categories of data that must be stored (the entities) along with the associations (or relationships) between them. The Wikipedia entry on ER diagrams is quite good, so I'm going to point you there to learn more:

Entity-relationship model article at Wikipedia [http://en.wikipedia.org/wiki/Entity-relationship_model]

An ER diagram is essentially a blueprint for a database structure. Some RDBMSs provide diagramming tools (e.g., Oracle Designer, MySQL Workbench) and often include the capability of automatically creating the table structure conceptualized in the diagram.

In a GIS context, Esri makes it possible to create new geodatabases based on diagrams authored using CASE (Computer-Aided Software Engineering) tools. This blog post, Using Case tools in Arc GIS 10, [http://blogs.esri.com/esri/arcgis/2010/08/05/using-case-tools-in-arcgis -10/] provides details if you are interested in learning more.

E. Design practice

To help drive these concepts home, here is a scenario for you to consider. You work for a group with an idea for a fun website: to provide a place for music lovers to share their reviews of albums on the 1001 Albums You Must Hear Before You Die list. (All of these albums had been streamable from Radio Romania 3Net, but sadly it appears that's no longer the case.)

Spend 15-30 minutes designing a database (on paper, no need to implement it in Access) for this scenario. Your database should be capable of efficiently storing all of the following data:

  • Artist
  • Album Genre
  • Record Label
  • Album Comments
  • Album
  • Track Name
  • Reviewer ID
  • Track Rating
  • Year
  • Track Length
  • Album Rating
  • Track Comments

When you're satisfied with your design, move to the next page and compare yours to mine.