Database Design Concepts
A. Database design concepts
When building a relational database from scratch, it is important that you put a good deal of thought into the process. A poorly designed database can cause a number of headaches for its users, including:
- loss of data integrity over time
- inability to support needed queries
- slow performance
Entire courses can be spent on database design concepts, but we don't have that kind of time, so let's just focus on some basic design rules that should serve you well. A well-designed table is one that:
- seeks to minimize redundant data
- represents a single subject
- has a primary key (a field or set of fields whose values will uniquely identify each record in the table)
- does not contain multi-part fields (e.g., "302 Walker Bldg, University Park, PA 16802")
- does not contain multi-valued fields (e.g., an Author field shouldn't hold values of the form "Jones, Martin, Williams")
- does not contain unnecessary duplicate fields (e.g., avoid using Author1, Author2, Author3)
- does not contain fields that rely on other fields (e.g., don't create a Wage field in a table that has PayRate and HrsWorked fields)
B. Normalization
The process of designing a database according to the rules described above is formally referred to as normalization. All database designers carry out normalization, whether they use that term to describe the process or not. Hardcore database designers not only use the term normalization, they're also able to express the extent to which a database has been normalized:
- First normal form (1NF) describes a database whose tables represent distinct entities, have no duplicative columns (e.g., no Author1, Author2, Author3), and have a column or columns that uniquely identify each row (i.e., a primary key). Databases meeting these requirements are said to be in first normal form.
- Second normal form (2NF) describes a database that is in 1NF and also avoids having non-key columns that are dependent on a subset of the primary key. It's understandable if that seems confusing, have a look at this simple example [www.1keydata.com/database-normalization/second-normal-form-2nf.php]
In the example, CustomerID and StoreID form a composite key - that is, the combination of the values from those columns uniquely identifies the rows in the table. In other words, only one row in the table will have a CustomerID of 1 together with a StoreID of 1, only one row will have a CustomerID of 1 together with a StoreID of 3, etc. The PurchaseLocation column depends on the StoreID column, which is only part of the primary key. As shown, the solution to putting the table in 2NF is to move the StoreID-PurchaseLocation relationship into a separate table. This should make intuitive sense as it spells out the PurchaseLocation values just once rather than spelling them out repeatedly. - Third normal form (3NF) describes a database that is in 2NF and also avoids having columns that derive their values from columns other than the primary key. The wage field example mentioned above is a clear violation of the 3NF rule.
In most cases, normalizing a database so that it is in 3NF is sufficient. However, it is worth pointing out that there are other normal forms including Boyce-Codman normal form (BCNF, or 3.5NF), fourth normal form (4NF) and fifth normal form (5NF). Rather than spend time going through examples of these other forms, I encourage you to simply keep in mind the basic characteristics of a well-designed table listed above. If you follow those guidelines carefully, in particular, constantly being on the lookout for redundant data, you should be able to reap the benefits of normalization.
Generally speaking, a higher level of normalization results in a higher number of tables. And as the number of tables increases, the costs of bringing together data through joins increases as well, both in terms of the expertise required in writing the queries and in the performance of the database. In other words, the normalization process can sometimes yield a design that is too difficult to implement or that performs too slowly. Thus, it is important to bear in mind that database design is often a balancing of concerns related to data integrity and storage efficiency (why we normalize) versus concerns related to its usability (getting data into and out of the database).
Earlier, we talked about city/state combinations being redundant with zip code. That is a great example of a situation in which de-normalizing the data might make sense. I have no hard data on this, but I would venture to say that the vast majority of relational databases that store these three attributes keep them all together in the same table. Yes, there is a benefit to storing the city and state names once in the zip code table (less chance of a misspelling, less disk space used). However, my guess is that the added complexity of joining the city/state together with the rest of the address elements outweighs that benefit to most database designers.
C. Example scenario
Let's work through an example design scenario to demonstrate how these rules might be applied to produce an efficient database. Ice cream entrepreneurs Jen and Barry have opened their business and now need a database to track orders. When taking an order, they record the customer's name, the details of the order such as the flavors and quantities of ice cream needed, the date the order is needed, and the delivery address. Their database needs to help them answer two important questions:
- Which orders are due to be shipped within the next two days?
- Which flavors must be produced in greater quantities?
A first crack at storing the order information might look like this:
Customer | Order | DeliveryDate | DeliveryAdd |
---|---|---|---|
Eric Cartman | 1 vanilla, 2 chocolate | 12/1/11 | 101 Main St |
Bart Simpson | 10 chocolate, 10 vanilla, 5 strawberry | 12/3/11 | 202 School Ln |
Stewie Griffin | 1 rocky road | 12/3/11 | 303 Chestnut St |
Bart Simpson | 3 mint chocolate chip, 2 strawberry | 12/5/11 | 202 School Ln |
Hank Hill | 2 coffee, 3 vanilla | 12/8/11 | 404 Canary Dr |
Stewie Griffin | 5 rocky road | 12/10/11 | 303 Chestnut St |
The problem with this design becomes clear when you imagine trying to write a query that calculates the number of gallons of vanilla that have been ordered. The quantities are mixed with the names of the flavors, and any one flavor could be listed anywhere within the order field (i.e., it won't be consistently listed first or second).
A design like the following would be slightly better:
Customer | Flavor1 | Qty1 | Flavor2 | Qty2 | Flavor3 | Qty3 | DeliveryDate | DeliveryAdd |
---|---|---|---|---|---|---|---|---|
Eric Cartman | vanilla | 1 | chocolate | 2 | 12/1/11 | 101 Main St |
||
Bart Simpson | chocolate | 10 | vanilla | 10 | strawberry | 5 | 12/3/11 | 202 School Ln |
Stewie Griffin | rocky road | 1 | 12/3/11 | 303 Chestnut St |
||||
Bart Simpson | mint chocolate chip | 3 | strawberry | 2 | 12/5/11 | 202 School Ln |
||
Hank Hill | coffee | 2 | vanilla | 3 | 12/8/11 | 404 Canary Dr |
||
Stewie Griffin | rocky road | 5 | 12/10/11 | 303 Chestnut St |
This is an improvement because it enables querying on flavors and summing quantities. However, to calculate the gallons of vanilla ordered you would need to sum the values from three fields. Also, the design would break down if a customer ordered more than three flavors.
Slightly better still is this design:
Customer | Flavor | Qty | DeliveryDate | DeliveryAdd |
---|---|---|---|---|
Eric Cartman | vanilla | 1 | 12/1/11 | 101 Main St |
Eric Cartman | chocolate | 2 | 12/1/11 | 101 Main St |
Bart Simpson | chocolate | 10 | 12/3/11 | 202 School Ln |
Bart Simpson | vanilla | 10 | 12/3/11 | 202 School Ln |
Bart Simpson | strawberry | 5 | 12/3/11 | 202 School Ln |
Stewie Griffin | rocky road | 1 | 12/3/11 | 303 Chestnut St |
Hank Hill | coffee | 2 | 12/8/11 | 404 Canary Dr |
Hank Hill | vanilla | 3 | 12/8/11 | 404 Canary Dr |
Stewie Griffin | rocky road | 5 | 12/10/11 | 303 Chestnut St |
This design makes calculating the gallons of vanilla ordered much easier. Unfortunately, it also produces a lot of redundant data and spreads a complete order from a single customer across multiple rows.
Better than all of these approaches would be to separate the data into four entities (Customers, Flavors, Orders and Order Items):
CustID | NameLast | NameFirst | DeliveryAdd |
---|---|---|---|
1 | Cartman | Eric | 101 Main St |
2 | Simpson | Bart | 202 School Ln |
3 | Griffin | Stewie | 303 Chestnut St |
4 | Hill | Hank | 404 Canary Dr |
FlavorID | Name |
---|---|
1 | vanilla |
2 | chocolate |
3 | strawberry |
4 | rocky road |
5 | mint chocolate chip |
6 | coffee |
OrderID | CustID | DeliveryDate |
---|---|---|
1 | 1 | 12/1/11 |
2 | 2 | 12/3/11 |
3 | 3 | 12/3/11 |
4 | 2 | 12/5/11 |
5 | 4 | 12/8/11 |
6 | 3 | 12/10/11 |
OrderItemID | OrderID | FlavorID | Qty |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 1 | 2 | 2 |
3 | 2 | 2 | 10 |
4 | 2 | 1 | 10 |
5 | 2 | 3 | 5 |
6 | 3 | 4 | 1 |
7 | 4 | 5 | 3 |
8 | 4 | 3 | 2 |
9 | 5 | 6 | 2 |
10 | 5 | 1 | 3 |
11 | 6 | 4 | 5 |
If one were to implement a design like this in MS-Access, the query needed to display orders that must be delivered in the next 2 days would look like this in the GUI:
It would produce the following SQL:
If you aren't experienced with relational databases, then such a table join may seem intimidating. You probably won't need to do anything quite so complex in this course. The purpose of this example is two-fold:
- To illustrate that multi-table designs are often preferable to a "one-big-spreadsheet" approach.
- To emphasize the importance of SQL in pulling together data spread across multiple tables to perform valuable database queries.
D. Data modeling
Whether it's just a quick sketch on a napkin or a months-long process involving many stakeholders, the life cycle of any effective database begins with data modeling. Data modeling itself begins with a requirements analysis, which can be more or less formal, depending on the scale of the project. One of the common products of the data modeling process is an entity-relationship (ER) diagram. This sort of diagram depicts the categories of data that must be stored (the entities) along with the associations (or relationships) between them. The Wikipedia entry on ER diagrams is quite good, so I'm going to point you there to learn more:
Entity-relationship model article at Wikipedia [http://en.wikipedia.org/wiki/Entity-relationship_model]
An ER diagram is essentially a blueprint for a database structure. Some RDBMSs provide diagramming tools (e.g., Oracle Designer, MySQL Workbench) and often include the capability of automatically creating the table structure conceptualized in the diagram.
In a GIS context, Esri makes it possible to create new geodatabases based on diagrams authored using CASE (Computer-Aided Software Engineering) tools. This blog post, Using Case tools in Arc GIS 10, [http://blogs.esri.com/esri/arcgis/2010/08/05/using-case-tools-in-arcgis -10/] provides details if you are interested in learning more.
E. Design practice
To help drive these concepts home, here is a scenario for you to consider. You work for a group with an idea for a fun website: to provide a place for music lovers to share their reviews of albums on the 1001 Albums You Must Hear Before You Die list. (All of these albums had been streamable from Radio Romania 3Net, but sadly it appears that's no longer the case.)
Spend 15-30 minutes designing a database (on paper, no need to implement it in Access) for this scenario. Your database should be capable of efficiently storing all of the following data:
- Artist
- Album Genre
- Record Label
- Album Comments
- Album
- Track Name
- Reviewer ID
- Track Rating
- Year
- Track Length
- Album Rating
- Track Comments
When you're satisfied with your design, move to the next page and compare yours to mine.