Find out what normalization is and how your database can benefit from it (or suffer from it). Learn the advantages, disadvantages, and some techniques and guidelines to doing it yourself.
In this hour, you learn the process of taking a raw database and breaking it into logical units called tables. This process is referred to as normalization. The normalization process is used by database developers to design databases in which it is easy to organize and manage data while ensuring the accuracy of data throughout the database.
The advantages and disadvantages of both normalization and denormalization of a database are discussed, as well as data integrity versus performance issues that pertain to normalization.
The highlights of this hour include
What normalization is
Benefits of normalization
Advantages of denormalization
Guidelines of normalization
The three normal forms
Normalizing a Database
Normalization is a process of reducing redundancies of data in a database. Normalization is a technique that is used when designing and redesigning a database. Normalization is a process or set of guidelines used to optimally design a database to reduce redundant data. The actual guidelines of normalization, called normal forms, will be discussed later in this hour. It was a difficult decision to decide whether to cover normalization in this book because of the complexity involved in understanding the rules of the normal forms this early on in your SQL journey. However, normalization is an important process that, if understood, will increase your understanding of SQL. We have attempted to simplify the process of normalization as much as possible in this hour. At this point, don't be overly concerned with all the specifics of normalization; it is most important to understand the basic concepts.
The Raw Database
A database that is not normalized may include data that is contained in one or more different tables for no apparent reason. This could be bad for security reasons, disk space usage, speed of queries, efficiency of database updates, and, maybe most importantly, data integrity. A database before normalization is one that has not been broken down logically into smaller, more manageable tables. Figure 4.1 illustrates the database used for this book before it was normalized.
Figure 4.1 The raw database.
Logical Database Design
Any database should be designed with the end user in mind. Logical database design, also referred to as the logical model, is the process of arranging data into logical, organized groups of objects that can easily be maintained. The logical design of a database should reduce data repetition or go so far as to completely eliminate it. After all, why store the same data twice? Naming conventions used in a database should also be standard and logical.
What Are the End User's Needs?
The needs of the end user should be one of the top considerations when designing a database. Remember that the end user is the person who ultimately uses the database. There should be ease of use through the user's front-end tool (a client program that allows a user access to a database), but this, along with optimal performance, cannot be achieved if the user's needs are not taken into consideration.
Some user-related design considerations include the following:
What data should be stored in the database?
How will the user access the database?
What privileges does the user require?
How should the data be grouped in the database?
What data is the most commonly accessed?
How is all data related in the database?
What measures should be taken to ensure accurate data?
Data should not be redundant, which means that the duplication of data should be kept to a minimum for several reasons. For example, it is unnecessary to store an employee's home address in more than one table. With duplicate data, unnecessary space is used. Confusion is always a threat when, for instance, an address for an employee in one table does not match the address of the same employee in another table. Which table is correct? Do you have documentation to verify the employee's current address? As if data management were not difficult enough, redundancy of data could prove to be a disaster.
The Normal Forms
The next sections discuss the normal forms, an integral concept involved in the process of database normalization.
Normal form is a way of measuring the levels, or depth, to which a database has been normalized. A database's level of normalization is determined by the normal form.
The following are the three most common normal forms in the normalization process:
The first normal form
The second normal form
The third normal form
Of the three normal forms, each subsequent normal form depends on normalization steps taken in the previous normal form. For example, to normalize a database using the second normal form, the database must first be in the first normal form.
The First Normal Form
The objective of the first normal form is to divide the base data into logical units called tables. When each table has been designed, a primary key is assigned to most or all tables. Examine Figure 4.2, which illustrates how the raw database shown in the previous figure has been redeveloped using the first normal form.
You can see that to achieve the first normal form, data had to be broken into logical units of related information, each having a primary key and ensuring that there are no repeated groups in any of the tables. Instead of one large table, there are now smaller, more manageable tables: , , and . The primary keys are normally the first columns listed in a table, in this case: , , and .
The Second Normal Form
The objective of the second normal form is to take data that is only partly dependent on the primary key and enter that data into another table. Figure 4.3 illustrates the second normal form.
According to the figure, the second normal form is derived from the first normal form by further breaking two tables down into more specific units.
split into two tables called and . Personal employee information is dependent on the primary key (), so that information remained in the (, , , , , , , , , and ). On the other hand, the information that is only partly dependent on the (each individual employee) is used to populate (, , , , , and ). Notice that both tables contain the column . This is the primary key of each table and is used to match corresponding data between the two tables.
split into two tables called and . What took place is similar to what occurred in the . Columns that were partly dependent on the primary key were directed to another table. The order information for a customer is dependent on each , but does not directly depend on the general customer information in the original table.
The Third Normal Form
The third normal form's objective is to remove data in a table that is not dependent on the primary key. Figure 4.4 illustrates the third normal form.
Another table was created to display the use of the third normal form. is split into two tables, one table containing the actual employee pay information and the other containing the position descriptions, which really do not need to reside in . The column is totally independent of the primary key, .
Naming conventions are one of the foremost considerations when you're normalizing a database. Names are how you will refer to objects in the database. You want to give your tables names that are descriptive of the type of information they contain so that the data you are looking for is easy to find. Descriptive table names are especially important for users querying the database that had no part in the database design. A company-wide naming convention should be set, providing guidance in the naming of not only tables within the database, but users, filenames, and other related objects. Designing and enforcing naming conventions is one of a company's first steps toward a successful database implementation.
Benefits of Normalization
Normalization provides numerous benefits to a database. Some of the major benefits include the following :
Greater overall database organization
Reduction of redundant data
Data consistency within the database
A much more flexible database design
A better handle on database security
Organization is brought about by the normalization process, making everyone's job easier, from the user who accesses tables to the database administrator (DBA) who is responsible for the overall management of every object in the database. Data redundancy is reduced, which simplifies data structures and conserves disk space. Because duplicate data is minimized, the possibility of inconsistent data is greatly reduced. For example, in one table an individual's name could read , whereas the name of the same individual reads in another table. Because the database has been normalized and broken into smaller tables, you are provided with more flexibility as far as modifying existing structures. It is much easier to modify a small table with little data than to modify one big table that holds all the vital data in the database. Lastly, security is also provided in the sense that the DBA can grant access to limited tables to certain users. Security is easier to control when normalization has occurred.
Data integrity is the assurance of consistent and accurate data within a database.
Referential integrity simply means that the values of one column in a table depend on the values of a column in another table. For instance, in order for a customer to have a record in the table, there must first be a record for that customer in the table. Integrity constraints can also control values by restricting a range of values for a column. The integrity constraint should be created at the table's creation. Referential integrity is typically controlled through the use of primary and foreign keys.
In a table, a foreign key, normally a single field, directly references a primary key in another table to enforce referential integrity. In the preceding paragraph, the in is a foreign key that references in .
Drawbacks of Normalization
Although most successful databases are normalized to some degree, there is one substantial drawback of a normalized database: reduced database performance. The acceptance of reduced performance requires the knowledge that when a query or transaction request is sent to the database, there are factors involved, such as CPU usage, memory usage, and input/output (I/O). To make a long story short, a normalized database requires much more CPU, memory, and I/O to process transactions and database queries than does a denormalized database. A normalized database must locate the requested tables and then join the data from the tables to either get the requested information or to process the desired data. A more in-depth discussion concerning database performance occurs in Hour 18, "Managing Database Users."
Denormalizing a Database
Denormalization is the process of taking a normalized database and modifying table structures to allow controlled redundancy for increased database performance. Attempting to improve performance is the only reason to ever denormalize a database. A denormalized database is not the same as a database that has not been normalized. Denormalizing a database is the process of taking the level of normalization within the database down a notch or two. Remember, normalization can actually slow performance with its frequently occurring table join operations. (Table joins are discussed during Hour 13, "Joining Tables in Queries.") Denormalization may involve recombining separate tables or creating duplicate data within tables to reduce the number of tables that need to be joined to retrieve the requested data, which results in less I/O and CPU time.
There are costs to denormalization, however. Data redundancy is increased in a denormalized database, which can improve performance but requires more extraneous efforts to keep track of related data. Application coding renders more complications, because the data has been spread across various tables and may be more difficult to locate. In addition, referential integrity is more of a chore; related data has been divided among a number of tables. There is a happy medium in both normalization and denormalization, but both require a thorough knowledge of the actual data and the specific business requirements of the pertinent company.
Learn how to enhance consistency and reduce redundancy by understanding how to normalize your data model and apply it to your design.This chapter is from the book
What Does It Mean to Normalize a Database?
Steps to Normalize Your Data Model
Denormalize DataWhen Does It Make Sense to Break the Rules?
Normalization AppliedReview the TEB Database and Refine the Design
What You Have Learned
In the last chapter, you were introduced to basic database design concepts. In this chapter, you will build on those skills. If you recall one of the basic mantras of this book, it takes several iterations of work to reach an optimal database design. After your initial complement of tables has been created, the next step involves outfitting the tables with columns. At this point, the only columns that exist are the primary and foreign keys necessary to support the relations between the tables. The question at this point is, "What columns do you include in a table?" Some of the columns to include can be determined through the application of common sense. Other columns to include are not as intuitive. As you will see, whether to include a column can determine whether new tables have to be created. If you think all the tables for the Time Entry and Billing (TEB) Database have been created, you are in for a surprise! The process of determining which columns go in a table is called normalization. The normalization process is the focus of this chapter, and upon completing this chapter, the TEB Database design will be complete.
What Does It Mean to Normalize a Database?
Database normalization can best be characterized as the process of organizing a database. With the question of what out of the way, let's turn to the question of why. The goal of normalization is to reduce problems with data consistency by reducing redundancy. Sound confusing? The concept of normalization is probably easier understood by way of a simple example. To illustrate, let's again turn to the Northwind Traders Database that ships with Access.
Figure 4.1 shows the ERD (Entity Relationship Diagram) for the Northwind Traders Database that was introduced in Chapter 2, "The Anatomy of a Real Database." Notice the relationship between the Customers and Orders tables. As a quick review of how relationships work in a relational database, the primary key of the parent table is carried in the child table as the foreign key. In this case, the CustomerID field is carried in the Orders table. This is how order records for a specific customer can be associated with that customer.
Why then not carry other fields from the Customers table? As you will see later, sometimes you might elect to carry other fields from the parent table to the child table. As a general rule, however, you will not want to do this. For example, what if you decide you are going to carry the CompanyName field in the Orders table and the company name changes? Not only would you have to update the Customers table, you would also have to update the Orders table and any other table in which CompanyName exists. Continuing with this example, what if the various locations of CompanyName were not consistently updated? Which version would reflect the current, most accurate version of CompanyName? If you were asked to prepare a report that needed to include CompanyName, which table should you use as the source of CompanyName?
Figure 4.1 The Entity Relationship Diagram for the Northwind Traders Database shows a relationship between customers and orders.
It seems clear that if you have only one instance of a given data element in a database, it leaves nothing to interpretation and guesswork. When you update CompanyName in the Customers table, you can be sure that every report that relies on CompanyName will be accurate and up to date. Why? Because CompanyName is not carried redundantly in the database, and as a result, consistency is ensured. The same concept can be seen throughout the data model in Figure 4.1. Greater consistency through the elimination of redundancythis is the goal of normalization!