Relational Database Normalization

If database migration involves a schema redesign, we need to follow certain forms. I have good instinct what these forms are. However, I got some questions are particular normal forms and I realized that I cannot spell them all out clearly, except that I learned it all in university (10+ years ago). I decided to spend some time writing up this little post to give myself a refresh on database normalization.

These normal forms (NF) are general patterns to allow developers to organize data better, with less redundancy. They are heuristics to prevent run-time exceptions, and bring more clarity and more efficient usage of disk space. It also ensure that dependencies between tables are properly enforced by database integrity constraints. Wikipedia states that there are more than then NFs. The normalization process is also progressive. So a higher level of normalization cannot be achieved unless the previous levels have been all satisfied. The most fundamental NFs are 1NF to 3NF. If 3NF is satisfied, the database is unlikely to run into issues with data redundancy, insert/update/delete exception, etc. In practice, database in immature products may have only achieved 2NF or even 1NF due to limitation in scalability or performance. However, every database designer must know 1NF to 3NF, which is what this post will cover. So let’s have an unnormalized table to begin with. Unnormalized form is the lowest form:

unnormalized table

1st Normal Form – the values in each column of a table must be atomic (indivisible). So what to do:

  • Eliminate duplicative (related) columns in a table
  • Create a separate table for each set of related data
  • Identify each set of related data with a primary

An example of table satisfying 1NF is:

1NF satisfaction example

Before moving to 2NF, let’s introduce three concepts:

  • Candidate key – the minimal set of attributes such that the relation does not have two distinct tuples, with the same values for these attributes; Wikipedia has more mathematical description, but the notion of candidate key is just similar to composite primary key.
  • Functional dependency – For any relation R, attribute Y is functionally dependent on attribute X (X—->Y), if for every valid instance of X, that value of X uniquely determines the value of Y.
    • Full functional dependency (or full dependency): all non-key attributes are fully functionally dependent on the candidate key;
    • Transitive dependency: an indirect relationship between attributes in the same table that causes a functional dependency (X—->Y; Y—->Z; so X—->Z);

2nd Normal Form – No partial dependencies. Values depend on the whole of every candidate key. So what to do:

  • Create separate tables for sets of values that apply to multiple records
  • Relate these tables with a foreign key

An example of satisfying 2NF is:

Table 1 of 2NF satisfaction example, MEMBERSHIP_ID is primary key
Table 2 of 2NF satisfaction example, MEMBERSHIP_ID is foreign key

3rd Normal Form – No transitive dependencies. Values depend only on candidate keys. So what to do

  • Eliminate fields that do not depend on the primary key

An example of satisfying 3NF is:

Table 1 of 3NF satisfaction example
Table 2 of 3NF satisfaction example
Table 3 of 3NF satisfaction example

Further Normal Forms – The first three NFs are brought up before 1972. Then eight more NFs are brought up. For someone with many years experience on database, 1NF to 3NF should already become an instinct. However, their language expression can be fairly abstract. So I hope this little article makes it simple.