An Introduction to Database Normalization


What is Database normalization?

Database normalization is the process of structuring and organizing relational databases in relational database theory. It involves breaking down large tables into smaller and simpler tables while maintaining relationships among the smaller tables.

Why this is useful?

Database normalization is useful for reducing anomalies in database tables. It removes data redundancy, which in turn reduces storage costs. By doing so, the database becomes more manageable, and helps to obtain results through simple and efficient queries.

If the same data is stored in multiple columns, updating the record requires changes in multiple places. Breaking down such cases by creating a single column and linking it through relationships reduces redundancy and removes anomalies during updates.

For example, if we store the same customer details, such as name and address, in every invoice record, it results in duplication. By using normalization, we can eliminate redundancy by separating the customer details into a dedicated table and linking it to the invoices table.

Database normalization enhances data integrity, making it more consistent and accurate.

Let’s examine the different normal forms in database normalization with examples:

First Normal Form (1 NF)

This normalization ensures that each column contains only atomic values, meaning each column holds indivisible values. Each column must have a unique name, and each row in the column should contain a single value. Additionally, all attributes must depend on the primary key.

Let’s look at an example books table.

Books

titleauthorpublication_dateauthor_countrygenres
DuneFrank HerbertAugust 1, 1965United StatesScience Fiction, Adventure, Political Thriller
The Da Vinci CodeDan BrownMarch 18, 2003United StatesMystery, Thriller, Historical Fiction

This table violates the 1NF as the genres column contains multiple values. To normalize it to 1NF, we define the primary key in the books table and create a Genre_Book table to store atomic genre records.

Books

book_idtitleauthorpublication_dateauthor_country
1DuneFrank HerbertAugust 1, 1965United States
2The Da Vinci CodeDan BrownMarch 18, 2003United States

Genre_Book

genre_book_idbook_idgenre_name
11Science Fiction
21Adventure
31Political Thriller
42Mystery
52Thriller
62Historical Fiction

Second Normal Form (2 NF)

For a database to be in the second normal form (2NF), it must first be in the first normal form (1NF). Additionally, each non-primary key must depend on the primary key itself. If changing the value of a non-primary key also changes the value of another non-primary key, then it violates the second normal form.

So far, in the Books table, the author_country column depends on the author column. If the author value changes and the author is from a different country as defined in the row, that also needs to be changed.

Books

book_idtitleauthorpublication_dateauthor_country
1DuneFrank HerbertAugust 1, 1965United States
2The Da Vinci CodeDan BrownMarch 18, 2003United States

To achieve 2NF, let’s break the books table into two tables: books and authors.

Books

book_idtitleauthor_idpublication_date
1Dune1August 1, 1965
2The Da Vinci Code2March 18, 2003

Authors

author_idauthorauthor_country
1Frank HerbertUnited States
2Dan BrownUnited States

Third Normal Form (3 NF)

To be in the third normal form (3NF), a database must first be in the second normal form (2NF). Additionally, it removes transitive dependencies by ensuring that each non-key attribute depends only on the primary key.

So far, we have the following tables that are in 2NF:

Books

book_idtitleauthor_idpublication_date
1Dune1August 1, 1965
2The Da Vinci Code2March 18, 2003

Authors

author_idauthorauthor_country
1Frank HerbertUnited States
2Dan BrownUnited States

Genre_Book

genre_book_idbook_idgenre_name
11Science Fiction
21Adventure
31Political Thriller
42Mystery
52Thriller
62Historical Fiction

Issue in 3NF:

Let’s say there is a hidden column author_dob in the books table, which is a non-prime attribute (not part of the key) that depends on author_id, another non-prime attribute. This creates a transitive dependency: book_idauthor_idauthor_dob.

This violates 3NF because author_dob depends on author_id, not directly on the primary key (book_id).

Solution: We should remove author_dob from the books table and store it in the authors table, as it directly describes the author. The tables will be as follows:

Books

book_idtitleauthor_idpublication_date
1Dune1August 1, 1965
2The Da Vinci Code2March 18, 2003

Authors

author_idauthorauthor_countryauthor_dob
1Frank HerbertUnited StatesOctober 8, 1920
2Dan BrownUnited StatesJune 22, 1964

Genre_Book

genre_book_idbook_idgenre_name
11Science Fiction
21Adventure
31Political Thriller
42Mystery
52Thriller
62Historical Fiction

Boyce-Codd Normal Form (BCNF)

This is a stricter version of 3NF that addresses additional anomalies. At this normalization level, every determinant is a candidate key.

Let’s look at the following tables:

Books

book_idtitleauthor_idpublication_date
1Dune1August 1, 1965
2The Da Vinci Code2March 18, 2003

Authors

author_idauthorauthor_countryauthor_dob
1Frank HerbertUnited StatesOctober 8, 1920
2Dan BrownUnited StatesJune 22, 1964

Genre_Book

To give an example, the genre_description column is added to this table.

genre_book_idbook_idgenre_namegenre_description
11Science FictionFiction dealing with futuristic concepts
21AdventureFiction involving exciting experiences
31Political ThrillerFiction with political intrigue
42MysteryFiction involving solving a crime
52ThrillerFiction with suspense and excitement
62Historical FictionFiction set in a historical period

This Genre_Book table violates BCNF for the following reasons.

The issue lies with the functional dependency:

  • genre_namegenre_description
  • genre_name is not a superkey of the table.
  • This violates the BCNF rule because all determinants (left-hand side of a functional dependency) must be superkeys in BCNF.

To comply with BCNF, we can break down the Genre_Book table into two tables: Genre_Book and Genres.

Books

book_idtitleauthor_idpublication_date
1Dune1August 1, 1965
2The Da Vinci Code2March 18, 2003

Authors

author_idauthorauthor_countryauthor_dob
1Frank HerbertUnited StatesOctober 8, 1920
2Dan BrownUnited StatesJune 22, 1964

Genres

genre_idgenre_namegenre_description
1Science FictionFiction dealing with futuristic concepts
2AdventureFiction involving exciting experiences
3Political ThrillerFiction with political intrigue
4MysteryFiction involving solving a crime
5ThrillerFiction with suspense and excitement
6Historical FictionFiction set in a historical period

Genre_Book

genre_book_idbook_idgenre_id
111
212
313
424
525
626

In the Genres table:

  • genre_name determines genre_description, and genre_id is the superkey for the table. Therefore, it satisfies BCNF.

In the Genre_Book table:

  • The primary key (genre_book_id) determines all other attributes, satisfying BCNF.

Fourth Normal Form (4NF)

This normal form removes multivalued dependencies and should be in BCNF. For example, let’s construct the following table.

Books_Genres_Authors

book_idgenre_idauthor_id
111
121
131
242
252
262

This table stores

  • book_id (FK books)
  • genre_id (FK genres)
  • author_id (FK authors)

Why This Violates 4NF?

  • Genres and Authors have independent relationships with the book.
  • For a single book_id, the genre_id and author_id are not dependent on each other, resulting in multivalued dependencies.
  • For book_id = 1, the genres (1, 2, 3) are independent of the author (1), but they are stored together unnecessarily.

To fix this, let’s split the table into the following two independent tables:

Book_Author

book_author_idbook_idauthor_id
111
222

Genre_Book

genre_book_idbook_idgenre_id
111
212
313
424
525
626

Now, there are no multivalued dependencies, and the tables are in 4NF.


Fifth Normal Form (5NF)

This is the highest normalization level, and it reduces the join dependencies of queries.

Suppose we extend the Genre_Book table to include publishers for each book-genre combination. This could lead to a table like this:

Genre_Book_Publishers

genre_book_idbook_idgenre_idpublisher_id
1111
2121
3131
4242
5252
6262
7112
8122

Why This Violates 5NF?

The relationships between book_id, genre_id, and publisher_id are independent:

  • A book can belong to multiple genres.
  • A book can have multiple publishers.
  • A publisher may publish books in multiple genres.

This results in redundancy in the Genre_Book_Publishers table. For example:

For book_id = 1, genre_id = 1 is associated with both publisher_id = 1 and publisher_id = 2. These relationships are independent, but they are stored in a single table, creating redundancy.

Solution: Decompose the Table

To resolve the 5NF violation, decompose the Genre_Book_Publishers table into smaller tables that represent independent relationships:

Book_Genre

book_idgenre_id
11
12
13
24
25
26

Book_Publisher

book_idpublisher_id
11
12
22

Genre_Publisher

genre_idpublisher_id
11
12
21
22
31
42
52
62

How This Resolves the 5NF Violation

Each table now represents a single, independent relationship:

  • Book_Genre tracks which books belong to which genres.
  • Book_Publisher tracks which books are published by which publishers.
  • Genre_Publisher tracks which genres are published by which publishers.

By joining these tables, you can still reconstruct the original data without redundancy.

Conclusion

The concept of database normalization is useful for eliminating redundancy, maintaining data integrity, scalability, and enabling flexible queries. The degree of normalization in a database depends on how you design your schema and relationships.

References