In a relational database – i.e. Microsoft Access, SQL Server, DB2, MySQL – the data exist in tables. In the traditional, academic world, these tables were structured so that they were in “Normal Form”.*
In the original paper on relational database design, Egar Codd defines how tables should be created so that it reduces update anomalies – having mismatch information across various sources.
Relational databases and normal form focused on separating programmers from people pulling data out of the database.
It provides a means of describing data
with its natural structure only-that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other.
You’ll have to keep in mind that databases were very different before this paper. It required programmers to build data bases, write it to a specific place on the computer’s disk and then had to write another program to pull data out.
Requirements for Normal Form
1NF: All attributes (columns) must be atomic
- All cells must contain one and only one fact – e.g. one phone number.
2NF: There must be no partial dependencies
- Partial Dependency means you can take information out of a table since it’s repeated.
Here’s an example, let’s say there’s a table that has a record for every time a cab is being used. If we store CabNumber, CabColor and DateofRide we’ll store the cab’s color every time. That’s a partial dependency.
The solution is to remove CabColor from the table and create a second table with only CanNumber and CabColor. That way if you ever change the Cab’s color you just make one update – instead of an update for every time the cab was used.
3NF: No transitive dependencies
- Transitive Dependency means there is repeated information in the table that is dependent on another column that is not the primary key.
Take a table of marketing campaigns with fields: ProjectID, ProjectName, ProjectLeader, and LeaderDept. So ProjectID is our key – if you know the ID you can pull a specific project. Now one project has one name and one leader. That works perfectly!
The problem we have now is that LeaderDept is dependent on ProjectLeader and NOT ProjectID. The best solution is to pull out LeaderDept into its own table (something like Employee, EmployeeDept) so that if you have to update the Leader’s department, you’re only updating one record rather than a record for every time they lead a project.
*The modern business intelligence point of view is that your database should be normalized (verb- meaning to put your database into normal form) unless it’s more convenient for the user to have certain columns together. It’s okay not to be in a perfect normal form but it just makes it harder on DBAs and keeping data in-sync.
Hat Tip
- Database Design Using Entity-Relationship Diagrams (2nd ed., Bagui, Earp)
- CSC-451 Database Design at Depaul University