Data Warehouses and Data Marts

In general a data warehouse is a centrally stored source of business analysis data.  It is essentially an aggregated (not necessarily summed or counted) copy of data entered through traditional OLTP systems.  There are two schools of thought on how a Data Warehouse is designed.

  • Bill Inmon: A central database with data provided by many sources.  The warehosue then feeds several subject-oriented data marts.
  • Ralph Kimball: Many, subject-oriented data marts built straight from the source.  The collection of these data marts are considered a data warehouse.

A subject-oriented data mart is a database that is built around one topic – such as keeping track of every email or performance of marketing campaigns at the prospect level.

What Sort of Data Warehouse Should I Build?

The default book you’ll find on most DBA’s desk is Ralph Kimball’s Data Warehouse Lifecycle Toolkit.  However, Kimball’s approach has its pros and cons.

Pros to Many Data Marts = Data Warehouse

  • Less Complicated – You build as you go and there’s only one ETL program for each Data Mart.
  • Faster to Implement – Just build one data mart at a time and you’ll be good to go.
  • You don’t need to know all of the data – Since each Data Mart is built for one subject, one business expert can guide any programmer and that same programmer doesn’t have to be involved in every Data Mart.

Cons to Kimball’s Model

  • Risking Bad Data – When you copy data, transform it and then store it, you risk a bad ETL program.  If the programmer makes a mistake when designing the program (like misses some data or truncates some string) and you don’t have history (like you do in Inmon’s model) then you cannot correct / reload the data.

Things to Watch When Building a Data Warehouse

  • Dimensional Modeling – Data Warehouses store data differently than the relational, normalized model.
  • Extraction, Transformation, Load (ETL) data into the warehouse
  • Level of Aggregation – Since the data is historical, you want to keep your data as granular as possible.  Imagine if you’re looking at sales, and you’ve aggregated data at a monthly level instead of a customer level.  You can’t detect trends without granular data.

Hat Tip