Data Warehouse Design Methodologies

Requirements for a Successful Data Warehouse Project
November 2, 2016
Designing a Relational Data Warehouse
November 17, 2016
There are two traditional data warehouse design methodologies came of age in the 1990’s, that of Bill Inmon’s Top-Down Atomic Data Warehouse and that of Ralph Kimball’s Bottom-Up Dimensional Data Warehouse. These methodologies have been used over the past 20 years to create informational data stores for organizations seeking to leverage their data for corporate gain. In this article, we will compare and contrast these two methodologies.

 

Atomic Data Warehouse – Bill Inmon

Bill Inmon’s Atomic Data Warehouse approach is strategic in nature and seeks to capture all of the enterprise data in 3rd Normal Form and store all of this atomic data in the data warehouse. Inmon defines a data warehouse as a subject-oriented, non-volatile, time-variant and integrated data source. Let’s break down each of these descriptors of the Inmon’s Data Warehouse.

  • Subject-Oriented – the data is organized so that the data, related by subject area, is linked together.
  • Non-Volatile – once data is entered it is never updated or deleted; all data is retained for future reporting needs.
  • Time-Variant – because of the non-volatile nature of the data and the need for time-based reporting, once data is entered into the warehouse it cannot be modified, new records must be added to reflect the changes in data over time.
  • Integrated – data is sourced from most to all of the enterprise’s information systems and organized in a consistent and unified manner.

 

Inmon then creates data marts, subject or department focused subset of the data warehouse, which is designed to address the data and reporting needs of the targeted subset of business users.

 

There are several benefits of this model. The first is that all of the corporate data is completely documented. This process provides the organization with a complete view of their processes, products/services, customers, vendors, etc. This documentation is invaluable to the organization as, in most cases, up to this point, every system has been launched in isolation and is often the first time the organization truly defines the different processes, products or parties with whom they interact with on a consistent basis.

 

Second, the data is efficiently stored in 3rd Normal Form in a single repository. This storage methodology makes the retrieval and storage of the data from transactional systems already defined in 3NF a little easier.

 

Finally, the data is readily available for extraction into data marts for the business users. Now that the data is fully defined and efficiently stored the warehousing team can build the data mart foe the business unit. These data marts will be created to allow the business unit quickly and efficiently answer their questions. It will also provide the user with the detail data supporting the data mart as well as the lineage of the data. This supporting information and data lineage is often critical in the acceptance and functional usage of the data mart by the business user.

 

There are also several challenges which this framework poses to the organization. First is the time-consuming task of documenting and defining the complete repository for the entire organization. This design methodology is a long, time-consuming process that, although invaluable, requires the data warehousing IT team to work closely with the business users to ensure the authoritative data is captured and stored in the correct data structure. Therefore a great deal of time will elapse between project kick-off and the initial data mart deliverable. This “deliverable gap” has been the demise of many data warehousing projects, as, depending on the size of the organization and the experience of the data warehousing team, the initial discovery and documentation step may never be completed before the project is canceled.

 

A second challenge is the lack of flexibility this model provides. If one adds a new business unit, a new application or offers a significantly different product or service the organization will need to go back and modify the existing data model to accurately document and define the new state of the business while maintaining the subject-oriented, non-volatile, time-variant and integrated aspects of the existing data stored in the warehouse. This too has often called into question in the value of a data warehouse.

 

Finally, there is substantial ETL processing necessary to transform the data warehouse data into a data mart to be used for business consumption. This protracted processing can cause delays in the delivery of the data to the business user. Sometimes these delays in transforming the data from the source system to the data warehouse and finally into the data mart for business consumption does not meet the needs of the business user and alternative solutions, or shortcuts are often researched and invoked.

 

Dimensional Data Warehouse – Ralph Kimball

Ralph Kimball’s methodology is more tactical in nature and is the antithesis of the Inmon’s methodology. Kimball’s definition of a data warehouse is “a copy of transaction data specifically structured for query and analysis.” He believes that you should start at the tactical level by focusing on the data mart first, thereby providing immediate value to the business users. Kimball’s data warehouse is to simply leverage the collection of the data marts as a whole. Kimball’s approach only worries about the data needed for the data marts.

 

Kimball’s data marts consist of source data converted from 3NF to a dimensional model. This dimensional model consists of facts and dimensions. Facts are calculated measures about entities at a specified point in time. Dimensions are the containers for the clarifying elements of the entities about which measures are grouped.

 

The major benefit of Kimball’s approach and the use of dimensional modeling is the speed upon which the business user can derive value from the data mart and the flexibility this modeling offers. Data marts can usually be defined, designed and delivered in less than 120 days and in a majority of cases in less than 90 days from the availability of the data. Next, this model also allows the facts or dimensions to easily be expanded to add new measures or additional information describing the entity to be added. Finally, this modeling technique calls for the preprocessing and storage of the data in such a way that aggregations of the fact data can be easily sliced and diced by the dimensional columns by the business users with little to no IT help. This combination of speed, agility, scalability, and understandability is needed in today’s rapidly changing business environment.

 

The challenges of the Kimball methodology is the lack of enterprise focus of the data warehouse. For example, each data mart may have similar but not conforming (consistent) dimensions across the different data marts as each may be derived from different sources. This has the potential of having each data mart provide a different answer to a standard enterprise question, such as “How many customers do we have?”, based on which source system the data mart has derived the customers.

 

Conclusion

In the end, both of these data warehousing methodologies provide intrinsic value to the enterprise. Some organizations want to focus on the strategic and therefore choose the Inmon methodology. While others need the speed and agility of the Kimball method. While still others want the best of both worlds and create a hybrid of both methodologies.

 

I hope you feel that you have a solid, high-level understanding of these methodologies to make an informed choice on your data warehousing methodology. I will provide more detailed information about how to implement these methodologies in future blog posts.

1 Comment

  1. Nice explaination JIM!!!