Snowflaking Dimension Tables
According to Dr. Kimball, dimensions should be designed as long, denormalized records. The reason for denormalization is to ensure maximum performance of the DW/BI queries. Snowflaking is the normalization of the dimension tables by removing all redundant attributes into separate dimension tables. The reason for normalizing is often the desire to “enhance” the design and reduce the storage costs of the warehouse.
Although this normalized design appears to optimize the storage and maintenance of the warehouse it actually introduces maintenance complexities while introducing sub-optimal query performance issues for DW/BI queries.
Reporting Using Snowflakes
As I demonstrated in a previous blog, Type 2 Slowly Changing Dimensions (SCD) provide the most flexibility for BI reporting because of the ability to report on historical data, including point-in-time reports. Snowflaking can sometimes hide the ability to see these changes the snowflake dimensions. Normalization, by definition, does not capture the transactional details of the modifications of the data. This lack of transactional detail will cause the BI reports to “drift” and move with the changing data in the snowflake dimensions.
Here is an example of a snowflaked model.
Let’s look at some detailed examples of the challenges this design can cause. First, the date dimension will require several joins to show the pay period information along with the month, quarter and year hierarchy. In order to find the sales per region, you will need to traverse the hierarchy from store to city to state to region. If you are looking for information about an employee, the data may change because the designer has used the id from the source application and therefore will not capturing the changes in the employee dimension. In all of these instances, when the snowflaked data changes the referencing dimension is not updated and does not know that the referenced data has changed, the detailed history of the change is lost and the ability to accurately regenerate metrics from a specific point-in-time is not possible.
Here are the same attributes captured in a denormalized star schema.
Star Schema Benefits
Looking at the denormalized star schema with no snowflaking we can quickly see that this model is much easier to understand. Because of the limited number of joins this model will perform better for reporting than the normalized design. In this design, we also see how we capture the effective dating of each record by using the active record start date and active record end date. This allows us to regenerate any metric as it was presented to the user on any given date in the past. Flags are used to quickly capture the most active records and are mainly used when loading the dimensions and building the fact records.
Snowflaked designs are often developed by novice data warehousing designers. The reasons for this are varied but I believe they center on the benefits of decreased storage and increased performance normalization provides online transaction processing systems (OLTP). Dimensional modeling requires that the tables be denormalized as updates are limited to deactivating an active record by switching the active record flag and setting the end date of an active record and inserting of new, active records are done in bulk. Remember, dimensional models are focused on providing the user with the ability to consume both summarized and detailed information, data which they can drill-into or roll-up to provide valuable information about the metrics they are researching. Finally, snowflaked designs should be limited in use to ensure an optimal data warehouse design. You may come across times when snowflaking is required. I strongly recommend that you try to flatten the dimensions and see if the simple star schema will work before settling on a snowflake design.