Menu Close

Data Warehouse Design Techniques – Snowflaking

We will continue our deep dive through advanced dimensional data warehouse design techniques by discussing snowflaking.

Snowflaking Dimension Tables

According to Dr. Kimball, dimensions should be designed as long, denormalized records. The reason for denormalization is to ensure maximum performance of the DW/BI queries. Snowflaking is the normalization of the dimension tables by removing all redundant attributes into separate dimension tables. The reason for normalizing is often the desire to “enhance” the design and reduce the storage costs of the warehouse.

Although this normalized design appears to optimize the storage and maintenance of the warehouse it actually introduces maintenance complexities while introducing sub-optimal query performance issues for DW/BI queries.

Reporting Using Snowflakes

As I demonstrated in a previous blog, Type 2 Slowly Changing Dimensions (SCD) provide the most flexibility for BI reporting because of the ability to report on historical data, including point-in-time reports. Snowflaking can sometimes hide the ability to see these changes the snowflake dimensions. Normalization, by definition, does not capture the transactional details of the modifications of the data. This lack of transactional detail will cause the BI reports to “drift” and move with the changing data in the snowflake dimensions.

Here is an example of a snowflaked model.

Snowflaking Challenges

Let’s look at some detailed examples of the challenges this design can cause. First, the date dimension will require several joins to show the pay period information along with the month, quarter and year hierarchy. In order to find the sales per region, you will need to traverse the hierarchy from store to city to state to region. If you are looking for information about an employee, the data may change because the designer has used the id from the source application and therefore will not capturing the changes in the employee dimension. In all of these instances, when the snowflaked data changes the referencing dimension is not updated and does not know that the referenced data has changed, the detailed history of the change is lost and the ability to accurately regenerate metrics from a specific point-in-time is not possible.

Here are the same attributes captured in a denormalized star schema.

Star Schema Benefits

Looking at the denormalized star schema with no snowflaking we can quickly see that this model is much easier to understand. Because of the limited number of joins this model will perform better for reporting than the normalized design. In this design, we also see how we capture the effective dating of each record by using the active record start date and active record end date. This allows us to regenerate any metric as it was presented to the user on any given date in the past. Flags are used to quickly capture the most active records and are mainly used when loading the dimensions and building the fact records.


Snowflaked designs are often developed by novice data warehousing designers. The reasons for this are varied but I believe they center on the benefits of decreased storage and increased performance normalization provides online transaction processing systems (OLTP).  Dimensional modeling requires that the tables be denormalized as updates are limited to deactivating an active record by switching the active record flag and setting the end date of an active record and inserting of new, active records are done in bulk. Remember, dimensional models are focused on providing the user with the ability to consume both summarized and detailed information, data which they can drill-into or roll-up to provide valuable information about the metrics they are researching. Finally, snowflaked designs should be limited in use to ensure an optimal data warehouse design. You may come across times when snowflaking is required. I strongly recommend that you try to flatten the dimensions and see if the simple star schema will work before settling on a snowflake design.

Posted in Blog


  1. Jay Quincy Allen

    “enhance” the design and reduce the storage costs?

    This is not the reason for normalization. Normalization is completely about data integrity and avoiding DML anomolies. Yes, fully normalized data has an exponetially smaller footprint on disk. Also, there are performance benefits gained in writes with normalized structures. However, these are not the reasons for nomalizing data.

    • Jim McHugh

      Jay, I do agree that normalization ensures data integrity and anomalies within the data and these features of normalization are imperative in OLTP applications. Unfortunately, normalization in the form of snowflaking in data warehouses can cause severe degradation in performance due to the overhead costs of the extra joins.

      • John Owen

        I always normalize first because you can’t denormalize without starting in normal form. It’s also not automatically true that denormalization means better performance, denormalization causes a data explosion and can increase the amount of processing for any unindexed column.

        Most modern RDBMS can construct materialized views to support denormalized perspectives. Column oriented databases can also take the pain out of non-redundant forms and un-indexed columns. Also many query optimizers support join-table elimination.

        • Jim McHugh

          John, Thank you for your comment.

          Designing a data warehouse is a complex process and there is no “golden model” that will work for every scenario because each set of requirements is different.

          It’s not automatically true that denormalization means better performance. You site processing time for an unindexed column. It is true that a recently modified data model can have new CRUD statements can cause performance issues. These issues can be easily overcome by using partitioning and proper indexing to improve the performance of these CRUD statements.

          I also agree that many modern RDBMS can construct materialized views (MV) to support snowflaked architectures, but the creation of these MVs still incurs costs of time, CPU, and memory.

          Finally, here is a post (Columnar Database and Star Schema) that is focused on a Kimble star schema data modeling solution and column oriented data warehouses.

Leave a Reply

Your email address will not be published. Required fields are marked *