Updating Type II Slowly Changing Dimensions

Introduction to KNIME
March 6, 2019
Beyond Recommender Systems: Alternative Applications of Matrix Factorization (Part 1)
March 27, 2019
Show all

In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses.

The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive processes from impacting the source application while providing a link back to the source system.

Once all of the data is landed, I can now proceed to the transformation and staging step. In this phase, I create staging tables for all of my dimensions. These tables look similar to the dimension table but they do not have a primary key or a majority of audit columns. Here is an example of a regular dimension table and its corresponding staging dimension.

If the staging environment is a full pull of all active customers, I will retrieve the information from the source landing tables and make the appropriate transformations to the data before landing it in the staging table. As you can see from the example above there is a data hash audit column in both the staging and data warehouse dimension. This column is a numeric value based on the hashing of all of the non-audit columns within the dimension. This allows me to use this column to help me quickly find the dimensional changes happening in the source system that I care about in the data warehouse.

Again, assuming a full retrieval of all active customers from the source system, I can quickly find my adds and (soft) deletes by comparing the natural key (CSTMR_ID) and the data hash found in the staging dimension against the natural key (CSTMR_ID) and the data hash of the active record (CSTMR_ACTV_RCRD_FL = 1) in the DW dimension via two simple SQL statements. Below are the examples of the SQL statements for a full refresh.

If I receive only the data that changes from the source system then I would follow the steps for the full list of active customers listed above but now I only need to worry about the records that are currently in the staging dimension. Below is the updated SQL statement for soft deleting a dimension record when the staging tables which contain only the data which has changed. There is no change in the SQL statement for the inserting of records the staging tables which contain only change data captured data.

I hope that you found this article useful and continue to use it as a reference as you implement your data warehouse.

Jim McHugh
Jim McHugh
An accomplished Sr. IT leader with over 27 years of professional experience applying technology solutions to business challenges. Mr. McHugh has expertise in data modeling, data governance, business intelligence, predictive analytics and data science. His responsibilities include establishing and executing a strategy that ensures the application of data management & analytics to enable an organization to strategically leverage and fully realize the value of their data.

2 Comments

  1. mingchau says:

    Hi Jim,

    Thanks for sharing these articles. They are very clear to understand 🙂
    Is there a rss feed to subscribe your articles?
    Also, may I know which database GUI tool you are using?

    Thanks,
    Mingchau

    • Jim McHugh says:

      Thank you for your kind words. I do not use an RSS feed at this time. The GUI tools I use are Oracle SQL Developer for the SQL and Oracle SQL Developer Data Modeler and ERWin for the data modeling.

Leave a Reply to mingchau Cancel reply

Your email address will not be published. Required fields are marked *

Contact