Menu Close

In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses.

The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive processes from impacting the source application while providing a link back to the source system.

Once all of the data is landed, I can now proceed to the transformation and staging step. In this phase, I create staging tables for all of my dimensions. These tables look similar to the dimension table but they do not have a primary key or a majority of audit columns. Here is an example of a regular dimension table and its corresponding staging dimension.

If the staging environment is a full pull of all active customers, I will retrieve the information from the source landing tables and make the appropriate transformations to the data before landing it in the staging table. As you can see from the example above there is a data hash audit column in both the staging and data warehouse dimension. This column is a numeric value based on the hashing of all of the non-audit columns within the dimension. This allows me to use this column to help me quickly find the dimensional changes happening in the source system that I care about in the data warehouse.

Again, assuming a full retrieval of all active customers from the source system, I can quickly find my adds and (soft) deletes by comparing the natural key (CSTMR_ID) and the data hash found in the staging dimension against the natural key (CSTMR_ID) and the data hash of the active record (CSTMR_ACTV_RCRD_FL = 1) in the DW dimension via two simple SQL statements. Below are the examples of the SQL statements for a full refresh.

If I receive only the data that changes from the source system then I would follow the steps for the full list of active customers listed above but now I only need to worry about the records that are currently in the staging dimension. Below is the updated SQL statement for soft deleting a dimension record when the staging tables which contain only the data which has changed. There is no change in the SQL statement for the inserting of records the staging tables which contain only change data captured data.

I hope that you found this article useful and continue to use it as a reference as you implement your data warehouse.

Posted in Blog

4 Comments

  1. mingchau

    Hi Jim,

    Thanks for sharing these articles. They are very clear to understand 🙂
    Is there a rss feed to subscribe your articles?
    Also, may I know which database GUI tool you are using?

    Thanks,
    Mingchau

    • Jim McHugh

      Thank you for your kind words. I do not use an RSS feed at this time. The GUI tools I use are Oracle SQL Developer for the SQL and Oracle SQL Developer Data Modeler and ERWin for the data modeling.

  2. MAHESH KUMAR

    Hi Jim,
    Only recently, I came across your blog. I have started reading your articles and find them to be very useful and crisp. Thanks a lot for the good work ! Please keep continuing…

    Regarding this blog, in the 2nd scenario, where the staging table load is based on “the extraction of only the changed records”, is the load strategy for the staging table “Truncate and Load” ? If yes, I am not clear as to why the below condition is used ?

    …AND DC.CSTMR_ID IN (SELECT CSTMR_ID FROM STG_DIM_CUSTOMER)…

    Let me explain with a below scenario..

    1) On the first day, as part of the initial load, 20 records are extracted into the staging table and then to the DW table ?
    2) On the 2nd day, 5 records are updated at the source system and 3 records are deleted. During the extraction process, only the 5 records will be extracted into the staging table. In this case, how will the 3 records that got deleted at the source be identified ?

    Please advise.

    • Jim McHugh

      In the example, I described how to soft delete records when only using data that has changed. The query that I provided will soft delete the records that you have retrieved since they do not have the same hash value as the records in the production DW table. Your question is how do I handle the records that were deleted from the source system if I am not provided the delete records via the CDC process.

      I was assuming that you know which records were updated or deleted in the source system as a part of the change data capture process. Therefore I would expect that there is some indicator that states if the record is new/updated or deleted. For the records that are updated or deleted, I run those records through the delete process as stated in the example. If you do not know which records were deleted you can run the full compare process or you could keep your staging table synchronized with the source table via a log-based replication tool (Data Guard, GoldenGate, etc.). When the record is deleted from the source system it will also delete the record in your staging table and will propagate to the DW on the next ETL run.

      I hope this helps. Please let me know if you need additional clarification.

Leave a Reply

Your email address will not be published. Required fields are marked *