In my previous post about Slowly Changing Dimensions (SCDs), I provided you an overview of the four most common SCDs types and how they are used. I have noticed quite a bit of traffic to this post over the past several years, and I thought now would be an excellent time to come back and take a closer look at SCDs.
Dimensions are tables containing data about an object. I like to think of dimensions as nouns (person, place, thing, or concept) The attributes of the dimension are the adjectives that provide descriptive detail about the dimension (name, height, color, price, etc.). Dimensions often contain attributes defined by the customer to group the dimensions into additional subsets. For example, you may have a dimension of products that will contain all of the pertinent detailed attributes of the products you sell. It may also contain additional information like product type, sale type, or vendor/brand. These attributes are used to group several products together, providing the user with additional ways to analyze the fact data.
Dimension tables can vary in size from a handful of rows to millions of rows. It is unusual for a dimension table to have hundreds of millions of rows. However, even dimensions with over a hundred million rows are relatively small when compared to the number of fact records associated with that dimension. Dimensions can be very wide, containing many attribute columns. I have created dimensions with over 150 attributes on more than one occasion. Although it is unusual to have so many attributes in a dimension, it does happen from time to time. If you run into situations where you have a large number of attributes, you may want to consider creating mini-dimensions, Snowflaking, or normalizing the dimension or using Outriggers to address repeating attributes.
There are eight (8) types of SCDs, and I will walk you through all of them here and provide you with scenarios of when to use one over another.
Type 0 Dimensions do not allow changes to the dimensional attributes. Although this sounds strange, we use a Type 0 SCD in every data model we create. The Date and Time dimensions are examples of Type 0 Dimensions. May 6, 2020, will always be a Wednesday, the 127th day of 2020, etc. We can add additional attributes to extend the dimension, but the attributes initially inserted to the dimension will NEVER change.
Type 1 Dimensions allow for direct, in-line updating of the dimensional attributes. This means that the record in the dimension is always representing the current state of the dimension with no history stored.
Let’s look at an example. Here in Figure 1. you have a store dimension that has an attribute of Region.
If there is a change in the regional boundaries, some stores may be moved from one region to another. In Figure 2., we see that store ID VA-13 has moved from the Northeast district to the Mid-Atlantic district.
Because the record has been updated, store ID VA-13, which had been reporting results in the Northeast district before the change, will now report all of their current AND historical results to the Mid-Atlantic region. This change will impact all of the historical reports because all of the revenue from the VA-13 store will be credited to the Mid-Atlantic region and subtracted from the Northeast region. All reports run before the update no longer matches the reports run after the update for the same timeframe.
In this example, the use of SCD Type 1 has a significant impact on the reporting of the facts because something that was grouped by the Region attribute type is suddenly moved to another Region with no explanation of the change in reporting numbers. I have run into many systems that were designed with Type 1 dimensions, and the customers have always questioned the validity of the data and complained that the reports they ran last week don’t match the numbers they are getting today. Worse, without an enormous amount of reviewing the ELT process, researching a database changelog, or reviewing the implementation of a change data capture architecture in the warehouse the customer cannot easily see and understand the changes that were made and from which data source (application) the data changes originated.
So, when would you create a Type 1 SCD? When the customer knows and understands the impact of a Type 1 SCD, and the impact on reporting is limited. I have never made a full warehouse with Type 1 SCDs, but I usually include a few of these dimensions in my models because some dimensions just do not require history to be maintained.
Again, I stress that you keep the customer informed of this design decision and document if and when this design will impact their reporting. Not only will this documentation keep you out of trouble in the eyes of your customer, but it will make it easy to find the source of the changes when the customer eventually asks why a report “unexpectedly” changed. Remember, it is your job as a Data Architect and/or Data Modeler to make sure your customer makes informed decisions and fully understands the impact of those decisions. You should never be looking for the easy way out, but instead, understand the customer’s immediate requirements and make sure they will meet the customer’s future expectations.
Type 2 Dimensions allow you to capture the history of the changes to the dimension by creating a new record with a surrogate key for each version of a natural key provided by the source system. When a difference in the record is detected, the latest version is inserted into the dimension, and the old record is inactivated from use during the Change Data Capture (CDC) and use by current fact records. A well documented and tested CDC is critical here because you could quickly create millions of dimensional records because of an error in the CDC process.
Let’s walk through an example using the same data as SCD 1. In Figure 3. You will notice I have added some additional columns (ACTV_RCRD, ACTV_STRT, ACTV_END). The active record flag is the only column that is required, but I add the active record start and end date to provide the ability for my customers to do point-in-time reporting.
Not included in this example is a record hash (RCRD_HASH). I create a hash value of the concatenated values in the columns that constitute a change in the record. I do not include columns that I expect to change on a regular basis and that the customer does not want to consider a change in the record. This hash column, along with the natural key, makes it easy to identify the changed records by comparing the natural key and the hash value of today’s record against the natural key and the hash value of the current record in the dimension.
In Figure 4. you will see that I have added a new record for the VA-13 store ID (Key 789). I then updated the original version of the record (Key 123) to no longer be active (ACTV_RCRD = 0) and set the end date to the day before the new record was inserted into the dimension (ACTV_END = 20160728). Now I have a full history of the changes to the dimensional attributes. If the customer asks what changed and when I can quickly bring up the records for the natural key and sort them by the ACTV_STRT column, now they will be able to see how the dimensional attributes have changed over time.
I use Type 2 SCDs most of the time for my modeling. I feel this dimensional type that provides the most flexibility for reporting and the greatest understanding of dimensional changes over time by the customer.
Type 3 Dimensions has columns for the previous version of the attributes. These additional columns allow one to report on the previous value while retaining a single record for each natural key, as we do in Type 1 SCDs. When a change is detected, the current value is “moved” to the previous value column, and the new value is “updated” in the original reporting column.
Looking at Figure 5. we see that we only care about capturing the previous Region for the records. If the name changes, we just update the name and make no change to the rest of the record. If the region changes as we see in Figure 6. we move the current value in Region to the Previous_Region column and add in the new Region into the Region column.
When reporting, we can use both columns so that people can see the record was changed at some previous time. I recommend adding a column to capture the date the change was made, as this additional information is often valuable in understanding the impact this change has on the report/business.
This type is not often used because it doesn’t capture all of the changes, lacks history if changes past the most recent change and is hard to use by modern reporting applications. Like Type 1 SCDs, I highly recommend making your customers aware of the full impact of this design decision and document this decision.
Type 4 Dimensions are mini dimensions that are derived from other dimensions based on relatively frequently changing data. For example, looking at Figure 7., we see a customer dimension where you have attributes for data of birth and income.
We know the customer’s age and income will change at least once per year, but we don’t necessarily want to add a new dimensional record every time one of these attributes change. So we create a mini dimension (Figure 8.) that contains age and income bands that the customer may use for reporting. Age is a calculated field using the data of birth, and the income can be simply updated as a type 1 attribute in the dimension. Now that we have addressed how to capture the changing attributes, we can now create an association with the mini-dimension and add that key to the fact table.
The above five SCD types are used with some relative frequency in large data models. The remaining three SCD types are hybrids, and although they are rarely used, they provide excellent options to challenging situations you may come across. Therefore, I include these SCDs in this deep dive, so you have a full understanding of all of your options when designing your dimensional data warehouse.
Type 5 Dimensions are a combination of Type 4 SCD and Type 1 SCDs. Here you create your model as you would for a Type 4 SCD, but now you add a Type 1 Outrigger to the dimension. Looking at Figure 9. you will notice the similarity to Figure 8. but notice the outrigger associated with the customer dimension. It contains the same information as the mini-dimension, but you can access this information without having to go through a fact table.
I don’t like to store information twice, so when I come across situations like this, I will choose the mini-dimension or the outrigger based upon the requirements and what is in the best interest of my customer now and in the future.
Type 6 Dimensions are a hybrid combination of Type 1, Type 2, and Type 3 SCDs that supports as-is and as-was reporting. In this SCD type, we will change all of the current value records for a specific natural key. The historical value will match the current record, and therefore shows the changing historical value of the record over time (see example in Figure 10.)
Here we can see the Type 2 SCD as we add new active records as the values in the dimension change. We capture the Type 3 by adding the historical column, where we capture the changes from one record to the next. Finally, we see a type 1 SCD change in the current column, as this will change for all versions of the natural key.
Although this is rarely used, it is a powerful modeling option to use when the situation arises.
Type 7 Dimensions provide the same functionality as Type 6 SCDs, reporting on the current version of the SCD, but you do not want to incur the cost of a table scan on a very large dimension. Unlike in Type 6 SCDs, Type 7 SCDs keep the current version of the dimension in a separate data structure, and the keys of this current dimension and the large dimension table are associated with the fact record. Imagine you are working as a Data Modeler for a large corporation, and you have several million customer records. The majority of your reports only require the current version of the customer.
Looking at Figure 11. we see a DIM_CUSTOMER and DIM_CUSTOMER_CURRENT, and both are associated with FACT_SALE. If my report only cares about the current version of the customer, I can use the key associated with DIM_CUSTOMER_CURRENT to generate the report. Since this is a list of current active customers with no historical changes, I will select and process fewer records and therefore use less CPU and memory and be able to present the results faster than if I had to go through the DIM_CUSTOMER dimension.
I have used this SCD type in the past when I come across large dimensional tables, and the reports are primarily focused on the current value of the dimension.
I hope you found this deep dive into Slowly Changing Dimensions informative. If you have questions about dimensional modeling or data warehousing best practices, please feel free to ask your question in the comment box below, and I will respond as quickly as possible.