Late Arriving Facts

Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow
August 14, 2019
Plotly.js: Getting Accurate Coordinates from Click Event
September 4, 2019
Show all

Late arriving facts are those transactions which are delayed in arriving to the warehouse. For Type I data warehouses there is no real impact but for Type II warehouse models there is a minor challenge that the ETL developer must remember to handle to ensure the late arriving transaction corresponds with dimensional attributes at the time of the transaction.

If we have a Type I warehouse or are always assured that we only receive fact data at the same time as the dimensional data then we can use the following simple query where we find the current active record for the dimension’s natural ID which is found in the fact record:

SELECT dimensional_key
   FROM dimension_table
WHERE dimension_natural_id = {natural_id from fact}
     AND dimension_actv_rcrd_fl = 1;

Taking a look at the Type II data warehouse challenge, we cannot assume that the active dimensional record is the correct record for the fact. Therefore, we need to modify the ETL workflow process (Figure 1) to address the challenge of the possible changing of dimensional data since the “old” fact occurred.

Figure 1

To address this issue, we need to add an additional check when associating the dimensional keys to the fact table. We must find the dimensional key value where the transaction date key is between the dimensional active record start and end dates (Figure 2 & Figure 3) to ensure the accuracy of the data at the time of the transaction.

Figure 2
Figure 3

The query needed to find this record is slightly different since we need to find the record between two dates instead of by finding the most active record:

SELECT dim_key
  FROM dimension_table
WHERE dim_natural_id = {natural_id from fact}
    AND {trnsctn_dt from fact} BETWEEN dim_actv_rcrd_strt_dt AND dim_actv_rcrd_end_dt;

Indexing Tip: Only index the natural identifier on the dimension for best performance. Including the date columns in the index will not improve performance. It will only make your index larger and thereby reduce the performance of the index. Remember dimensions are supposed to be wide and shallow. If you have a rapidly changing dimension you will need to find a way to eliminate the attributes causing the dimensional change (see my Rapidly Changing Dimensions post).

In conclusion, make sure you know your data. Profiling of your data and full knowledge of your customer’s business process is critical to a successful implementation of your data warehouse.

In my next blog, I will discuss the challenge of late arriving dimensional records and the impact it has on the accuracy of the data in the data warehouse.

Jim McHugh
Jim McHugh
An accomplished Sr. IT leader with over 27 years of professional experience applying technology solutions to business challenges. Mr. McHugh has expertise in data modeling, data governance, business intelligence, predictive analytics and data science. His responsibilities include establishing and executing a strategy that ensures the application of data management & analytics to enable an organization to strategically leverage and fully realize the value of their data.

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact