The three keys to warehouse project success are to quickly, efficiently and consistently refresh the data in the warehouse. Here are four ways to design the process with some being generally more successful than others.
Refresh the Dimensions, then the Facts
This approach attempts to make sure that the process is completed with minimal negative impact should an error arise in the ETL workflow. We accomplish this by inserting/updating (upserting) the Dimension Data, then move on to the Fact Data. Our reasoning is that modifying dimensions has no impact on the reports. This isolates the fact loading workflow as the only place where a report impacting error could occur. In most cases this only will impact a single fact table if incremental commits are needed during the loading of the fact. That said, there remains a potential for the customer to have an inconsistent view of the data because the workflow still is running. Quick and efficient with the potential for inconsistency in data and reports.
Refresh the Partition
This approach is similar to the above process except the Fact Data will be partitioned. It allows only the current (and sometimes previous) partition to refresh at any one time by truncating the partition and running an insert select. Because this approach limits the amount of data being transferred it should be fast but there still is the possibility of inconsistency while the truncate / insert process is running.
Rename the Tables
Renaming tables can be an effective way to minimize the risk of data inconsistency. In this process one would create a second copy of the data and make changes to these tables. Once the refresh process is complete a process is kicked-off to rename the tables by changing the current table to a temporary name (perhaps with a dash (-) at the end), then change the refreshed table to be the name of the table (remove the underscore) to be used by the reporting system and finally changing the name of the original table to have an underscore so it can be used by the refresh process during the next instantiation. Because you’ll be using at least twice the disk space and increasing the time to refresh data, this consistency has a high server cost.
I try to ensure that my blogs are vendor neutral but occasionally one product offers something that no one else does. Partition Exchange is an Oracle enterprise database offering and hits all three pillars (quick, efficient and consistent) of an effective data refresh solution.
For a partition exchange process, one creates a partitioned fact table (required by any large data warehouse fact table). I recommend that the table be partitioned by (date) range but that is not required. Next set up a swap partition table. This table will not be partitioned but will be an exact replica of the partitioned table. The ETL workflow will load the non-partitioned table(s) to include all of the information necessary for that time frame. Once all of the fact tables are loaded the exchange partition process is kicked-off. This process will essentially swap the pointers from the partition with the data in the stand-alone table. This process takes a fraction of a second no matter the size of the data in the table and partition. There is an added bonus, if a query is run against these tables while the exchange process is taking place: the query will be held until the process is complete, thereby ensuring that the data is consistent. The final benefit of this process is that there is an easy fallback should there be an error in the data. To recover from a processing error, just run the exchange partition commands again and the data will be back to the way it was before the process started.
This is my go-to solution when Oracle DB is the engine for the warehouse. When I use other tools, I will use some combination of the other processes based upon the functionality provided by the database and ELT software.
I hope you found this interesting and I look forward to your feedback concerning your experiences loading large data warehouses.