Jim McHugh

January 8, 2020

What Is an Operational Data Store?

The definition of an Operational Data Store (ODS) is a subject-oriented, integrated, volatile collection of current data designed to support the tactical decision-making of an organization. An ODS can also be used as a data hub, used to synchronize data between disparate and unconnected applications within the enterprise. Let’s breakdown the definition so we can fully understand the ODS. First, we state that an ODS is subject oriented therefore it should be centered around a single business concept, such as a customer or product. Next, we state that an ODS is integrated. Here we are stating that we can pull […]
December 26, 2019

Data Refresh Techniques

The three keys to warehouse project success are to quickly, efficiently and consistently refresh the data in the warehouse. Here are four ways to design the process with some being generally more successful than others. Refresh the Dimensions, then the Facts This approach attempts to make sure that the process is completed with minimal negative impact should an error arise in the ETL workflow. We accomplish this by inserting/updating (upserting) the Dimension Data, then move on to the Fact Data. Our reasoning is that modifying dimensions has no impact on the reports. This isolates the fact loading workflow as the […]
November 26, 2019

Drilling Down / Drilling Through (Across)

You often hear “drilling down” and “drilling across” from business intelligence professionals, but what does it mean and how do you, as a data modeler, design for this requirement? This blog post will provide you with insights on how to do this. To drill down means to decompose fact data via hierarchical dimensional attributes. Time and geography (location) are good examples of hierarchical dimensions. In a complete time dimension, you will have the day, month, quarter and year hierarchy. In a geography dimension you may have a city, district, state, and region hierarchy. Consider your shown a dashboard with year […]
November 18, 2019

Load Multiple json Files into a Database Table Using KNIME

I was recently tasked with loading a lot of historical json files into a database. I wanted to use KNIME to create a repeatable workflow to address this issue since this would be need to be an ongoing process after the initial loading of data was complete. Here is the workflow I created (Figure 1). Let’s take a look at the component steps for a better understanding of each node and why they are used. I started by reading a list of json files contained in a specified directory (Figure 2.). It is obvious that this is needed for the […]
October 30, 2019

Dynamic Looping in KNIME

I recently came across a situation where I needed to load a couple of table partitions. I wanted to automate this process and have a table drive the process. To accomplish this task, I stated by creating a table of partitions that would drive the dynamic refresh process. My requirement was to be able to refresh the current month and the previous month’s data. (see the DDL for the partitioned fact table below) CREATE TABLE FACT_BOOK_SALE ( FBS_BOOK_KEY NUMBER NOT NULL ENABLE ,FBS_CSTMR_KEY NUMBER NOT NULL ENABLE ,FBS_STORE_KEY NUMBER NOT NULL ENABLE ,FBS_DATE_KEY NUMBER NOT NULL ENABLE ,FBS_BOOK_ISBN_13 NUMBER ,FBS_CSTMR_ID […]
October 23, 2019

Updating / Replacing Fact Records

You will inevitably come across a time when a source system will allow the updating of transaction records. This drives data modelers, ETL developers, analytics designers, and analytics users crazy. After all the source system stated that this information was a “fact”, leading all to believe it would not change, now you must design for and resolve the changing fact data. In this blog post we will discuss the options and the impact on these decisions will have on your warehouse and BI solution. Let’s start with the easy and move to the complex. The easiest thing to do when […]
September 25, 2019

Late Arriving Dimensions

In my last post I discussed late arriving facts, and although it is slightly annoying to receive delayed transaction data, the impact on the ETL and accuracy of the data in data warehouse is minimal. As we will discuss in this blog post, this is not the case for late arriving dimension (early arriving fact) data. As I have discussed throughout this blog series on designing and loading dimensional data warehouses, we always load data in the following order: lookup tables, then dimensions, and finally facts. But what happens when you get fact data before the dimension data arrives or […]
September 18, 2019

XPATH & KNIME

Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME. For this example, I will be using an XML version of the Shakespeare play Hamlet. Everything started off fine. I grabbed the XML Reader node and placed it […]
August 21, 2019

Late Arriving Facts

Late arriving facts are those transactions which are delayed in arriving to the warehouse. For Type I data warehouses there is no real impact but for Type II warehouse models there is a minor challenge that the ETL developer must remember to handle to ensure the late arriving transaction corresponds with dimensional attributes at the time of the transaction. If we have a Type I warehouse or are always assured that we only receive fact data at the same time as the dimensional data then we can use the following simple query where we find the current active record for […]
August 14, 2019

Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow

The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow. First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE; Next, we need to Create […]
August 5, 2019

Bulletproof Your KNIME Workflows with System Property Variables and External Config Files

KNIME workflows can contain a lot of code and, like all good programmers, I always want to eliminate as many code changes as possible. To accomplish this optimization, I rely on the use of table-driven processes, external property files and system property variables. The use of these techniques can reduce code changes and ensure that workflows will work on different computers that are not setup identically. I’d like to begin by discussing table-driven solutions. For this type of solution, I will use the data in the table to make the change. This can be done by simply changing the value […]
July 15, 2019

Factless Fact Tables

Factless facts are those fact tables that have no measures associated with the transaction.  Factless facts are a simple collection of dimensional keys which define the transactions or describing condition for the time period of the fact. You may question the need for a factless fact but they are important dimensional data structures which capture important information which can be leveraged into rollup measures or as information presented to a user. The most common example used for factless facts are student attendance in a class. As you can see from the dimensional diagram below the FACT_ATTENDANCE is an amalgamation of […]
June 20, 2019

Period-To-Date Facts

We have all come across times when our customer wants to know how the organization is currently doing. They often want to know how they are measuring up against this time last year or against the projected measure. The most common examples of these request are a year-to-date calculation and a budget vs. actual analysis. In this blog post I will describe how to efficiently address these common business requests. Year-To-Date (YTD) Calculations Our customer has stated that they wish to show a YTD metric for sales which can be broken down into quarterly and monthly metrics as well. In […]
May 28, 2019

Loading Accumulating Snapshot Fact Tables

Often management looks for bottlenecks in corporate processes so that they can be streamlined or used as a measurement of success for the organization. In order to do achieve these goals we need to measure time between two or more related events. The easiest way to report on this time-series process is to use accumulating snapshot facts.  Accumulating snapshot facts are updatable fact records used to measure time between two or more related events. The most common example of this type of fact can be seen in order processing. Let’s take a look! Order processing consists on many serialized processes. […]
April 24, 2019

Loading Transaction Fact Tables

This blog post will focus on loading transaction fact tables, subsequent posts for peoiodioc and accumulating snapshots will follow in the coming weeks. Loading fact tables is very different than loading dimensions. First, we need to know the type of fact we are using. The major types of facts are transaction, periodic snapshot, accumulating snapshot and time-span accumulating snapshots. We also need to know the grain of the fact, the dimensional keys which are associated to the measurement event. Let’s say we want to measure product sales of by customer, product, and date attributes. The source transaction system may provide […]
March 20, 2019

Updating Type II Slowly Changing Dimensions

In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses. The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive […]
February 20, 2019

Joining Fact Tables

Joining fact tables can be done but there are some inherent risks so you need to be careful when joining fact tables is required. In the following simple scenario, we have a fact of authors to articles and a separate fact of articles to pageviews. Our customer has asked for the ability to 1) find the authors who have provided the most content in a given time period, find the articles which have the greatest number of pageviews for a given time period and 3) find the authors with the highest number of pageviews in a given time period. Our […]
January 23, 2019

Types of Data Models

A conceptual model is a representation of a system, made of the composition of concepts which are used to help people know, understand, or simulate a subject the model represents.[i] In dimensional modeling this is the most abstract level of modeling. Here we are only worried about the entities (dimensions) and their relationship to each other (facts). No attributes are required in the conceptual model as we are trying to work with the customer who is not an expert in databases or data warehousing but they do understand how the entities are related in their business. The conceptual model will help us […]
July 28, 2017

Data Warehouse Design Techniques – Constraints and Indexes

In this week’s blog, we will discuss constraints and indexes. In data warehousing, like in life, constraints are things we love to hate. Constraints keep us from making mistakes, which in most cases is a good thing, until we come across an anomaly which needs to be addressed but the constraints prevent this from happening. Most of the time indexes help us to find data faster, but that is not always the case.   What are Indexes? Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then […]
July 5, 2017

Data Warehouse Design Techniques – Aggregates

In this week’s blog, we will discuss how to optimize the performance of your data warehouse by using aggregates. What are Aggregates? Aggregates are the summarization of fact related data for the purpose of improved performance. There are many occasions when the customer wants to be able to quickly answer a question where the data is at a higher grain than we collect. To avoid slow responses due to system summing data up to higher levels we can pre-calculate these requests, storing the summarized data in a separate star. Aggregates can be considered to be conformed fact tables since they […]
Contact