January 15, 2020

iTunes Library Cleanup: XML and String Distances in KNIME

I have been a full-time telecommuter over 18 years so when I am not on a conference call, I am usually listening to music while I work. Over this span of my career I have accumulated a large music library which I manage in iTunes and stream to various devices around my house. As my collection has grown, I’ve tried to be organized and diligent about keeping the meta-data of my music cleaned up. But it has inevitably gotten a messy over time, particularly since my children have hit their teenage years and started to add their own music to […]
January 8, 2020

What Is an Operational Data Store?

The definition of an Operational Data Store (ODS) is a subject-oriented, integrated, volatile collection of current data designed to support the tactical decision-making of an organization. An ODS can also be used as a data hub, used to synchronize data between disparate and unconnected applications within the enterprise. Let’s breakdown the definition so we can fully understand the ODS. First, we state that an ODS is subject oriented therefore it should be centered around a single business concept, such as a customer or product. Next, we state that an ODS is integrated. Here we are stating that we can pull […]
January 2, 2020

Legos for Grownups: Rationale for a Low-Code Environment

I believe that parents fall into two categories during the holiday season. Those who are buying Legos by the pound with the belief they are inspiring their children to be imaginative and develop engineering skills, and those who have banned Legos because the piles from previous years are taking up more space than bedroom furniture. The genius of Legos is undeniable. The most complex creations can be built by just about anyone, and since the process of creation is so satisfying everyone gets hooked. I am sure I went through a period when all I wanted to do “when I […]
December 26, 2019

Data Refresh Techniques

The three keys to warehouse project success are to quickly, efficiently and consistently refresh the data in the warehouse. Here are four ways to design the process with some being generally more successful than others. Refresh the Dimensions, then the Facts This approach attempts to make sure that the process is completed with minimal negative impact should an error arise in the ETL workflow. We accomplish this by inserting/updating (upserting) the Dimension Data, then move on to the Fact Data. Our reasoning is that modifying dimensions has no impact on the reports. This isolates the fact loading workflow as the […]
December 12, 2019

Let It Snow: Generating Fractal Snowflakes with KNIME

Continuing the holiday theme of my previous blog entry, today I decided to write a quick post on how to generate fractal snowflakes using KNIME, the industry leading open source data science platform.   While I’d be comfortable coding this in any number of programming languages, the very nature of KNIME’s desktop application makes it simple to rapidly explore and visualize data in a way that lends itself to fun mathematical explorations for even non-programmers.   However, as with many data science tasks, in the midst of my explorations today I found myself confronted with anomalous results that I initially couldn’t explain.  […]
November 26, 2019

Drilling Down / Drilling Through (Across)

You often hear “drilling down” and “drilling across” from business intelligence professionals, but what does it mean and how do you, as a data modeler, design for this requirement? This blog post will provide you with insights on how to do this. To drill down means to decompose fact data via hierarchical dimensional attributes. Time and geography (location) are good examples of hierarchical dimensions. In a complete time dimension, you will have the day, month, quarter and year hierarchy. In a geography dimension you may have a city, district, state, and region hierarchy. Consider your shown a dashboard with year […]
November 25, 2019

Baking an Approximate Pi with KNIME Using a Monte Carlo Recipe

With the holiday season in the United States rapidly approaching, I thought I’d have some fun with this blog posting.  So rather than giving you a typical blog entry highlighting some specific cool pieces of the awesome open-source KNIME end to end data science platform, today I am going to teach you how to bake an Approximate Pi with KNIME using a Monte Carlo recipe.   And along the way I will show you some interesting cooking tools and techniques to add to your KNIME kitchen. What is an Approximate Pi you ask?  It is simply an estimate of the irrational […]
November 18, 2019

Load Multiple json Files into a Database Table Using KNIME

I was recently tasked with loading a lot of historical json files into a database. I wanted to use KNIME to create a repeatable workflow to address this issue since this would be need to be an ongoing process after the initial loading of data was complete. Here is the workflow I created (Figure 1). Let’s take a look at the component steps for a better understanding of each node and why they are used. I started by reading a list of json files contained in a specified directory (Figure 2.). It is obvious that this is needed for the […]
November 13, 2019

Qlik Core: The Next Generation of Self-Serve Analytics

Since its inception, we have all watched Qlik Sense grow. From desktop to cloud, from custom extensions to embedded visualizations to mashups using Enigma.js. Qlik Core is the next phase of this growth, and it is exactly what developers ordered. There’s no debating that Qlik Sense Desktop is an incredibly powerful tool, giving business analysts and decision makers everything they could want: easy ETL, data governance, security, self-service analytics, and more. Over the most recent years, Qlik Sense has upped the ante with new features, unlocking powerful new capabilities while maintaining the ease-of-use that makes Qlik so appealing. But. We’ve […]
November 6, 2019

Leveraging Athena with KNIME in a Robust Manner, Part 2

In my previous blog posting, I introduced an issue we were having with seemingly random intermittent failures using Amazon Web Services’ Athena backed by a large number of data files in S3.   The issue was arising because S3 is eventually consistent and occasionally queries were being executed before their underlying data files were fully materialized in S3. Our solution was to introduce try/catch Knime nodes with a loop to retry failed queries a few times in case of intermittent failures.   To do this we had to do our own flow variable resolution in the Athena SQL queries since the standard […]
October 30, 2019

Dynamic Looping in KNIME

I recently came across a situation where I needed to load a couple of table partitions. I wanted to automate this process and have a table drive the process. To accomplish this task, I stated by creating a table of partitions that would drive the dynamic refresh process. My requirement was to be able to refresh the current month and the previous month’s data. (see the DDL for the partitioned fact table below) CREATE TABLE FACT_BOOK_SALE ( FBS_BOOK_KEY NUMBER NOT NULL ENABLE ,FBS_CSTMR_KEY NUMBER NOT NULL ENABLE ,FBS_STORE_KEY NUMBER NOT NULL ENABLE ,FBS_DATE_KEY NUMBER NOT NULL ENABLE ,FBS_BOOK_ISBN_13 NUMBER ,FBS_CSTMR_ID […]
October 24, 2019

Leveraging Athena with KNIME in a Robust Manner, Part One

Recently, we began experiencing seemingly random intermittent failures in one of our modeling workflows used in VANE, an advanced predictive analytics project we are building for the US Army.  These failures were occurring with varying frequency inside of any one of several Database SQL Executor nodes.  These nodes were performing a large number of SQL queries against Amazon Web Services’ Athena backed by a large number of data files in S3.   The workflow was randomly failing in any of these nodes with a HIVE SPLIT error due to a missing partition file in S3.  When we investigated the failure, we […]
October 23, 2019

Updating / Replacing Fact Records

You will inevitably come across a time when a source system will allow the updating of transaction records. This drives data modelers, ETL developers, analytics designers, and analytics users crazy. After all the source system stated that this information was a “fact”, leading all to believe it would not change, now you must design for and resolve the changing fact data. In this blog post we will discuss the options and the impact on these decisions will have on your warehouse and BI solution. Let’s start with the easy and move to the complex. The easiest thing to do when […]
October 18, 2019

Hierarchies in Qlik Sense

Understanding how Qlik’s Associative Engine handles hierarchies will improve your ability to render meaningful and flexible rollups that users can wield to draw insights at all levels of granularity. This article covers how Qlik hierarchies do not follow the same conventions as a normal database (SQL) and discusses how hierarchy and ancestry load functions can be used in tandem to allow easy access to data at any granularity. Database Hierarchies Perhaps the most common way to store hierarchies in databases is through a Parent-Child Table. In such a table, there exists a row for each node in the hierarchy containing […]
October 9, 2019

AngularJS Basics: $scope

AngularJS, for the uninitiated, is an open-source front-end framework that is maintained by Google and an active community. It has been around for some time, but the learning curve can be pretty steep as AngularJS encompasses not one language, but two; HTML and Javascript. AngularJS puts the two of these together and exponentially increases your front-end capabilities as opposed to plain old HTML. If you’re familiar with HTML then that half of the equation is rather straightforward, but the Javascript is where things will take a turn with the inclusion of “$scope”. Scope as a concept typically defines the level […]
September 26, 2019

Optimizing Qlik Load-Time

I have recently been brought on to a project involving a Qlik implementation for a client. Among the requirements for the project, the client needs to have the data refreshed every hour. As with many Qlik implementations, the Qlik segment is the final step in the ETL process with the data arriving to an Oracle DB first. By the time the ETL arrived to Qlik, only 30 minutes remained of the allotted hour. Here is the process for optimization that I followed. Status on arrival: The number of tasks in the QMC were about 40. There were about 30 apps […]
September 25, 2019

Late Arriving Dimensions

In my last post I discussed late arriving facts, and although it is slightly annoying to receive delayed transaction data, the impact on the ETL and accuracy of the data in data warehouse is minimal. As we will discuss in this blog post, this is not the case for late arriving dimension (early arriving fact) data. As I have discussed throughout this blog series on designing and loading dimensional data warehouses, we always load data in the following order: lookup tables, then dimensions, and finally facts. But what happens when you get fact data before the dimension data arrives or […]
September 18, 2019


Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME. For this example, I will be using an XML version of the Shakespeare play Hamlet. Everything started off fine. I grabbed the XML Reader node and placed it […]
September 4, 2019

Plotly.js: Getting Accurate Coordinates from Click Event

Across StackExchange and the Plotly.js community forums, there seems to be one frequently asked question that has not yet been answered: how can you get accurate coordinates from a click event on a Plotly chart? Take your standard scatter plot: If I hover my mouse over (4, 14), and click, the current Plotly.js onClick event data returns data for each trace at that same x value. So, you would get (4, 17) for trace 0, (4, 11) for trace 1, and (4, 12) for trace 2. This is great if I’m trying to find out information about the traces, […]
August 21, 2019

Late Arriving Facts

Late arriving facts are those transactions which are delayed in arriving to the warehouse. For Type I data warehouses there is no real impact but for Type II warehouse models there is a minor challenge that the ETL developer must remember to handle to ensure the late arriving transaction corresponds with dimensional attributes at the time of the transaction. If we have a Type I warehouse or are always assured that we only receive fact data at the same time as the dimensional data then we can use the following simple query where we find the current active record for […]