January 15, 2020

iTunes Library Cleanup: XML and String Distances in KNIME

I have been a full-time telecommuter over 18 years so when I am not on a conference call, I am usually listening to music while I work. Over this span of my career I have accumulated a large music library which I manage in iTunes and stream to various devices around my house. As my collection has grown, I’ve tried to be organized and diligent about keeping the meta-data of my music cleaned up. But it has inevitably gotten a messy over time, particularly since my children have hit their teenage years and started to add their own music to […]
January 2, 2020

Legos for Grownups: Rationale for a Low-Code Environment

I believe that parents fall into two categories during the holiday season. Those who are buying Legos by the pound with the belief they are inspiring their children to be imaginative and develop engineering skills, and those who have banned Legos because the piles from previous years are taking up more space than bedroom furniture. The genius of Legos is undeniable. The most complex creations can be built by just about anyone, and since the process of creation is so satisfying everyone gets hooked. I am sure I went through a period when all I wanted to do “when I […]
December 12, 2019

Let It Snow: Generating Fractal Snowflakes with KNIME

Continuing the holiday theme of my previous blog entry, today I decided to write a quick post on how to generate fractal snowflakes using KNIME, the industry leading open source data science platform.   While I’d be comfortable coding this in any number of programming languages, the very nature of KNIME’s desktop application makes it simple to rapidly explore and visualize data in a way that lends itself to fun mathematical explorations for even non-programmers.   However, as with many data science tasks, in the midst of my explorations today I found myself confronted with anomalous results that I initially couldn’t explain.  […]
November 25, 2019

Baking an Approximate Pi with KNIME Using a Monte Carlo Recipe

With the holiday season in the United States rapidly approaching, I thought I’d have some fun with this blog posting.  So rather than giving you a typical blog entry highlighting some specific cool pieces of the awesome open-source KNIME end to end data science platform, today I am going to teach you how to bake an Approximate Pi with KNIME using a Monte Carlo recipe.   And along the way I will show you some interesting cooking tools and techniques to add to your KNIME kitchen. What is an Approximate Pi you ask?  It is simply an estimate of the irrational […]
November 18, 2019

Load Multiple json Files into a Database Table Using KNIME

I was recently tasked with loading a lot of historical json files into a database. I wanted to use KNIME to create a repeatable workflow to address this issue since this would be need to be an ongoing process after the initial loading of data was complete. Here is the workflow I created (Figure 1). Let’s take a look at the component steps for a better understanding of each node and why they are used. I started by reading a list of json files contained in a specified directory (Figure 2.). It is obvious that this is needed for the […]
November 6, 2019

Leveraging Athena with KNIME in a Robust Manner, Part 2

In my previous blog posting, I introduced an issue we were having with seemingly random intermittent failures using Amazon Web Services’ Athena backed by a large number of data files in S3.   The issue was arising because S3 is eventually consistent and occasionally queries were being executed before their underlying data files were fully materialized in S3. Our solution was to introduce try/catch Knime nodes with a loop to retry failed queries a few times in case of intermittent failures.   To do this we had to do our own flow variable resolution in the Athena SQL queries since the standard […]
October 30, 2019

Dynamic Looping in KNIME

I recently came across a situation where I needed to load a couple of table partitions. I wanted to automate this process and have a table drive the process. To accomplish this task, I stated by creating a table of partitions that would drive the dynamic refresh process. My requirement was to be able to refresh the current month and the previous month’s data. (see the DDL for the partitioned fact table below) CREATE TABLE FACT_BOOK_SALE ( FBS_BOOK_KEY NUMBER NOT NULL ENABLE ,FBS_CSTMR_KEY NUMBER NOT NULL ENABLE ,FBS_STORE_KEY NUMBER NOT NULL ENABLE ,FBS_DATE_KEY NUMBER NOT NULL ENABLE ,FBS_BOOK_ISBN_13 NUMBER ,FBS_CSTMR_ID […]
October 24, 2019

Leveraging Athena with KNIME in a Robust Manner, Part One

Recently, we began experiencing seemingly random intermittent failures in one of our modeling workflows used in VANE, an advanced predictive analytics project we are building for the US Army.  These failures were occurring with varying frequency inside of any one of several Database SQL Executor nodes.  These nodes were performing a large number of SQL queries against Amazon Web Services’ Athena backed by a large number of data files in S3.   The workflow was randomly failing in any of these nodes with a HIVE SPLIT error due to a missing partition file in S3.  When we investigated the failure, we […]
September 18, 2019


Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME. For this example, I will be using an XML version of the Shakespeare play Hamlet. Everything started off fine. I grabbed the XML Reader node and placed it […]
August 14, 2019

Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow

The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow. First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE; Next, we need to Create […]
August 7, 2019

URL Tips and Tricks

URL manipulation is one of those things you may not even think about until it becomes the biggest headache of your week. Thankfully KNIME has some things we can use to make life easier. Here are some of the common situations you could encounter when working with URLs in KNIME and how to handle them. The Workflow Needs to Run in the Any System’s KNIME Directory Perhaps the most common issue you’ll run into, most of the time you won’t want a KNIME workflow to only run on your machine or server. You still want it to use a specific […]
August 6, 2019

Batch Execution of KNIME Workflows, Part 2

In an earlier blog posting I walked through the steps required to install KNIME in a Docker container so that it could be used to run workflows in a batch mode.  This is a very useful technique for automating workflow execution and leveraging cloud resources to do so.  However, most workflows are not self-contained: they need access to configuration, external data, and local storage.   I did not cover those aspects in my first posting so this blog entry will introduce ways to pass in configuration and support the batch execution of KNIME workflows in a container. Setting Flow Variables The […]
August 5, 2019

Bulletproof Your KNIME Workflows with System Property Variables and External Config Files

KNIME workflows can contain a lot of code and, like all good programmers, I always want to eliminate as many code changes as possible. To accomplish this optimization, I rely on the use of table-driven processes, external property files and system property variables. The use of these techniques can reduce code changes and ensure that workflows will work on different computers that are not setup identically. I’d like to begin by discussing table-driven solutions. For this type of solution, I will use the data in the table to make the change. This can be done by simply changing the value […]
August 1, 2019

Automated Execution of Multiple KNIME Workflows

When using the open source KNIME Analytics Platform to build sophisticated data processing and analysis pipelines, I often find myself building workflows that consist of so many nodes they become difficult to manage.  This complexity can be managed by grouping nodes into processing stages and then bundling those stages into meta-nodes so that the overall workflow layout is easier to follow. However, I’ve found that this approach still leaves workflows unwieldy to work with as you still have to open the meta-nodes to explore and troubleshoot their processing.  Over the years I’ve worked with KNIME, I’ve developed a habit of […]
July 31, 2019

Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME.

This blog describes how to perform Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME. Let’s start with quick introduction about Amazon Athena. What is Athena? Athena is serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage. It supports ANSI SQL queries with support for joins, JSON and window functions. How to connect Athena and execute SQL from KNIME? KNIME can interface with Athena using the following nodes. How to do Analysis of data in S3 using Athena trough KNIME? A traditional approach is to download the entire files from […]
July 30, 2019

Batch Execution of KNIME Workflows Using Docker

In my previous blog posting I introduced our service management platform Dex which is at the core of many of our advanced analytics solutions.  Dex seamlessly manages the coordination of the data acquisition, ingestion, transformation and modeling for our system via processing steps that are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  One of the core applications we leverage for many of these steps is the open source KNIME analytics platform (   KNIME’s easy to use GUI allows us to quickly develop and test ETL and modeling workflows.  Once […]
July 29, 2019

Using the Python Node in KNIME to Perform Granger Causality Tests

KNIME includes a wide variety of data manipulation and data science nodes, but there are a few cases where you may need to embed specific python routines inside a workflow. Or more often, I test algorithms in Python but do not want to put the effort into orchestrating the data access, iterations, and parallel processing in python. KNIME has several python nodes, but first you will need to set up your python environments. Setting up Python Environments KNIME requires you to have Python 2 and Python 3 environments running locally. If you have Anaconda 3.x, you will need to create […]