August 14, 2019

Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow

The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow. First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE; Next, we need to Create […]
August 7, 2019

URL Tips and Tricks

URL manipulation is one of those things you may not even think about until it becomes the biggest headache of your week. Thankfully KNIME has some things we can use to make life easier. Here are some of the common situations you could encounter when working with URLs in KNIME and how to handle them. The Workflow Needs to Run in the Any System’s KNIME Directory Perhaps the most common issue you’ll run into, most of the time you won’t want a KNIME workflow to only run on your machine or server. You still want it to use a specific […]
August 6, 2019

Batch Execution of KNIME Workflows, Part 2

In an earlier blog posting I walked through the steps required to install KNIME in a Docker container so that it could be used to run workflows in a batch mode.  This is a very useful technique for automating workflow execution and leveraging cloud resources to do so.  However, most workflows are not self-contained: they need access to configuration, external data, and local storage.   I did not cover those aspects in my first posting so this blog entry will introduce ways to pass in configuration and support the batch execution of KNIME workflows in a container. Setting Flow Variables The […]
August 5, 2019

Bulletproof Your KNIME Workflows with System Property Variables and External Config Files

KNIME workflows can contain a lot of code and, like all good programmers, I always want to eliminate as many code changes as possible. To accomplish this optimization, I rely on the use of table-driven processes, external property files and system property variables. The use of these techniques can reduce code changes and ensure that workflows will work on different computers that are not setup identically. I’d like to begin by discussing table-driven solutions. For this type of solution, I will use the data in the table to make the change. This can be done by simply changing the value […]
August 1, 2019

Automated Execution of Multiple KNIME Workflows

When using the open source KNIME Analytics Platform to build sophisticated data processing and analysis pipelines, I often find myself building workflows that consist of so many nodes they become difficult to manage.  This complexity can be managed by grouping nodes into processing stages and then bundling those stages into meta-nodes so that the overall workflow layout is easier to follow. However, I’ve found that this approach still leaves workflows unwieldy to work with as you still have to open the meta-nodes to explore and troubleshoot their processing.  Over the years I’ve worked with KNIME, I’ve developed a habit of […]
July 31, 2019

Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME.

This blog describes how to perform Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME. Let’s start with quick introduction about Amazon Athena. What is Athena? Athena is serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage. It supports ANSI SQL queries with support for joins, JSON and window functions. How to connect Athena and execute SQL from KNIME? KNIME can interface with Athena using the following nodes. How to do Analysis of data in S3 using Athena trough KNIME? A traditional approach is to download the entire files from […]
July 30, 2019

Batch Execution of KNIME Workflows Using Docker

In my previous blog posting I introduced our service management platform Dex which is at the core of many of our advanced analytics solutions.  Dex seamlessly manages the coordination of the data acquisition, ingestion, transformation and modeling for our system via processing steps that are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  One of the core applications we leverage for many of these steps is the open source KNIME analytics platform (   KNIME’s easy to use GUI allows us to quickly develop and test ETL and modeling workflows.  Once […]
July 29, 2019

Using the Python Node in KNIME to Perform Granger Causality Tests

KNIME includes a wide variety of data manipulation and data science nodes, but there are a few cases where you may need to embed specific python routines inside a workflow. Or more often, I test algorithms in Python but do not want to put the effort into orchestrating the data access, iterations, and parallel processing in python. KNIME has several python nodes, but first you will need to set up your python environments. Setting up Python Environments KNIME requires you to have Python 2 and Python 3 environments running locally. If you have Anaconda 3.x, you will need to create […]
July 22, 2019

5 Tips and Workarounds for Angular/JS Extension Development

This article provides a few small tips and tricks to ease some of the headaches of developing Angular/JS extensions for Qlik. 1. When using jQuery, make sure you are utilizing unique selectors for a given instance of the extension. Multiple instances of an extension can exist on a single sheet. If using jQuery, it is easy to select (and accidentally edit) all instances of an extension on a sheet. To prevent this, each instance of the extension should have its own id. More advanced users may want to consider using the objectId that Qlik assigns to visualizations, but those can […]
July 15, 2019

Factless Fact Tables

Factless facts are those fact tables that have no measures associated with the transaction.  Factless facts are a simple collection of dimensional keys which define the transactions or describing condition for the time period of the fact. You may question the need for a factless fact but they are important dimensional data structures which capture important information which can be leveraged into rollup measures or as information presented to a user. The most common example used for factless facts are student attendance in a class. As you can see from the dimensional diagram below the FACT_ATTENDANCE is an amalgamation of […]
July 11, 2019

PostgreSQL – Why You’d Want to Use It

If you’ve been in this industry for a few years, then you probably know what SQL is. It’s the golden standard for working with databases and nearly every modern coding language interfaces with it. Though simple it’s very flexible and different variants of it allow for some more robust functionality like Oracle SQL and MySQL. PGSQL, or PostgreSQL, is another one of these variants, but it takes things one step further and even one step after that. Let’s start with what exactly PGSQL is and is not. PGSQL is a variant of SQL and uses most of the same syntax. […]
July 1, 2019

Structured Query Language Basics

Structured Query Language (SQL) is the standard language used for retrieving and manipulating database information. Today’s blog, I will be focusing on how we can use this language to simply get an organized result set back that answers a specific question. This will be accomplished by using the Select, From, Where, Group By, Having, and Order By clauses We will be using the EMPLOYEES table in the HR schema from Oracle live as our practice table. You can find the structure of the table in the image below. This can be accessed for free. All you need is an oracle […]
June 20, 2019

Period-To-Date Facts

We have all come across times when our customer wants to know how the organization is currently doing. They often want to know how they are measuring up against this time last year or against the projected measure. The most common examples of these request are a year-to-date calculation and a budget vs. actual analysis. In this blog post I will describe how to efficiently address these common business requests. Year-To-Date (YTD) Calculations Our customer has stated that they wish to show a YTD metric for sales which can be broken down into quarterly and monthly metrics as well. In […]
May 30, 2019

Angular vs React

If you’ve done any sort of front-end programming, you know there are two front-end development frameworks that have emerged as the front runners in the market: React and Angular. Whether you’re just starting in on a development effort or are interested in which may be best to learn to improve your marketability, this article will outline the key differences between the two frameworks and identify recent market trends to help you make a decision. Angular7, or more familiarly Angular, is a full-fledged MVC framework championed by Google. It includes many OOTB features including routing, unit testing, forms, and AJAX requests. […]
May 28, 2019

Loading Accumulating Snapshot Fact Tables

Often management looks for bottlenecks in corporate processes so that they can be streamlined or used as a measurement of success for the organization. In order to do achieve these goals we need to measure time between two or more related events. The easiest way to report on this time-series process is to use accumulating snapshot facts.  Accumulating snapshot facts are updatable fact records used to measure time between two or more related events. The most common example of this type of fact can be seen in order processing. Let’s take a look! Order processing consists on many serialized processes. […]
May 15, 2019

Building Qlik Sense Extensions that can Export and Snapshot

The Fluff If you’ve ever built an extension in Qlik Sense focused on data visualization, you know how cool it is to harness the power of the Qlik associative engine. It is important for such visualizations to integrate with the user experience and really feel like Qlik—both in style and functionality. Yet many complex extensions suffer functionality drawbacks from failing to overcome two major hurdles: the ability to export to PDF/Image through the right-click menu, and the ability to function as a snapshot within a Qlik Data Story. Though these issues have the same root cause, it proved incredibly difficult […]
May 1, 2019

Client Identification Using Custom Request Headers in Spring Boot

One of the key building blocks of NuWave’s advanced predictive analytics solutions is a service management platform we built called Dex, which is short for Deus Ex Machina.   Dex was built as collection of microservices using Spring Boot and is responsible for coordinating the execution of complex workflows which take data through acquisition, ingestion, transformation, normalization, and modeling with many different advanced machine learning algorithms.   These processing steps are performed with a large number of technologies (Java, Python, R, KNIME, AWS SageMaker, etc.) and are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as […]
April 24, 2019

Loading Transaction Fact Tables

This blog post will focus on loading transaction fact tables, subsequent posts for peoiodioc and accumulating snapshots will follow in the coming weeks. Loading fact tables is very different than loading dimensions. First, we need to know the type of fact we are using. The major types of facts are transaction, periodic snapshot, accumulating snapshot and time-span accumulating snapshots. We also need to know the grain of the fact, the dimensional keys which are associated to the measurement event. Let’s say we want to measure product sales of by customer, product, and date attributes. The source transaction system may provide […]
April 12, 2019

Data Preparation for Machine Learning: Vector Spaces

Machine learning algorithms often rely on certain assumptions about the data being mined. Ensuring data meet these assumptions can be a significant portion of the preparation work before model training and predicting begins. When I began my data science journey, I was blissfully unaware of this and thought my preparation was done just because I had stuffed everything into a table. Feeding these naïvely compiled tables into learners I wondered why some algorithms never seemed to perform well for me. As I began digging into the algorithms themselves, many referred to vector space operations as being fundamental to their function. […]
April 4, 2019

Oracle Certified Professional Exam Prep

How I prepared for (and passed) the Oracle Certified Professional exam.