• ##### Drilling Down / Drilling Through (Across)
You often hear “drilling down” and “drilling across” from business intelligence professionals, but what does it mean and how do you, as a data modeler, design for this requirement? This blog post will provide you with insights on how to do this. To drill down means to decompose fact data via hierarchical dimensional attributes. Time and geography (location) are good examples of hierarchical dimensions. In a complete time dimension, you will have the day, month, quarter and year hierarchy. In a geography dimension you may have a city, district, state, and region hierarchy. Consider your shown a dashboard with year to date (YTD) sales. You may want to decompose that data to look into the sales by quarter or by month to see if there is a trend in the data. Using that same fact data you may be presented with the overall sales in your company but want to research the sales by a smaller geographic area like state or district to see if there are any anomalies or trends happening that you may […]
• ##### Baking an Approximate Pi with KNIME Using a Monte Carlo Recipe
With the holiday season in the United States rapidly approaching, I thought I’d have some fun with this blog posting.  So rather than giving you a typical blog entry highlighting some specific cool pieces of the awesome open-source KNIME end to end data science platform, today I am going to teach you how to bake an Approximate Pi with KNIME using a Monte Carlo recipe.   And along the way I will show you some interesting cooking tools and techniques to add to your KNIME kitchen. What is an Approximate Pi you ask?  It is simply an estimate of the irrational value of pi, the ratio of a circle’s circumference to its diameter.  And our Monte Carlo recipe?  Roughly speaking, Monte Carlo methods are a broad class of computational algorithms that use random sampling to obtain numerical results.  So, in the style of the Muppets’ Swedish Chef, we are going to start cooking randomly and by doing so enough times, we will eventually bake a decent Approximate Pi.  Ok, so I’m fudging a little bit in […]
• ##### Qlik Core: The Next Generation of Self-Serve Analytics
Since its inception, we have all watched Qlik Sense grow. From desktop to cloud, from custom extensions to embedded visualizations to mashups using Enigma.js. Qlik Core is the next phase of this growth, and it is exactly what developers ordered. There’s no debating that Qlik Sense Desktop is an incredibly powerful tool, giving business analysts and decision makers everything they could want: easy ETL, data governance, security, self-service analytics, and more. Over the most recent years, Qlik Sense has upped the ante with new features, unlocking powerful new capabilities while maintaining the ease-of-use that makes Qlik so appealing. But. We’ve all had that one client who wants the power of Qlik with their branding, their look-and-feel, in their custom web application. Of course, there are mashups, and embedded visualizations, but you still have to work within the parameters of the Qlik Sense Enterprise Edition. While you get the power of Qlik, you also get the heavy architecture in the background, even if you don’t want it. It can also be hard to get the client […]
• ##### Leveraging Athena with KNIME in a Robust Manner, Part 2
In my previous blog posting, I introduced an issue we were having with seemingly random intermittent failures using Amazon Web Services’ Athena backed by a large number of data files in S3.   The issue was arising because S3 is eventually consistent and occasionally queries were being executed before their underlying data files were fully materialized in S3. Our solution was to introduce try/catch Knime nodes with a loop to retry failed queries a few times in case of intermittent failures.   To do this we had to do our own flow variable resolution in the Athena SQL queries since the standard Knime database SQL executor node does not do variable substitution when the query is configured via a flow variable.   My previous blog posting (link to part 1 here) covered how we resolved embedded flow variable references in a Java snippet node. In this blog posting I am going to cover the workflow we use for the actual query retry logic so that we retry each query a given number of times until it either succeeds […]
• ##### Dynamic Looping in KNIME
I recently came across a situation where I needed to load a couple of table partitions. I wanted to automate this process and have a table drive the process. To accomplish this task, I stated by creating a table of partitions that would drive the dynamic refresh process. My requirement was to be able to refresh the current month and the previous month’s data. (see the DDL for the partitioned fact table below) CREATE TABLE FACT_BOOK_SALE ( FBS_BOOK_KEY NUMBER NOT NULL ENABLE ,FBS_CSTMR_KEY NUMBER NOT NULL ENABLE ,FBS_STORE_KEY NUMBER NOT NULL ENABLE ,FBS_DATE_KEY NUMBER NOT NULL ENABLE ,FBS_BOOK_ISBN_13 NUMBER ,FBS_CSTMR_ID VARCHAR2(50 CHAR) ,FBS_STORE_ID NUMBER ,FBS_QTY NUMBER ,FBS_DSCNT_PCT NUMBER(6,3) ,FBS_SALE_PRICE NUMBER(19,4) ,PRIMARY KEY (FBS_BOOK_KEY, FBS_CSTMR_KEY, FBS_STORE_KEY, FBS_DATE_KEY) ) PARTITION BY RANGE(FBS_DATE_KEY) ( PARTITION FACT_BOOK_SALE_201901 VALUES LESS THAN (20190200), PARTITION FACT_BOOK_SALE_201902 VALUES LESS THAN (20190300), PARTITION FACT_BOOK_SALE_201903 VALUES LESS THAN (20190400), PARTITION FACT_BOOK_SALE_201904 VALUES LESS THAN (20190500), PARTITION FACT_BOOK_SALE_201905 VALUES LESS THAN (20190600), PARTITION FACT_BOOK_SALE_201906 VALUES LESS THAN (20190700), PARTITION FACT_BOOK_SALE_201907 VALUES LESS THAN (20190800), PARTITION FACT_BOOK_SALE_201908 VALUES LESS THAN (20190900), PARTITION FACT_BOOK_SALE_201909 VALUES LESS THAN (20191000), […]
• ##### Leveraging Athena with KNIME in a Robust Manner, Part One
Recently, we began experiencing seemingly random intermittent failures in one of our modeling workflows used in VANE, an advanced predictive analytics project we are building for the US Army.  These failures were occurring with varying frequency inside of any one of several Database SQL Executor nodes.  These nodes were performing a large number of SQL queries against Amazon Web Services’ Athena backed by a large number of data files in S3.   The workflow was randomly failing in any of these nodes with a HIVE SPLIT error due to a missing partition file in S3.  When we investigated the failure, we would invariable find the supposedly missing partition file and restarting the workflow from the failed step would succeed.  With a little bit of research, we learned that this is a known problem with Athena and S3 storage due to the nature of S3 being “eventually consistent”.   Simply put, some of the Athena queries were being executed before their required data files were fully materialized in S3. With this newfound knowledge, we began to consider ways […]
• ##### Updating / Replacing Fact Records
You will inevitably come across a time when a source system will allow the updating of transaction records. This drives data modelers, ETL developers, analytics designers, and analytics users crazy. After all the source system stated that this information was a “fact”, leading all to believe it would not change, now you must design for and resolve the changing fact data. In this blog post we will discuss the options and the impact on these decisions will have on your warehouse and BI solution. Let’s start with the easy and move to the complex. The easiest thing to do when receiving data causing an existing fact record to be updated is to simply update the measures in the existing record. Some people will argue that “you cannot just update the fact record. I have already had people use the data and now they will see a change that cannot readily be explained since the measure was overwritten.” Although this statement is true, most data warehouse project owners, sponsors and users understand that there can be […]
• ##### Hierarchies in Qlik Sense
Understanding how Qlik’s Associative Engine handles hierarchies will improve your ability to render meaningful and flexible rollups that users can wield to draw insights at all levels of granularity. This article covers how Qlik hierarchies do not follow the same conventions as a normal database (SQL) and discusses how hierarchy and ancestry load functions can be used in tandem to allow easy access to data at any granularity. Database Hierarchies Perhaps the most common way to store hierarchies in databases is through a Parent-Child Table. In such a table, there exists a row for each node in the hierarchy containing its ID and the ID of its parent. This format is widespread through most database technologies as it is simple, flexible, and updates quickly. Depending on the specific data, other formats may be used for better read/write performance. Perhaps because of its ubiquity, Qlik is built to take utilize Parent-Child Tables. But it cannot provide a true hierarchical user experience without processing those tables through special functions. The Qlik Associative Engine Qlik cannot handle all […]
• ##### AngularJS Basics: \$scope
AngularJS, for the uninitiated, is an open-source front-end framework that is maintained by Google and an active community. It has been around for some time, but the learning curve can be pretty steep as AngularJS encompasses not one language, but two; HTML and Javascript. AngularJS puts the two of these together and exponentially increases your front-end capabilities as opposed to plain old HTML. If you’re familiar with HTML then that half of the equation is rather straightforward, but the Javascript is where things will take a turn with the inclusion of “\$scope”. Scope as a concept typically defines the level of visibility of a variable or object of which there are two, local or global, but in normal Javascript this is not explicitly called upon. In AngularJS, \$scope is essentially the middleman between your HTML view and your Javascript Controller. It is viewable by both the HTML and the Javascript and as such can be interacted with by both. In this sense we can say that \$scope in Angular is always visible to its controller […]
• ##### Late Arriving Dimensions
In my last post I discussed late arriving facts, and although it is slightly annoying to receive delayed transaction data, the impact on the ETL and accuracy of the data in data warehouse is minimal. As we will discuss in this blog post, this is not the case for late arriving dimension (early arriving fact) data. As I have discussed throughout this blog series on designing and loading dimensional data warehouses, we always load data in the following order: lookup tables, then dimensions, and finally facts. But what happens when you get fact data before the dimension data arrives or there is a backdated change in the dimensional data? These late arriving dimensions may be generated by a known business process but these late arriving dimensions can cause a significant challenge to the accuracy of the data and test the skill and patience of the ETL developers and testers. Let’s take a look at the challenge the delay is receiving the dimensional data presents and the options we have to resolve these challenges. The most […]
• ##### XPATH & KNIME
Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME. For this example, I will be using an XML version of the Shakespeare play Hamlet. Everything started off fine. I grabbed the XML Reader node and placed it on my canvas. I opened the node configuration tool and entered in the location of the xml file. So far, so good. My task was to flatten the XML document into a denormalized table. I started out by finding all of the relevant tags I would need to make this output meet the requirements. OK, the play title tag occurs once, so I grab an XPATH node and configure it to grab the title of […]
• ##### Plotly.js: Getting Accurate Coordinates from Click Event
Across StackExchange and the Plotly.js community forums, there seems to be one frequently asked question that has not yet been answered: how can you get accurate coordinates from a click event on a Plotly chart? Take your standard scatter plot: https://codepen.io/plotly/pen/LVMRMO If I hover my mouse over (4, 14), and click, the current Plotly.js onClick event data returns data for each trace at that same x value. So, you would get (4, 17) for trace 0, (4, 11) for trace 1, and (4, 12) for trace 2. This is great if I’m trying to find out information about the traces, but this is not useful at all if I’m trying to get the point on the grid on which I clicked. Of course, with the event data you get all kinds of other somewhat useful information and functions. etpinard gets close to a solution in this codepen (https://codepen.io/etpinard/pen/EyydEj) by using the xaxis and yaxis built-in p2c function. Unfortunately, this doesn’t exactly work, and there is little to no documentation on these conversion functions. There are […]
• ##### Late Arriving Facts
Late arriving facts are those transactions which are delayed in arriving to the warehouse. For Type I data warehouses there is no real impact but for Type II warehouse models there is a minor challenge that the ETL developer must remember to handle to ensure the late arriving transaction corresponds with dimensional attributes at the time of the transaction. If we have a Type I warehouse or are always assured that we only receive fact data at the same time as the dimensional data then we can use the following simple query where we find the current active record for the dimension’s natural ID which is found in the fact record: SELECT dimensional_key    FROM dimension_table WHERE dimension_natural_id = {natural_id from fact}      AND dimension_actv_rcrd_fl = 1; Taking a look at the Type II data warehouse challenge, we cannot assume that the active dimensional record is the correct record for the fact. Therefore, we need to modify the ETL workflow process (Figure 1) to address the challenge of the possible changing of dimensional data since the “old” […]
• ##### Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow
The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow. First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE; Next, we need to Create an IAM Policy to Access Amazon S3 Resources and Create an IAM Role to Allow Access AWS Services. Now that we have a role associated with the IAM access policy we need to grant it to the database. To do this logon to the AWS console and go to the RDS page and select the database you wish to have this functionality. On that page scroll down to Manage IAM Roles (see Figure 1) and […]
• ##### URL Tips and Tricks
URL manipulation is one of those things you may not even think about until it becomes the biggest headache of your week. Thankfully KNIME has some things we can use to make life easier. Here are some of the common situations you could encounter when working with URLs in KNIME and how to handle them. The Workflow Needs to Run in the Any System’s KNIME Directory Perhaps the most common issue you’ll run into, most of the time you won’t want a KNIME workflow to only run on your machine or server. You still want it to use a specific file structure beyond a certain point, but you don’t want to hard-code everything before that. KNIME has a remarkably simple way of letting you deal with this in the form of standard variable keywords. Take this Create Directory node as an example: This syntax uses two such variables, “knime://” and “knime.workflow”. The former, “knime://”, is the system’s KNIME mount point, specified when you open your work space in the editor. This mount point almost entirely […]
• ##### Batch Execution of KNIME Workflows, Part 2
In an earlier blog posting I walked through the steps required to install KNIME in a Docker container so that it could be used to run workflows in a batch mode.  This is a very useful technique for automating workflow execution and leveraging cloud resources to do so.  However, most workflows are not self-contained: they need access to configuration, external data, and local storage.   I did not cover those aspects in my first posting so this blog entry will introduce ways to pass in configuration and support the batch execution of KNIME workflows in a container. Setting Flow Variables The most common way to supply external configuration to a KNIME workflow is to set flow variables from the command line when KNIME is run in batch mode.    This is done using the -workflow.variable option.  Each instance of this option sets the value for a single variable so it may be included any number of times in the command line.   The format for the option value is as follows: -workflow.variable=[Variable Name],[Value],[Type] The [Variable Name] is simply […]
• ##### Bulletproof Your KNIME Workflows with System Property Variables and External Config Files
KNIME workflows can contain a lot of code and, like all good programmers, I always want to eliminate as many code changes as possible. To accomplish this optimization, I rely on the use of table-driven processes, external property files and system property variables. The use of these techniques can reduce code changes and ensure that workflows will work on different computers that are not setup identically. I’d like to begin by discussing table-driven solutions. For this type of solution, I will use the data in the table to make the change. This can be done by simply changing the value of an attribute. For example, if I wanted to ensure a document was processed, I would add a flag (doc_processed_fl) to the document table with a default value of FALSE and have the program set the value to TRUE upon the completion of the process. This way I am assured the record will be processed. In the event of a catastrophic failure (network error, database error, etc.) causing the workflow to be halted, the flag […]
• ##### Automated Execution of Multiple KNIME Workflows
When using the open source KNIME Analytics Platform to build sophisticated data processing and analysis pipelines, I often find myself building workflows that consist of so many nodes they become difficult to manage.  This complexity can be managed by grouping nodes into processing stages and then bundling those stages into meta-nodes so that the overall workflow layout is easier to follow. However, I’ve found that this approach still leaves workflows unwieldy to work with as you still have to open the meta-nodes to explore and troubleshoot their processing.  Over the years I’ve worked with KNIME, I’ve developed a habit of breaking larger workflows up into smaller individual workflows representing each processing stage in the overall pipeline.  This serves to make building and debugging each processing step much more tractable at the cost of requiring more storage for persisting the outputs of one stage so that they may be used as the inputs of the next stage. Another small drawback of separate workflows is that they all need to be executed in order for the overall […]
• ##### Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME.
This blog describes how to perform Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME. Let’s start with quick introduction about Amazon Athena. What is Athena? Athena is serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage. It supports ANSI SQL queries with support for joins, JSON and window functions. How to connect Athena and execute SQL from KNIME? KNIME can interface with Athena using the following nodes. How to do Analysis of data in S3 using Athena trough KNIME? A traditional approach is to download the entire files from S3 to KNIME using a Node such as the Parquet Reader. This approach works fine if the file is relatively manageable in size. But if you have large files in S3, then this download approach will consume lots of time and local memory and processing will be slow because of data volume. Athena can help with these issues. Since Athena supports SQL processing on top of S3, all heavy lift data processing such as joins, […]
• ##### Batch Execution of KNIME Workflows Using Docker
In my previous blog posting I introduced our service management platform Dex which is at the core of many of our advanced analytics solutions.  Dex seamlessly manages the coordination of the data acquisition, ingestion, transformation and modeling for our system via processing steps that are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  One of the core applications we leverage for many of these steps is the open source KNIME analytics platform (www.knime.org).   KNIME’s easy to use GUI allows us to quickly develop and test ETL and modeling workflows.  Once these workflows are done, we then deploy them inside of Dex to be run inside Docker containers using KNIME’s batch execution mode.  The purpose of this article is to provide a guide to building a Docker container to execute KNIME workflows in a headless environment.  This article details a very basic set up for the Docker container and does not cover things like user accounts and security credentials which may be required for a more […]
• ##### Using the Python Node in KNIME to Perform Granger Causality Tests
KNIME includes a wide variety of data manipulation and data science nodes, but there are a few cases where you may need to embed specific python routines inside a workflow. Or more often, I test algorithms in Python but do not want to put the effort into orchestrating the data access, iterations, and parallel processing in python. KNIME has several python nodes, but first you will need to set up your python environments. Setting up Python Environments KNIME requires you to have Python 2 and Python 3 environments running locally. If you have Anaconda 3.x, you will need to create a python 2 environment with numpy and pandas installed. From the Anaconda prompt, run the following: conda install nb_conda_kernelsconda create -n py27 python=2.7 ipykernelactivate py27 pip install numpy pip install pandas You can not point to the python executables directly if you are running on Windows and will need to create batch files to run the environments. The .bat files for  Python3 (base) and Python2 (py27) environments: @REM Adapt the folder in the PATH to […]
• ##### 5 Tips and Workarounds for Angular/JS Extension Development
This article provides a few small tips and tricks to ease some of the headaches of developing Angular/JS extensions for Qlik. 1. When using jQuery, make sure you are utilizing unique selectors for a given instance of the extension. Multiple instances of an extension can exist on a single sheet. If using jQuery, it is easy to select (and accidentally edit) all instances of an extension on a sheet. To prevent this, each instance of the extension should have its own id. More advanced users may want to consider using the objectId that Qlik assigns to visualizations, but those can be difficult to track down. We at NuWave have found an easy method to utilize unique selectors: Generate a random number, place it in a wrapper div in the html, then use that number for all selections. 2. Use Validated and a broadcast() (with a Timeout) for re-drawing in Angular extensions. When creating an extension that needs to redraw (ie: on selection or property changes), Qlik recommends using the Validated event. Qlik’s documentation describes this […]
• ##### Factless Fact Tables
Factless facts are those fact tables that have no measures associated with the transaction.  Factless facts are a simple collection of dimensional keys which define the transactions or describing condition for the time period of the fact. You may question the need for a factless fact but they are important dimensional data structures which capture important information which can be leveraged into rollup measures or as information presented to a user. The most common example used for factless facts are student attendance in a class. As you can see from the dimensional diagram below the FACT_ATTENDANCE is an amalgamation of the DATE_KEY, the STUDENT_KEY, and the CLASS_KEY. As you can see there is nothing we can measure about a student’s attendance at a class. The student was there and the attendance was recorded or the student was not there and no record is recorded. It is a fact, plain and simple. There is a derivation of this fact where you can always load the full roster of individuals registered for the class and add a […]
• ##### PostgreSQL – Why You’d Want to Use It
If you’ve been in this industry for a few years, then you probably know what SQL is. It’s the golden standard for working with databases and nearly every modern coding language interfaces with it. Though simple it’s very flexible and different variants of it allow for some more robust functionality like Oracle SQL and MySQL. PGSQL, or PostgreSQL, is another one of these variants, but it takes things one step further and even one step after that. Let’s start with what exactly PGSQL is and is not. PGSQL is a variant of SQL and uses most of the same syntax. It’s open source and has been in development for decades now, if you’ve been around the block then you may even know it as Ingres. It can do all the same things that traditional SQL databases can do, following the same rules and structure. It is not a reinvention of the wheel, more so a different take on it. It is not specifically NoSQL, though it can be used for it, and it doesn’t require […]
• ##### Structured Query Language Basics
Structured Query Language (SQL) is the standard language used for retrieving and manipulating database information. Today’s blog, I will be focusing on how we can use this language to simply get an organized result set back that answers a specific question. This will be accomplished by using the Select, From, Where, Group By, Having, and Order By clauses We will be using the EMPLOYEES table in the HR schema from Oracle live as our practice table. You can find the structure of the table in the image below. This can be accessed for free. All you need is an oracle account and login with those credentials. Go to this link (Oracle Live) and select “Start Coding Now”. SELECT: All SQL statements start with the SELECT. This is how the query knows what information you want to return. The SELECT is where you place the column names, how many columns, column alias, and aggregations you want to see in your result. If you don’t know the structure of the table in advance you can place an […]
• ##### Period-To-Date Facts
We have all come across times when our customer wants to know how the organization is currently doing. They often want to know how they are measuring up against this time last year or against the projected measure. The most common examples of these request are a year-to-date calculation and a budget vs. actual analysis. In this blog post I will describe how to efficiently address these common business requests. Year-To-Date (YTD) Calculations Our customer has stated that they wish to show a YTD metric for sales which can be broken down into quarterly and monthly metrics as well. In all of my other blog posts I have modeled some dimension or fact to show you how to address the requirement at hand, in this scenario I will not. The reason is because we already have done the work to address this requirement in previous blogs. Let’s look how we already addressed this issue with an example. In the star schema shown above we can see that we are storing purchase transactions by customer, product, […]
• ##### Angular vs React
If you’ve done any sort of front-end programming, you know there are two front-end development frameworks that have emerged as the front runners in the market: React and Angular. Whether you’re just starting in on a development effort or are interested in which may be best to learn to improve your marketability, this article will outline the key differences between the two frameworks and identify recent market trends to help you make a decision. Angular7, or more familiarly Angular, is a full-fledged MVC framework championed by Google. It includes many OOTB features including routing, unit testing, forms, and AJAX requests. With these built-in features comes the strong suggestion on how to structure your application. When building an Angular app, you [generally] use their router, their forms, their requests, and you [generally] do it in their style. React.js is technically a library (not a framework) that is built and maintained by Facebook. It is more lightweight than Angular, but lacks the built-in features listed above. If you want to incorporate form validation into your React app, […]
Often management looks for bottlenecks in corporate processes so that they can be streamlined or used as a measurement of success for the organization. In order to do achieve these goals we need to measure time between two or more related events. The easiest way to report on this time-series process is to use accumulating snapshot facts.  Accumulating snapshot facts are updatable fact records used to measure time between two or more related events. The most common example of this type of fact can be seen in order processing. Let’s take a look! Order processing consists on many serialized processes. First there is the order, followed by the product creation or packaging of the items contained within the order. Once the items are packed, they are sent to QA for a double check of the order before sending the package to shipping, where the carrier will pickup and deliver the product to the customer. In this process we have many areas where the process could get held up (abandon cart, missing inventory, slow QA process, […]
• ##### Building Qlik Sense Extensions that can Export and Snapshot
The Fluff If you’ve ever built an extension in Qlik Sense focused on data visualization, you know how cool it is to harness the power of the Qlik associative engine. It is important for such visualizations to integrate with the user experience and really feel like Qlik—both in style and functionality. Yet many complex extensions suffer functionality drawbacks from failing to overcome two major hurdles: the ability to export to PDF/Image through the right-click menu, and the ability to function as a snapshot within a Qlik Data Story. Though these issues have the same root cause, it proved incredibly difficult for me to track down. It boils down to some undocumented changes in a couple key Qlik JavaScript objects. To see the result of my research and to see instructions on how to handle these issues correctly, skip down to The Stuff below. My path started by trying to “turn on” the export functionality for a table-like extension I was developing. The Qlik help pages helpfully told me to add a property named “export” to […]
• ##### Client Identification Using Custom Request Headers in Spring Boot
One of the key building blocks of NuWave’s advanced predictive analytics solutions is a service management platform we built called Dex, which is short for Deus Ex Machina.   Dex was built as collection of microservices using Spring Boot and is responsible for coordinating the execution of complex workflows which take data through acquisition, ingestion, transformation, normalization, and modeling with many different advanced machine learning algorithms.   These processing steps are performed with a large number of technologies (Java, Python, R, KNIME, AWS SageMaker, etc.) and are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  Dex seamlessly manages the coordination of the execution of all these ephemeral tasks across these various technologies while also providing runtime configuration, secure credential management, state management, and storage provisioning for each executing task.   This has led to the affectionate nickname of “Cat Herder” for this critical technology at the heart of our advanced analytical solutions. With such a complex processing workflow spread across multiple underlying technologies, we faced the challenge of […]
This blog post will focus on loading transaction fact tables, subsequent posts for peoiodioc and accumulating snapshots will follow in the coming weeks. Loading fact tables is very different than loading dimensions. First, we need to know the type of fact we are using. The major types of facts are transaction, periodic snapshot, accumulating snapshot and time-span accumulating snapshots. We also need to know the grain of the fact, the dimensional keys which are associated to the measurement event. Let’s say we want to measure product sales of by customer, product, and date attributes. The source transaction system may provide the following associated dimensions: DIM_CUSTOMER, DIM_PRODUCT, and DIM_DATE. Here is an example of a set of product sale transaction records: Product_ID Customer_ID Transaction_Date QTY EXT_Price Discount Total 55 1415 4/10/2019 10:01 3 38.97 5 37.02 72 1415 4/10/2019 10:01 5 39.95 5 37.95 120 1415 4/10/2019 10:01 1 12.99 5 12.34 17 3457 4/10/2019 10:09 3 44.97 0 44.97 57 8322 4/10/2019 10:10 6 35.94 10 32.35 106 1194 4/10/2019 10:17 3 23.97 0 23.97 […]
• ##### Data Preparation for Machine Learning: Vector Spaces
Machine learning algorithms often rely on certain assumptions about the data being mined. Ensuring data meet these assumptions can be a significant portion of the preparation work before model training and predicting begins. When I began my data science journey, I was blissfully unaware of this and thought my preparation was done just because I had stuffed everything into a table. Feeding these naïvely compiled tables into learners I wondered why some algorithms never seemed to perform well for me. As I began digging into the algorithms themselves, many referred to vector space operations as being fundamental to their function. Answering the simple question of “what is a vector space” took me on quite an enlightening journey. Vector spaces are really a neat tool for organizing data and require more than getting all your data into a table. Perhaps most mind-bending is that a vector space can be constructed from anything, even if mortals like myself tend to just use numbers and categorical values. I believe part of the reason I was late to understanding […]
• ##### Oracle Certified Professional Exam Prep
How I prepared for (and passed) the Oracle Certified Professional exam.
• ##### Beyond Recommender Systems: Alternative Applications of Matrix Factorization (Part 1)
For their utility in retail and business systems, Recommender Systems are one of the most popular applications of predictive analytics. A Recommender System is a process that leverages data from the behaviors of individuals and attempts to make personalized predictions of a targeted individual’s future behaviors. Well-known and now obvious applications of a Recommender System include predicting a customer’s preferences for products they may buy so that they can be recommended for purchase. Another well-known and studied application is the prediction of movies that are of interest to an individual based upon their own movie viewing habits and the viewing habits of other individuals that may share some common attitude or trait that indicates some shared interest in specific movies. Though there are several methods for implementing a Recommender System, Collaborative Filtering has become the de facto standard. Collaborative Filtering is an approach to Recommender Systems that is based on a machine learning technique called Matrix Factorization and was made famous within the data science community through its application in the one-million-dollar winning solution for […]
• ##### Updating Type II Slowly Changing Dimensions
In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses. The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive processes from impacting the source application while providing a link back to the source system. Once all of the data is landed, I can now proceed to the transformation and staging step. In this phase, I create staging tables for all of my dimensions. These tables look similar to the dimension table but they do not have a primary key or a majority of audit columns. Here is an example of a regular dimension table […]
• ##### Introduction to KNIME
Let’s say you’ve been tasked to pull data from a variety of data sources on a monthly basis, manipulating and processing it to create insights. At first one may look to a powerhouse software like Informatica, but what if you don’t really need something that heavy and your budget is extremely limited? You need an ETL tool to get you from point A to point B accurately and reasonably quickly, something that’s easy to learn but has the depth and flexibility for more advanced processing and customization. Enter KNIME. Up-front KNIME presents itself as a tool designed for data analysts, emphasizing visual clarity and usability over raw power; where Informatica will perform multiple transformation types in a single Expression node, KNIME has a node for each step. This may be a turnoff for some coming from enterprise-level software, but the enhanced control offered by KNIME is extremely valuable and can shave hours off your development and debugging. A major offering of KNIME is that it allows you to see how your data changes at every […]
• ##### Joining Fact Tables
Joining fact tables can be done but there are some inherent risks so you need to be careful when joining fact tables is required. In the following simple scenario, we have a fact of authors to articles and a separate fact of articles to pageviews. Our customer has asked for the ability to 1) find the authors who have provided the most content in a given time period, find the articles which have the greatest number of pageviews for a given time period and 3) find the authors with the highest number of pageviews in a given time period. Our data model can be seen here: To answer the first two questions is done by using a simple select statement joining a single fact table with the date dimension and the associated other associated dimensions to retrieve the answer. TOP 10 AUTHORS BY ARTICLE COUNT﻿ TOP 10 ARTICLES BY MONTH The third requirement will compel us to join two fact tables to get the answer. Remember to join a fact table you need at least […]
• ##### Types of Data Models
A conceptual model is a representation of a system, made of the composition of concepts which are used to help people know, understand, or simulate a subject the model represents.[i] In dimensional modeling this is the most abstract level of modeling. Here we are only worried about the entities (dimensions) and their relationship to each other (facts). No attributes are required in the conceptual model as we are trying to work with the customer who is not an expert in databases or data warehousing but they do understand how the entities are related in their business. The conceptual model will help us to ensure we have captured the full list of dimensions and facts. If we are missing an entity or a relationship the customer will be able to provide us with that information. Now that we have completed a successful conceptual data model, we need to create a logical model. A logical model is a fully attributed conceptual model. The attributes are fully spelled out with no abbreviations. High-level data types (string, number, date, etc.) are […]

January 19, 2018
McLean, Virginia — 1-18-2018 — NuWave Solutions, a leader and innovator in Analytics and Data Management solutions, was selected to present a webcast on Self-Service Data Science at the Qlik Federal Tech Tuesday webcast on 23 January 2018. Brian Frutchey, Vice President at NuWave, will speak about how powerful predictive analytics can be made accessible to the “common man” by using technologies from our partners Celect and Qlik together. “NuWave has deep experience in self-service machine learning from Celect and self-service data exploration platform from Qlik, and has begun using them together for a variety of US Public Sector customers to democratize enterprise analytics, including the Army and Military Intelligence Community,” says Frutchey. “We are helping organizations accelerate their ability to apply machine learning to understand and predict increasingly complex behaviors, detect anomalies, and optimize outcomes. By working with our partners, Celect and Qlik, achieving these results can be accelerated even when an enterprise’s data is challenging and their staff lacks statisticians. This upcoming presentation will reveal some of the reasons why.” For more information […]
December 5, 2017

## NuWave Solutions Certified as a Great Workplace

McLean, Virginia — 11-13-2017 — NuWave was certified as a great workplace today by the independent analysts at Great Place to Work®. NuWave earned this credential based on extensive ratings provided by its employees in anonymous surveys. A summary of these ratings can be found at http://reviews.greatplacetowork.com/nuwave-solutions-llc. “We are honored to have received this recognition,” says Rob Castle, Chief Technology Officer at NuWave. “It’s gratifying to know how highly our employees feel about their workplace. Our workforce is the lifeblood of our company, and the foundation for success for ourselves and our customers.” “We are thrilled by this certification,” says Nora Claudias, Vice President of Corporate Resources at NuWave.  “One of our goals is to maintain a family-oriented corporate culture that nurtures trust, respect, growth and encouragement. Knowing we’re on track with that goal is a significant achievement.” “We applaud NuWave for seeking certification and releasing its employees’ feedback,” said Kim Peters, Executive Vice President of Great Place to Work’s Certification Program. “These ratings measure its capacity to earn its own employees’ trust and create […]