Data Modeling

  • Qlik Core: The Next Generation of Self-Serve Analytics
    Since its inception, we have all watched Qlik Sense grow. From desktop to cloud, from custom extensions to embedded visualizations to mashups using Enigma.js. Qlik Core is the next phase of this growth, and it is exactly what developers ordered. There’s no debating that Qlik Sense Desktop is an incredibly powerful tool, giving business analysts and decision makers everything they could want: easy ETL, data governance, security, self-service analytics, and more. Over the most recent years, Qlik Sense has upped the ante with new features, unlocking powerful new capabilities while maintaining the ease-of-use that makes Qlik so appealing. But. We’ve all had that one client who wants the power of Qlik with their branding, their look-and-feel, in their custom web application. Of course, there are mashups, and embedded visualizations, but you still have to work within the parameters of the Qlik Sense Enterprise Edition. While you get the power of Qlik, you also get the heavy architecture in the background, even if you don’t want it. It can also be hard to get the client […]
  • Leveraging Athena with KNIME in a Robust Manner, Part 2
    In my previous blog posting, I introduced an issue we were having with seemingly random intermittent failures using Amazon Web Services’ Athena backed by a large number of data files in S3.   The issue was arising because S3 is eventually consistent and occasionally queries were being executed before their underlying data files were fully materialized in S3. Our solution was to introduce try/catch Knime nodes with a loop to retry failed queries a few times in case of intermittent failures.   To do this we had to do our own flow variable resolution in the Athena SQL queries since the standard Knime database SQL executor node does not do variable substitution when the query is configured via a flow variable.   My previous blog posting (link to part 1 here) covered how we resolved embedded flow variable references in a Java snippet node. In this blog posting I am going to cover the workflow we use for the actual query retry logic so that we retry each query a given number of times until it either succeeds […]
  • Dynamic Looping in KNIME
    I recently came across a situation where I needed to load a couple of table partitions. I wanted to automate this process and have a table drive the process. To accomplish this task, I stated by creating a table of partitions that would drive the dynamic refresh process. My requirement was to be able to refresh the current month and the previous month’s data. (see the DDL for the partitioned fact table below) CREATE TABLE FACT_BOOK_SALE ( FBS_BOOK_KEY NUMBER NOT NULL ENABLE ,FBS_CSTMR_KEY NUMBER NOT NULL ENABLE ,FBS_STORE_KEY NUMBER NOT NULL ENABLE ,FBS_DATE_KEY NUMBER NOT NULL ENABLE ,FBS_BOOK_ISBN_13 NUMBER ,FBS_CSTMR_ID VARCHAR2(50 CHAR) ,FBS_STORE_ID NUMBER ,FBS_QTY NUMBER ,FBS_DSCNT_PCT NUMBER(6,3) ,FBS_SALE_PRICE NUMBER(19,4) ,PRIMARY KEY (FBS_BOOK_KEY, FBS_CSTMR_KEY, FBS_STORE_KEY, FBS_DATE_KEY) ) PARTITION BY RANGE(FBS_DATE_KEY) ( PARTITION FACT_BOOK_SALE_201901 VALUES LESS THAN (20190200), PARTITION FACT_BOOK_SALE_201902 VALUES LESS THAN (20190300), PARTITION FACT_BOOK_SALE_201903 VALUES LESS THAN (20190400), PARTITION FACT_BOOK_SALE_201904 VALUES LESS THAN (20190500), PARTITION FACT_BOOK_SALE_201905 VALUES LESS THAN (20190600), PARTITION FACT_BOOK_SALE_201906 VALUES LESS THAN (20190700), PARTITION FACT_BOOK_SALE_201907 VALUES LESS THAN (20190800), PARTITION FACT_BOOK_SALE_201908 VALUES LESS THAN (20190900), PARTITION FACT_BOOK_SALE_201909 VALUES LESS THAN (20191000), […]
  • Leveraging Athena with KNIME in a Robust Manner, Part One
    Recently, we began experiencing seemingly random intermittent failures in one of our modeling workflows used in VANE, an advanced predictive analytics project we are building for the US Army.  These failures were occurring with varying frequency inside of any one of several Database SQL Executor nodes.  These nodes were performing a large number of SQL queries against Amazon Web Services’ Athena backed by a large number of data files in S3.   The workflow was randomly failing in any of these nodes with a HIVE SPLIT error due to a missing partition file in S3.  When we investigated the failure, we would invariable find the supposedly missing partition file and restarting the workflow from the failed step would succeed.  With a little bit of research, we learned that this is a known problem with Athena and S3 storage due to the nature of S3 being “eventually consistent”.   Simply put, some of the Athena queries were being executed before their required data files were fully materialized in S3. With this newfound knowledge, we began to consider ways […]
  • Updating / Replacing Fact Records
    You will inevitably come across a time when a source system will allow the updating of transaction records. This drives data modelers, ETL developers, analytics designers, and analytics users crazy. After all the source system stated that this information was a “fact”, leading all to believe it would not change, now you must design for and resolve the changing fact data. In this blog post we will discuss the options and the impact on these decisions will have on your warehouse and BI solution. Let’s start with the easy and move to the complex. The easiest thing to do when receiving data causing an existing fact record to be updated is to simply update the measures in the existing record. Some people will argue that “you cannot just update the fact record. I have already had people use the data and now they will see a change that cannot readily be explained since the measure was overwritten.” Although this statement is true, most data warehouse project owners, sponsors and users understand that there can be […]
  • Hierarchies in Qlik Sense
    Understanding how Qlik’s Associative Engine handles hierarchies will improve your ability to render meaningful and flexible rollups that users can wield to draw insights at all levels of granularity. This article covers how Qlik hierarchies do not follow the same conventions as a normal database (SQL) and discusses how hierarchy and ancestry load functions can be used in tandem to allow easy access to data at any granularity. Database Hierarchies Perhaps the most common way to store hierarchies in databases is through a Parent-Child Table. In such a table, there exists a row for each node in the hierarchy containing its ID and the ID of its parent. This format is widespread through most database technologies as it is simple, flexible, and updates quickly. Depending on the specific data, other formats may be used for better read/write performance. Perhaps because of its ubiquity, Qlik is built to take utilize Parent-Child Tables. But it cannot provide a true hierarchical user experience without processing those tables through special functions. The Qlik Associative Engine Qlik cannot handle all […]
  • AngularJS Basics: $scope
    AngularJS, for the uninitiated, is an open-source front-end framework that is maintained by Google and an active community. It has been around for some time, but the learning curve can be pretty steep as AngularJS encompasses not one language, but two; HTML and Javascript. AngularJS puts the two of these together and exponentially increases your front-end capabilities as opposed to plain old HTML. If you’re familiar with HTML then that half of the equation is rather straightforward, but the Javascript is where things will take a turn with the inclusion of “$scope”. Scope as a concept typically defines the level of visibility of a variable or object of which there are two, local or global, but in normal Javascript this is not explicitly called upon. In AngularJS, $scope is essentially the middleman between your HTML view and your Javascript Controller. It is viewable by both the HTML and the Javascript and as such can be interacted with by both. In this sense we can say that $scope in Angular is always visible to its controller […]
  • Optimizing Qlik Load-Time
    I have recently been brought on to a project involving a Qlik implementation for a client. Among the requirements for the project, the client needs to have the data refreshed every hour. As with many Qlik implementations, the Qlik segment is the final step in the ETL process with the data arriving to an Oracle DB first. By the time the ETL arrived to Qlik, only 30 minutes remained of the allotted hour. Here is the process for optimization that I followed. Status on arrival: The number of tasks in the QMC were about 40. There were about 30 apps in the front end that needed reloading, most of which would reload in under a minute. However, there were several apps loading tables of over 250 million records with dozens of fields, and those apps would take well over an hour. The total time to reload everything took about 6.5 hours. Load time needed drastic improvement. Below are some techniques for Extraction, Transformation & Loading improvement, I’ll touch on more Loading techniques in another post. […]
  • Late Arriving Dimensions
    In my last post I discussed late arriving facts, and although it is slightly annoying to receive delayed transaction data, the impact on the ETL and accuracy of the data in data warehouse is minimal. As we will discuss in this blog post, this is not the case for late arriving dimension (early arriving fact) data. As I have discussed throughout this blog series on designing and loading dimensional data warehouses, we always load data in the following order: lookup tables, then dimensions, and finally facts. But what happens when you get fact data before the dimension data arrives or there is a backdated change in the dimensional data? These late arriving dimensions may be generated by a known business process but these late arriving dimensions can cause a significant challenge to the accuracy of the data and test the skill and patience of the ETL developers and testers. Let’s take a look at the challenge the delay is receiving the dimensional data presents and the options we have to resolve these challenges. The most […]
    Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME. For this example, I will be using an XML version of the Shakespeare play Hamlet. Everything started off fine. I grabbed the XML Reader node and placed it on my canvas. I opened the node configuration tool and entered in the location of the xml file. So far, so good. My task was to flatten the XML document into a denormalized table. I started out by finding all of the relevant tags I would need to make this output meet the requirements. OK, the play title tag occurs once, so I grab an XPATH node and configure it to grab the title of […]
  • Plotly.js: Getting Accurate Coordinates from Click Event
    Across StackExchange and the Plotly.js community forums, there seems to be one frequently asked question that has not yet been answered: how can you get accurate coordinates from a click event on a Plotly chart? Take your standard scatter plot: If I hover my mouse over (4, 14), and click, the current Plotly.js onClick event data returns data for each trace at that same x value. So, you would get (4, 17) for trace 0, (4, 11) for trace 1, and (4, 12) for trace 2. This is great if I’m trying to find out information about the traces, but this is not useful at all if I’m trying to get the point on the grid on which I clicked. Of course, with the event data you get all kinds of other somewhat useful information and functions. etpinard gets close to a solution in this codepen ( by using the xaxis and yaxis built-in p2c function. Unfortunately, this doesn’t exactly work, and there is little to no documentation on these conversion functions. There are […]
  • Late Arriving Facts
    Late arriving facts are those transactions which are delayed in arriving to the warehouse. For Type I data warehouses there is no real impact but for Type II warehouse models there is a minor challenge that the ETL developer must remember to handle to ensure the late arriving transaction corresponds with dimensional attributes at the time of the transaction. If we have a Type I warehouse or are always assured that we only receive fact data at the same time as the dimensional data then we can use the following simple query where we find the current active record for the dimension’s natural ID which is found in the fact record: SELECT dimensional_key    FROM dimension_table WHERE dimension_natural_id = {natural_id from fact}      AND dimension_actv_rcrd_fl = 1; Taking a look at the Type II data warehouse challenge, we cannot assume that the active dimensional record is the correct record for the fact. Therefore, we need to modify the ETL workflow process (Figure 1) to address the challenge of the possible changing of dimensional data since the “old” […]
  • Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow
    The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow. First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE; Next, we need to Create an IAM Policy to Access Amazon S3 Resources and Create an IAM Role to Allow Access AWS Services. Now that we have a role associated with the IAM access policy we need to grant it to the database. To do this logon to the AWS console and go to the RDS page and select the database you wish to have this functionality. On that page scroll down to Manage IAM Roles (see Figure 1) and […]
  • URL Tips and Tricks
    URL manipulation is one of those things you may not even think about until it becomes the biggest headache of your week. Thankfully KNIME has some things we can use to make life easier. Here are some of the common situations you could encounter when working with URLs in KNIME and how to handle them. The Workflow Needs to Run in the Any System’s KNIME Directory Perhaps the most common issue you’ll run into, most of the time you won’t want a KNIME workflow to only run on your machine or server. You still want it to use a specific file structure beyond a certain point, but you don’t want to hard-code everything before that. KNIME has a remarkably simple way of letting you deal with this in the form of standard variable keywords. Take this Create Directory node as an example: This syntax uses two such variables, “knime://” and “knime.workflow”. The former, “knime://”, is the system’s KNIME mount point, specified when you open your work space in the editor. This mount point almost entirely […]
  • Batch Execution of KNIME Workflows, Part 2
    In an earlier blog posting I walked through the steps required to install KNIME in a Docker container so that it could be used to run workflows in a batch mode.  This is a very useful technique for automating workflow execution and leveraging cloud resources to do so.  However, most workflows are not self-contained: they need access to configuration, external data, and local storage.   I did not cover those aspects in my first posting so this blog entry will introduce ways to pass in configuration and support the batch execution of KNIME workflows in a container. Setting Flow Variables The most common way to supply external configuration to a KNIME workflow is to set flow variables from the command line when KNIME is run in batch mode.    This is done using the -workflow.variable option.  Each instance of this option sets the value for a single variable so it may be included any number of times in the command line.   The format for the option value is as follows: -workflow.variable=[Variable Name]:[Value]:[Type] The [Variable Name] is simply […]
  • Bulletproof Your KNIME Workflows with System Property Variables and External Config Files
    KNIME workflows can contain a lot of code and, like all good programmers, I always want to eliminate as many code changes as possible. To accomplish this optimization, I rely on the use of table-driven processes, external property files and system property variables. The use of these techniques can reduce code changes and ensure that workflows will work on different computers that are not setup identically. I’d like to begin by discussing table-driven solutions. For this type of solution, I will use the data in the table to make the change. This can be done by simply changing the value of an attribute. For example, if I wanted to ensure a document was processed, I would add a flag (doc_processed_fl) to the document table with a default value of FALSE and have the program set the value to TRUE upon the completion of the process. This way I am assured the record will be processed. In the event of a catastrophic failure (network error, database error, etc.) causing the workflow to be halted, the flag […]
  • Automated Execution of Multiple KNIME Workflows
    When using the open source KNIME Analytics Platform to build sophisticated data processing and analysis pipelines, I often find myself building workflows that consist of so many nodes they become difficult to manage.  This complexity can be managed by grouping nodes into processing stages and then bundling those stages into meta-nodes so that the overall workflow layout is easier to follow. However, I’ve found that this approach still leaves workflows unwieldy to work with as you still have to open the meta-nodes to explore and troubleshoot their processing.  Over the years I’ve worked with KNIME, I’ve developed a habit of breaking larger workflows up into smaller individual workflows representing each processing stage in the overall pipeline.  This serves to make building and debugging each processing step much more tractable at the cost of requiring more storage for persisting the outputs of one stage so that they may be used as the inputs of the next stage. Another small drawback of separate workflows is that they all need to be executed in order for the overall […]
  • Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME.
    This blog describes how to perform Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME. Let’s start with quick introduction about Amazon Athena. What is Athena? Athena is serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage. It supports ANSI SQL queries with support for joins, JSON and window functions. How to connect Athena and execute SQL from KNIME? KNIME can interface with Athena using the following nodes. How to do Analysis of data in S3 using Athena trough KNIME? A traditional approach is to download the entire files from S3 to KNIME using a Node such as the Parquet Reader. This approach works fine if the file is relatively manageable in size. But if you have large files in S3, then this download approach will consume lots of time and local memory and processing will be slow because of data volume. Athena can help with these issues. Since Athena supports SQL processing on top of S3, all heavy lift data processing such as joins, […]
  • Batch Execution of KNIME Workflows Using Docker
    In my previous blog posting I introduced our service management platform Dex which is at the core of many of our advanced analytics solutions.  Dex seamlessly manages the coordination of the data acquisition, ingestion, transformation and modeling for our system via processing steps that are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  One of the core applications we leverage for many of these steps is the open source KNIME analytics platform (   KNIME’s easy to use GUI allows us to quickly develop and test ETL and modeling workflows.  Once these workflows are done, we then deploy them inside of Dex to be run inside Docker containers using KNIME’s batch execution mode.  The purpose of this article is to provide a guide to building a Docker container to execute KNIME workflows in a headless environment.  This article details a very basic set up for the Docker container and does not cover things like user accounts and security credentials which may be required for a more […]
  • Using the Python Node in KNIME to Perform Granger Causality Tests
    KNIME includes a wide variety of data manipulation and data science nodes, but there are a few cases where you may need to embed specific python routines inside a workflow. Or more often, I test algorithms in Python but do not want to put the effort into orchestrating the data access, iterations, and parallel processing in python. KNIME has several python nodes, but first you will need to set up your python environments. Setting up Python Environments KNIME requires you to have Python 2 and Python 3 environments running locally. If you have Anaconda 3.x, you will need to create a python 2 environment with numpy and pandas installed. From the Anaconda prompt, run the following: conda install nb_conda_kernelsconda create -n py27 python=2.7 ipykernelactivate py27 pip install numpy pip install pandas You can not point to the python executables directly if you are running on Windows and will need to create batch files to run the environments. The .bat files for  Python3 (base) and Python2 (py27) environments: @REM Adapt the folder in the PATH to […]
  • 5 Tips and Workarounds for Angular/JS Extension Development
    This article provides a few small tips and tricks to ease some of the headaches of developing Angular/JS extensions for Qlik. 1. When using jQuery, make sure you are utilizing unique selectors for a given instance of the extension. Multiple instances of an extension can exist on a single sheet. If using jQuery, it is easy to select (and accidentally edit) all instances of an extension on a sheet. To prevent this, each instance of the extension should have its own id. More advanced users may want to consider using the objectId that Qlik assigns to visualizations, but those can be difficult to track down. We at NuWave have found an easy method to utilize unique selectors: Generate a random number, place it in a wrapper div in the html, then use that number for all selections. 2. Use Validated and a broadcast() (with a Timeout) for re-drawing in Angular extensions. When creating an extension that needs to redraw (ie: on selection or property changes), Qlik recommends using the Validated event. Qlik’s documentation describes this […]
  • Factless Fact Tables
    Factless facts are those fact tables that have no measures associated with the transaction.  Factless facts are a simple collection of dimensional keys which define the transactions or describing condition for the time period of the fact. You may question the need for a factless fact but they are important dimensional data structures which capture important information which can be leveraged into rollup measures or as information presented to a user. The most common example used for factless facts are student attendance in a class. As you can see from the dimensional diagram below the FACT_ATTENDANCE is an amalgamation of the DATE_KEY, the STUDENT_KEY, and the CLASS_KEY. As you can see there is nothing we can measure about a student’s attendance at a class. The student was there and the attendance was recorded or the student was not there and no record is recorded. It is a fact, plain and simple. There is a derivation of this fact where you can always load the full roster of individuals registered for the class and add a […]
  • PostgreSQL – Why You’d Want to Use It
    If you’ve been in this industry for a few years, then you probably know what SQL is. It’s the golden standard for working with databases and nearly every modern coding language interfaces with it. Though simple it’s very flexible and different variants of it allow for some more robust functionality like Oracle SQL and MySQL. PGSQL, or PostgreSQL, is another one of these variants, but it takes things one step further and even one step after that. Let’s start with what exactly PGSQL is and is not. PGSQL is a variant of SQL and uses most of the same syntax. It’s open source and has been in development for decades now, if you’ve been around the block then you may even know it as Ingres. It can do all the same things that traditional SQL databases can do, following the same rules and structure. It is not a reinvention of the wheel, more so a different take on it. It is not specifically NoSQL, though it can be used for it, and it doesn’t require […]
  • Structured Query Language Basics
    Structured Query Language (SQL) is the standard language used for retrieving and manipulating database information. Today’s blog, I will be focusing on how we can use this language to simply get an organized result set back that answers a specific question. This will be accomplished by using the Select, From, Where, Group By, Having, and Order By clauses We will be using the EMPLOYEES table in the HR schema from Oracle live as our practice table. You can find the structure of the table in the image below. This can be accessed for free. All you need is an oracle account and login with those credentials. Go to this link (Oracle Live) and select “Start Coding Now”. SELECT: All SQL statements start with the SELECT. This is how the query knows what information you want to return. The SELECT is where you place the column names, how many columns, column alias, and aggregations you want to see in your result. If you don’t know the structure of the table in advance you can place an […]
  • Period-To-Date Facts
    We have all come across times when our customer wants to know how the organization is currently doing. They often want to know how they are measuring up against this time last year or against the projected measure. The most common examples of these request are a year-to-date calculation and a budget vs. actual analysis. In this blog post I will describe how to efficiently address these common business requests. Year-To-Date (YTD) Calculations Our customer has stated that they wish to show a YTD metric for sales which can be broken down into quarterly and monthly metrics as well. In all of my other blog posts I have modeled some dimension or fact to show you how to address the requirement at hand, in this scenario I will not. The reason is because we already have done the work to address this requirement in previous blogs. Let’s look how we already addressed this issue with an example. In the star schema shown above we can see that we are storing purchase transactions by customer, product, […]
  • Angular vs React
    If you’ve done any sort of front-end programming, you know there are two front-end development frameworks that have emerged as the front runners in the market: React and Angular. Whether you’re just starting in on a development effort or are interested in which may be best to learn to improve your marketability, this article will outline the key differences between the two frameworks and identify recent market trends to help you make a decision. Angular7, or more familiarly Angular, is a full-fledged MVC framework championed by Google. It includes many OOTB features including routing, unit testing, forms, and AJAX requests. With these built-in features comes the strong suggestion on how to structure your application. When building an Angular app, you [generally] use their router, their forms, their requests, and you [generally] do it in their style. React.js is technically a library (not a framework) that is built and maintained by Facebook. It is more lightweight than Angular, but lacks the built-in features listed above. If you want to incorporate form validation into your React app, […]
  • Loading Accumulating Snapshot Fact Tables
    Often management looks for bottlenecks in corporate processes so that they can be streamlined or used as a measurement of success for the organization. In order to do achieve these goals we need to measure time between two or more related events. The easiest way to report on this time-series process is to use accumulating snapshot facts.  Accumulating snapshot facts are updatable fact records used to measure time between two or more related events. The most common example of this type of fact can be seen in order processing. Let’s take a look! Order processing consists on many serialized processes. First there is the order, followed by the product creation or packaging of the items contained within the order. Once the items are packed, they are sent to QA for a double check of the order before sending the package to shipping, where the carrier will pickup and deliver the product to the customer. In this process we have many areas where the process could get held up (abandon cart, missing inventory, slow QA process, […]
  • Building Qlik Sense Extensions that can Export and Snapshot
    The Fluff If you’ve ever built an extension in Qlik Sense focused on data visualization, you know how cool it is to harness the power of the Qlik associative engine. It is important for such visualizations to integrate with the user experience and really feel like Qlik—both in style and functionality. Yet many complex extensions suffer functionality drawbacks from failing to overcome two major hurdles: the ability to export to PDF/Image through the right-click menu, and the ability to function as a snapshot within a Qlik Data Story. Though these issues have the same root cause, it proved incredibly difficult for me to track down. It boils down to some undocumented changes in a couple key Qlik JavaScript objects. To see the result of my research and to see instructions on how to handle these issues correctly, skip down to The Stuff below. My path started by trying to “turn on” the export functionality for a table-like extension I was developing. The Qlik help pages helpfully told me to add a property named “export” to […]
  • Client Identification Using Custom Request Headers in Spring Boot
    One of the key building blocks of NuWave’s advanced predictive analytics solutions is a service management platform we built called Dex, which is short for Deus Ex Machina.   Dex was built as collection of microservices using Spring Boot and is responsible for coordinating the execution of complex workflows which take data through acquisition, ingestion, transformation, normalization, and modeling with many different advanced machine learning algorithms.   These processing steps are performed with a large number of technologies (Java, Python, R, KNIME, AWS SageMaker, etc.) and are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  Dex seamlessly manages the coordination of the execution of all these ephemeral tasks across these various technologies while also providing runtime configuration, secure credential management, state management, and storage provisioning for each executing task.   This has led to the affectionate nickname of “Cat Herder” for this critical technology at the heart of our advanced analytical solutions. With such a complex processing workflow spread across multiple underlying technologies, we faced the challenge of […]
  • Loading Transaction Fact Tables
    This blog post will focus on loading transaction fact tables, subsequent posts for peoiodioc and accumulating snapshots will follow in the coming weeks. Loading fact tables is very different than loading dimensions. First, we need to know the type of fact we are using. The major types of facts are transaction, periodic snapshot, accumulating snapshot and time-span accumulating snapshots. We also need to know the grain of the fact, the dimensional keys which are associated to the measurement event. Let’s say we want to measure product sales of by customer, product, and date attributes. The source transaction system may provide the following associated dimensions: DIM_CUSTOMER, DIM_PRODUCT, and DIM_DATE. Here is an example of a set of product sale transaction records: Product_ID Customer_ID Transaction_Date QTY EXT_Price Discount Total 55 1415 4/10/2019 10:01 3 38.97 5 37.02 72 1415 4/10/2019 10:01 5 39.95 5 37.95 120 1415 4/10/2019 10:01 1 12.99 5 12.34 17 3457 4/10/2019 10:09 3 44.97 0 44.97 57 8322 4/10/2019 10:10 6 35.94 10 32.35 106 1194 4/10/2019 10:17 3 23.97 0 23.97 […]
  • Data Preparation for Machine Learning: Vector Spaces
    Machine learning algorithms often rely on certain assumptions about the data being mined. Ensuring data meet these assumptions can be a significant portion of the preparation work before model training and predicting begins. When I began my data science journey, I was blissfully unaware of this and thought my preparation was done just because I had stuffed everything into a table. Feeding these naïvely compiled tables into learners I wondered why some algorithms never seemed to perform well for me. As I began digging into the algorithms themselves, many referred to vector space operations as being fundamental to their function. Answering the simple question of “what is a vector space” took me on quite an enlightening journey. Vector spaces are really a neat tool for organizing data and require more than getting all your data into a table. Perhaps most mind-bending is that a vector space can be constructed from anything, even if mortals like myself tend to just use numbers and categorical values. I believe part of the reason I was late to understanding […]
  • Oracle Certified Professional Exam Prep
    How I prepared for (and passed) the Oracle Certified Professional exam.
  • Beyond Recommender Systems: Alternative Applications of Matrix Factorization (Part 1)
    For their utility in retail and business systems, Recommender Systems are one of the most popular applications of predictive analytics. A Recommender System is a process that leverages data from the behaviors of individuals and attempts to make personalized predictions of a targeted individual’s future behaviors. Well-known and now obvious applications of a Recommender System include predicting a customer’s preferences for products they may buy so that they can be recommended for purchase. Another well-known and studied application is the prediction of movies that are of interest to an individual based upon their own movie viewing habits and the viewing habits of other individuals that may share some common attitude or trait that indicates some shared interest in specific movies. Though there are several methods for implementing a Recommender System, Collaborative Filtering has become the de facto standard. Collaborative Filtering is an approach to Recommender Systems that is based on a machine learning technique called Matrix Factorization and was made famous within the data science community through its application in the one-million-dollar winning solution for […]
  • Updating Type II Slowly Changing Dimensions
    In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses. The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive processes from impacting the source application while providing a link back to the source system. Once all of the data is landed, I can now proceed to the transformation and staging step. In this phase, I create staging tables for all of my dimensions. These tables look similar to the dimension table but they do not have a primary key or a majority of audit columns. Here is an example of a regular dimension table […]
  • Introduction to KNIME
    Let’s say you’ve been tasked to pull data from a variety of data sources on a monthly basis, manipulating and processing it to create insights. At first one may look to a powerhouse software like Informatica, but what if you don’t really need something that heavy and your budget is extremely limited? You need an ETL tool to get you from point A to point B accurately and reasonably quickly, something that’s easy to learn but has the depth and flexibility for more advanced processing and customization. Enter KNIME. Up-front KNIME presents itself as a tool designed for data analysts, emphasizing visual clarity and usability over raw power; where Informatica will perform multiple transformation types in a single Expression node, KNIME has a node for each step. This may be a turnoff for some coming from enterprise-level software, but the enhanced control offered by KNIME is extremely valuable and can shave hours off your development and debugging. A major offering of KNIME is that it allows you to see how your data changes at every […]
  • Joining Fact Tables
    Joining fact tables can be done but there are some inherent risks so you need to be careful when joining fact tables is required. In the following simple scenario, we have a fact of authors to articles and a separate fact of articles to pageviews. Our customer has asked for the ability to 1) find the authors who have provided the most content in a given time period, find the articles which have the greatest number of pageviews for a given time period and 3) find the authors with the highest number of pageviews in a given time period. Our data model can be seen here: To answer the first two questions is done by using a simple select statement joining a single fact table with the date dimension and the associated other associated dimensions to retrieve the answer. TOP 10 AUTHORS BY ARTICLE COUNT TOP 10 ARTICLES BY MONTH The third requirement will compel us to join two fact tables to get the answer. Remember to join a fact table you need at least […]
  • Types of Data Models
    A conceptual model is a representation of a system, made of the composition of concepts which are used to help people know, understand, or simulate a subject the model represents.[i] In dimensional modeling this is the most abstract level of modeling. Here we are only worried about the entities (dimensions) and their relationship to each other (facts). No attributes are required in the conceptual model as we are trying to work with the customer who is not an expert in databases or data warehousing but they do understand how the entities are related in their business. The conceptual model will help us to ensure we have captured the full list of dimensions and facts. If we are missing an entity or a relationship the customer will be able to provide us with that information. Now that we have completed a successful conceptual data model, we need to create a logical model. A logical model is a fully attributed conceptual model. The attributes are fully spelled out with no abbreviations. High-level data types (string, number, date, etc.) are […]
  • Data Warehouse Design Techniques – Constraints and Indexes
    In this week’s blog, we will discuss constraints and indexes. In data warehousing, like in life, constraints are things we love to hate. Constraints keep us from making mistakes, which in most cases is a good thing, until we come across an anomaly which needs to be addressed but the constraints prevent this from happening. Most of the time indexes help us to find data faster, but that is not always the case.   What are Indexes? Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then sorted and binary searches are performed to quickly find the record.   When should I Use Indexes? In general, you use indexes when you have a large data set. Small lookup or reference tables are bad candidates for indexes. In most cases the database can more efficiently select all of the information from the table without ever using the index, making the index useless. As a rule of thumb, let the SQL be your guide. […]
  • Data Warehouse Design Techniques – Aggregates
    In this week’s blog, we will discuss how to optimize the performance of your data warehouse by using aggregates. What are Aggregates? Aggregates are the summarization of fact related data for the purpose of improved performance. There are many occasions when the customer wants to be able to quickly answer a question where the data is at a higher grain than we collect. To avoid slow responses due to system summing data up to higher levels we can pre-calculate these requests, storing the summarized data in a separate star. Aggregates can be considered to be conformed fact tables since they must provide the same query results as the detailed fact table. When should I use Aggregates? Aggregates are to be considered for use when the number of detailed records to be processed is large and/or the processing of the customer queries begins to impact the performance of the BI application. Customers of BI applications expect an application which is very responsive to their queries. Aggregates are the way to meet this implied requirement. Aggregate Example […]
  • Data Warehouse Design Techniques – Derived Schemas
    Getting the correct answer is one of the most important requirements of a data warehouse, but this is only part of the requirement. The full requirement is to provide the correct information, to the user at the right time. The information is no good to the user if they need the answer today but you need to write a custom report which will take a week to develop, test and deploy. Although you will be able to provide the correct answer, it will be late and therefore of limited to no value to the decision-making process. How can we address this issues of flexibility and timeliness while building our data warehouse? In this blog article, I will demonstrate how derived schemas can address these challenges. Snapshot Fact & Accumulating Snapshot Fact Tables We have previously discussed these data structures in my previous blog articles titled “Snapshot Fact Tables” and “Accumulating Snapshot Fact Tables”. In these articles, I showed how we improve performance by moving the aggregation and computation of the metrics from the query to […]

October 26, 2016

Data Warehousing – The Basics

What Are Data Warehouses? A data warehouse is a large, central store of data from disparate source systems across the enterprise optimized for analysis and reporting. The data is extracted from the source transaction based application databases, transformed into new data structures through the applying a set of rules and functions against the extracted data before loading the transformed data into the data warehouse data structure which is optimized for reporting. This central data store provides the business user with a snapshot of the business at a point-in-time. The data warehouse is used in making both strategic and tactical business decisions.   Why Are Data Warehouses Important? Data warehouses hold large quantities of source system data over long periods of time. This time-variant data contained in the data warehouses provide the foundation upon which historical analytical and reporting applications are built upon. These applications provide the organization’s senior leadership with a quick and holistic view of the success of the organization over time. Data warehouses can also be used and the source data used for […]
November 2, 2016

Requirements for a Successful Data Warehouse Project

In order to have a successful data warehousing project you will need the following five challenges addressed: #1. Executive Sponsorship All projects need executive sponsorship. Without senior executive leadership, most corporate projects will likely fail. Data warehouse projects are no different, in fact, these projects are even more in need of executive sponsorship than most business initiatives. It is imperative that the organization’s corporate leaders infuse the enterprise with excitement concerning the success of a data warehousing project. These executives must also tie the success of the project to each and every one of their direct reports. This will ensure that all of the business units of the organization are linked to the success of the Data Warehousing project. Organizational commitment from the top down will eliminate data fiefdoms, those individuals/departments who refuse to share their data or information with the larger organization for fear of loss of power or influence. These fiefdoms often work subtly to sabotage the success of the warehouse project. They do this out of fear, afraid that sharing their information […]
November 9, 2016

Data Warehouse Design Methodologies

There are two traditional data warehouse design methodologies came of age in the 1990’s, that of Bill Inmon’s Top-Down Atomic Data Warehouse and that of Ralph Kimball’s Bottom-Up Dimensional Data Warehouse. These methodologies have been used over the past 20 years to create informational data stores for organizations seeking to leverage their data for corporate gain. In this article, we will compare and contrast these two methodologies.   Atomic Data Warehouse – Bill Inmon Bill Inmon’s Atomic Data Warehouse approach is strategic in nature and seeks to capture all of the enterprise data in 3rd Normal Form and store all of this atomic data in the data warehouse. Inmon defines a data warehouse as a subject-oriented, non-volatile, time-variant and integrated data source. Let’s break down each of these descriptors of the Inmon’s Data Warehouse. Subject-Oriented – the data is organized so that the data, related by subject area, is linked together. Non-Volatile – once data is entered it is never updated or deleted; all data is retained for future reporting needs. Time-Variant – because of […]
November 17, 2016

Designing a Relational Data Warehouse

The use of relational modeling is prevalent in today’s business landscape as most applications are built using relational databases. Relational modeling is based on mathematical set theory and was first described in 1969 by E.F. Codd. He stated that a grouping of elements (tuples) could be related to other tuples in such a way so as to limit the amount of data stored for a specific tuple and therefore optimize the storage of such data. This sounds difficult! However, it is easy. Let’s look at a real world example. Take a look at the information collected about John Smith below. John Quincy Smith DOB: 12/31/1969 National ID: 123-45-6789 1234 Main Street, Anytown, VA 20001 (h) 8000 Corporate Drive, Suite 1022, Sombertown, VA 20101 USA (w) (w) (p) (p) (703) 555-1212 (w) (703) 212-1555 (h) (571) 234-9876 (c) 1234-5678-9012-3456 (Visa) Could all of this information be stored in a single table? Yes, absolutely. However, storing data in a single table is inefficient. In this example, we added a second personal email address column. […]
November 30, 2016

Designing a Dimensional Data Warehouse – The Basics

Last week I wrote about Relational (Atomic) Data Warehouses and how to create these data structures. This week we will look at Dimensional Data Warehouses and how they differ from the Relational Data Warehouse. Creating a dimensional data warehouse is very different from creating a relational data warehouse. In the relational data warehouse, you are looking to store every atomic element which the organization has deemed to be important from the source applications and storing this data in a relational structure. Relational data structures are optimized for transactional speed and the limiting the use of disk space which is why this methodology is so prevalent in transaction-based systems. Dimensional models are specifically designed for optimized for the delivery of reports and analytics.   Dimensions – What are they? Dimensional Data Warehouses must have dimensions, right? So what are dimensions? Think of dimensions as containers of adjectives which describe the dimension and will help us limit the fact information we wish to see. Dimension tables can contain a lot of columns if the dimension is robust […]
December 14, 2016

Data Warehousing Best Practices

There are many times when you completed a task only to say “I wish I would have known that before I started this project” Whether it is fixing the breaks on your car, completing a woodworking project or building a data warehouse, best practices should always be observed to ensure the success of the project. These are seven of the best practices I have observed and implemented over the years when delivering a data warehouse/business intelligence solution. As you will see, most of these are not technical solutions but focus more on the soft skills needed to ensure the success of these long in duration and expensive solutions.   Define Standards Before Beginning Design I liken this practice to the “measure twice, cut once” adage. It is important that all of the documentation and physical deliverables of the project be defined at the outset of the project. Naming standards, documentation standards, coding standards, weekly status reports, release deliverables, etc. should all be defined before the kick-off meeting. Examples of these documents should be a part […]
December 21, 2016

Data Warehousing Documentation Review

One of the most important artifacts delivered as a part of any project is documentation. Documentation is the bane of IT development teams. Most developers feel “if it was hard to write, it should be even harder to understand and modify.” When developing and delivering a data warehouse documentation is critical to the success of the project. This documentation will help both the business users and the technical teams understand the source, the transformation and storage of the data they need to consume. These documents are the foundation upon which the warehouse will be built.   Business Requirements Document (BRD) This is a formal document which requires review and approval. The BRD captures problem the organization is trying to solve and any restrictions on that need to be considered. Essentially, the BRD defines the business user’s high-level specifications (scope) on what they are hoping to achieve from the data warehouse and delivered through a BI reporting or analytic tool along with any assumptions, constraints and risks which impact or may impact the project. This document […]
January 4, 2017

Advanced Dimensional Data Warehouse Design Techniques

In an earlier blog post, I walked you through the basics of dimensional data warehouse design by introducing you to dimension tables, fact tables and star schema design. For this blog post, I will introduce some of the more widely used advanced dimensional design techniques which I will describe in detail in the coming weeks. DIMENSIONS Conformed Dimensions Conformed dimensions are those dimensions which have been designed in such a way that the dimension can be used across many fact tables in different subject areas of the warehouse.   Slowly Changing Dimensions Slowly changing dimensions are the second most commonly used advanced dimensional technique used in dimensional data warehouses. Slowly changing dimensions are used when you wish to capture the changing data within the dimension over time.   Hierarchical Dimensions Hierarchical dimensions are those dimensions which have an associated hierarchy. A date (day, month, quarter, year) is an example of a hierarchical dimension and one that is often used by many data modelers and data warehouses.   Snowflake Dimension Tables The removing all redundant dimensional […]
January 11, 2017

Data Warehouse Design Techniques – Conformed Dimensions

In my last blog post, I presented to you some of the more widely used advanced technical dimensional design techniques. Over the next several of weeks, I will provide detail to each of the design techniques. Conformed Dimensions Conformed dimensions are those dimensions which have been designed in such a way that the dimension can be used across many fact tables in different subject areas of the warehouse. It is imperative that the designer plan for these dimensions as they will provide reporting consistency across subject areas and reduce the development costs of those subject areas via reuse of existing dimensions. The date dimension is an excellent example of a conformed dimension. Most warehouses only have a single date dimension used throughout the warehouse. Conformed Dimension – Challenges Structural Differences – in some cases you will have source tables which contain different table structures These may include: additional or missing columns columns with different data types columns named differently containing the same or similar data These differences can usually be resolved easily but occasionally this […]
January 18, 2017

Data Warehouse Design Techniques – Slowly Changing Dimensions

In my last blog post, I demonstrated the importance of conformed dimensions to the flexibility and scalability of the warehouse. This week we will discuss the importance of capturing the dimensional change in slowly changing dimensions. Slowly Changing Dimensions Slowly Changing Dimensions (SCD) are the most commonly used advanced dimensional technique used in dimensional data warehouses. Slowly changing dimensions are used when you wish to capture the changing data within the dimension over time. There are three methodologies for slowly changing dimensions.   Type 1 – For this type of slowly changing dimension you simply overwrite the existing data values with new data values. This makes the updating of the dimension easy and limits the growth of the dimension table to only new records. The drawback of this is you lose the historical value of the data because the dimension will always contain the current values for each attribute. For instance, you have a store dimension which has an attribute for a geographic region. If there is a redesign in the regional boundaries some stores may move […]