Paul Pirolli

  • Hierarchies in Qlik Sense
    Understanding how Qlik’s Associative Engine handles hierarchies will improve your ability to render meaningful and flexible rollups that users can wield to draw insights at all levels of granularity. This article covers how Qlik hierarchies do not follow the same conventions as a normal database (SQL) and discusses how hierarchy and ancestry load functions can be used in tandem to allow easy access to data at any granularity. Database Hierarchies Perhaps the most common way to store hierarchies in databases is through a Parent-Child Table. In such a table, there exists a row for each node in the hierarchy containing its ID and the ID of its parent. This format is widespread through most database technologies as it is simple, flexible, and updates quickly. Depending on the specific data, other formats may be used for better read/write performance. Perhaps because of its ubiquity, Qlik is built to take utilize Parent-Child Tables. But it cannot provide a true hierarchical user experience without processing those tables through special functions. The Qlik Associative Engine Qlik cannot handle all […]
  • AngularJS Basics: $scope
    AngularJS, for the uninitiated, is an open-source front-end framework that is maintained by Google and an active community. It has been around for some time, but the learning curve can be pretty steep as AngularJS encompasses not one language, but two; HTML and Javascript. AngularJS puts the two of these together and exponentially increases your front-end capabilities as opposed to plain old HTML. If you’re familiar with HTML then that half of the equation is rather straightforward, but the Javascript is where things will take a turn with the inclusion of “$scope”. Scope as a concept typically defines the level of visibility of a variable or object of which there are two, local or global, but in normal Javascript this is not explicitly called upon. In AngularJS, $scope is essentially the middleman between your HTML view and your Javascript Controller. It is viewable by both the HTML and the Javascript and as such can be interacted with by both. In this sense we can say that $scope in Angular is always visible to its controller […]
  • Optimizing Qlik Load-Time
    I have recently been brought on to a project involving a Qlik implementation for a client. Among the requirements for the project, the client needs to have the data refreshed every hour. As with many Qlik implementations, the Qlik segment is the final step in the ETL process with the data arriving to an Oracle DB first. By the time the ETL arrived to Qlik, only 30 minutes remained of the allotted hour. Here is the process for optimization that I followed. Status on arrival: The number of tasks in the QMC were about 40. There were about 30 apps in the front end that needed reloading, most of which would reload in under a minute. However, there were several apps loading tables of over 250 million records with dozens of fields, and those apps would take well over an hour. The total time to reload everything took about 6.5 hours. Load time needed drastic improvement. Below are some techniques for Extraction, Transformation & Loading improvement, I’ll touch on more Loading techniques in another post. […]
  • Late Arriving Dimensions
    In my last post I discussed late arriving facts, and although it is slightly annoying to receive delayed transaction data, the impact on the ETL and accuracy of the data in data warehouse is minimal. As we will discuss in this blog post, this is not the case for late arriving dimension (early arriving fact) data. As I have discussed throughout this blog series on designing and loading dimensional data warehouses, we always load data in the following order: lookup tables, then dimensions, and finally facts. But what happens when you get fact data before the dimension data arrives or there is a backdated change in the dimensional data? These late arriving dimensions may be generated by a known business process but these late arriving dimensions can cause a significant challenge to the accuracy of the data and test the skill and patience of the ETL developers and testers. Let’s take a look at the challenge the delay is receiving the dimensional data presents and the options we have to resolve these challenges. The most […]
  • XPATH & KNIME
    Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME. For this example, I will be using an XML version of the Shakespeare play Hamlet. Everything started off fine. I grabbed the XML Reader node and placed it on my canvas. I opened the node configuration tool and entered in the location of the xml file. So far, so good. My task was to flatten the XML document into a denormalized table. I started out by finding all of the relevant tags I would need to make this output meet the requirements. OK, the play title tag occurs once, so I grab an XPATH node and configure it to grab the title of […]
  • Plotly.js: Getting Accurate Coordinates from Click Event
    Across StackExchange and the Plotly.js community forums, there seems to be one frequently asked question that has not yet been answered: how can you get accurate coordinates from a click event on a Plotly chart? Take your standard scatter plot: https://codepen.io/plotly/pen/LVMRMO If I hover my mouse over (4, 14), and click, the current Plotly.js onClick event data returns data for each trace at that same x value. So, you would get (4, 17) for trace 0, (4, 11) for trace 1, and (4, 12) for trace 2. This is great if I’m trying to find out information about the traces, but this is not useful at all if I’m trying to get the point on the grid on which I clicked. Of course, with the event data you get all kinds of other somewhat useful information and functions. etpinard gets close to a solution in this codepen (https://codepen.io/etpinard/pen/EyydEj) by using the xaxis and yaxis built-in p2c function. Unfortunately, this doesn’t exactly work, and there is little to no documentation on these conversion functions. There are […]
  • Late Arriving Facts
    Late arriving facts are those transactions which are delayed in arriving to the warehouse. For Type I data warehouses there is no real impact but for Type II warehouse models there is a minor challenge that the ETL developer must remember to handle to ensure the late arriving transaction corresponds with dimensional attributes at the time of the transaction. If we have a Type I warehouse or are always assured that we only receive fact data at the same time as the dimensional data then we can use the following simple query where we find the current active record for the dimension’s natural ID which is found in the fact record: SELECT dimensional_key    FROM dimension_table WHERE dimension_natural_id = {natural_id from fact}      AND dimension_actv_rcrd_fl = 1; Taking a look at the Type II data warehouse challenge, we cannot assume that the active dimensional record is the correct record for the fact. Therefore, we need to modify the ETL workflow process (Figure 1) to address the challenge of the possible changing of dimensional data since the “old” […]
  • Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow
    The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow. First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE; Next, we need to Create an IAM Policy to Access Amazon S3 Resources and Create an IAM Role to Allow Access AWS Services. Now that we have a role associated with the IAM access policy we need to grant it to the database. To do this logon to the AWS console and go to the RDS page and select the database you wish to have this functionality. On that page scroll down to Manage IAM Roles (see Figure 1) and […]
  • URL Tips and Tricks
    URL manipulation is one of those things you may not even think about until it becomes the biggest headache of your week. Thankfully KNIME has some things we can use to make life easier. Here are some of the common situations you could encounter when working with URLs in KNIME and how to handle them. The Workflow Needs to Run in the Any System’s KNIME Directory Perhaps the most common issue you’ll run into, most of the time you won’t want a KNIME workflow to only run on your machine or server. You still want it to use a specific file structure beyond a certain point, but you don’t want to hard-code everything before that. KNIME has a remarkably simple way of letting you deal with this in the form of standard variable keywords. Take this Create Directory node as an example: This syntax uses two such variables, “knime://” and “knime.workflow”. The former, “knime://”, is the system’s KNIME mount point, specified when you open your work space in the editor. This mount point almost entirely […]
  • Batch Execution of KNIME Workflows, Part 2
    In an earlier blog posting I walked through the steps required to install KNIME in a Docker container so that it could be used to run workflows in a batch mode.  This is a very useful technique for automating workflow execution and leveraging cloud resources to do so.  However, most workflows are not self-contained: they need access to configuration, external data, and local storage.   I did not cover those aspects in my first posting so this blog entry will introduce ways to pass in configuration and support the batch execution of KNIME workflows in a container. Setting Flow Variables The most common way to supply external configuration to a KNIME workflow is to set flow variables from the command line when KNIME is run in batch mode.    This is done using the -workflow.variable option.  Each instance of this option sets the value for a single variable so it may be included any number of times in the command line.   The format for the option value is as follows: -workflow.variable=[Variable Name]:[Value]:[Type] The [Variable Name] is simply […]
  • Bulletproof Your KNIME Workflows with System Property Variables and External Config Files
    KNIME workflows can contain a lot of code and, like all good programmers, I always want to eliminate as many code changes as possible. To accomplish this optimization, I rely on the use of table-driven processes, external property files and system property variables. The use of these techniques can reduce code changes and ensure that workflows will work on different computers that are not setup identically. I’d like to begin by discussing table-driven solutions. For this type of solution, I will use the data in the table to make the change. This can be done by simply changing the value of an attribute. For example, if I wanted to ensure a document was processed, I would add a flag (doc_processed_fl) to the document table with a default value of FALSE and have the program set the value to TRUE upon the completion of the process. This way I am assured the record will be processed. In the event of a catastrophic failure (network error, database error, etc.) causing the workflow to be halted, the flag […]
  • Automated Execution of Multiple KNIME Workflows
    When using the open source KNIME Analytics Platform to build sophisticated data processing and analysis pipelines, I often find myself building workflows that consist of so many nodes they become difficult to manage.  This complexity can be managed by grouping nodes into processing stages and then bundling those stages into meta-nodes so that the overall workflow layout is easier to follow. However, I’ve found that this approach still leaves workflows unwieldy to work with as you still have to open the meta-nodes to explore and troubleshoot their processing.  Over the years I’ve worked with KNIME, I’ve developed a habit of breaking larger workflows up into smaller individual workflows representing each processing stage in the overall pipeline.  This serves to make building and debugging each processing step much more tractable at the cost of requiring more storage for persisting the outputs of one stage so that they may be used as the inputs of the next stage. Another small drawback of separate workflows is that they all need to be executed in order for the overall […]
  • Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME.
    This blog describes how to perform Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME. Let’s start with quick introduction about Amazon Athena. What is Athena? Athena is serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage. It supports ANSI SQL queries with support for joins, JSON and window functions. How to connect Athena and execute SQL from KNIME? KNIME can interface with Athena using the following nodes. How to do Analysis of data in S3 using Athena trough KNIME? A traditional approach is to download the entire files from S3 to KNIME using a Node such as the Parquet Reader. This approach works fine if the file is relatively manageable in size. But if you have large files in S3, then this download approach will consume lots of time and local memory and processing will be slow because of data volume. Athena can help with these issues. Since Athena supports SQL processing on top of S3, all heavy lift data processing such as joins, […]
  • Batch Execution of KNIME Workflows Using Docker
    In my previous blog posting I introduced our service management platform Dex which is at the core of many of our advanced analytics solutions.  Dex seamlessly manages the coordination of the data acquisition, ingestion, transformation and modeling for our system via processing steps that are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  One of the core applications we leverage for many of these steps is the open source KNIME analytics platform (www.knime.org).   KNIME’s easy to use GUI allows us to quickly develop and test ETL and modeling workflows.  Once these workflows are done, we then deploy them inside of Dex to be run inside Docker containers using KNIME’s batch execution mode.  The purpose of this article is to provide a guide to building a Docker container to execute KNIME workflows in a headless environment.  This article details a very basic set up for the Docker container and does not cover things like user accounts and security credentials which may be required for a more […]
  • Using the Python Node in KNIME to Perform Granger Causality Tests
    KNIME includes a wide variety of data manipulation and data science nodes, but there are a few cases where you may need to embed specific python routines inside a workflow. Or more often, I test algorithms in Python but do not want to put the effort into orchestrating the data access, iterations, and parallel processing in python. KNIME has several python nodes, but first you will need to set up your python environments. Setting up Python Environments KNIME requires you to have Python 2 and Python 3 environments running locally. If you have Anaconda 3.x, you will need to create a python 2 environment with numpy and pandas installed. From the Anaconda prompt, run the following: conda install nb_conda_kernelsconda create -n py27 python=2.7 ipykernelactivate py27 pip install numpy pip install pandas You can not point to the python executables directly if you are running on Windows and will need to create batch files to run the environments. The .bat files for  Python3 (base) and Python2 (py27) environments: @REM Adapt the folder in the PATH to […]
  • 5 Tips and Workarounds for Angular/JS Extension Development
    This article provides a few small tips and tricks to ease some of the headaches of developing Angular/JS extensions for Qlik. 1. When using jQuery, make sure you are utilizing unique selectors for a given instance of the extension. Multiple instances of an extension can exist on a single sheet. If using jQuery, it is easy to select (and accidentally edit) all instances of an extension on a sheet. To prevent this, each instance of the extension should have its own id. More advanced users may want to consider using the objectId that Qlik assigns to visualizations, but those can be difficult to track down. We at NuWave have found an easy method to utilize unique selectors: Generate a random number, place it in a wrapper div in the html, then use that number for all selections. 2. Use Validated and a broadcast() (with a Timeout) for re-drawing in Angular extensions. When creating an extension that needs to redraw (ie: on selection or property changes), Qlik recommends using the Validated event. Qlik’s documentation describes this […]
  • Factless Fact Tables
    Factless facts are those fact tables that have no measures associated with the transaction.  Factless facts are a simple collection of dimensional keys which define the transactions or describing condition for the time period of the fact. You may question the need for a factless fact but they are important dimensional data structures which capture important information which can be leveraged into rollup measures or as information presented to a user. The most common example used for factless facts are student attendance in a class. As you can see from the dimensional diagram below the FACT_ATTENDANCE is an amalgamation of the DATE_KEY, the STUDENT_KEY, and the CLASS_KEY. As you can see there is nothing we can measure about a student’s attendance at a class. The student was there and the attendance was recorded or the student was not there and no record is recorded. It is a fact, plain and simple. There is a derivation of this fact where you can always load the full roster of individuals registered for the class and add a […]
  • PostgreSQL – Why You’d Want to Use It
    If you’ve been in this industry for a few years, then you probably know what SQL is. It’s the golden standard for working with databases and nearly every modern coding language interfaces with it. Though simple it’s very flexible and different variants of it allow for some more robust functionality like Oracle SQL and MySQL. PGSQL, or PostgreSQL, is another one of these variants, but it takes things one step further and even one step after that. Let’s start with what exactly PGSQL is and is not. PGSQL is a variant of SQL and uses most of the same syntax. It’s open source and has been in development for decades now, if you’ve been around the block then you may even know it as Ingres. It can do all the same things that traditional SQL databases can do, following the same rules and structure. It is not a reinvention of the wheel, more so a different take on it. It is not specifically NoSQL, though it can be used for it, and it doesn’t require […]
  • Structured Query Language Basics
    Structured Query Language (SQL) is the standard language used for retrieving and manipulating database information. Today’s blog, I will be focusing on how we can use this language to simply get an organized result set back that answers a specific question. This will be accomplished by using the Select, From, Where, Group By, Having, and Order By clauses We will be using the EMPLOYEES table in the HR schema from Oracle live as our practice table. You can find the structure of the table in the image below. This can be accessed for free. All you need is an oracle account and login with those credentials. Go to this link (Oracle Live) and select “Start Coding Now”. SELECT: All SQL statements start with the SELECT. This is how the query knows what information you want to return. The SELECT is where you place the column names, how many columns, column alias, and aggregations you want to see in your result. If you don’t know the structure of the table in advance you can place an […]
  • Period-To-Date Facts
    We have all come across times when our customer wants to know how the organization is currently doing. They often want to know how they are measuring up against this time last year or against the projected measure. The most common examples of these request are a year-to-date calculation and a budget vs. actual analysis. In this blog post I will describe how to efficiently address these common business requests. Year-To-Date (YTD) Calculations Our customer has stated that they wish to show a YTD metric for sales which can be broken down into quarterly and monthly metrics as well. In all of my other blog posts I have modeled some dimension or fact to show you how to address the requirement at hand, in this scenario I will not. The reason is because we already have done the work to address this requirement in previous blogs. Let’s look how we already addressed this issue with an example. In the star schema shown above we can see that we are storing purchase transactions by customer, product, […]
  • Angular vs React
    If you’ve done any sort of front-end programming, you know there are two front-end development frameworks that have emerged as the front runners in the market: React and Angular. Whether you’re just starting in on a development effort or are interested in which may be best to learn to improve your marketability, this article will outline the key differences between the two frameworks and identify recent market trends to help you make a decision. Angular7, or more familiarly Angular, is a full-fledged MVC framework championed by Google. It includes many OOTB features including routing, unit testing, forms, and AJAX requests. With these built-in features comes the strong suggestion on how to structure your application. When building an Angular app, you [generally] use their router, their forms, their requests, and you [generally] do it in their style. React.js is technically a library (not a framework) that is built and maintained by Facebook. It is more lightweight than Angular, but lacks the built-in features listed above. If you want to incorporate form validation into your React app, […]
  • Loading Accumulating Snapshot Fact Tables
    Often management looks for bottlenecks in corporate processes so that they can be streamlined or used as a measurement of success for the organization. In order to do achieve these goals we need to measure time between two or more related events. The easiest way to report on this time-series process is to use accumulating snapshot facts.  Accumulating snapshot facts are updatable fact records used to measure time between two or more related events. The most common example of this type of fact can be seen in order processing. Let’s take a look! Order processing consists on many serialized processes. First there is the order, followed by the product creation or packaging of the items contained within the order. Once the items are packed, they are sent to QA for a double check of the order before sending the package to shipping, where the carrier will pickup and deliver the product to the customer. In this process we have many areas where the process could get held up (abandon cart, missing inventory, slow QA process, […]
  • Building Qlik Sense Extensions that can Export and Snapshot
    The Fluff If you’ve ever built an extension in Qlik Sense focused on data visualization, you know how cool it is to harness the power of the Qlik associative engine. It is important for such visualizations to integrate with the user experience and really feel like Qlik—both in style and functionality. Yet many complex extensions suffer functionality drawbacks from failing to overcome two major hurdles: the ability to export to PDF/Image through the right-click menu, and the ability to function as a snapshot within a Qlik Data Story. Though these issues have the same root cause, it proved incredibly difficult for me to track down. It boils down to some undocumented changes in a couple key Qlik JavaScript objects. To see the result of my research and to see instructions on how to handle these issues correctly, skip down to The Stuff below. My path started by trying to “turn on” the export functionality for a table-like extension I was developing. The Qlik help pages helpfully told me to add a property named “export” to […]
  • Client Identification Using Custom Request Headers in Spring Boot
    One of the key building blocks of NuWave’s advanced predictive analytics solutions is a service management platform we built called Dex, which is short for Deus Ex Machina.   Dex was built as collection of microservices using Spring Boot and is responsible for coordinating the execution of complex workflows which take data through acquisition, ingestion, transformation, normalization, and modeling with many different advanced machine learning algorithms.   These processing steps are performed with a large number of technologies (Java, Python, R, KNIME, AWS SageMaker, etc.) and are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks.  Dex seamlessly manages the coordination of the execution of all these ephemeral tasks across these various technologies while also providing runtime configuration, secure credential management, state management, and storage provisioning for each executing task.   This has led to the affectionate nickname of “Cat Herder” for this critical technology at the heart of our advanced analytical solutions. With such a complex processing workflow spread across multiple underlying technologies, we faced the challenge of […]
  • Loading Transaction Fact Tables
    This blog post will focus on loading transaction fact tables, subsequent posts for peoiodioc and accumulating snapshots will follow in the coming weeks. Loading fact tables is very different than loading dimensions. First, we need to know the type of fact we are using. The major types of facts are transaction, periodic snapshot, accumulating snapshot and time-span accumulating snapshots. We also need to know the grain of the fact, the dimensional keys which are associated to the measurement event. Let’s say we want to measure product sales of by customer, product, and date attributes. The source transaction system may provide the following associated dimensions: DIM_CUSTOMER, DIM_PRODUCT, and DIM_DATE. Here is an example of a set of product sale transaction records: Product_ID Customer_ID Transaction_Date QTY EXT_Price Discount Total 55 1415 4/10/2019 10:01 3 38.97 5 37.02 72 1415 4/10/2019 10:01 5 39.95 5 37.95 120 1415 4/10/2019 10:01 1 12.99 5 12.34 17 3457 4/10/2019 10:09 3 44.97 0 44.97 57 8322 4/10/2019 10:10 6 35.94 10 32.35 106 1194 4/10/2019 10:17 3 23.97 0 23.97 […]
  • Data Preparation for Machine Learning: Vector Spaces
    Machine learning algorithms often rely on certain assumptions about the data being mined. Ensuring data meet these assumptions can be a significant portion of the preparation work before model training and predicting begins. When I began my data science journey, I was blissfully unaware of this and thought my preparation was done just because I had stuffed everything into a table. Feeding these naïvely compiled tables into learners I wondered why some algorithms never seemed to perform well for me. As I began digging into the algorithms themselves, many referred to vector space operations as being fundamental to their function. Answering the simple question of “what is a vector space” took me on quite an enlightening journey. Vector spaces are really a neat tool for organizing data and require more than getting all your data into a table. Perhaps most mind-bending is that a vector space can be constructed from anything, even if mortals like myself tend to just use numbers and categorical values. I believe part of the reason I was late to understanding […]
  • Oracle Certified Professional Exam Prep
    How I prepared for (and passed) the Oracle Certified Professional exam.
  • Beyond Recommender Systems: Alternative Applications of Matrix Factorization (Part 1)
    For their utility in retail and business systems, Recommender Systems are one of the most popular applications of predictive analytics. A Recommender System is a process that leverages data from the behaviors of individuals and attempts to make personalized predictions of a targeted individual’s future behaviors. Well-known and now obvious applications of a Recommender System include predicting a customer’s preferences for products they may buy so that they can be recommended for purchase. Another well-known and studied application is the prediction of movies that are of interest to an individual based upon their own movie viewing habits and the viewing habits of other individuals that may share some common attitude or trait that indicates some shared interest in specific movies. Though there are several methods for implementing a Recommender System, Collaborative Filtering has become the de facto standard. Collaborative Filtering is an approach to Recommender Systems that is based on a machine learning technique called Matrix Factorization and was made famous within the data science community through its application in the one-million-dollar winning solution for […]
  • Updating Type II Slowly Changing Dimensions
    In this blog post I will provide an example of how I implement Type II dimensions in my data warehouses. The process I go through when refreshing data into a Type II dimension is to first land the data from the source system. I always create a separate schema to hold the data, as it is presented from the source system with no transformations. The tables and columns are named exactly as they are in the source system. The basic idea here is to quickly off-load the data from the source system, thereby eliminating any long running or resource intensive processes from impacting the source application while providing a link back to the source system. Once all of the data is landed, I can now proceed to the transformation and staging step. In this phase, I create staging tables for all of my dimensions. These tables look similar to the dimension table but they do not have a primary key or a majority of audit columns. Here is an example of a regular dimension table […]
  • Introduction to KNIME
    Let’s say you’ve been tasked to pull data from a variety of data sources on a monthly basis, manipulating and processing it to create insights. At first one may look to a powerhouse software like Informatica, but what if you don’t really need something that heavy and your budget is extremely limited? You need an ETL tool to get you from point A to point B accurately and reasonably quickly, something that’s easy to learn but has the depth and flexibility for more advanced processing and customization. Enter KNIME. Up-front KNIME presents itself as a tool designed for data analysts, emphasizing visual clarity and usability over raw power; where Informatica will perform multiple transformation types in a single Expression node, KNIME has a node for each step. This may be a turnoff for some coming from enterprise-level software, but the enhanced control offered by KNIME is extremely valuable and can shave hours off your development and debugging. A major offering of KNIME is that it allows you to see how your data changes at every […]
  • Joining Fact Tables
    Joining fact tables can be done but there are some inherent risks so you need to be careful when joining fact tables is required. In the following simple scenario, we have a fact of authors to articles and a separate fact of articles to pageviews. Our customer has asked for the ability to 1) find the authors who have provided the most content in a given time period, find the articles which have the greatest number of pageviews for a given time period and 3) find the authors with the highest number of pageviews in a given time period. Our data model can be seen here: To answer the first two questions is done by using a simple select statement joining a single fact table with the date dimension and the associated other associated dimensions to retrieve the answer. TOP 10 AUTHORS BY ARTICLE COUNT TOP 10 ARTICLES BY MONTH The third requirement will compel us to join two fact tables to get the answer. Remember to join a fact table you need at least […]
  • Types of Data Models
    A conceptual model is a representation of a system, made of the composition of concepts which are used to help people know, understand, or simulate a subject the model represents.[i] In dimensional modeling this is the most abstract level of modeling. Here we are only worried about the entities (dimensions) and their relationship to each other (facts). No attributes are required in the conceptual model as we are trying to work with the customer who is not an expert in databases or data warehousing but they do understand how the entities are related in their business. The conceptual model will help us to ensure we have captured the full list of dimensions and facts. If we are missing an entity or a relationship the customer will be able to provide us with that information. Now that we have completed a successful conceptual data model, we need to create a logical model. A logical model is a fully attributed conceptual model. The attributes are fully spelled out with no abbreviations. High-level data types (string, number, date, etc.) are […]
  • Data Warehouse Design Techniques – Constraints and Indexes
    In this week’s blog, we will discuss constraints and indexes. In data warehousing, like in life, constraints are things we love to hate. Constraints keep us from making mistakes, which in most cases is a good thing, until we come across an anomaly which needs to be addressed but the constraints prevent this from happening. Most of the time indexes help us to find data faster, but that is not always the case.   What are Indexes? Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then sorted and binary searches are performed to quickly find the record.   When should I Use Indexes? In general, you use indexes when you have a large data set. Small lookup or reference tables are bad candidates for indexes. In most cases the database can more efficiently select all of the information from the table without ever using the index, making the index useless. As a rule of thumb, let the SQL be your guide. […]
  • Data Warehouse Design Techniques – Aggregates
    In this week’s blog, we will discuss how to optimize the performance of your data warehouse by using aggregates. What are Aggregates? Aggregates are the summarization of fact related data for the purpose of improved performance. There are many occasions when the customer wants to be able to quickly answer a question where the data is at a higher grain than we collect. To avoid slow responses due to system summing data up to higher levels we can pre-calculate these requests, storing the summarized data in a separate star. Aggregates can be considered to be conformed fact tables since they must provide the same query results as the detailed fact table. When should I use Aggregates? Aggregates are to be considered for use when the number of detailed records to be processed is large and/or the processing of the customer queries begins to impact the performance of the BI application. Customers of BI applications expect an application which is very responsive to their queries. Aggregates are the way to meet this implied requirement. Aggregate Example […]
  • Data Warehouse Design Techniques – Derived Schemas
    Getting the correct answer is one of the most important requirements of a data warehouse, but this is only part of the requirement. The full requirement is to provide the correct information, to the user at the right time. The information is no good to the user if they need the answer today but you need to write a custom report which will take a week to develop, test and deploy. Although you will be able to provide the correct answer, it will be late and therefore of limited to no value to the decision-making process. How can we address this issues of flexibility and timeliness while building our data warehouse? In this blog article, I will demonstrate how derived schemas can address these challenges. Snapshot Fact & Accumulating Snapshot Fact Tables We have previously discussed these data structures in my previous blog articles titled “Snapshot Fact Tables” and “Accumulating Snapshot Fact Tables”. In these articles, I showed how we improve performance by moving the aggregation and computation of the metrics from the query to […]
  • Loading Hierarchical Bridge Tables
    This blog article is a follow-up to the Ragged Hierarchical Dimensions article I posted a few weeks ago.  In this article, I spoke of using a hierarchical bridge table at a high level, today I will discuss the nuances of this method in more detail.Let’s begin at the dimension table. In order to capture the data with the appropriate level of detail you will need to design your dimension with recursive fields.I have designed this dimension table as a type 2 dimension table and I include the recursive (parent) field as an attribute. This allows me to capture the point-in-time changes of the dimensional attributes while still being able to display the hierarchy. Most reporting tools cannot handle this recursive field, so we need to find a way to transform the recursive join into a relational join. To do this we will use recursive query as a part of the ETL process to create this relational join (bridge) table. The query will create a relational hierarchy consisting of each record in the hierarchy associated with […]
  • Data Warehouse Design Techniques – Accumulating Snapshot Fact Tables
    Today we will look at a different type of fact table, the accumulating fact. Accumulating Fact Tables Accumulating Fact tables are used to show the activity of progress through a well-defined business process and are most often used to research the time between milestones.  These fact tables are updated as the business process unfolds and each milestone is completed.  The most common examples can be found in order and insurance processing. Here a single fact table may contain the important business processes milestones upon which the organization wishes to measure performance. A different version of this same accumulating fact table may include the calculated lag time between the milestone dates as shown below. The lag (number of days) can be calculated from the claim date (total time to complete the milestone) or the lag can be calculated as the number of days since the last milestone was completed (days spent in this stage). The important part is to be consistent in calculating the lag within the snapshot fact. Capturing of this lag data can make […]
  • Data Warehouse Design Techniques – Snapshot Fact Tables
    Last week we defined the three types of fact tables and have established a solid foundation in the definition and use of transactional fact tables. This week we will focus on periodic snapshot fact tables. Snapshot Fact Tables (Periodic) Snapshot fact tables capture the state of the measures based on the occurrence of a status event or at a specified point-in-time or over specified time intervals (week, month, quarter, year, etc.). The Snapshot Fact table and the Transactional fact table can be very similar in design. In the design above the sales are rolled up to the month in the Periodic Snapshot fact table. In this table, you are unable to drill down past the month, as the month is the lowest grain of this Periodic Snapshot fact table. Periodic Snapshot tables are also important when measuring those metrics which may not have a corresponding transaction, such as air temperature. We cannot capture transactions causing the air temperature to rise or fall but we can measure the temperature at predetermined points in time. As you […]
  • Data Warehouse Design Techniques – Fact Tables
    Now that we have established a solid foundation with dimension tables we will now turn our focus to the fact tables. Fact tables are data structures which capture the measurements of a particular business process. The measurements (quantity, amount, etc.) are defined by the collection of related dimensions. This collection of dimensional keys is called the grain of the fact. Types of Fact Tables The three basic fact table grains are the transactional, the periodic snapshot and the accumulating snapshot. Transactional fact tables are the most common fact in dimensional modeling. Transactional fact tables capture the measurement at its most atomic dimensional level, at the point-in-time of the transaction. This allows the fact table to provide robust dimensional grouping and roll-up and drill-down reporting capabilities to the business user. Periodic Snapshot fact tables capture the state of the metrics at a specified point-in-time. Periodic Snapshot Fact tables are used to quickly assess and review the performance of the measures over specified time intervals (week, month, quarter, year, etc.). As you can see the Periodic Snapshot […]
  • Data Warehouse Design Techniques – Bridge Tables
    There will be times when you arrive at a many to many relationship between a fact and a dimension. This many to many relationship is in violation of the dimensional modeling rules, so how do you handle this multi-valued relationship?  Bridge tables. In this blog post I will take you through the importance and usefulness of bridge tables and how and when to use these data structures. Multi-Valued Dimensions Multi-valued dimensions are more common than most people think and are often “refactored” to a single valued dimension. Coming from a publishing background, sales of articles come to mind as a multi-valued dimension. The sale of articles must have one author but could very easily have two or more authors. How do you handle this in the warehouse? You could define one author as the primary author or you could create a really long field and place all of the authors of the article in this field. But how can you find the total sales for all of the articles written by a single individual who […]

September 26, 2019

Optimizing Qlik Load-Time

I have recently been brought on to a project involving a Qlik implementation for a client. Among the requirements for the project, the client needs to have the data refreshed every hour. As with many Qlik implementations, the Qlik segment is the final step in the ETL process with the data arriving to an Oracle DB first. By the time the ETL arrived to Qlik, only 30 minutes remained of the allotted hour. Here is the process for optimization that I followed. Status on arrival: The number of tasks in the QMC were about 40. There were about 30 apps in the front end that needed reloading, most of which would reload in under a minute. However, there were several apps loading tables of over 250 million records with dozens of fields, and those apps would take well over an hour. The total time to reload everything took about 6.5 hours. Load time needed drastic improvement. Below are some techniques for Extraction, Transformation & Loading improvement, I’ll touch on more Loading techniques in another post. […]
Contact