• Data Warehouse Design Techniques – Constraints and Indexes
    In this week’s blog, we will discuss constraints and indexes. In data warehousing, like in life, constraints are things we love to hate. Constraints keep us from making mistakes, which in most cases is a good thing, until we come across an anomaly which needs to be addressed but the constraints prevent this from happening. Most of the time indexes help us to find data faster, but that is not always the case.   What are Indexes? Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then sorted and binary searches are performed to quickly find the record.   When should I Use Indexes? In general, you use indexes when you have a large data set. Small lookup or reference tables are bad candidates for indexes. In most cases the database can more efficiently select all of the information from the table without ever using the index, making the index useless. As a rule of thumb, let the SQL be your guide. […]
  • Data Warehouse Design Techniques – Aggregates
    In this week’s blog, we will discuss how to optimize the performance of your data warehouse by using aggregates. What are Aggregates? Aggregates are the summarization of fact related data for the purpose of improved performance. There are many occasions when the customer wants to be able to quickly answer a question where the data is at a higher grain than we collect. To avoid slow responses due to system summing data up to higher levels we can pre-calculate these requests, storing the summarized data in a separate star. Aggregates can be considered to be conformed fact tables since they must provide the same query results as the detailed fact table. When should I use Aggregates? Aggregates are to be considered for use when the number of detailed records to be processed is large and/or the processing of the customer queries begins to impact the performance of the BI application. Customers of BI applications expect an application which is very responsive to their queries. Aggregates are the way to meet this implied requirement. Aggregate Example […]
  • Data Warehouse Design Techniques – Derived Schemas
    Getting the correct answer is one of the most important requirements of a data warehouse, but this is only part of the requirement. The full requirement is to provide the correct information, to the user at the right time. The information is no good to the user if they need the answer today but you need to write a custom report which will take a week to develop, test and deploy. Although you will be able to provide the correct answer, it will be late and therefore of limited to no value to the decision-making process. How can we address this issues of flexibility and timeliness while building our data warehouse? In this blog article, I will demonstrate how derived schemas can address these challenges. Snapshot Fact & Accumulating Snapshot Fact Tables We have previously discussed these data structures in my previous blog articles titled “Snapshot Fact Tables” and “Accumulating Snapshot Fact Tables”. In these articles, I showed how we improve performance by moving the aggregation and computation of the metrics from the query to […]
  • Loading Hierarchical Bridge Tables
    This blog article is a follow-up to the Ragged Hierarchical Dimensions article I posted a few weeks ago.  In this article, I spoke of using a hierarchical bridge table at a high level, today I will discuss the nuances of this method in more detail.Let’s begin at the dimension table. In order to capture the data with the appropriate level of detail you will need to design your dimension with recursive fields.I have designed this dimension table as a type 2 dimension table and I include the recursive (parent) field as an attribute. This allows me to capture the point-in-time changes of the dimensional attributes while still being able to display the hierarchy. Most reporting tools cannot handle this recursive field, so we need to find a way to transform the recursive join into a relational join. To do this we will use recursive query as a part of the ETL process to create this relational join (bridge) table. The query will create a relational hierarchy consisting of each record in the hierarchy associated with […]
  • Data Warehouse Design Techniques – Accumulating Snapshot Fact Tables
    Today we will look at a different type of fact table, the accumulating fact. Accumulating Fact Tables Accumulating Fact tables are used to show the activity of progress through a well-defined business process and are most often used to research the time between milestones.  These fact tables are updated as the business process unfolds and each milestone is completed.  The most common examples can be found in order and insurance processing. Here a single fact table may contain the important business processes milestones upon which the organization wishes to measure performance. A different version of this same accumulating fact table may include the calculated lag time between the milestone dates as shown below. The lag (number of days) can be calculated from the claim date (total time to complete the milestone) or the lag can be calculated as the number of days since the last milestone was completed (days spent in this stage). The important part is to be consistent in calculating the lag within the snapshot fact. Capturing of this lag data can make […]
  • Data Warehouse Design Techniques – Snapshot Fact Tables
    Last week we defined the three types of fact tables and have established a solid foundation in the definition and use of transactional fact tables. This week we will focus on periodic snapshot fact tables. Snapshot Fact Tables (Periodic) Snapshot fact tables capture the state of the measures based on the occurrence of a status event or at a specified point-in-time or over specified time intervals (week, month, quarter, year, etc.). The Snapshot Fact table and the Transactional fact table can be very similar in design. In the design above the sales are rolled up to the month in the Periodic Snapshot fact table. In this table, you are unable to drill down past the month, as the month is the lowest grain of this Periodic Snapshot fact table. Periodic Snapshot tables are also important when measuring those metrics which may not have a corresponding transaction, such as air temperature. We cannot capture transactions causing the air temperature to rise or fall but we can measure the temperature at predetermined points in time. As you […]
  • Data Warehouse Design Techniques – Fact Tables
    Now that we have established a solid foundation with dimension tables we will now turn our focus to the fact tables. Fact tables are data structures which capture the measurements of a particular business process. The measurements (quantity, amount, etc.) are defined by the collection of related dimensions. This collection of dimensional keys is called the grain of the fact. Types of Fact Tables The three basic fact table grains are the transactional, the periodic snapshot and the accumulating snapshot. Transactional fact tables are the most common fact in dimensional modeling. Transactional fact tables capture the measurement at its most atomic dimensional level, at the point-in-time of the transaction. This allows the fact table to provide robust dimensional grouping and roll-up and drill-down reporting capabilities to the business user. Periodic Snapshot fact tables capture the state of the metrics at a specified point-in-time. Periodic Snapshot Fact tables are used to quickly assess and review the performance of the measures over specified time intervals (week, month, quarter, year, etc.). As you can see the Periodic Snapshot […]
  • Data Warehouse Design Techniques – Bridge Tables
    There will be times when you arrive at a many to many relationship between a fact and a dimension. This many to many relationship is in violation of the dimensional modeling rules, so how do you handle this multi-valued relationship?  Bridge tables. In this blog post I will take you through the importance and usefulness of bridge tables and how and when to use these data structures. Multi-Valued Dimensions Multi-valued dimensions are more common than most people think and are often “refactored” to a single valued dimension. Coming from a publishing background, sales of articles come to mind as a multi-valued dimension. The sale of articles must have one author but could very easily have two or more authors. How do you handle this in the warehouse? You could define one author as the primary author or you could create a really long field and place all of the authors of the article in this field. But how can you find the total sales for all of the articles written by a single individual who […]
  • Data Warehouse Design Techniques – Ragged Hierarchical Dimensions
    In my last post, we discussed the creation of simple hierarchical dimensions. In this post, I will discuss how to handle those ragged hierarchies, those which can skip levels in the hierarchy. Snowflaking There may be instances where you wish to normalize the hierarchy into a snowflake. Below is a city, state/province, country hierarchy which can easily be converted into a snowflake model. Here we would simply take the list of all of the countries and place them into a country dimension, the state/provinces into a state/province table and finally a city into a city table. While loading these dimensions, you must remember to carry the parent and grandparent IDs along. Therefore, the city dimension will reference the state table where appropriate and the country table (always) while the state/province table will only reference the country dimension. Flatten Hierarchy One way to handle the ragged hierarchy is to flatten the hierarchy just as we did in the simple dimensional hierarchy and copy the grandparent level data down to the parent level. This copying of this […]
  • Data Warehouse Design Techniques – Simple Hierarchical Dimensions
    Today, continuing through our look at the advanced modeling techniques of dimensional data warehouses, we will take a look at simple hierarchical dimensions. Simple Hierarchical Dimensions Hierarchical dimensions are those dimensions which have a parent/child relationship. In simple hierarchies there every child has a parent at the level above with no skipping of levels. A date (day, month, quarter, year) is the most common example of a simple hierarchical dimension and one that is used by all dimensional data warehouses. The question for the data modeler is how to model the hierarchy? Let’s use the date dimension for some initial examples. The first and most common way to model the date hierarchy is to flatten the date into a single dimension where you have every piece of information you need concerning the date in a single location. In this flattened dimensional hierarchy, you can see all of the parent, grandparent, great-grandparent, etc. information of the date in a single table. The selecting of a date will provide you with the detailed information about the date […]
  • Data Warehouse Design Techniques – Snowflaking
    We will continue our deep dive through advanced dimensional data warehouse design techniques by discussing snowflaking. Snowflaking Dimension Tables According to Dr. Kimball, dimensions should be designed as long, denormalized records. The reason for denormalization is to ensure maximum performance of the DW/BI queries. Snowflaking is the normalization of the dimension tables by removing all redundant attributes into separate dimension tables. The reason for normalizing is often the desire to “enhance” the design and reduce the storage costs of the warehouse. Although this normalized design appears to optimize the storage and maintenance of the warehouse it actually introduces maintenance complexities while introducing sub-optimal query performance issues for DW/BI queries. Reporting Using Snowflakes As I demonstrated in a previous blog, Type 2 Slowly Changing Dimensions (SCD) provide the most flexibility for BI reporting because of the ability to report on historical data, including point-in-time reports. Snowflaking can sometimes hide the ability to see these changes the snowflake dimensions. Normalization, by definition, does not capture the transactional details of the modifications of the data. This lack of […]
  • Data Warehouse Design Techniques – Degenerate & Junk Dimensions
    Today we will look at the advanced dimensional data warehouse design techniques of Degenerate and Junk Dimensions. Degenerate Dimension Tables As I stated in an earlier blog post, degenerate dimensions are not physically implemented data structures. Degenerate dimension attributes exist in the fact table as a part of the primary key but have no corresponding dimension. Let’s look at the most common example of a degenerate dimension, the invoice. Invoices contain a lot of information which can be complex and sometimes difficult to model. All invoices contain information about products and customers, but some contain additional information like shipping information and product location (warehouse) information. Starting with a conceptual the model for an invoice, our model may look like this: Some may want to try to create an invoice dimension to capture the invoice information in a single dimension as can be seen in this logical model: While this model accurately depicts the invoice transaction there are several issues with this model. The greatest among these issues would be that the invoice dimension can be […]
  • Data Warehouse Design Techniques – Rapidly Changing Dimensions
    In my last blog post, I showed how to capture changes for slowly changing dimensions. Today we will see how to handle rapidly changing dimensions. Rapidly Changing Dimensions Rapidly changing dimensions are dimensions where the attribute values of the dimension change frequently causing the dimension grow rapidly if you a have designed the dimension to capture the changes as a Type 2 dimension. The rapid growth of this dimension will impact maintenance and performance as the dimension grows. Let’s take a look at solutions to this challenge. Mini-Dimension (SCD Type 4) Mini-dimensions contain the rapidly changing attributes of the original dimension and are treated as a stand-alone dimension. This dimension will band the attributes together instead of treating them as individual pieces of data. The fact table can then connect to the mini-dimension to based upon the data provided or this mini-dimension could be connected to the “parent” dimension using the surrogate key as a Type 1 or Type 2 SCD attribute. An example of this can be seen in a customer table where the […]
  • Data Warehouse Design Techniques – Slowly Changing Dimensions
    In my last blog post, I demonstrated the importance of conformed dimensions to the flexibility and scalability of the warehouse. This week we will discuss the importance of capturing the dimensional change in slowly changing dimensions. Slowly Changing Dimensions Slowly Changing Dimensions (SCD) are the most commonly used advanced dimensional technique used in dimensional data warehouses. Slowly changing dimensions are used when you wish to capture the changing data within the dimension over time. There are three methodologies for slowly changing dimensions.   Type 1 – For this type of slowly changing dimension you simply overwrite the existing data values with new data values. This makes the updating of the dimension easy and limits the growth of the dimension table to only new records. The drawback of this is you lose the historical value of the data because the dimension will always contain the current values for each attribute. For instance, you have a store dimension which has an attribute for a geographic region. If there is a redesign in the regional boundaries some stores may move […]
  • Data Warehouse Design Techniques – Conformed Dimensions
    In my last blog post, I presented to you some of the more widely used advanced technical dimensional design techniques. Over the next several of weeks, I will provide detail to each of the design techniques. Conformed Dimensions Conformed dimensions are those dimensions which have been designed in such a way that the dimension can be used across many fact tables in different subject areas of the warehouse. It is imperative that the designer plan for these dimensions as they will provide reporting consistency across subject areas and reduce the development costs of those subject areas via reuse of existing dimensions. The date dimension is an excellent example of a conformed dimension. Most warehouses only have a single date dimension used throughout the warehouse. Conformed Dimension – Challenges Structural Differences – in some cases you will have source tables which contain different table structures These may include: additional or missing columns columns with different data types columns named differently containing the same or similar data These differences can usually be resolved easily but occasionally this […]
  • Advanced Dimensional Data Warehouse Design Techniques
    In an earlier blog post, I walked you through the basics of dimensional data warehouse design by introducing you to dimension tables, fact tables and star schema design. For this blog post, I will introduce some of the more widely used advanced dimensional design techniques which I will describe in detail in the coming weeks. DIMENSIONS Conformed Dimensions Conformed dimensions are those dimensions which have been designed in such a way that the dimension can be used across many fact tables in different subject areas of the warehouse.   Slowly Changing Dimensions Slowly changing dimensions are the second most commonly used advanced dimensional technique used in dimensional data warehouses. Slowly changing dimensions are used when you wish to capture the changing data within the dimension over time.   Hierarchical Dimensions Hierarchical dimensions are those dimensions which have an associated hierarchy. A date (day, month, quarter, year) is an example of a hierarchical dimension and one that is often used by many data modelers and data warehouses.   Snowflake Dimension Tables The removing all redundant dimensional […]
  • Data Warehousing Documentation Review
    One of the most important artifacts delivered as a part of any project is documentation. Documentation is the bane of IT development teams. Most developers feel “if it was hard to write, it should be even harder to understand and modify.” When developing and delivering a data warehouse documentation is critical to the success of the project. This documentation will help both the business users and the technical teams understand the source, the transformation and storage of the data they need to consume. These documents are the foundation upon which the warehouse will be built.   Business Requirements Document (BRD) This is a formal document which requires review and approval. The BRD captures problem the organization is trying to solve and any restrictions on that need to be considered. Essentially, the BRD defines the business user’s high-level specifications (scope) on what they are hoping to achieve from the data warehouse and delivered through a BI reporting or analytic tool along with any assumptions, constraints and risks which impact or may impact the project. This document […]
  • Data Warehousing Best Practices
    There are many times when you completed a task only to say “I wish I would have known that before I started this project” Whether it is fixing the breaks on your car, completing a woodworking project or building a data warehouse, best practices should always be observed to ensure the success of the project. These are seven of the best practices I have observed and implemented over the years when delivering a data warehouse/business intelligence solution. As you will see, most of these are not technical solutions but focus more on the soft skills needed to ensure the success of these long in duration and expensive solutions.   Define Standards Before Beginning Design I liken this practice to the “measure twice, cut once” adage. It is important that all of the documentation and physical deliverables of the project be defined at the outset of the project. Naming standards, documentation standards, coding standards, weekly status reports, release deliverables, etc. should all be defined before the kick-off meeting. Examples of these documents should be a part […]
  • Designing a Dimensional Data Warehouse – The Basics
    Last week I wrote about Relational (Atomic) Data Warehouses and how to create these data structures. This week we will look at Dimensional Data Warehouses and how they differ from the Relational Data Warehouse. Creating a dimensional data warehouse is very different from creating a relational data warehouse. In the relational data warehouse, you are looking to store every atomic element which the organization has deemed to be important from the source applications and storing this data in a relational structure. Relational data structures are optimized for transactional speed and the limiting the use of disk space which is why this methodology is so prevalent in transaction-based systems. Dimensional models are specifically designed for optimized for the delivery of reports and analytics.   Dimensions – What are they? Dimensional Data Warehouses must have dimensions, right? So what are dimensions? Think of dimensions as containers of adjectives which describe the dimension and will help us limit the fact information we wish to see. Dimension tables can contain a lot of columns if the dimension is robust […]
  • Designing a Relational Data Warehouse
    The use of relational modeling is prevalent in today’s business landscape as most applications are built using relational databases. Relational modeling is based on mathematical set theory and was first described in 1969 by E.F. Codd. He stated that a grouping of elements (tuples) could be related to other tuples in such a way so as to limit the amount of data stored for a specific tuple and therefore optimize the storage of such data. This sounds difficult! However, it is easy. Let’s look at a real world example. Take a look at the information collected about John Smith below. John Quincy Smith DOB: 12/31/1969 National ID: 123-45-6789 1234 Main Street, Anytown, VA 20001 (h) 8000 Corporate Drive, Suite 1022, Sombertown, VA 20101 USA (w) (w) (p) (p) (703) 555-1212 (w) (703) 212-1555 (h) (571) 234-9876 (c) 1234-5678-9012-3456 (Visa) Could all of this information be stored in a single table? Yes, absolutely. However, storing data in a single table is inefficient. In this example, we added a second personal email address column. […]
  • Data Warehouse Design Methodologies
    There are two traditional data warehouse design methodologies came of age in the 1990’s, that of Bill Inmon’s Top-Down Atomic Data Warehouse and that of Ralph Kimball’s Bottom-Up Dimensional Data Warehouse. These methodologies have been used over the past 20 years to create informational data stores for organizations seeking to leverage their data for corporate gain. In this article, we will compare and contrast these two methodologies.   Atomic Data Warehouse – Bill Inmon Bill Inmon’s Atomic Data Warehouse approach is strategic in nature and seeks to capture all of the enterprise data in 3rd Normal Form and store all of this atomic data in the data warehouse. Inmon defines a data warehouse as a subject-oriented, non-volatile, time-variant and integrated data source. Let’s break down each of these descriptors of the Inmon’s Data Warehouse. Subject-Oriented – the data is organized so that the data, related by subject area, is linked together. Non-Volatile – once data is entered it is never updated or deleted; all data is retained for future reporting needs. Time-Variant – because of […]
  • Requirements for a Successful Data Warehouse Project
    In order to have a successful data warehousing project you will need the following five challenges addressed: #1. Executive Sponsorship All projects need executive sponsorship. Without senior executive leadership, most corporate projects will likely fail. Data warehouse projects are no different, in fact, these projects are even more in need of executive sponsorship than most business initiatives. It is imperative that the organization’s corporate leaders infuse the enterprise with excitement concerning the success of a data warehousing project. These executives must also tie the success of the project to each and every one of their direct reports. This will ensure that all of the business units of the organization are linked to the success of the Data Warehousing project. Organizational commitment from the top down will eliminate data fiefdoms, those individuals/departments who refuse to share their data or information with the larger organization for fear of loss of power or influence. These fiefdoms often work subtly to sabotage the success of the warehouse project. They do this out of fear, afraid that sharing their information […]
  • Data Warehousing – The Basics
    What Are Data Warehouses? A data warehouse is a large, central store of data from disparate source systems across the enterprise optimized for analysis and reporting. The data is extracted from the source transaction based application databases, transformed into new data structures through the applying a set of rules and functions against the extracted data before loading the transformed data into the data warehouse data structure which is optimized for reporting. This central data store provides the business user with a snapshot of the business at a point-in-time. The data warehouse is used in making both strategic and tactical business decisions.   Why Are Data Warehouses Important? Data warehouses hold large quantities of source system data over long periods of time. This time-variant data contained in the data warehouses provide the foundation upon which historical analytical and reporting applications are built upon. These applications provide the organization’s senior leadership with a quick and holistic view of the success of the organization over time. Data warehouses can also be used and the source data used for […]
  • Introducing Vilocity 2.0….
    By: Ryan Legge For those of you who missed our release on Vilocity 2.0, you can read it here.  As I am sure you know from having read about this revolutionary solution, Vilocity uses an open source framework to combine two powerful COTS BI/Data Discovery products for data analysis and reporting: Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Endeca Information Discovery (OEID). This framework combines these two powerful tools into a single user experience, allowing users to seamlessly navigate from Business Intelligence to Information Discovery. Several key features of Vilocity include: A common user interface across both OBIEE and OEID The ability to filter and navigate both OBIEE and OEID components seamlessly utilizing OEID’s Guided Navigation A true personalized experience that allows users to combine both OBIEE and OEID components into a single dashboard We could talk all day about the benefits of utilizing the Vilocity framework to provide a truly unified Analytics capability. But for this post, we want to talk about one of the new features of Vilocity 2.0, the unique Portlet […]
  • Introduction to Endeca Studio 3.1
    By: Matthew T. Farrington The Oracle Endeca Information Discovery (OEID) platform provides an incredibly powerful, intuitive, and fast means of digging through complex data-sets of varied structures. All of us at NuWave Solutions celebrated when we received our Oracle Partner Network (OPN) status specializing in Oracle Endeca Information Discovery (OEID) on December 2013. The best way to see the capabilities of software is to build a simple application and start analyzing some data. Fortunately, you don’t need a team of developers, database exerts, or front-end designers to construct a robust data discovery application. Endeca Studio 3.1 comes with a simple application builder tool that takes a given data-set and produces a simple application which you can subsequently customize to your desire. This tutorial will walk you through application creation and component configuration, starting you down the path of data discovery. To quickly become familiar with Endeca Studio 3.1 we are going to create a new application that leverages Chicago crime statistics to examine why so many police reports do not result in an arrest. For the […]