In this week’s blog, we will discuss constraints and indexes. In data warehousing, like in life, constraints are things we love to hate. Constraints keep us from making mistakes, which in most cases is a good thing, until we come across an anomaly which needs to be addressed but the constraints prevent this from happening. Most of the time indexes help us to find data faster, but that is not always the case.
What are Indexes?
Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then sorted and binary searches are performed to quickly find the record.
When should I Use Indexes?
In general, you use indexes when you have a large data set. Small lookup or reference tables are bad candidates for indexes. In most cases the database can more efficiently select all of the information from the table without ever using the index, making the index useless. As a rule of thumb, let the SQL be your guide. If you notice that a particular statement is taking a long time then take a look at the explain plan and determine if adding an index will help. In the early stages of data warehouses some tables will not need an index, but as time goes by and more data is added to the table you may find the need to add one or more indexes to the table to ensure optimal performance. Remember that you can also drop indexes before a large ETL load and rebuild them upon completion to speed up the loading of large data sets.
What are Constraints?
Constraints are data structures used to ensure the data being entered into the table meets certain requirements before it can be committed to the table. Constraints come in many forms:
- Check – data in the column is verified against an expression before it is accepted as a valid entry. Validating that a column only contains a “Y” or “N” character for a binary column or validating that a column is not null.
- Unique – validates that the data in the column(s) contains a unique value for all of the records in the table. This constraint is implemented via an index.
- Primary Key – special type of unique constraint which can be used as the primary way to retrieve a unique record from the table. Tables can have multiple unique indexes, but it can have only one Primary Key Constraint. This constraint is implemented via a unique index and is available to be referenced by a Foreign Key.
- Foreign Key – a set of one or more columns used to reference a Primary Key in another table to ensure data integrity.
When should I use Constraints?
Because constraints are used to ensure the integrity and consistency of the data they should be used in almost all cases. But, as is often the case, you may come across some cases you may wish to forgo the constraints and rely on the ETL process instead.
Constraints come with some overhead which you must keep in the forefront of your mind when implementing your physical data structures. Unique and Primary Key constraints are implemented using indexes. In most cases, Foreign Key constraints can also be implemented with associated indexes. Please note this is not required. Implementing indexes for all foreign keys may have a detrimental impact on the performance of your data warehouse loading process. Of course, not having indexes on foreign keys may also have a negative effect on the ETL performance. So how do you decide when to add an index to a foreign key? I only implement Primary Keys when initially implementing a new fact table to the data warehouse. I always wait until we move code to QA before deciding which Foreign Key needs an index to improve the performance of the ETL or the application. This way I ensure that only those indexes which are absolutely necessary are implemented. In some cases, I create the index before executing the ETL and then remove the index immediately after the data loading is complete or vice versa so as to optimize the performance of both the ETL and the application. Please keep this in mind as you design and implement your data warehouse.
In conclusion, Constraints and Indexes are important data structures used to ensure the quality of the data and the speed in which the data is retrieved. When designing your data warehouse, verify that all appropriate constraints are defined and implemented. As for indexes, please take time to review your indexes on a regular basis to ensure that they are still being used and or find those indexes which are missing and keep your ETL and application running at optimal performance.