August 16, 2011 Leave a comment
Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.
Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:
1. Is the primary key unique?
Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.
2. Does it apply to all rows?
Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)
3. Is it minimal?
Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)
4. Is it stable over time?
Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.
Rationale for preference of Surrogate Keys over Natural Keys:
|#||Issue||Natural Key||Surrogate Key|
|1||Primary and Foreign Key Size||Natural keys and their indexes can be large, depending on the format and components involved in making a record unique.||Surrogate keys are usually a single column of type integer, making them as small as possible.|
|2||Optionality and Applicability||The population of natural keys may not be enforced in the source data.||Surrogate keys can be attached to any record.|
|3||Uniqueness||Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources.||Surrogate values are guaranteed to be unique.|
|4||Privacy||Natural keys may be prone to privacy or security concerns.||Surrogate keys are cryptic values with no intrinsic meaning.|
|5||Cascading Updates||If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used.||Surrogate values never change.|
|6||Load Efficiencies||With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER.||Surrogate values can be loaded more efficiently.|
|7||Join Efficiencies||With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER.||Surrogate values can be joined more efficiently.|
|8||Common key strategy||A lack of a common key strategy will lead to longer decision-making process and more patterns for code development.||A common key strategy based on the use of surrogates will streamline design and development.|
|9||Relationships||Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities.||The use of surrogates promotes reuse of associative entities for multiple types of relationships.|
|10||History||Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture.||Surrogate keys enable a strategic design to store history using associatives.|
The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.
In summary, the reasons to employ surrogate keys as a design policy are related to two factors
In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.
Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.