Data Design Principles

Obey the principles without being bound by them.

– Bruce Lee

Taking a practical approach to developing a well-formed enterprise data warehouse – and by that, I mean one that is accurate, efficient and productive – involves basing it on sound design principles. These principles are specific to each sector of the reference architecture; each of which enables specific capabilities and serves specific functions. Here, I would like to lay out the principles of the Information Warehousing layer’s normalized central repository – the system of record.

The Information Warehousing layer is designed as a normalized repository for the data that has been processed “upstream”. It arrives cleansed, transformed and mastered; and is consolidated here into a single “System of Record”.  The discipline of normalization restructures the data, removing it from the confines of the single perspective of the source system, into the multiple perspectives across the enterprise. The data is modelled according to its “essence” rather than its “use”.

This process of redesign imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The way the data is broken apart into separate tables makes it challenging to query, but this is not its main purpose. Data integrity and flexibility are the primary goals, and tuning is geared towards load performance rather than data access.

  1. Data Integrity
    Take a proactive stance to protect referential integrity and reduce instances of redundancy or potential for inconsistency.
  2. Scalability
    Allow for increases in volumes or additional sources of existing information, both within subject areas (e.g., issuers, counterparties) and core concepts (e.g., parties, locations).
  3. Flexibility
    Allow for additional sources or changes in existing sources, so that design is not tied to a given source or mirrors the source. Design will give primary consideration to reuse, then extension and finally modification of existing structures.
  4. Consistency
    Apply standard patterns for data design to promote efficiencies of data and ETL design. The decision-making process will be expedited as will data modelling work and ETL development.
  5. Efficiency
    Focus efficiency on three aspects:

    1. Implementation
      Use of repeatable patterns for data design will minimize data modelling and ETL work effort.
    2. Operation
      Ease ongoing maintenance by keeping the number of data objects to a minimum; maintain consistent standards; and apply logical structures for ease of navigation and use.
    3. Load Performance
      Priority given to performance of ETL load processes; including those that use the System of Record as a source to load the Data Mart sector.
  6. Enterprise Perspective
    For all data objects, retain and remain open to, a full range of existing and potential relationships between entities to ensure that data reflects an enterprise perspective that is not limited to the perspective of any given project’s requirements.

These foundational principles are implemented through strategies that impact storage of history, hierarchical structures, degree of normalization, classifications, surrogate keys and a number of other aspects of design. The principles form the criteria to judge the best approach to take in a given situation. It’s not always straightforward, even with the principles in place – at times one has to favour one principle over another (e.g., flexibility over load performance), but this list provides the guidance to frame the debate and take a considered approach.

As suggested by the opening quote, I’m not advocating a blind conformance to a set of rules; but, in my experience, one of the greatest obstacles to efficient and effective development is the decision-making process. Limiting the parameters of debate with intelligent guidelines can facilitate decisions being made quickly and correctly.

Feel free to suggest additions or amendments to this list of design principles for the System of Record.


Reference Domains Part II: Modelling Classifications

This is the second article in the series on working with reference domains, also commonly referred to as classifications. In Part I, we looked at the nature of classifications. Here we will discuss designing structures to accommodate them in the EDW’s Information Warehousing layer, within the normalized System of Record.

Two potential approaches to designing the Reference Domains are to:

  1. Put each reference domain of values into its own table.
  2. Collect all reference domains into two tables; one for a master set of domains and the other for all the values, with a column linking each to the relevant domain.

The second approach is the recommended one, and the remainder of this article will present the design and the rationale. I have seen the consolidated reference data approach implemented many times within the System of Record of the EDW; and it has proved highly effective.

Given this preferred approach, every Reference Domain and Value is to be placed in the following entities:

  • Reference Domain
  • Reference Value

As mentioned in Reference Domains Part I, reference domains are commonly made up of pairs of code and description attributes. However, it is possible that some domains may be comprised of additional attribution, such as short names, long names, multi-language names etc. In such cases, consideration should be given to extending the model to accommodate additional fields. (This is not the same as related  reference values that may be stored together in the source, but should be separated in the EDW.)

The domain values may be organized hierarchically. To support many-to-many relationships between domain values, in terms of hierarchies or other relationship types, the associative entity, Reference Value to Reference Value Relationship, should be used. As with all associatives, it will accommodate multiple concurrent as well as historical relationships.

To facilitate changes in values within a given scheme, a Reference Domain to Reference Value Relationship entity can be deployed.

Maintaining History

The history of values for a given scheme in relationship to other core concepts is to be retained within relevant associative structures.

For example, in the diagram below, the current value of Party Status Id, with its effective date, called Party Status Change Date, are held on the Party entity. Historical values are held on the Party to Domain Relationship entity, with the Domain Scheme Id holding the identifier for the Party Status domain, and the Domain Value Id holding the identifier for the previous value.

To illustrate the example more fully, consider the following:

The Domain Scheme and Domain Value tables contain rows for the Party Status domain scheme.

Day 1

On day one, the party enters the system with a party status of “Active”. There are no entries in the Party to Domain Relationship table.

Day 2

On day two, the party status changes to “Closed”. The current value in Party is overwritten with the new value; the change date is updated; and the old value is inserted into the associative.

Design Benefits

The following points outline the benefits of employing a consolidated set of domain schemes and values within a small set of tables, versus creating a separate table for each domain scheme’s set of values.

  1. Data Integrity: The integrity of the allowable set of values within a given domain are maintained through the application logic defined and implemented through the ETL. Maintaining a master list reduces the chance of inconsistencies or the appearance of anomalies.
  2. Flexibility: There is only one table to access for all domain value codes and descriptions within the SOR.
  3. Implementation Efficiency: No additional database entities or tables to be created, tested or maintained within the System of Record (SOR) logical or physical model. Fewer objects means less chance for error.
  4. Operational Efficiency: A single utility can be created to interface with this table. It is true that even with multiple tables a single utility could be created with minimal logic. However, each new object would require some changes to the application, whereas the consolidated version can be extended seamlessly.
  5. Consistency: History is stored in the classification associative tables (e.g., Party to Domain Relationship). The retention of the domain schemes and values as surrogate keys in the Domain Value entity facilitates these structures. This allows history to be gathered using a common mechanism, explicitly identifying both the scheme and the value.

In part three, we will go on to look at collecting and documenting classifications.

In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

EDW Reference Architecture: Information Warehousing Layer

EDW Reference Architecture: Information Warehousing

Information Warehousing

This is the fourth article in a series introduced in EDW Reference Architecture: Why Bother?. Other articles have looked at the Data Acquisition and Data Integration layers.

The Information Warehousing layer is designed as a normalized repository for all the integrated and mastered data. The primary purpose of this area is to organize the data at an atomic level in a state that is cleansed, mastered and well-defined. The discipline of normalization imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The relatively complex structures are not conducive to ease of navigation, and are tuned for load performance rather than data access.

System of Record

Purpose: Storage of Single Trusted Source

As the name suggests, the system of record is designed to provide a single trusted source for all downstream consumption. Its focus is on flexibility, storage and load performance.

Content: Integrated / Atomic

This area contains the lowest level of detail available, with minimal redundancy, including the repetition of information within derived or computed values. The intention is to retain here all component fields that would be used to derive other values. Ideally, each piece of information exists only once within the system of record, although some controlled level of redundancy for convenience may be included (e.g., primary identifier for an Involved Party included on both the fundamental Involved Party table as well as the Involved Party Identifier attributive table).

Structure: Normalized

The principles of normalization isolate each distinct group of information to minimum repetition and ensure that it belongs logically together. The modelling directive is to model according to the “essence” of the entity rather than its “use”. Practically, this ends up meaning that tables contain columns closely related to a single fundamental concept. The business function of a given piece of data is revealed through its relationship to other entities, including role-based entities. Through the use of relationship and attributive table, new data elements can be added to the System of Record without requiring additional structures.

Cleansing: Mastered and Cleansed

The system of record is the target of the data that has passed through the Data Integration Layer. No data should be loaded into the System of Record that has not passed through the process of being mastered and cleansed. It is critical to user confidence that this area remain closely governed and that its integrity not be compromised by co-locating external data without applying the discipline of the acquisition and integration layers.

Retention: Medium-term / Selective / Regulation-determined

The normalized structures permit the storage of all data elements. Most data elements can have history applied at the attribute level, recording the change in a single attribute without needing to repeat the other values contained on the same row. This means that history can be applied more selectively, with some fields being overwritten without storing the original value. For long-term storage, beyond seven years, data will be moved to a separate archival storage platform. The retention period may be driven by regulatory requirements, so the data remains accessible for auditing; although regulatory needs may be met through archival storage.

Consumption: Limited

As has been stated, the primary function of the system of record is to remain optimized for storage and open to receiving new data sources. However, access to the system of record should not be barred to all users. The normalized structures offer a flexible path through the data, with multiple perspectives retained. The selective history approach provides greater visibility to changes in data values than denormalized tables. These structural advantages should be exploited.

It is important that this access not be allowed to compromise the integrity of the system. Design decisions should not be influenced by considerations of query performance. Should it become clear that usage of the normalized structures for data analysis is of sufficient value to users, consideration should be given to creating a Deep Analytics Sector.

Interactive Sector

Purpose: User-maintained Information

The Interactive Sector provides an area in which users can create and maintain authoritative data, directly feeding the System of Record. The content of this sector can be viewed as an extension of the System of Record as well as a source to enhance the Information Provisioning Layer, either through joins or through ETL migration to tables within that layer.

There are many circumstances in which users need to be able to set up and control sets of values, such as banding (e.g., age demographic ranges) or domains of values for lookup tables that may be relatively dynamic. In many organization, these may have been managed through spreadsheets or local databases; a mature solution will see these implemented through a centralized and technologically advanced approach.

Content: Integrated / Atomic and Derived

Although the data here is user-generated, it resides within the Information Warehousing layer, and is designed and quality-controlled in such a way to be integrated. This means that the applications created to insert and update data in its tables will enforce referential integrity from data within the System of Record, and make use of its domain schemes and values. The content is to be primarily atomic, although some data may be derived. The intent is to enable and govern user-maintained data, rather than constrain it.

Structure: Normalized / Denormalized

The data structures of the Interactive Sector will mirror the System of Record wherever possible. In cases where user data will be linked directly to reporting tables, a denormalized design may be adopted. The goal is to be agile and accommodating without compromising the reliability of the information being provisioned. The guiding principle should be that the users are creating the “golden copy” of the data they are generating, and therefore it should conform as much as possible to the design of the System of Record.

Cleansing: Cleansed

It is essential that the application underlying the user interface enforce integrity controls on all data being entered and maintained. The Interactive Sector is being defined as sitting within the Information Warehousing layer, a zone in which all data must be scrubbed clean, rationalized and consolidated into a single trusted source. If the user interface lacks such controls, it should be treated as a standard source and be passed through the Staging area and subject to the processes of analysis, profiling and ETL transformations.

Retention: Medium-term

As per the System of Record, this user-maintained sector tracks the history of changes as required, and stores it for one to seven years, as appropriate. In DW2.0, Mr. Inmon suggests that the interactive sector store data for the short-term only, and be retained for a longer time in the Integrated Sector (System of Record). This might make sense for your organization, but the industry is moving towards migrating data as little as possible, and it may be more efficient to hold onto this data and access it where it is generated.  

Consumption: Open

The Interactive Sector data is intended to be combined with the Information Provisioning layer for business intelligence and extracts for downstream use. The data may be subject to security restrictions, but is otherwise open to all consumers from the Information Delivery layer.

I welcome your comments.

The next article will look at the Information Provisioning layer.

EDW Reference Architecture: Why Bother?

EDW Reference Architecture

In the following series of articles I would like to define a possible reference architecture for an enterprise data warehouse. This represents an amalgam of real and imagined client sites, addressing all-too-real issues I’ve encountered. It is presented as a straw man; for discussion, for dressing up to fit the needs of your organization, or for burning on the bonfire of rejected ideas. I hope that if the last course be chosen, that it at least provides fuel for better ideas – and that you’ll share them with me.

In preparing a reference architecture the immediate question arises: why bother? As an artefact it can appear arcane, ivory-tower, and unnecessary. Certainly, in many cases, it is presented in a way that has little meaning to the consumers of business intelligence, or even the practitioners involved in the EDW design and development. So how should it be presented and who is it for?

The architecture provides a reference point for all the stakeholders of the EDW. If it is to have meaning and value it must directly address specific issues, resolving them and fitting in to a larger corporate landscape.

For sponsors of the EDW, it must represent the vision of the system; which will presumably be along the lines of “The EDW will provide a single source of the truth throughout the organization, in a timely, reliable and accessible fashion”. The totality of the vision should be encompassed by, and evident in, the description of the reference architecture. What is the purpose of each EDW component? How does this meet what is required of the system?

For managers of the EDW, it draws a picture of the scope of work that will be required. It will not define the processes, but the definition of those processes will be dependent on understanding the end-to-end architecture. What demands do each of the EDW areas make on resources, in terms of data migration, design and retention?

For those involved in its design, it sets out the boundaries of the components within the EDW, defining their structure and relationship to one another. What types of database objects inhabit each component? What processing is necessary as those objects are populated?

For business, the reference architecture should present a clear and concise picture of how raw data will be transformed into information. It shows where information will be delivered for consumption, and how the processes along the way will contribute to its quality, reliability and ease of access. The message must be your information is in good hands – you’ll have what you need when you need it; and it will be correct.

You can judge for yourself whether the way I’ve laid out the reference architecture meets the conditions set out for the various audiences. The next 4 articles will define each of the layers of the EDW. There are 2 components, or sectors, within each layer, and the following aspects of the components will be described:

  • Purpose – The general function of each component within the context of the EDW.
  • Content – The nature of the data within the component.
  • Structure – The type of data objects that will hold the sector’s data; normalized, denormalized, inherited from the source or requiring data design.
  • Transformations – The level of processing the data will undergo as it populates the sector.
  • Retention – How long the data will likely reside in the sector.
  • Consumers – Who will have access to the data and for what purpose.

As I suggested, this is offered in the spirit of discussion, and I welcome your comments and feedback from your own experiences.

The next article in the series is on the Data Acquisition layer.

Picking the Team, Part III: ETL Designer

In this series of articles, we’re looking at the top qualities required for candidates to fill three key positions in the data warehousing team. In part one we looked at the analyst, in part two, the role of data designer. Here we’ll explore the five top skills to look for in the ETL designer / developer. In larger operations, these will be two distinct roles, with perhaps multiple resources. There may be a designer, with a lead developer and a team of ETL developers. For the purposes of this article, we’ll roll ETL into one position: with ETL design and development taken together.

ETL Design/Development

The process of Extract Transform Load (ETL) involves the migration of data from one system to another. In the case of the data warehouse, this occurs in multiple stages: source to staging, staging to integration layer, and integration layer to data mart.

We’ll take the base technical skills as a given. The ETL resource must be proficient in the toolset being applied (DataStage, Informatica, Ab Initio etc.) While the functionality is essentially the same across the board, the features of each are quite distinct. Beyond this fundamental compentency, look for the candidate’s expertise in the following areas:

  1. Experience with Delta Loads to Staging
  2.  Experience with Normalized Structures (Integration Layer)
    • Change Capture / History
    • Reference Tables (consolidated vs separate)
    • Surrogate Keys
  3. Experience with Denormalized Structures (Provisioning Layer)
    • Slowly Changing Dimensions
    • Bridging and Relationship Tables
  4. Approach to Performance Tuning
  5. Approach to Rapid Development

It’s important to find someone with knowledge of the end-to-end process. They may have been involved in one aspect of the process in the past, or been responsible for maintenance of existing jobs. This will not be sufficient. The ETL needs to be designed with an understanding of the specific challenges involved at each stage, with a corresponding toolkit of solutions at their disposal to address them.

The staging area is loaded from the source systems, and requires strategies to extract only the latest data. The ETL candidate needs to be familiar with different strategies to manage this, such as: source system logs, datestamps on source tables, set comparison operations within staging.

It’s quite common for an ETL resource to have experience with loading data marts, but have no knowledge of normalization. This can colour their thinking regarding the population of the Integration Layer, and present challenges when using it as a source for the Provisioning Layer. You should specifically ensure that the candidate is aware of the different ways change capture is applied in the normalized environment, in particular the use of associative tables; that they have experience of consolidated reference tables; and that they understand that surrogate keys in the Integration Layer are static.

Dimensions and fact tables are standard structures in the industry and every ETL designer should be well-versed in their use. It would be a good idea to check that they know the three types of slowly changing dimension:

  1. Update the existing value with the new value
  2. Create a new row for the new value
  3. Design multiple columns for new and previous value

Another data mart structure that is not always applied is the bridging table, for navigating unbalanced hierchies, which require a separate ETL pattern to process. The candidate should have experience with both bridging and relationship tables, which can handle many-to-many relationships.

There is usually a window of time within which the nightly, weekly or monthly ETL processes must be completed. This puts a premium on the ability of the ETL designer to optimize the efficiency of the code. Make sure the candidate has practical experience with this and can list a number of strategies that they have employed. Don’t expect them to answer questions like: “How long should it take to load a million rows”. There are too many variables for them to give a meaningful response. What is important is that they can articulate a number of avenues of approach.

The final point relates to the candidate’s overall attitude to development. Most IT shops want agility, while the data warehouse life cycle contains inherent dependencies. The ETL designer should always be looking for repeatable patterns, building on previous experience as well as the blocks of code developed through the project. Find out what strategies they employ to create processes quickly and accurately. ETL can be the most time-intensive part of the project. An efficient ETL designer can be key to the warehouse ROI.