Data Design Principles

Obey the principles without being bound by them.

– Bruce Lee

Taking a practical approach to developing a well-formed enterprise data warehouse – and by that, I mean one that is accurate, efficient and productive – involves basing it on sound design principles. These principles are specific to each sector of the reference architecture; each of which enables specific capabilities and serves specific functions. Here, I would like to lay out the principles of the Information Warehousing layer’s normalized central repository – the system of record.

The Information Warehousing layer is designed as a normalized repository for the data that has been processed “upstream”. It arrives cleansed, transformed and mastered; and is consolidated here into a single “System of Record”.  The discipline of normalization restructures the data, removing it from the confines of the single perspective of the source system, into the multiple perspectives across the enterprise. The data is modelled according to its “essence” rather than its “use”.

This process of redesign imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The way the data is broken apart into separate tables makes it challenging to query, but this is not its main purpose. Data integrity and flexibility are the primary goals, and tuning is geared towards load performance rather than data access.

  1. Data Integrity
    Take a proactive stance to protect referential integrity and reduce instances of redundancy or potential for inconsistency.
  2. Scalability
    Allow for increases in volumes or additional sources of existing information, both within subject areas (e.g., issuers, counterparties) and core concepts (e.g., parties, locations).
  3. Flexibility
    Allow for additional sources or changes in existing sources, so that design is not tied to a given source or mirrors the source. Design will give primary consideration to reuse, then extension and finally modification of existing structures.
  4. Consistency
    Apply standard patterns for data design to promote efficiencies of data and ETL design. The decision-making process will be expedited as will data modelling work and ETL development.
  5. Efficiency
    Focus efficiency on three aspects:

    1. Implementation
      Use of repeatable patterns for data design will minimize data modelling and ETL work effort.
    2. Operation
      Ease ongoing maintenance by keeping the number of data objects to a minimum; maintain consistent standards; and apply logical structures for ease of navigation and use.
    3. Load Performance
      Priority given to performance of ETL load processes; including those that use the System of Record as a source to load the Data Mart sector.
  6. Enterprise Perspective
    For all data objects, retain and remain open to, a full range of existing and potential relationships between entities to ensure that data reflects an enterprise perspective that is not limited to the perspective of any given project’s requirements.

These foundational principles are implemented through strategies that impact storage of history, hierarchical structures, degree of normalization, classifications, surrogate keys and a number of other aspects of design. The principles form the criteria to judge the best approach to take in a given situation. It’s not always straightforward, even with the principles in place – at times one has to favour one principle over another (e.g., flexibility over load performance), but this list provides the guidance to frame the debate and take a considered approach.

As suggested by the opening quote, I’m not advocating a blind conformance to a set of rules; but, in my experience, one of the greatest obstacles to efficient and effective development is the decision-making process. Limiting the parameters of debate with intelligent guidelines can facilitate decisions being made quickly and correctly.

Feel free to suggest additions or amendments to this list of design principles for the System of Record.


In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

EDW Reference Architecture: Information Provisioning Layer

EDW Reference Architecture: Information Provisioning

Information Provisioning Layer

This is the fifth in a series introduced in EDW Reference Architecture: Why Bother?. Others in the series have looked at the Data Acquisition, Data Integration, and Information Warehousing layers.

The Information Provisioning Layer is designed for ease of navigation and optimized for retrieval performance for business intelligence and other consumers.

Data Mart

Purpose: Meeting Specific Needs

This sector is intended as a platform from which to meet specific information requirements. Where possible, the objects will be designed for re-use. However, the guiding principle is the effective servicing of business needs, and to this end, the solution’s primary focus is on the given project’s requirement.

Content: Golden Copies Plus

The primary source of content in this sector is the Information Warehousing layer; trusted and atomic data that can be used to populate the tables in the Data Marts as well as generate aggregated and computed measures, and derived dimensional attributes. In addition, it may be expedient to combine this information with data from other sources, that have not passed through the process of integration into the EDW. This should be permitted on an exception basis, with governance controls and a clear demarcation of the data’s lineage.

Structure: Denormalized

A common approach to data architecture in Data Marts is the “Star Schema”. This lends itself to ease of use and efficient performance. It also supports a high degree of re-use of objects, with “conformed” dimensions, to borrow a term from Mr. Ralph Kimball, being shared by a variety of users. Applying views on top of the tables provides a mechanism for security and a way to filter on rows and columns for specific use-cases. This sector is not limited to the “Star Schema” design, and should employ structures that are “fit for purpose” wherever greater efficiencies might be attained. This should include the option to apply views over table from the Information Warehousing layer, in cases where performance is not a concern, or the information resides within the Interactive Sector.

Cleansing: Cleansed

The Data Mart draws information primarily from a single trusted source; the ETL processes loading the Data Mart are not cleansing the data, although it will perform transformations to create derived values and aggregations. External sources being combined with Data Mart tables should not require cleansing. If they do, they should pass through a process of analysis and profiling to identify and remediate any anomalies. If the data is atomic, consideration should be given to making it a standard source input, subject to the full process of integration through the Information Warehousing layer.

Retention: Medium-term / Business-determined

The length of the retention period will depend on the business need rather than any regulatory requirement. If only current views of dimensions are required, the dimensions may be type one (i.e., changes are applied as updates to existing rows).

Consumption: Open

The information held in the Data Marts is intended for consumption, subject to privacy concerns and security constraints. Information will be accessed through business intelligence applications and other downstream systems. Appropriate levels of governance will be enforced for all information retrieval.

I welcome your comments.

As a final note on EDW Reference Architecture, the next article will discuss the Metadata Repository.

EDW Reference Architecture: Information Warehousing Layer

EDW Reference Architecture: Information Warehousing

Information Warehousing

This is the fourth article in a series introduced in EDW Reference Architecture: Why Bother?. Other articles have looked at the Data Acquisition and Data Integration layers.

The Information Warehousing layer is designed as a normalized repository for all the integrated and mastered data. The primary purpose of this area is to organize the data at an atomic level in a state that is cleansed, mastered and well-defined. The discipline of normalization imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The relatively complex structures are not conducive to ease of navigation, and are tuned for load performance rather than data access.

System of Record

Purpose: Storage of Single Trusted Source

As the name suggests, the system of record is designed to provide a single trusted source for all downstream consumption. Its focus is on flexibility, storage and load performance.

Content: Integrated / Atomic

This area contains the lowest level of detail available, with minimal redundancy, including the repetition of information within derived or computed values. The intention is to retain here all component fields that would be used to derive other values. Ideally, each piece of information exists only once within the system of record, although some controlled level of redundancy for convenience may be included (e.g., primary identifier for an Involved Party included on both the fundamental Involved Party table as well as the Involved Party Identifier attributive table).

Structure: Normalized

The principles of normalization isolate each distinct group of information to minimum repetition and ensure that it belongs logically together. The modelling directive is to model according to the “essence” of the entity rather than its “use”. Practically, this ends up meaning that tables contain columns closely related to a single fundamental concept. The business function of a given piece of data is revealed through its relationship to other entities, including role-based entities. Through the use of relationship and attributive table, new data elements can be added to the System of Record without requiring additional structures.

Cleansing: Mastered and Cleansed

The system of record is the target of the data that has passed through the Data Integration Layer. No data should be loaded into the System of Record that has not passed through the process of being mastered and cleansed. It is critical to user confidence that this area remain closely governed and that its integrity not be compromised by co-locating external data without applying the discipline of the acquisition and integration layers.

Retention: Medium-term / Selective / Regulation-determined

The normalized structures permit the storage of all data elements. Most data elements can have history applied at the attribute level, recording the change in a single attribute without needing to repeat the other values contained on the same row. This means that history can be applied more selectively, with some fields being overwritten without storing the original value. For long-term storage, beyond seven years, data will be moved to a separate archival storage platform. The retention period may be driven by regulatory requirements, so the data remains accessible for auditing; although regulatory needs may be met through archival storage.

Consumption: Limited

As has been stated, the primary function of the system of record is to remain optimized for storage and open to receiving new data sources. However, access to the system of record should not be barred to all users. The normalized structures offer a flexible path through the data, with multiple perspectives retained. The selective history approach provides greater visibility to changes in data values than denormalized tables. These structural advantages should be exploited.

It is important that this access not be allowed to compromise the integrity of the system. Design decisions should not be influenced by considerations of query performance. Should it become clear that usage of the normalized structures for data analysis is of sufficient value to users, consideration should be given to creating a Deep Analytics Sector.

Interactive Sector

Purpose: User-maintained Information

The Interactive Sector provides an area in which users can create and maintain authoritative data, directly feeding the System of Record. The content of this sector can be viewed as an extension of the System of Record as well as a source to enhance the Information Provisioning Layer, either through joins or through ETL migration to tables within that layer.

There are many circumstances in which users need to be able to set up and control sets of values, such as banding (e.g., age demographic ranges) or domains of values for lookup tables that may be relatively dynamic. In many organization, these may have been managed through spreadsheets or local databases; a mature solution will see these implemented through a centralized and technologically advanced approach.

Content: Integrated / Atomic and Derived

Although the data here is user-generated, it resides within the Information Warehousing layer, and is designed and quality-controlled in such a way to be integrated. This means that the applications created to insert and update data in its tables will enforce referential integrity from data within the System of Record, and make use of its domain schemes and values. The content is to be primarily atomic, although some data may be derived. The intent is to enable and govern user-maintained data, rather than constrain it.

Structure: Normalized / Denormalized

The data structures of the Interactive Sector will mirror the System of Record wherever possible. In cases where user data will be linked directly to reporting tables, a denormalized design may be adopted. The goal is to be agile and accommodating without compromising the reliability of the information being provisioned. The guiding principle should be that the users are creating the “golden copy” of the data they are generating, and therefore it should conform as much as possible to the design of the System of Record.

Cleansing: Cleansed

It is essential that the application underlying the user interface enforce integrity controls on all data being entered and maintained. The Interactive Sector is being defined as sitting within the Information Warehousing layer, a zone in which all data must be scrubbed clean, rationalized and consolidated into a single trusted source. If the user interface lacks such controls, it should be treated as a standard source and be passed through the Staging area and subject to the processes of analysis, profiling and ETL transformations.

Retention: Medium-term

As per the System of Record, this user-maintained sector tracks the history of changes as required, and stores it for one to seven years, as appropriate. In DW2.0, Mr. Inmon suggests that the interactive sector store data for the short-term only, and be retained for a longer time in the Integrated Sector (System of Record). This might make sense for your organization, but the industry is moving towards migrating data as little as possible, and it may be more efficient to hold onto this data and access it where it is generated.  

Consumption: Open

The Interactive Sector data is intended to be combined with the Information Provisioning layer for business intelligence and extracts for downstream use. The data may be subject to security restrictions, but is otherwise open to all consumers from the Information Delivery layer.

I welcome your comments.

The next article will look at the Information Provisioning layer.

EDW Reference Architecture: Why Bother?

EDW Reference Architecture

In the following series of articles I would like to define a possible reference architecture for an enterprise data warehouse. This represents an amalgam of real and imagined client sites, addressing all-too-real issues I’ve encountered. It is presented as a straw man; for discussion, for dressing up to fit the needs of your organization, or for burning on the bonfire of rejected ideas. I hope that if the last course be chosen, that it at least provides fuel for better ideas – and that you’ll share them with me.

In preparing a reference architecture the immediate question arises: why bother? As an artefact it can appear arcane, ivory-tower, and unnecessary. Certainly, in many cases, it is presented in a way that has little meaning to the consumers of business intelligence, or even the practitioners involved in the EDW design and development. So how should it be presented and who is it for?

The architecture provides a reference point for all the stakeholders of the EDW. If it is to have meaning and value it must directly address specific issues, resolving them and fitting in to a larger corporate landscape.

For sponsors of the EDW, it must represent the vision of the system; which will presumably be along the lines of “The EDW will provide a single source of the truth throughout the organization, in a timely, reliable and accessible fashion”. The totality of the vision should be encompassed by, and evident in, the description of the reference architecture. What is the purpose of each EDW component? How does this meet what is required of the system?

For managers of the EDW, it draws a picture of the scope of work that will be required. It will not define the processes, but the definition of those processes will be dependent on understanding the end-to-end architecture. What demands do each of the EDW areas make on resources, in terms of data migration, design and retention?

For those involved in its design, it sets out the boundaries of the components within the EDW, defining their structure and relationship to one another. What types of database objects inhabit each component? What processing is necessary as those objects are populated?

For business, the reference architecture should present a clear and concise picture of how raw data will be transformed into information. It shows where information will be delivered for consumption, and how the processes along the way will contribute to its quality, reliability and ease of access. The message must be your information is in good hands – you’ll have what you need when you need it; and it will be correct.

You can judge for yourself whether the way I’ve laid out the reference architecture meets the conditions set out for the various audiences. The next 4 articles will define each of the layers of the EDW. There are 2 components, or sectors, within each layer, and the following aspects of the components will be described:

  • Purpose – The general function of each component within the context of the EDW.
  • Content – The nature of the data within the component.
  • Structure – The type of data objects that will hold the sector’s data; normalized, denormalized, inherited from the source or requiring data design.
  • Transformations – The level of processing the data will undergo as it populates the sector.
  • Retention – How long the data will likely reside in the sector.
  • Consumers – Who will have access to the data and for what purpose.

As I suggested, this is offered in the spirit of discussion, and I welcome your comments and feedback from your own experiences.

The next article in the series is on the Data Acquisition layer.

Effective Communication: Team Meetings

There is a game played with groups of 10 to 20 people, where the object is to collectively count to 30. There are some rules to the game: if two people speak at once, the count returns to zero; no-one can say more than one number consecutively; and the count must be random, not simply moving from person to person around the circle in order, or controlled by anyone in any way.

This is extremely difficult to accomplish, and for the first 20 minutes may appear impossible. For a while, the failures are amusing, as people inevitably try to count at the same time. The amusement turns to irritation, particularly with those that are more outspoken, and repeatedly return the count to zero. And then something happens. The group becomes quiet and focused. The count is passed from one voice to another. No signals are made, no special order is followed. In the end, everyone has been part of the count; and the goal is achieved.

There is no doubt that any successful project will necessitate clear verbal communication among the group. As I have written elsewhere, documentation is the cornerstone of building a sound data warehouse; I would like to add to that it must be accompanied by a steady and cordial flow of human contact.

I’m not suggesting that everybody needs to be in the same room. I have been involved in highly successful projects where the participants were scattered around the globe. In one case, with part of the team in India, the client in the mid-western United States, and me in Toronto, Canada, we met on the phone twice each day, at the start and end. We leveraged the full 24 hours and were able to address issues as soon as they arose. The conference calls supplemented and clarified the documentation that was carefully version-controlled and our time was skilfully project-managed.

By contrast, I’ve seen projects where the team was all present on the client site, but the communication was less than effective. In one instance, little emphasis was placed on ensuring that everyone was on the same page. Instead, information about progress was held closely by the leadership. Issues and related decisions were never recorded or tracked. Project management methodology was in place, but the dialogue between the team leads, and the leads and their teams, was so fundamentally dysfunctional, the project fell behind its targets and never recovered.

Most remarkable is a development team I have witnessed that epitomizes effective communication. The team meets on an almost daily basis. Although there are more than a dozen participants at each meeting, representing the whole life cycle of development, the discussion is always focused and involves the active engagement of most of the participants. Each meeting is facilitated rather than led, walking through a structured agenda. Issues are never framed as blame to be apportioned; but rather problems to be solved. The atmosphere of the meetings I attended was always respectful and cordial. Jokes were welcome; though each item on the agenda was given appropriate time and weight. I had the sense that this was a group that would have easily counted to 30.

In conclusion, here are 5 factors towards effective communication:

5 Success Factors

It may sound trite, but one of the critical success factors for functional communication is respect. When people are treated like fools, they feel foolish and are less likely to think creatively. I’ve seen smart people shut down under harsh treatment. It stands to reason that the skills of the group are maximized when everyone feels comfortable to contribute.

No-one wants to be in constant meetings, with nothing being accomplished; but it’s equally important to establish an easy rapport among the group. There’s nothing wrong with one-to-one direct communication either. However, this shouldn’t replace the formalized meetings which allow issues to be aired across the teams (analysts, designers, ETL etc.).

The active involvement of the group is essential. If each meeting is being dominated by only a few voices, important perspectives may be overlooked. In my experience, people become disengaged when they feel that their contributions are never acknowledged or their suggestions considered. If the right people have been brought on board, you’ll want to hear from them.

Impending deadlines can ratchet up the tension in the room; as can personalities that are abrasive or fractious. Everyone will function much better if the stress is kept to a minimum and is never allowed to be mixed with heightened emotions.

The group is meeting to get a collective sense of the project’s status and air current issues. The discussion should be directed to move swiftly. If a given problem is taking too much time, take it up after the meeting. It’s also important that only a single discussion is taking place. Side conversations can develop easily, but they should be brought back to the group immediately.

Even as I write these out, they seem self-evident. But keeping them in balance is an act of will on the part of the whole group. A facilitator is key to finding that balance, while it is up to the group to maintain it. In the case of the group with such impressive status meetings, I understand that it took some time to achieve their level of success. It was dependent on the format, the personalities and the rapport of the individuals. I hope they continue to metaphorically count to 30 each day. It’s a pleasant and effective way to work.