Making the Case for EDW

An EDW enables strategic capabilities for organizations; resolving issues that plague complex technology environments, while creating a solid foundation on which to build new opportunities.

Here are 7 key benefits of a well-implemented EDW:

1.     Provide an environment with suitable governance/controls to ensure integrity

Consolidating the data from multiple systems requires active governance by the business in cooperation with technology; and ultimately leads to more valuable information driving business decisions. The EDW provides the process to either make the changes to the data as it enters the system, or expose the issues so they can be resolved at their source.

2.     Consolidate information from disparate source systems

Reporting within the context of a given source system can lead to stovepipes of information, without the perspective of the rest of the enterprise. By bringing the information from across the systems into a centralized repository, multiple perspectives can be gained – and with them, new insights.

3.     Accumulate historical records over time – for trend analysis, forecasting and “as was” reporting

Frequently, source systems are limited to current views of products, customers and vendors, and may contain a limited set of transactions. This restricts the ability to report on trends over time, variances between current and historical periods and historical views, such as sales against previous versions of the organization’s hierarchy.

4.     Perform queries/reports on a dedicated environment, rather than on operational systems

Some organizations make do with drawing business intelligence from the operational systems themselves. The problem is that those systems may contain data integrity issues that go either unresolved or unknown. Also, the operations being performed need those resources; much better to have an environment dedicated to analytics, isolated from other business processes.

5.     Provide a dedicated environment tuned to performance of business intelligence

Each sector of the reference architecture provides a specific set of capabilities and is designed to optimize the efficiency of the overall system. The Information Provisioning sector is devoted to delivering information to end users and applications with clarity, ease of use, and speed.

6.     Provide a reporting environment that is easy to navigate and use

Simplicity can be an obstacle to user adoption of any business intelligence solution. That simplicity starts with the design of the data structures in the Information Provisioning sector.  The reference architecture is uncompromising in ensuring data integrity upstream; which allows for users to be presented with recognizable business objects and intuitive links between them.

7.     To have a secure environment, without risking operational system access

The ability to secure sensitive and private information for discreet sets of users remains a widely shared concern. In many cases, securing operational systems is not an option or is problematic. A centralized EDW provides the ability to enact a comprehensive security policy against a single system.

In Praise of Industry Models

If you want to make an apple pie from scratch, you must first create the universe.—Carl Sagan

Making an enterprise data warehouse from scratch may not necessitate recreating the universe, but it’s also not easy as pie. It’s a major undertaking, which must be handled incrementally if a workable result is to be achieved. Along the way are a multitude of design decisions to be made, each of which will have ramifications downstream, such as ETL processing, business intelligence development and enterprise services.

I have spent some years working with the set of Industry Models from IBM, and while I have avoided speaking directly about them on this blog, so as not to reveal privileged information, I would like to take this opportunity to point out some of the real benefits of incorporating them into an EDW initiative. The models are not without their frustrations and limitations. But having experienced the alternative – of having to develop models from scratch – I can attest to their value.

Here are some of the major benefits of having a template data model:

  1. Standards
    • Data Architecture – The industry models provide a cohesive infrastructure, with repeating patterns, following a strict, disciplined approach. The design is based on a classification model that presents business information according to “what it is” rather than “how it’s used”. This opens up information to be stored in many perspectives, rather than for a single line of business.
    • Lowers Risk of redundancy – Because of its structure and its evolution over numerous implementations, redundancy of objects and conflicts between “versions” of information is eliminated. Every piece of incoming information has a single placeholder within the data model. The EDW System of Record becomes the “Single Source of Truth”.
    • Definitions and Naming Standards – The models come with a full set of logical names and physical abbreviations, reducing the risk of the misidentification of terms and facilitating the adoption of a single set of enterprise names.
  2. Enterprise Perspective
    • Core Concepts – Template model incorporates all aspects of each industry through its structured hierarchies of core concepts.
    • Iterative – Facilitates incremental adoption while minimizing risk of being “painted into a corner” through limiting design decisions.
    • Multi-Industry Integration – There are some occasions where an enterprise may be involved in multiple lines of business, such as retail and banking, or retail and insurance. Because of the common design principles on which the models are based, they are well-positioned for such integration.
  3. Flexibility
    • History – Accommodates history at “attribute-level”, minimizing redundancy and providing an efficient structure for historical analysis.
    • Growth – Accommodates growth with minimal structural changes; particularly with regard to metrics and relationships between business entities.
    • Perspectives – Highly-normalized structures are well-positioned for creating different perspectives of dimensions and metrics for reporting.
  4. Process Acceleration
    • Data Design Patterns – While the models are not intended to be used “out-of-the-box” without customization, the template offers a complete picture of the enterprise and a significant head start on development.
    • ETL Patterns – Repeating patterns in Data Architecture translate to repeatable ETL patterns, with a modular approach to each set of recurring data structures (e.g., writing history of a classification value, handling single values in an attributive table, approach to normalized hierarchies).
    • Testing Patterns – Acceleration of effort extends to unit and system integration testing. With repeating data structures come the ability to craft SQL code that follows repeatable patterns, reducing the effort to prepare test scripts.

In the absence of a set of Industry Models, from IBM or another vendor, all of these points must be addressed afresh; or left unconsidered, with the risk that mistakes will be made. The structures will follow inconsistent patterns, the names will conform to no specific set of conventions, and the effort to create a flexible repository to service the whole business will end up a silo of information being used by a small corner of the organization. Until the next time the executive decides to try again to create an enterprise asset.

If you want to make an apple pie from scratch, follow the recipe.

If you have experiences with or without industry models, please feel free to share them…

Data Design Principles

Obey the principles without being bound by them.

- Bruce Lee

Taking a practical approach to developing a well-formed enterprise data warehouse – and by that, I mean one that is accurate, efficient and productive – involves basing it on sound design principles. These principles are specific to each sector of the reference architecture; each of which enables specific capabilities and serves specific functions. Here, I would like to lay out the principles of the Information Warehousing layer’s normalized central repository – the system of record.

The Information Warehousing layer is designed as a normalized repository for the data that has been processed “upstream”. It arrives cleansed, transformed and mastered; and is consolidated here into a single “System of Record”.  The discipline of normalization restructures the data, removing it from the confines of the single perspective of the source system, into the multiple perspectives across the enterprise. The data is modelled according to its “essence” rather than its “use”.

This process of redesign imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The way the data is broken apart into separate tables makes it challenging to query, but this is not its main purpose. Data integrity and flexibility are the primary goals, and tuning is geared towards load performance rather than data access.

  1. Data Integrity
    Take a proactive stance to protect referential integrity and reduce instances of redundancy or potential for inconsistency.
  2. Scalability
    Allow for increases in volumes or additional sources of existing information, both within subject areas (e.g., issuers, counterparties) and core concepts (e.g., parties, locations).
  3. Flexibility
    Allow for additional sources or changes in existing sources, so that design is not tied to a given source or mirrors the source. Design will give primary consideration to reuse, then extension and finally modification of existing structures.
  4. Consistency
    Apply standard patterns for data design to promote efficiencies of data and ETL design. The decision-making process will be expedited as will data modelling work and ETL development.
  5. Efficiency
    Focus efficiency on three aspects:

    1. Implementation
      Use of repeatable patterns for data design will minimize data modelling and ETL work effort.
    2. Operation
      Ease ongoing maintenance by keeping the number of data objects to a minimum; maintain consistent standards; and apply logical structures for ease of navigation and use.
    3. Load Performance
      Priority given to performance of ETL load processes; including those that use the System of Record as a source to load the Data Mart sector.
  6. Enterprise Perspective
    For all data objects, retain and remain open to, a full range of existing and potential relationships between entities to ensure that data reflects an enterprise perspective that is not limited to the perspective of any given project’s requirements.

These foundational principles are implemented through strategies that impact storage of history, hierarchical structures, degree of normalization, classifications, surrogate keys and a number of other aspects of design. The principles form the criteria to judge the best approach to take in a given situation. It’s not always straightforward, even with the principles in place – at times one has to favour one principle over another (e.g., flexibility over load performance), but this list provides the guidance to frame the debate and take a considered approach.

As suggested by the opening quote, I’m not advocating a blind conformance to a set of rules; but, in my experience, one of the greatest obstacles to efficient and effective development is the decision-making process. Limiting the parameters of debate with intelligent guidelines can facilitate decisions being made quickly and correctly.

Feel free to suggest additions or amendments to this list of design principles for the System of Record.

A Question of Scope

Don’t bite off more than you can chew because nobody looks attractive spitting it back out.

- Carroll Bryant

I apologize for the image – but the sentiment is apt. My mother would have said that “my eyes were bigger than my stomach” and admonished me for letting good food go to waste. The problem is that when the size of the project is beyond the budget constraints of the business, or the capacity of the development team, work goes unfinished. And so it’s essential that the scope of the effort be understood before any commitment is made to complete it.

The scope of a data management initiative is influenced by 3 factors:

  1. Reference Architecture
  2. Source Data Requirements
  3. Target Dimensional Breakdown

Reference Architecture

The reference architecture determines the number of areas that need to be modeled and the number of times the data will be migrated from one sector to another. Each sector also has its own degree of complexity – some considerably less than others.

  • What is the reference architecture of the target solution?
  • How many sectors of the reference architecture are targeted for this initiative?
  • Are we building on an existing structure or is this the initial project?
  • Do you have existing architecture, naming and content standards?
EDW Reference Architecture

EDW Reference Architecture

Source Data Requirements

The source data requirements are the list of fields that are to be brought into the target system. It is important to understand that a cost attaches to each one of these fields; and that if one field is being drawn from a source table, it will not be the same level of effort to bring in additional fields from the same table. Therefore, the scoping needs to be based on the number of individual pieces of information, not just the number of source tables.

  • How many distinct fields are required? (e.g., Customer First Name, Date of Birth, Industry Classification Code)
  • How many different source systems are involved?
  • How many different source tables are involved?
  • How many fields are being drawn from multiple sources? (e.g., Customer First Name coming from Marketing database and Point of Sale system)

Target Dimensional Breakdown

The target dimensional breakdown determines the breadth of subject areas being modeled, and the potential complexity of the processes involved.

The measures are numeric “counts and amounts” that either come directly from source systems, and so are relatively straightforward, or need to be calculated or derived from component values.

Obviously, these are more complex, and will involve the storage of both components and resulting calculations.

The dimensions give the measures context (e.g, sales by product, balances by branch). The number of different dimensions that need to be modeled and potentially mastered can have a significant impact on development time, particularly when they involve hierarchies.

  • What are the measures?
  • How many require calculations/derivations vs. coming directly from the sources?
  • What dimensions are being requested (customer, employee, store, branch etc.)?
  • What hierarchies are being requested?

Once you’ve established the scope of the project, you can estimate the time it will take to develop it. These are important inputs into understanding the order of magnitude of the task at hand; and an essential set of questions to be answered when communicating with the internal development team or a potential vendor.

We all know that scope can creep once the project gets going, and that becomes a management issue; but it’s important to start with a baseline understanding of the size of the task at hand. It helps set appropriate expectations for everyone involved. And it’s good for the digestion.

Nine ETL Design Principles

The principles of ETL design define the guidelines by which data migration will be constructed. Below are 9 principles to guide architectural decisions.

1. Performance
In almost all cases, the prime concern of any ETL implementation is to migrate the data from source to target as quickly as possible. There is usually a load “window” specified as part of the non-functional requirements; a duration of time that is available to complete the process. The constraints are based on either the availability of the source system data, the need for the business to have access to the information, or a combination of both.

2. Simplicity
As with all programming, a premium is placed on simplicity of design. This is in the interests of productivity of development time, consideration of ongoing maintenance, and a likely improvement in performance. The fewer steps involved, the less chance of mistakes being made, or places for things to go wrong. When changes need to be made, or fixes applied, the fewer touch points, the better. During the life of the processes, ownership will likely change hands. The system’s simplicity will aid clarity for those who need to take it on.

3. Repeatability
One needs to be able to re-run jobs to achieve consistent and predictable results each time. This means it needs to be applicable to all relevant incoming sources, and in no way dependent on specific time parameters. If sources change, the process needs to handle those changes gracefully and consistently.

4. Modularity
Units of work should be designed with an eye to repeatable patterns, interdependencies and discreet operations that function in isolation. The goal is to have as few modules as possible to be applied as templates to all future development work. This principle assists with clarity and efficiency of design, as well as reusability.

5. Reusability
As a principle, the goal should be not only to repeat modular patterns, but where possible re-use existing jobs and apply parameterization. This optimizes the efficiency of development and reduces the cycles required for testing.

6. Extensibility
Rather than “bring everything” from a given source when a data migration process is first built, it should be possible to include only that which is identified as valuable to the business in the context of a given project or release cycle. Over time, additional data elements from the sources can be added to the ETL jobs, with potentially, new targets. The ETL job should take this iterative approach into account.

7. Revocability
This refers to the ability to reset the database after a run and to return to the state it was in prior to running the process. This will be important for testing cycles during the development process, but also in production, in the event the database becomes corrupted or requires rolling back to a previous day.

8. Subject-orientation
Workloads are to be divided into units based on business subject areas rather than source-system groupings or strictly target table structures. This recognizes that a given source table may contain information about more than one subject area (e.g., Customers and Accounts) In addition, a given subject area may be composed of multiple source tables, which may populate multiple targets. Simply limiting the ETL jobs to a single source and a single target may compromise the other principles, particularly performance and simplicity. Similarly, orienting the ETL jobs to either the source or target layouts may degrade the efficiency of the design.

9. Auditability
It is essential to be able to trace the path that data takes from source to target and be able to identify any transformations that are applied on values along the way.

With these guiding principles, specific strategies can be employed with a set of criteria to judge their applicability.

I would be interested to hear feedback on this list; and am open to any additions.

Reference Domains Part II: Modelling Classifications

This is the second article in the series on working with reference domains, also commonly referred to as classifications. In Part I, we looked at the nature of classifications. Here we will discuss designing structures to accommodate them in the EDW’s Information Warehousing layer, within the normalized System of Record.

Two potential approaches to designing the Reference Domains are to:

  1. Put each reference domain of values into its own table.
  2. Collect all reference domains into two tables; one for a master set of domains and the other for all the values, with a column linking each to the relevant domain.

The second approach is the recommended one, and the remainder of this article will present the design and the rationale. I have seen the consolidated reference data approach implemented many times within the System of Record of the EDW; and it has proved highly effective.

Given this preferred approach, every Reference Domain and Value is to be placed in the following entities:

  • Reference Domain
  • Reference Value

As mentioned in Reference Domains Part I, reference domains are commonly made up of pairs of code and description attributes. However, it is possible that some domains may be comprised of additional attribution, such as short names, long names, multi-language names etc. In such cases, consideration should be given to extending the model to accommodate additional fields. (This is not the same as related  reference values that may be stored together in the source, but should be separated in the EDW.)

The domain values may be organized hierarchically. To support many-to-many relationships between domain values, in terms of hierarchies or other relationship types, the associative entity, Reference Value to Reference Value Relationship, should be used. As with all associatives, it will accommodate multiple concurrent as well as historical relationships.

To facilitate changes in values within a given scheme, a Reference Domain to Reference Value Relationship entity can be deployed.

Maintaining History

The history of values for a given scheme in relationship to other core concepts is to be retained within relevant associative structures.

For example, in the diagram below, the current value of Party Status Id, with its effective date, called Party Status Change Date, are held on the Party entity. Historical values are held on the Party to Domain Relationship entity, with the Domain Scheme Id holding the identifier for the Party Status domain, and the Domain Value Id holding the identifier for the previous value.

To illustrate the example more fully, consider the following:

The Domain Scheme and Domain Value tables contain rows for the Party Status domain scheme.

Day 1

On day one, the party enters the system with a party status of “Active”. There are no entries in the Party to Domain Relationship table.

Day 2

On day two, the party status changes to “Closed”. The current value in Party is overwritten with the new value; the change date is updated; and the old value is inserted into the associative.

Design Benefits

The following points outline the benefits of employing a consolidated set of domain schemes and values within a small set of tables, versus creating a separate table for each domain scheme’s set of values.

  1. Data Integrity: The integrity of the allowable set of values within a given domain are maintained through the application logic defined and implemented through the ETL. Maintaining a master list reduces the chance of inconsistencies or the appearance of anomalies.
  2. Flexibility: There is only one table to access for all domain value codes and descriptions within the SOR.
  3. Implementation Efficiency: No additional database entities or tables to be created, tested or maintained within the System of Record (SOR) logical or physical model. Fewer objects means less chance for error.
  4. Operational Efficiency: A single utility can be created to interface with this table. It is true that even with multiple tables a single utility could be created with minimal logic. However, each new object would require some changes to the application, whereas the consolidated version can be extended seamlessly.
  5. Consistency: History is stored in the classification associative tables (e.g., Party to Domain Relationship). The retention of the domain schemes and values as surrogate keys in the Domain Value entity facilitates these structures. This allows history to be gathered using a common mechanism, explicitly identifying both the scheme and the value.

In part three, we will go on to look at collecting and documenting classifications.

In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

Follow

Get every new post delivered to your Inbox.