Stop It: EDW Worst Practices

th-2There’s a great comedy sketch with Bob Newhart as a therapist who dispenses one key piece of advice to all his patients: Stop It! Whatever is holding them back or causing them distress, his counsel is to simply stop doing it. Afraid of heights? Stop it. Biting your nails? Stop it. You get the idea.

In a similar vein, I would like to present some practical advice to organizations struggling with the development of their data warehouse. These behaviours are going to seem obviously wrong-headed to many of you. But I assure you, they are very real to some – and you know who you are. For those of you experiencing the frustration currently, I hope you can take some comfort in the recognition that you’re not alone.

  1. Work effort and timelines are being defined with no regard to the scope of the work to be delivered
    • While it’s fair to estimate work effort and then allow for revision of scope as work proceeds, it is not reasonable to set completion dates based on a list of requirement headings. Any guesswork has to be educated. Some initial scoping exercise should be done at the outset of the project and development team leads should be consulted for their input on the complexity of the work. Promising delivery dates without this preliminary discovery is like providing a definitive answer to the proverbial question: How long is a piece of string? Stop it!
  2. Development team members aren’t limited to a single task at a given time
    • Some people become so enamoured of parallel development that they apply the concept at the individual level. The idea is that the developer takes on more than one task concurrently, each task given equal priority, and development time will be reduced significantly. Perhaps there are some uniquely gifted people who can multi-task to this degree, but for most of us it leads to nothing being done with our full attention, and the end result is inferior output. Prioritize work to be done, respect dependencies between tasks, and give each item the requisite focus. Efficiencies cannot be gained through concurrent assignments. Stop it!
  3. No analysis or profiling of data sources is done, and if it is, it’s kept separate from target data design decisions.
    • When a project plan is a set of fundamentally unrelated tasks, it’s a matter of checking each one off, and the order in which they’re done is of no consequence. With this perspective on development, the profiling and analysis of data sources has no bearing on the design of the EDW and can comfortably be done in parallel, or even after, modelling. If only this were so. The inconvenient truth is that data profiling is an essential input to requirements assessment, data modelling, ETL, and testing. It can even provide insight to business intelligence. Bypassing or isolating source analysis? Stop it!
  4. Source systems’ development work is happening in parallel with EDW development.
    • This might be related to a source system that is in the midst of an upgrade, or being migrated to a new platform. Whatever the reason, the result is that it keeps changing. This is certain to lead to significant rework. There will be some that argue it just has to be done, but it’s an expensive call to make. It would be far more cost-effective to wait until the source is in a (relatively) stable state, at least enough that change can be managed in periodic releases. Stop it!
  5. The EDW target data models are being designed from reporting requirements only, without reference to actual sources.
    • It seems like a reasonable thing to do, because it’s related to the end state, after all. We know that fields with the names from the report requirements will ultimately be needed. The problem is that there’s no context for that information. What data elements are related to what dimensions? Do we understand all the primary keys that will be involved? Even if the report requirements suggest field lengths, the data types may not match the sources when they do come available. This is a bad idea. Stop it!
  6. Change requests are being delivered directly to the data modellers from any other team member.
    • This may look like the ultimate in democratized agility. Anyone can ask for changes in the database at any time: business analysts doing requirements, ETL developers trying to load the tables, or even the reporting team, ensuring that everything they need is available. Having a process to manage model changes through a central project manager can be time-consuming and frustrating. To some people it smacks too much of old-school “waterfall” methodologies. The reality is that this degree of fluidity isn’t democratization, it’s anarchy. Stop it!
  7. EDW data models are created without reference to source data types.
    • This is part of a “do whatever you can for now” philosophy, that is impatient with waiting for actual source systems to be identified. Instead, modeling work is to be done directly from requirements documents. It seems smart at first blush, because at least the work of building the target can get started; when the source comes in we’ll be that much further ahead. Except it doesn’t work that way. Only when the source shows up are we able to see exactly what needs to be done; what’s really available, what’s populated, how it’s structured. Don’t waste your time. Stop it!
  8. No model reviews are conducted.
    • It’s easy to see how a model review can get in the way of delivering swiftly, especially when the scope of what is being designed is of such a size that it challenges anyone’s ability to absorb it. There has to be some level of governance around data modeling, with architecture principles, naming standards and input from team leads, particularly around requirements being met, ETL issues being addressed and reporting needs being considered. Just publishing the model with no checks or balances is a recipe for dissatisfaction and confusion. Stop it!
  9. No effort is made to communicate data design decisions or changes to other team members (ETL, Testing, Reporting).
    • Of course, the intention isn’t to keep team members in the dark about changes that are being made, but rather this is a result of leaving communication to chance. When delivering quickly takes primacy over every other consideration, the impact of one piece of work on the whole system drops in importance. Communication becomes a distraction. It’s up to project management to ensure that everyone understands what each team is doing, and that potential risks get identified before changes are made. Good development team leads understand how upstream changes will affect them; but first they need to know that the changes are occurring. Keeping them out of the loop? Stop it!
  10. Design policy decisions are being overturned repeatedly during development
    • This isn’t related to legitimate changes in your sources or even requirements. This is a matter of changing course with regard to fundamental design decisions. For example, starting with surrogate keys, and then changing to natural keys; or using domain models to manage data types, but then deciding to assign them to individual columns; or switching between atomic and dimensional structures within the same set of tables. Having design principles and standards defined prior to development, and then following through with them, can help enormously to accelerate the process. Conversely, second-guessing design choices during development will inevitably sow confusion and obstruct progress. Stop it!
  11. Reports are being built from database tables that have not been finalized
    • Even as I write this it strains my credulity, but I have seen it happen. The drive to deliver quickly leads to situations where the report development team is given what is essentially an early draft of data designs. This would never happen in a disciplined project, but is almost inevitable in an environment where many of the points in this list are routine. Once development has extended this far, any change in the data, even the spelling correction of a column name, will have costly ripples throughout the team (logical and physical data models, mapping models, design documentation, ETL jobs being revised and retested, test case documentation etc.). The recommendation isn’t so much to slow down, so much as to take care. More haste, less speed? Stop it!
  12. Documentation is avoided, particularly tracking of design decisions
    • I have met very few people who actually enjoy writing documentation; and frequently it sits on the virtual shelf gathering digital dust. However, like a toque in January, when it’s required, its absence is keenly felt. Relying on verbal communication and email to give instructions and track decisions can be effective in the short-term, but it soon becomes unwieldy and unreliable. Structured documentation, which is kept up-to-date diligently, is an efficient method of capturing essential design information. The key is to endow it with importance. If the team uses documentation as a central reference, making additions and corrections when needed, it becomes a reliable touch point for the project team, and a resource for ongoing maintenance. Relying on the team’s collective memory doesn’t cut it. Stop it!

Any one of these practices will certainly have a negative impact on the project, but taken together they will make a dysfunctional mess. The good news is that they aren’t beyond therapy. There are ways to approach EDW development that are efficient, accurate and productive. Go for it!

 

Making the Case for EDW

An EDW enables strategic capabilities for organizations; resolving issues that plague complex technology environments, while creating a solid foundation on which to build new opportunities.

Here are 7 key benefits of a well-implemented EDW:

1.     Provide an environment with suitable governance/controls to ensure integrity

Consolidating the data from multiple systems requires active governance by the business in cooperation with technology; and ultimately leads to more valuable information driving business decisions. The EDW provides the process to either make the changes to the data as it enters the system, or expose the issues so they can be resolved at their source.

2.     Consolidate information from disparate source systems

Reporting within the context of a given source system can lead to stovepipes of information, without the perspective of the rest of the enterprise. By bringing the information from across the systems into a centralized repository, multiple perspectives can be gained – and with them, new insights.

3.     Accumulate historical records over time – for trend analysis, forecasting and “as was” reporting

Frequently, source systems are limited to current views of products, customers and vendors, and may contain a limited set of transactions. This restricts the ability to report on trends over time, variances between current and historical periods and historical views, such as sales against previous versions of the organization’s hierarchy.

4.     Perform queries/reports on a dedicated environment, rather than on operational systems

Some organizations make do with drawing business intelligence from the operational systems themselves. The problem is that those systems may contain data integrity issues that go either unresolved or unknown. Also, the operations being performed need those resources; much better to have an environment dedicated to analytics, isolated from other business processes.

5.     Provide a dedicated environment tuned to performance of business intelligence

Each sector of the reference architecture provides a specific set of capabilities and is designed to optimize the efficiency of the overall system. The Information Provisioning sector is devoted to delivering information to end users and applications with clarity, ease of use, and speed.

6.     Provide a reporting environment that is easy to navigate and use

Simplicity can be an obstacle to user adoption of any business intelligence solution. That simplicity starts with the design of the data structures in the Information Provisioning sector.  The reference architecture is uncompromising in ensuring data integrity upstream; which allows for users to be presented with recognizable business objects and intuitive links between them.

7.     To have a secure environment, without risking operational system access

The ability to secure sensitive and private information for discreet sets of users remains a widely shared concern. In many cases, securing operational systems is not an option or is problematic. A centralized EDW provides the ability to enact a comprehensive security policy against a single system.

Data Design Principles

Obey the principles without being bound by them.

- Bruce Lee

Taking a practical approach to developing a well-formed enterprise data warehouse – and by that, I mean one that is accurate, efficient and productive – involves basing it on sound design principles. These principles are specific to each sector of the reference architecture; each of which enables specific capabilities and serves specific functions. Here, I would like to lay out the principles of the Information Warehousing layer’s normalized central repository – the system of record.

The Information Warehousing layer is designed as a normalized repository for the data that has been processed “upstream”. It arrives cleansed, transformed and mastered; and is consolidated here into a single “System of Record”.  The discipline of normalization restructures the data, removing it from the confines of the single perspective of the source system, into the multiple perspectives across the enterprise. The data is modelled according to its “essence” rather than its “use”.

This process of redesign imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The way the data is broken apart into separate tables makes it challenging to query, but this is not its main purpose. Data integrity and flexibility are the primary goals, and tuning is geared towards load performance rather than data access.

  1. Data Integrity
    Take a proactive stance to protect referential integrity and reduce instances of redundancy or potential for inconsistency.
  2. Scalability
    Allow for increases in volumes or additional sources of existing information, both within subject areas (e.g., issuers, counterparties) and core concepts (e.g., parties, locations).
  3. Flexibility
    Allow for additional sources or changes in existing sources, so that design is not tied to a given source or mirrors the source. Design will give primary consideration to reuse, then extension and finally modification of existing structures.
  4. Consistency
    Apply standard patterns for data design to promote efficiencies of data and ETL design. The decision-making process will be expedited as will data modelling work and ETL development.
  5. Efficiency
    Focus efficiency on three aspects:

    1. Implementation
      Use of repeatable patterns for data design will minimize data modelling and ETL work effort.
    2. Operation
      Ease ongoing maintenance by keeping the number of data objects to a minimum; maintain consistent standards; and apply logical structures for ease of navigation and use.
    3. Load Performance
      Priority given to performance of ETL load processes; including those that use the System of Record as a source to load the Data Mart sector.
  6. Enterprise Perspective
    For all data objects, retain and remain open to, a full range of existing and potential relationships between entities to ensure that data reflects an enterprise perspective that is not limited to the perspective of any given project’s requirements.

These foundational principles are implemented through strategies that impact storage of history, hierarchical structures, degree of normalization, classifications, surrogate keys and a number of other aspects of design. The principles form the criteria to judge the best approach to take in a given situation. It’s not always straightforward, even with the principles in place – at times one has to favour one principle over another (e.g., flexibility over load performance), but this list provides the guidance to frame the debate and take a considered approach.

As suggested by the opening quote, I’m not advocating a blind conformance to a set of rules; but, in my experience, one of the greatest obstacles to efficient and effective development is the decision-making process. Limiting the parameters of debate with intelligent guidelines can facilitate decisions being made quickly and correctly.

Feel free to suggest additions or amendments to this list of design principles for the System of Record.

Nine ETL Design Principles

The principles of ETL design define the guidelines by which data migration will be constructed. Below are 9 principles to guide architectural decisions.

1. Performance
In almost all cases, the prime concern of any ETL implementation is to migrate the data from source to target as quickly as possible. There is usually a load “window” specified as part of the non-functional requirements; a duration of time that is available to complete the process. The constraints are based on either the availability of the source system data, the need for the business to have access to the information, or a combination of both.

2. Simplicity
As with all programming, a premium is placed on simplicity of design. This is in the interests of productivity of development time, consideration of ongoing maintenance, and a likely improvement in performance. The fewer steps involved, the less chance of mistakes being made, or places for things to go wrong. When changes need to be made, or fixes applied, the fewer touch points, the better. During the life of the processes, ownership will likely change hands. The system’s simplicity will aid clarity for those who need to take it on.

3. Repeatability
One needs to be able to re-run jobs to achieve consistent and predictable results each time. This means it needs to be applicable to all relevant incoming sources, and in no way dependent on specific time parameters. If sources change, the process needs to handle those changes gracefully and consistently.

4. Modularity
Units of work should be designed with an eye to repeatable patterns, interdependencies and discreet operations that function in isolation. The goal is to have as few modules as possible to be applied as templates to all future development work. This principle assists with clarity and efficiency of design, as well as reusability.

5. Reusability
As a principle, the goal should be not only to repeat modular patterns, but where possible re-use existing jobs and apply parameterization. This optimizes the efficiency of development and reduces the cycles required for testing.

6. Extensibility
Rather than “bring everything” from a given source when a data migration process is first built, it should be possible to include only that which is identified as valuable to the business in the context of a given project or release cycle. Over time, additional data elements from the sources can be added to the ETL jobs, with potentially, new targets. The ETL job should take this iterative approach into account.

7. Revocability
This refers to the ability to reset the database after a run and to return to the state it was in prior to running the process. This will be important for testing cycles during the development process, but also in production, in the event the database becomes corrupted or requires rolling back to a previous day.

8. Subject-orientation
Workloads are to be divided into units based on business subject areas rather than source-system groupings or strictly target table structures. This recognizes that a given source table may contain information about more than one subject area (e.g., Customers and Accounts) In addition, a given subject area may be composed of multiple source tables, which may populate multiple targets. Simply limiting the ETL jobs to a single source and a single target may compromise the other principles, particularly performance and simplicity. Similarly, orienting the ETL jobs to either the source or target layouts may degrade the efficiency of the design.

9. Auditability
It is essential to be able to trace the path that data takes from source to target and be able to identify any transformations that are applied on values along the way.

With these guiding principles, specific strategies can be employed with a set of criteria to judge their applicability.

I would be interested to hear feedback on this list; and am open to any additions.

Reference Domains Part II: Modelling Classifications

This is the second article in the series on working with reference domains, also commonly referred to as classifications. In Part I, we looked at the nature of classifications. Here we will discuss designing structures to accommodate them in the EDW’s Information Warehousing layer, within the normalized System of Record.

Two potential approaches to designing the Reference Domains are to:

  1. Put each reference domain of values into its own table.
  2. Collect all reference domains into two tables; one for a master set of domains and the other for all the values, with a column linking each to the relevant domain.

The second approach is the recommended one, and the remainder of this article will present the design and the rationale. I have seen the consolidated reference data approach implemented many times within the System of Record of the EDW; and it has proved highly effective.

Given this preferred approach, every Reference Domain and Value is to be placed in the following entities:

  • Reference Domain
  • Reference Value

As mentioned in Reference Domains Part I, reference domains are commonly made up of pairs of code and description attributes. However, it is possible that some domains may be comprised of additional attribution, such as short names, long names, multi-language names etc. In such cases, consideration should be given to extending the model to accommodate additional fields. (This is not the same as related  reference values that may be stored together in the source, but should be separated in the EDW.)

The domain values may be organized hierarchically. To support many-to-many relationships between domain values, in terms of hierarchies or other relationship types, the associative entity, Reference Value to Reference Value Relationship, should be used. As with all associatives, it will accommodate multiple concurrent as well as historical relationships.

To facilitate changes in values within a given scheme, a Reference Domain to Reference Value Relationship entity can be deployed.

Maintaining History

The history of values for a given scheme in relationship to other core concepts is to be retained within relevant associative structures.

For example, in the diagram below, the current value of Party Status Id, with its effective date, called Party Status Change Date, are held on the Party entity. Historical values are held on the Party to Domain Relationship entity, with the Domain Scheme Id holding the identifier for the Party Status domain, and the Domain Value Id holding the identifier for the previous value.

To illustrate the example more fully, consider the following:

The Domain Scheme and Domain Value tables contain rows for the Party Status domain scheme.

Day 1

On day one, the party enters the system with a party status of “Active”. There are no entries in the Party to Domain Relationship table.

Day 2

On day two, the party status changes to “Closed”. The current value in Party is overwritten with the new value; the change date is updated; and the old value is inserted into the associative.

Design Benefits

The following points outline the benefits of employing a consolidated set of domain schemes and values within a small set of tables, versus creating a separate table for each domain scheme’s set of values.

  1. Data Integrity: The integrity of the allowable set of values within a given domain are maintained through the application logic defined and implemented through the ETL. Maintaining a master list reduces the chance of inconsistencies or the appearance of anomalies.
  2. Flexibility: There is only one table to access for all domain value codes and descriptions within the SOR.
  3. Implementation Efficiency: No additional database entities or tables to be created, tested or maintained within the System of Record (SOR) logical or physical model. Fewer objects means less chance for error.
  4. Operational Efficiency: A single utility can be created to interface with this table. It is true that even with multiple tables a single utility could be created with minimal logic. However, each new object would require some changes to the application, whereas the consolidated version can be extended seamlessly.
  5. Consistency: History is stored in the classification associative tables (e.g., Party to Domain Relationship). The retention of the domain schemes and values as surrogate keys in the Domain Value entity facilitates these structures. This allows history to be gathered using a common mechanism, explicitly identifying both the scheme and the value.

In part three, we will go on to look at collecting and documenting classifications.

Follow

Get every new post delivered to your Inbox.