Stop It: EDW Worst Practices

th-2There’s a great comedy sketch with Bob Newhart as a therapist who dispenses one key piece of advice to all his patients: Stop It! Whatever is holding them back or causing them distress, his counsel is to simply stop doing it. Afraid of heights? Stop it. Biting your nails? Stop it. You get the idea.

In a similar vein, I would like to present some practical advice to organizations struggling with the development of their data warehouse. These behaviours are going to seem obviously wrong-headed to many of you. But I assure you, they are very real to some – and you know who you are. For those of you experiencing the frustration currently, I hope you can take some comfort in the recognition that you’re not alone.

  1. Work effort and timelines are being defined with no regard to the scope of the work to be delivered
    • While it’s fair to estimate work effort and then allow for revision of scope as work proceeds, it is not reasonable to set completion dates based on a list of requirement headings. Any guesswork has to be educated. Some initial scoping exercise should be done at the outset of the project and development team leads should be consulted for their input on the complexity of the work. Promising delivery dates without this preliminary discovery is like providing a definitive answer to the proverbial question: How long is a piece of string? Stop it!
  2. Development team members aren’t limited to a single task at a given time
    • Some people become so enamoured of parallel development that they apply the concept at the individual level. The idea is that the developer takes on more than one task concurrently, each task given equal priority, and development time will be reduced significantly. Perhaps there are some uniquely gifted people who can multi-task to this degree, but for most of us it leads to nothing being done with our full attention, and the end result is inferior output. Prioritize work to be done, respect dependencies between tasks, and give each item the requisite focus. Efficiencies cannot be gained through concurrent assignments. Stop it!
  3. No analysis or profiling of data sources is done, and if it is, it’s kept separate from target data design decisions.
    • When a project plan is a set of fundamentally unrelated tasks, it’s a matter of checking each one off, and the order in which they’re done is of no consequence. With this perspective on development, the profiling and analysis of data sources has no bearing on the design of the EDW and can comfortably be done in parallel, or even after, modelling. If only this were so. The inconvenient truth is that data profiling is an essential input to requirements assessment, data modelling, ETL, and testing. It can even provide insight to business intelligence. Bypassing or isolating source analysis? Stop it!
  4. Source systems’ development work is happening in parallel with EDW development.
    • This might be related to a source system that is in the midst of an upgrade, or being migrated to a new platform. Whatever the reason, the result is that it keeps changing. This is certain to lead to significant rework. There will be some that argue it just has to be done, but it’s an expensive call to make. It would be far more cost-effective to wait until the source is in a (relatively) stable state, at least enough that change can be managed in periodic releases. Stop it!
  5. The EDW target data models are being designed from reporting requirements only, without reference to actual sources.
    • It seems like a reasonable thing to do, because it’s related to the end state, after all. We know that fields with the names from the report requirements will ultimately be needed. The problem is that there’s no context for that information. What data elements are related to what dimensions? Do we understand all the primary keys that will be involved? Even if the report requirements suggest field lengths, the data types may not match the sources when they do come available. This is a bad idea. Stop it!
  6. Change requests are being delivered directly to the data modellers from any other team member.
    • This may look like the ultimate in democratized agility. Anyone can ask for changes in the database at any time: business analysts doing requirements, ETL developers trying to load the tables, or even the reporting team, ensuring that everything they need is available. Having a process to manage model changes through a central project manager can be time-consuming and frustrating. To some people it smacks too much of old-school “waterfall” methodologies. The reality is that this degree of fluidity isn’t democratization, it’s anarchy. Stop it!
  7. EDW data models are created without reference to source data types.
    • This is part of a “do whatever you can for now” philosophy, that is impatient with waiting for actual source systems to be identified. Instead, modeling work is to be done directly from requirements documents. It seems smart at first blush, because at least the work of building the target can get started; when the source comes in we’ll be that much further ahead. Except it doesn’t work that way. Only when the source shows up are we able to see exactly what needs to be done; what’s really available, what’s populated, how it’s structured. Don’t waste your time. Stop it!
  8. No model reviews are conducted.
    • It’s easy to see how a model review can get in the way of delivering swiftly, especially when the scope of what is being designed is of such a size that it challenges anyone’s ability to absorb it. There has to be some level of governance around data modeling, with architecture principles, naming standards and input from team leads, particularly around requirements being met, ETL issues being addressed and reporting needs being considered. Just publishing the model with no checks or balances is a recipe for dissatisfaction and confusion. Stop it!
  9. No effort is made to communicate data design decisions or changes to other team members (ETL, Testing, Reporting).
    • Of course, the intention isn’t to keep team members in the dark about changes that are being made, but rather this is a result of leaving communication to chance. When delivering quickly takes primacy over every other consideration, the impact of one piece of work on the whole system drops in importance. Communication becomes a distraction. It’s up to project management to ensure that everyone understands what each team is doing, and that potential risks get identified before changes are made. Good development team leads understand how upstream changes will affect them; but first they need to know that the changes are occurring. Keeping them out of the loop? Stop it!
  10. Design policy decisions are being overturned repeatedly during development
    • This isn’t related to legitimate changes in your sources or even requirements. This is a matter of changing course with regard to fundamental design decisions. For example, starting with surrogate keys, and then changing to natural keys; or using domain models to manage data types, but then deciding to assign them to individual columns; or switching between atomic and dimensional structures within the same set of tables. Having design principles and standards defined prior to development, and then following through with them, can help enormously to accelerate the process. Conversely, second-guessing design choices during development will inevitably sow confusion and obstruct progress. Stop it!
  11. Reports are being built from database tables that have not been finalized
    • Even as I write this it strains my credulity, but I have seen it happen. The drive to deliver quickly leads to situations where the report development team is given what is essentially an early draft of data designs. This would never happen in a disciplined project, but is almost inevitable in an environment where many of the points in this list are routine. Once development has extended this far, any change in the data, even the spelling correction of a column name, will have costly ripples throughout the team (logical and physical data models, mapping models, design documentation, ETL jobs being revised and retested, test case documentation etc.). The recommendation isn’t so much to slow down, so much as to take care. More haste, less speed? Stop it!
  12. Documentation is avoided, particularly tracking of design decisions
    • I have met very few people who actually enjoy writing documentation; and frequently it sits on the virtual shelf gathering digital dust. However, like a toque in January, when it’s required, its absence is keenly felt. Relying on verbal communication and email to give instructions and track decisions can be effective in the short-term, but it soon becomes unwieldy and unreliable. Structured documentation, which is kept up-to-date diligently, is an efficient method of capturing essential design information. The key is to endow it with importance. If the team uses documentation as a central reference, making additions and corrections when needed, it becomes a reliable touch point for the project team, and a resource for ongoing maintenance. Relying on the team’s collective memory doesn’t cut it. Stop it!

Any one of these practices will certainly have a negative impact on the project, but taken together they will make a dysfunctional mess. The good news is that they aren’t beyond therapy. There are ways to approach EDW development that are efficient, accurate and productive. Go for it!



Nine ETL Design Principles

The principles of ETL design define the guidelines by which data migration will be constructed. Below are 9 principles to guide architectural decisions.

1. Performance
In almost all cases, the prime concern of any ETL implementation is to migrate the data from source to target as quickly as possible. There is usually a load “window” specified as part of the non-functional requirements; a duration of time that is available to complete the process. The constraints are based on either the availability of the source system data, the need for the business to have access to the information, or a combination of both.

2. Simplicity
As with all programming, a premium is placed on simplicity of design. This is in the interests of productivity of development time, consideration of ongoing maintenance, and a likely improvement in performance. The fewer steps involved, the less chance of mistakes being made, or places for things to go wrong. When changes need to be made, or fixes applied, the fewer touch points, the better. During the life of the processes, ownership will likely change hands. The system’s simplicity will aid clarity for those who need to take it on.

3. Repeatability
One needs to be able to re-run jobs to achieve consistent and predictable results each time. This means it needs to be applicable to all relevant incoming sources, and in no way dependent on specific time parameters. If sources change, the process needs to handle those changes gracefully and consistently.

4. Modularity
Units of work should be designed with an eye to repeatable patterns, interdependencies and discreet operations that function in isolation. The goal is to have as few modules as possible to be applied as templates to all future development work. This principle assists with clarity and efficiency of design, as well as reusability.

5. Reusability
As a principle, the goal should be not only to repeat modular patterns, but where possible re-use existing jobs and apply parameterization. This optimizes the efficiency of development and reduces the cycles required for testing.

6. Extensibility
Rather than “bring everything” from a given source when a data migration process is first built, it should be possible to include only that which is identified as valuable to the business in the context of a given project or release cycle. Over time, additional data elements from the sources can be added to the ETL jobs, with potentially, new targets. The ETL job should take this iterative approach into account.

7. Revocability
This refers to the ability to reset the database after a run and to return to the state it was in prior to running the process. This will be important for testing cycles during the development process, but also in production, in the event the database becomes corrupted or requires rolling back to a previous day.

8. Subject-orientation
Workloads are to be divided into units based on business subject areas rather than source-system groupings or strictly target table structures. This recognizes that a given source table may contain information about more than one subject area (e.g., Customers and Accounts) In addition, a given subject area may be composed of multiple source tables, which may populate multiple targets. Simply limiting the ETL jobs to a single source and a single target may compromise the other principles, particularly performance and simplicity. Similarly, orienting the ETL jobs to either the source or target layouts may degrade the efficiency of the design.

9. Auditability
It is essential to be able to trace the path that data takes from source to target and be able to identify any transformations that are applied on values along the way.

With these guiding principles, specific strategies can be employed with a set of criteria to judge their applicability.

I would be interested to hear feedback on this list; and am open to any additions.

In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.