A Question of Scope

Don’t bite off more than you can chew because nobody looks attractive spitting it back out.

- Carroll Bryant

I apologize for the image – but the sentiment is apt. My mother would have said that “my eyes were bigger than my stomach” and admonished me for letting good food go to waste. The problem is that when the size of the project is beyond the budget constraints of the business, or the capacity of the development team, work goes unfinished. And so it’s essential that the scope of the effort be understood before any commitment is made to complete it.

The scope of a data management initiative is influenced by 3 factors:

  1. Reference Architecture
  2. Source Data Requirements
  3. Target Dimensional Breakdown

Reference Architecture

The reference architecture determines the number of areas that need to be modeled and the number of times the data will be migrated from one sector to another. Each sector also has its own degree of complexity – some considerably less than others.

  • What is the reference architecture of the target solution?
  • How many sectors of the reference architecture are targeted for this initiative?
  • Are we building on an existing structure or is this the initial project?
  • Do you have existing architecture, naming and content standards?
EDW Reference Architecture

EDW Reference Architecture

Source Data Requirements

The source data requirements are the list of fields that are to be brought into the target system. It is important to understand that a cost attaches to each one of these fields; and that if one field is being drawn from a source table, it will not be the same level of effort to bring in additional fields from the same table. Therefore, the scoping needs to be based on the number of individual pieces of information, not just the number of source tables.

  • How many distinct fields are required? (e.g., Customer First Name, Date of Birth, Industry Classification Code)
  • How many different source systems are involved?
  • How many different source tables are involved?
  • How many fields are being drawn from multiple sources? (e.g., Customer First Name coming from Marketing database and Point of Sale system)

Target Dimensional Breakdown

The target dimensional breakdown determines the breadth of subject areas being modeled, and the potential complexity of the processes involved.

The measures are numeric “counts and amounts” that either come directly from source systems, and so are relatively straightforward, or need to be calculated or derived from component values.

Obviously, these are more complex, and will involve the storage of both components and resulting calculations.

The dimensions give the measures context (e.g, sales by product, balances by branch). The number of different dimensions that need to be modeled and potentially mastered can have a significant impact on development time, particularly when they involve hierarchies.

  • What are the measures?
  • How many require calculations/derivations vs. coming directly from the sources?
  • What dimensions are being requested (customer, employee, store, branch etc.)?
  • What hierarchies are being requested?

Once you’ve established the scope of the project, you can estimate the time it will take to develop it. These are important inputs into understanding the order of magnitude of the task at hand; and an essential set of questions to be answered when communicating with the internal development team or a potential vendor.

We all know that scope can creep once the project gets going, and that becomes a management issue; but it’s important to start with a baseline understanding of the size of the task at hand. It helps set appropriate expectations for everyone involved. And it’s good for the digestion.

Nine ETL Design Principles

The principles of ETL design define the guidelines by which data migration will be constructed. Below are 9 principles to guide architectural decisions.

1. Performance
In almost all cases, the prime concern of any ETL implementation is to migrate the data from source to target as quickly as possible. There is usually a load “window” specified as part of the non-functional requirements; a duration of time that is available to complete the process. The constraints are based on either the availability of the source system data, the need for the business to have access to the information, or a combination of both.

2. Simplicity
As with all programming, a premium is placed on simplicity of design. This is in the interests of productivity of development time, consideration of ongoing maintenance, and a likely improvement in performance. The fewer steps involved, the less chance of mistakes being made, or places for things to go wrong. When changes need to be made, or fixes applied, the fewer touch points, the better. During the life of the processes, ownership will likely change hands. The system’s simplicity will aid clarity for those who need to take it on.

3. Repeatability
One needs to be able to re-run jobs to achieve consistent and predictable results each time. This means it needs to be applicable to all relevant incoming sources, and in no way dependent on specific time parameters. If sources change, the process needs to handle those changes gracefully and consistently.

4. Modularity
Units of work should be designed with an eye to repeatable patterns, interdependencies and discreet operations that function in isolation. The goal is to have as few modules as possible to be applied as templates to all future development work. This principle assists with clarity and efficiency of design, as well as reusability.

5. Reusability
As a principle, the goal should be not only to repeat modular patterns, but where possible re-use existing jobs and apply parameterization. This optimizes the efficiency of development and reduces the cycles required for testing.

6. Extensibility
Rather than “bring everything” from a given source when a data migration process is first built, it should be possible to include only that which is identified as valuable to the business in the context of a given project or release cycle. Over time, additional data elements from the sources can be added to the ETL jobs, with potentially, new targets. The ETL job should take this iterative approach into account.

7. Revocability
This refers to the ability to reset the database after a run and to return to the state it was in prior to running the process. This will be important for testing cycles during the development process, but also in production, in the event the database becomes corrupted or requires rolling back to a previous day.

8. Subject-orientation
Workloads are to be divided into units based on business subject areas rather than source-system groupings or strictly target table structures. This recognizes that a given source table may contain information about more than one subject area (e.g., Customers and Accounts) In addition, a given subject area may be composed of multiple source tables, which may populate multiple targets. Simply limiting the ETL jobs to a single source and a single target may compromise the other principles, particularly performance and simplicity. Similarly, orienting the ETL jobs to either the source or target layouts may degrade the efficiency of the design.

9. Auditability
It is essential to be able to trace the path that data takes from source to target and be able to identify any transformations that are applied on values along the way.

With these guiding principles, specific strategies can be employed with a set of criteria to judge their applicability.

I would be interested to hear feedback on this list; and am open to any additions.

In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

EDW Roles and Responsibilities: Part I

In the next couple of articles, we look at the organizational structure of the EDW development team. This needs to be considered in the context of the development life cycle and areas of governance that I have discussed elsewhere, with particular reference to Release Management.

A few things to note while reading this list:

  • Technical infrastructure is not included – this is taken as a given. In my experience, the hardware and software is set up and maintained by a different group, who are not usually involved directly in any design or development decisions. Of course, they are involved when it comes to the underlying technical architecture of the overall solution, and the infrastructure needs to support the flow of data through the warehouse layers.
  • The organization tree structure diagram above is conceptual, and intended to represent a generic organization. Some organizations are larger and have many resources, internal and external within any branch. Others have a small core team that does all the development, with a single person handling many roles.
  • The main thing to take away from the diagram is that the EDW Program team analyse, data model, ETL and QA their work, while project streams gather requirements, spec and build reports. The EDW Council is a body of higher authority who can arbitrate design decisions and provide governance of the process.

Program/Release Manager

The program manager is responsible for the overall management of the EDW release. This includes the fulfillment of all data requirements, adjusted after identification and analysis of the sources; as well as delivery of the validated, repeatable processes to populate the target databases.

Throughout the release life cycle, the program manager is accountable for all the major milestone deliverables that sit within the responsibility of the EDW development team. This excludes only the detailing of requirements for each project stream and the development of the business intelligence solution that is built on top of the data mart structures.

The program manager must be part of the intake process to determine the scope of each release. To this end, he/she is consulted on the definition of each project’s requirements.

While the program manager is not directly involved in, or responsible for, business intelligence, he/she is kept informed of BI progress to remain responsive to needs that may arise.

Project Manager

The project manager of each stream is directly accountable for the detailed definition of requirements feeding into the EDW, as well as the documentation of BI specifications and the delivery of BI reporting that will draw from the data mart.

The project manager will not be involved in the development of the EDW, but will remain informed of its progress. Early in the process, it is particularly important that the project manager be apprised of data quality issues that arise through identification and analysis of sources. The system of record is not relevant to an individual project, while the data mart will be a critical aspect of the project’s success; and so the project manager is informed of major milestones in the development of the data mart.

Business Analyst

The business analyst is responsible for the compiling the detailed definition of requirements. It is also the business analyst’s responsibility to identify the sources that will fulfill those requirements. It is to be expected that not all analysts will have the business knowledge to complete the first task as well as the technical knowledge to complete the second. This is a case where a single role could be performed by different people.

The business analyst supports processes that involve, or require input from, analysis. This means that may be actively involved in the analysis of profile results and writing SQL queries to interrogate the source data. He/she will find, amend and compose business definitions of fields. When it comes time to map sources to targets, the analyst will be engaged to communicate his/her understanding to the person writing the document. It may even be the case that the analyst’s skills in documentation mean that he/she writes does the writing.

Data content governance involves insight into the nature and quality of the data. To this end, the business analyst supports this governance process. The analyst is also consulted on the information and reporting requirements to determine suitable structures for the data mart.

Data Steward

The data steward is responsible for governing what data will be brought into the EDW, the governance of definitions for each of the data elements, and the identification and definition of business rules to ensure data quality. This means, that while the business analyst will see that definitions are attached to each field, the data steward must ensure that the definitions are complete and accurate.

The data steward supports processes that depend on an understanding of the business meaning of data. This includes the definition of requirements, the identification and analysis of sources, and the mapping of those sources to the system of record.

The data steward also supports naming standards, although, due to each steward being attached to a given subject area domain, or set of domains, they are not responsible for setting or maintaining these standards.

Data Quality Analyst

The data quality analyst is responsible for assessing the quality of data at various layers. Initially, the quality analyst is engaged in the activity of running the data profiling functions and any interrogation queries. While the data analyst may draft queries, it is ultimately the responsible of the quality analyst to ensure that the analyst’s questions get answered. Later in the development life cycle, the data quality analyst is responsible for testing the data populating the system of record and the data mart.

The quality analyst enables data analysis, and so supports the governance of data content in assessing the integrity and suitability of the sources.

The data quality analyst may be consulted on questions arising from mapping sources to the system of record. They may offer technical insights or be able to perform additional quality tests to support the mapping exercise.

The requirements and source fields are an input to the quality analyst’s activities, and so he/she is informed of the results of these processes.

Enterprise Data Architect

The enterprise data architect is responsible for the governance of data architecture, including the establishment of principles and guidelines, as well as reviews and final sign-off of the data models. He/she is also responsible for establishing guidelines for logical and physical naming standards.

The enterprise data architect is consulted on all data modelling activities, relating to both the system of record and the data mart. Data models for any other component of the architect will also pass through enterprise architecture for consultation.

The governance of data content relates to the placement of relevant data within the context of the reference architecture. The governance process involves the consultation of the enterprise data architect.

The requirements and source fields are an input to the enterprise data architect’s activities, and so he/she is informed of the results of these processes.

Data Modeller

The data modeller is responsible for all processes that are directly related to data model artifacts, including the logical mapping of the source to the system of record, logical modelling and physical modelling.

The data modeller supports processes that modelling depends on, including the analysis of the sources through profiling and interrogation; or are dependent on the models, which includes the documentation of the physical mappings to populate both the system of record and the data mart.

The governance of business definitions, naming standards and data architecture all pertain directly to the data models; and so the data modeller supports these processes, by ensuring that all fields contain complete, well-formed and accurate definitions and conform to standard naming conventions, and that the structure aligns with the design standards of the given component within the reference architecture.

The data modeller is informed of requirements and source fields, as these are inputs to the process of designing data structures.

Part II will complete the list of EDW Roles and Responsibilities.

Follow

Get every new post delivered to your Inbox.