Data Design Principles

Obey the principles without being bound by them.

- Bruce Lee

Taking a practical approach to developing a well-formed enterprise data warehouse – and by that, I mean one that is accurate, efficient and productive – involves basing it on sound design principles. These principles are specific to each sector of the reference architecture; each of which enables specific capabilities and serves specific functions. Here, I would like to lay out the principles of the Information Warehousing layer’s normalized central repository – the system of record.

The Information Warehousing layer is designed as a normalized repository for the data that has been processed “upstream”. It arrives cleansed, transformed and mastered; and is consolidated here into a single “System of Record”.  The discipline of normalization restructures the data, removing it from the confines of the single perspective of the source system, into the multiple perspectives across the enterprise. The data is modelled according to its “essence” rather than its “use”.

This process of redesign imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The way the data is broken apart into separate tables makes it challenging to query, but this is not its main purpose. Data integrity and flexibility are the primary goals, and tuning is geared towards load performance rather than data access.

  1. Data Integrity
    Take a proactive stance to protect referential integrity and reduce instances of redundancy or potential for inconsistency.
  2. Scalability
    Allow for increases in volumes or additional sources of existing information, both within subject areas (e.g., issuers, counterparties) and core concepts (e.g., parties, locations).
  3. Flexibility
    Allow for additional sources or changes in existing sources, so that design is not tied to a given source or mirrors the source. Design will give primary consideration to reuse, then extension and finally modification of existing structures.
  4. Consistency
    Apply standard patterns for data design to promote efficiencies of data and ETL design. The decision-making process will be expedited as will data modelling work and ETL development.
  5. Efficiency
    Focus efficiency on three aspects:

    1. Implementation
      Use of repeatable patterns for data design will minimize data modelling and ETL work effort.
    2. Operation
      Ease ongoing maintenance by keeping the number of data objects to a minimum; maintain consistent standards; and apply logical structures for ease of navigation and use.
    3. Load Performance
      Priority given to performance of ETL load processes; including those that use the System of Record as a source to load the Data Mart sector.
  6. Enterprise Perspective
    For all data objects, retain and remain open to, a full range of existing and potential relationships between entities to ensure that data reflects an enterprise perspective that is not limited to the perspective of any given project’s requirements.

These foundational principles are implemented through strategies that impact storage of history, hierarchical structures, degree of normalization, classifications, surrogate keys and a number of other aspects of design. The principles form the criteria to judge the best approach to take in a given situation. It’s not always straightforward, even with the principles in place – at times one has to favour one principle over another (e.g., flexibility over load performance), but this list provides the guidance to frame the debate and take a considered approach.

As suggested by the opening quote, I’m not advocating a blind conformance to a set of rules; but, in my experience, one of the greatest obstacles to efficient and effective development is the decision-making process. Limiting the parameters of debate with intelligent guidelines can facilitate decisions being made quickly and correctly.

Feel free to suggest additions or amendments to this list of design principles for the System of Record.

Reference Domains Part II: Modelling Classifications

This is the second article in the series on working with reference domains, also commonly referred to as classifications. In Part I, we looked at the nature of classifications. Here we will discuss designing structures to accommodate them in the EDW’s Information Warehousing layer, within the normalized System of Record.

Two potential approaches to designing the Reference Domains are to:

  1. Put each reference domain of values into its own table.
  2. Collect all reference domains into two tables; one for a master set of domains and the other for all the values, with a column linking each to the relevant domain.

The second approach is the recommended one, and the remainder of this article will present the design and the rationale. I have seen the consolidated reference data approach implemented many times within the System of Record of the EDW; and it has proved highly effective.

Given this preferred approach, every Reference Domain and Value is to be placed in the following entities:

  • Reference Domain
  • Reference Value

As mentioned in Reference Domains Part I, reference domains are commonly made up of pairs of code and description attributes. However, it is possible that some domains may be comprised of additional attribution, such as short names, long names, multi-language names etc. In such cases, consideration should be given to extending the model to accommodate additional fields. (This is not the same as related  reference values that may be stored together in the source, but should be separated in the EDW.)

The domain values may be organized hierarchically. To support many-to-many relationships between domain values, in terms of hierarchies or other relationship types, the associative entity, Reference Value to Reference Value Relationship, should be used. As with all associatives, it will accommodate multiple concurrent as well as historical relationships.

To facilitate changes in values within a given scheme, a Reference Domain to Reference Value Relationship entity can be deployed.

Maintaining History

The history of values for a given scheme in relationship to other core concepts is to be retained within relevant associative structures.

For example, in the diagram below, the current value of Party Status Id, with its effective date, called Party Status Change Date, are held on the Party entity. Historical values are held on the Party to Domain Relationship entity, with the Domain Scheme Id holding the identifier for the Party Status domain, and the Domain Value Id holding the identifier for the previous value.

To illustrate the example more fully, consider the following:

The Domain Scheme and Domain Value tables contain rows for the Party Status domain scheme.

Day 1

On day one, the party enters the system with a party status of “Active”. There are no entries in the Party to Domain Relationship table.

Day 2

On day two, the party status changes to “Closed”. The current value in Party is overwritten with the new value; the change date is updated; and the old value is inserted into the associative.

Design Benefits

The following points outline the benefits of employing a consolidated set of domain schemes and values within a small set of tables, versus creating a separate table for each domain scheme’s set of values.

  1. Data Integrity: The integrity of the allowable set of values within a given domain are maintained through the application logic defined and implemented through the ETL. Maintaining a master list reduces the chance of inconsistencies or the appearance of anomalies.
  2. Flexibility: There is only one table to access for all domain value codes and descriptions within the SOR.
  3. Implementation Efficiency: No additional database entities or tables to be created, tested or maintained within the System of Record (SOR) logical or physical model. Fewer objects means less chance for error.
  4. Operational Efficiency: A single utility can be created to interface with this table. It is true that even with multiple tables a single utility could be created with minimal logic. However, each new object would require some changes to the application, whereas the consolidated version can be extended seamlessly.
  5. Consistency: History is stored in the classification associative tables (e.g., Party to Domain Relationship). The retention of the domain schemes and values as surrogate keys in the Domain Value entity facilitates these structures. This allows history to be gathered using a common mechanism, explicitly identifying both the scheme and the value.

In part three, we will go on to look at collecting and documenting classifications.

In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

EDW Reference Architecture: Information Warehousing Layer

EDW Reference Architecture: Information Warehousing

Information Warehousing

This is the fourth article in a series introduced in EDW Reference Architecture: Why Bother?. Other articles have looked at the Data Acquisition and Data Integration layers.

The Information Warehousing layer is designed as a normalized repository for all the integrated and mastered data. The primary purpose of this area is to organize the data at an atomic level in a state that is cleansed, mastered and well-defined. The discipline of normalization imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The relatively complex structures are not conducive to ease of navigation, and are tuned for load performance rather than data access.

System of Record

Purpose: Storage of Single Trusted Source

As the name suggests, the system of record is designed to provide a single trusted source for all downstream consumption. Its focus is on flexibility, storage and load performance.

Content: Integrated / Atomic

This area contains the lowest level of detail available, with minimal redundancy, including the repetition of information within derived or computed values. The intention is to retain here all component fields that would be used to derive other values. Ideally, each piece of information exists only once within the system of record, although some controlled level of redundancy for convenience may be included (e.g., primary identifier for an Involved Party included on both the fundamental Involved Party table as well as the Involved Party Identifier attributive table).

Structure: Normalized

The principles of normalization isolate each distinct group of information to minimum repetition and ensure that it belongs logically together. The modelling directive is to model according to the “essence” of the entity rather than its “use”. Practically, this ends up meaning that tables contain columns closely related to a single fundamental concept. The business function of a given piece of data is revealed through its relationship to other entities, including role-based entities. Through the use of relationship and attributive table, new data elements can be added to the System of Record without requiring additional structures.

Cleansing: Mastered and Cleansed

The system of record is the target of the data that has passed through the Data Integration Layer. No data should be loaded into the System of Record that has not passed through the process of being mastered and cleansed. It is critical to user confidence that this area remain closely governed and that its integrity not be compromised by co-locating external data without applying the discipline of the acquisition and integration layers.

Retention: Medium-term / Selective / Regulation-determined

The normalized structures permit the storage of all data elements. Most data elements can have history applied at the attribute level, recording the change in a single attribute without needing to repeat the other values contained on the same row. This means that history can be applied more selectively, with some fields being overwritten without storing the original value. For long-term storage, beyond seven years, data will be moved to a separate archival storage platform. The retention period may be driven by regulatory requirements, so the data remains accessible for auditing; although regulatory needs may be met through archival storage.

Consumption: Limited

As has been stated, the primary function of the system of record is to remain optimized for storage and open to receiving new data sources. However, access to the system of record should not be barred to all users. The normalized structures offer a flexible path through the data, with multiple perspectives retained. The selective history approach provides greater visibility to changes in data values than denormalized tables. These structural advantages should be exploited.

It is important that this access not be allowed to compromise the integrity of the system. Design decisions should not be influenced by considerations of query performance. Should it become clear that usage of the normalized structures for data analysis is of sufficient value to users, consideration should be given to creating a Deep Analytics Sector.

Interactive Sector

Purpose: User-maintained Information

The Interactive Sector provides an area in which users can create and maintain authoritative data, directly feeding the System of Record. The content of this sector can be viewed as an extension of the System of Record as well as a source to enhance the Information Provisioning Layer, either through joins or through ETL migration to tables within that layer.

There are many circumstances in which users need to be able to set up and control sets of values, such as banding (e.g., age demographic ranges) or domains of values for lookup tables that may be relatively dynamic. In many organization, these may have been managed through spreadsheets or local databases; a mature solution will see these implemented through a centralized and technologically advanced approach.

Content: Integrated / Atomic and Derived

Although the data here is user-generated, it resides within the Information Warehousing layer, and is designed and quality-controlled in such a way to be integrated. This means that the applications created to insert and update data in its tables will enforce referential integrity from data within the System of Record, and make use of its domain schemes and values. The content is to be primarily atomic, although some data may be derived. The intent is to enable and govern user-maintained data, rather than constrain it.

Structure: Normalized / Denormalized

The data structures of the Interactive Sector will mirror the System of Record wherever possible. In cases where user data will be linked directly to reporting tables, a denormalized design may be adopted. The goal is to be agile and accommodating without compromising the reliability of the information being provisioned. The guiding principle should be that the users are creating the “golden copy” of the data they are generating, and therefore it should conform as much as possible to the design of the System of Record.

Cleansing: Cleansed

It is essential that the application underlying the user interface enforce integrity controls on all data being entered and maintained. The Interactive Sector is being defined as sitting within the Information Warehousing layer, a zone in which all data must be scrubbed clean, rationalized and consolidated into a single trusted source. If the user interface lacks such controls, it should be treated as a standard source and be passed through the Staging area and subject to the processes of analysis, profiling and ETL transformations.

Retention: Medium-term

As per the System of Record, this user-maintained sector tracks the history of changes as required, and stores it for one to seven years, as appropriate. In DW2.0, Mr. Inmon suggests that the interactive sector store data for the short-term only, and be retained for a longer time in the Integrated Sector (System of Record). This might make sense for your organization, but the industry is moving towards migrating data as little as possible, and it may be more efficient to hold onto this data and access it where it is generated.  

Consumption: Open

The Interactive Sector data is intended to be combined with the Information Provisioning layer for business intelligence and extracts for downstream use. The data may be subject to security restrictions, but is otherwise open to all consumers from the Information Delivery layer.

I welcome your comments.

The next article will look at the Information Provisioning layer.

EDW Reference Architecture: Why Bother?

EDW Reference Architecture

In the following series of articles I would like to define a possible reference architecture for an enterprise data warehouse. This represents an amalgam of real and imagined client sites, addressing all-too-real issues I’ve encountered. It is presented as a straw man; for discussion, for dressing up to fit the needs of your organization, or for burning on the bonfire of rejected ideas. I hope that if the last course be chosen, that it at least provides fuel for better ideas – and that you’ll share them with me.

In preparing a reference architecture the immediate question arises: why bother? As an artefact it can appear arcane, ivory-tower, and unnecessary. Certainly, in many cases, it is presented in a way that has little meaning to the consumers of business intelligence, or even the practitioners involved in the EDW design and development. So how should it be presented and who is it for?

The architecture provides a reference point for all the stakeholders of the EDW. If it is to have meaning and value it must directly address specific issues, resolving them and fitting in to a larger corporate landscape.

For sponsors of the EDW, it must represent the vision of the system; which will presumably be along the lines of “The EDW will provide a single source of the truth throughout the organization, in a timely, reliable and accessible fashion”. The totality of the vision should be encompassed by, and evident in, the description of the reference architecture. What is the purpose of each EDW component? How does this meet what is required of the system?

For managers of the EDW, it draws a picture of the scope of work that will be required. It will not define the processes, but the definition of those processes will be dependent on understanding the end-to-end architecture. What demands do each of the EDW areas make on resources, in terms of data migration, design and retention?

For those involved in its design, it sets out the boundaries of the components within the EDW, defining their structure and relationship to one another. What types of database objects inhabit each component? What processing is necessary as those objects are populated?

For business, the reference architecture should present a clear and concise picture of how raw data will be transformed into information. It shows where information will be delivered for consumption, and how the processes along the way will contribute to its quality, reliability and ease of access. The message must be your information is in good hands - you’ll have what you need when you need it; and it will be correct.

You can judge for yourself whether the way I’ve laid out the reference architecture meets the conditions set out for the various audiences. The next 4 articles will define each of the layers of the EDW. There are 2 components, or sectors, within each layer, and the following aspects of the components will be described:

  • Purpose – The general function of each component within the context of the EDW.
  • Content – The nature of the data within the component.
  • Structure – The type of data objects that will hold the sector’s data; normalized, denormalized, inherited from the source or requiring data design.
  • Transformations – The level of processing the data will undergo as it populates the sector.
  • Retention – How long the data will likely reside in the sector.
  • Consumers – Who will have access to the data and for what purpose.

As I suggested, this is offered in the spirit of discussion, and I welcome your comments and feedback from your own experiences.

The next article in the series is on the Data Acquisition layer.

Picking the Team, Part III: ETL Designer

In this series of articles, we’re looking at the top qualities required for candidates to fill three key positions in the data warehousing team. In part one we looked at the analyst, in part two, the role of data designer. Here we’ll explore the five top skills to look for in the ETL designer / developer. In larger operations, these will be two distinct roles, with perhaps multiple resources. There may be a designer, with a lead developer and a team of ETL developers. For the purposes of this article, we’ll roll ETL into one position: with ETL design and development taken together.

ETL Design/Development

The process of Extract Transform Load (ETL) involves the migration of data from one system to another. In the case of the data warehouse, this occurs in multiple stages: source to staging, staging to integration layer, and integration layer to data mart.

We’ll take the base technical skills as a given. The ETL resource must be proficient in the toolset being applied (DataStage, Informatica, Ab Initio etc.) While the functionality is essentially the same across the board, the features of each are quite distinct. Beyond this fundamental compentency, look for the candidate’s expertise in the following areas:

  1. Experience with Delta Loads to Staging
  2.  Experience with Normalized Structures (Integration Layer)
    • Change Capture / History
    • Reference Tables (consolidated vs separate)
    • Surrogate Keys
  3. Experience with Denormalized Structures (Provisioning Layer)
    • Slowly Changing Dimensions
    • Bridging and Relationship Tables
  4. Approach to Performance Tuning
  5. Approach to Rapid Development

It’s important to find someone with knowledge of the end-to-end process. They may have been involved in one aspect of the process in the past, or been responsible for maintenance of existing jobs. This will not be sufficient. The ETL needs to be designed with an understanding of the specific challenges involved at each stage, with a corresponding toolkit of solutions at their disposal to address them.

The staging area is loaded from the source systems, and requires strategies to extract only the latest data. The ETL candidate needs to be familiar with different strategies to manage this, such as: source system logs, datestamps on source tables, set comparison operations within staging.

It’s quite common for an ETL resource to have experience with loading data marts, but have no knowledge of normalization. This can colour their thinking regarding the population of the Integration Layer, and present challenges when using it as a source for the Provisioning Layer. You should specifically ensure that the candidate is aware of the different ways change capture is applied in the normalized environment, in particular the use of associative tables; that they have experience of consolidated reference tables; and that they understand that surrogate keys in the Integration Layer are static.

Dimensions and fact tables are standard structures in the industry and every ETL designer should be well-versed in their use. It would be a good idea to check that they know the three types of slowly changing dimension:

  1. Update the existing value with the new value
  2. Create a new row for the new value
  3. Design multiple columns for new and previous value

Another data mart structure that is not always applied is the bridging table, for navigating unbalanced hierchies, which require a separate ETL pattern to process. The candidate should have experience with both bridging and relationship tables, which can handle many-to-many relationships.

There is usually a window of time within which the nightly, weekly or monthly ETL processes must be completed. This puts a premium on the ability of the ETL designer to optimize the efficiency of the code. Make sure the candidate has practical experience with this and can list a number of strategies that they have employed. Don’t expect them to answer questions like: “How long should it take to load a million rows”. There are too many variables for them to give a meaningful response. What is important is that they can articulate a number of avenues of approach.

The final point relates to the candidate’s overall attitude to development. Most IT shops want agility, while the data warehouse life cycle contains inherent dependencies. The ETL designer should always be looking for repeatable patterns, building on previous experience as well as the blocks of code developed through the project. Find out what strategies they employ to create processes quickly and accurately. ETL can be the most time-intensive part of the project. An efficient ETL designer can be key to the warehouse ROI.

Picking the Team, Part II: Data Designer

In part one of picking a team, we discussed the role of the analyst. Here, we continue to look at skills required for the development team in the role of data designer. In large organizations, these will be the separate positions of architect and modeller, while in smaller organizations, architecture and modelling may be handled by the same individual.

Data Designer

The role of the data designer is two-fold, reflecting the two data layers of the warehouse to be addressed: the normalized Integration Layer and the denormalized Provisioning Layer. Data design is performed in a data modelling tool. Knowledge of a modelling tool is assumed, for the purposes of this article; although it is important to ensure the candidate knows how to use the same tool being used on the project. The data designer is responsible for creating a logical and physical data model and needs to communicate how the sources map to the target model to the ETL designer.

For the Integration Layer

  1. Mastery of principals of normalization with an understanding of the implications on:
  2. Understanding of Logical to Physical transformations
  3. Ability to make a distinction between a concept’s nature and its usage

Having a grasp of modelling to the third normal form and beyond (“the key, the whole key and nothing but the key, so help me Codd”) is a good and necessary grounding, but not all that is required of the data designer. This is not a one-off exercise. The data structures that are created must fit in with the existing warehouse (should one exist) and accommodate future additions. (One hopes that each project is only a piece of the whole; it certainly is discouraged to try to model the enterprise in one go!). Retaining the flexibility of the Integration Layer is essential. A well-designed structure should be able to accept new sources that, in many cases, will either find targets in existing database objects or require new tables rather than changes to existing tables. The other points listed above must also be taken into consideration by the designer. When finding this resource, it is important to ensure that the candidate have developed enough expertise to be able to weigh a spectrum of choices against their effect on these aspects of design.

The transformations from logical to physical are included as a separate skillset, because it is common in the market for data modellers to identify themselves as either “logical” or “physical” modellers.  If you do face such a scenario, and the modelling tasks must be divided between two modellers, it is important that as the logical model is transformed that it not lose the careful attention to the considerations listed in the first point. It’s no use if the logical model retains flexibility while the physical denormalization paints the model into a corner.

Normalization can be seen as a process of distinguishing a data element’s essential nature from the context in which it’s used. The ability to discern this from what can be “a bunch of tables” in the source system is invaluable. (For example, a condition is included on a transaction, could be seen as related to the account, but actually applies to the customer.) This is one of the areas that requires communication with the analyst and a shared understanding of the problem being examined.

For the Provisioning Layer

  1. Mastery of principals of denormalization with an understanding of the implications on:
    • Slowly changing dimensions
    • Hierarchies
    • Relationships
    • Additive, semi-additive and non-additive metrics
  2. Knowledge of applicability of Views including horizontal and vertical filtering
  3. Ability to translate reporting requirements into optimized data structures

On the face of it, dimensional modelling may seem a simpler discipline than normalization. However, it contains hidden complexities which are closely associated with the ability of users to navigate the structures intuitively and retrieve data on demand with great efficiency. This ground has been well-covered by Ralph Kimball and his associates, and any project should ensure that the modelling candidate is well-versed in Mr. Kimball’s teachings. I would go so far as to say that if the candidate doesn’t know Ralph Kimball’s name, you should look elsewhere.

Views are second on the list, because the designer needs to have some flexibility to cope with security concerns, the needs of specific lines of business, providing meaningful sets of data to given groups etc. The designer should have an extensive “bag of tricks” to apply to any situation so that they can reach beyond “Star Schemas” to provide whatever structure will best serve the situation.

This brings me to the third point, which is that the Provisioning Layer is all about optimizing data retrieval. In order to do this well, the person needs to be able to look at a report specification, be it a sample of the report output or a set of dimensional attributes and metrics, and see how it needs to be structured in such a way that users will be able to grasp what each table contains, how they relate to one another and how calculations can be performed quickly and accurately.

For both disciplines, I include the following 2 essential traits:

  1. Vision of the “Big Picture” for the data warehouse
  2. Systematic attention to detail

With the data warehouse being built project by project, it is critical that the design maintain a complete vision of where it is going. Tactical choices in the Integration Layer particularly are extremely difficult to adjust later to align with the strategic goals. In the Provisioning Layer, it is the maintenance of conformed dimensions and a system to guide the creation of fact tables that will minimize the number of objects created. In both cases, meeting the needs of the project while retaining an architectural vision are imperative.

The task of creating a data model is necessarily a detailed one, with many sub-tasks and a long checklist of validations to be applied. The best designers will systematically cross every datatype and dot every standardized name; and in the diagrams, entity relationship lines will never cross (ok, maybe sometimes)! Documentation must include a source to target mapping to communicate how the structures are to be loaded. Again, attention to detail is crucial, as is the clarity of the material being presented.

If you can run the development life cycle to produce the Integration and Provisioning Layers serially, these designs can be created by the same person. If time efficiencies are important, some parallelism is possible, and two resources will be required. In large organizations, it will be essential to have an architect overseeing multiple modellers designing for concurrent projects. In all cases, a full complement of these capabilities will be needed to create a solid foundation.

Next: ETL Designer

Picking the Team, Part I: Analyst

Finding the right team to deliver the data warehouse successfully can be challenging. With an unworkable methodology the best people in the world won’t be able to bring things together; and conversely, the best methodology will fall apart with a team that lacks the requisite skills. A sound methodology will support those with adequate skills; and work at optimal efficiency with resources that really know how to do their jobs. 

I’m fond of the notion that the most productive people are smart and lazy, while the least are stupid and hard-working. At the end of the day, the most effective teams will be made up of a mixture of people who are bright, engaged and co-operative, bringing a variety of strengths and perspectives. 

Over the next few postings, we’ll explore five things to look for in each of the three core competency roles of the data warehouse team: Analyst, Data Designer and ETL Designer.
 
Analyst:
 
The responsibilities of the analyst are to compile and document the catalogue of source data elements that have been scoped into the project requirements. This will involve reviewing source documentation and interviewing relevant data stewards. It is also part of the analysis phase to verify the availability and integrity of data through profiling and interrogation. The analyst then communicates the results to the data architect or modeller responsible for designing the target system. (For a more complete discussion, see Measure Twice Part I and Part II).
 
With this brief definition of the role of the analyst, these are the key skills they should possess:
  1. An understanding of the business the data represents AND/OR the ability to grasp the link between data elements and business concepts and their relationships to one another.
  2. Ability to ask probing questions and be ready to pose new questions based on the answers given.
  3. Ability to articulate the information gathered in both written and verbal form.
  4. Technical skills to profile and interrogate the data.
  5. Knowledge of data architecture to understand the structure of sources and potential impact on target data designs.

Obviously, knowledge of the business is an asset to any analysis; but far more important is the incisive curiosity to explore the nature of each data element as it relates to business concepts as well as other data elements. In many cases, knowledgable analysts will encounter new areas of the business, or specific source data that is unknown to them. It is critical they be able to gain an understanding of these new pieces of information. If they are slow to pick up new knowledge, this could present an impediment to the project. I would rather deal with an inquisitive analyst who comes with fresh eyes to the subject and is able to compile a comprehensive picture of the data. Better yet, someone who is knowledgable and inquisitive.

In order to find answers, the analyst needs to be able to formulate questions; and recognize when the answer has led to new questions. It isn’t enough that he/she goes into the interview with a checklist of queries. The person needs to be engaging and attentive, with a manner that will draw people out and allow the analyst to be able to ask for multiple interviews. Singing and dancing isn’t required, but this is not a role for an introvert.

While they need to be able to both talk and listen, it is also important that the analyst document findings and be able to articulate them to the data architect/modeller. The analyst is the conduit of information from the data steward, be that the business user or IT subject matter expert, to the person responsible for data design. The more comprehensive, coherent and relevant the analysis, the more efficient the overall development process will be.
 
It is hugely beneficial for the analyst to be able to run and interpret the data profiles. These are the automated standard reports that look at the fundamental state of the columns within a table (e.g., nulls, value lengths etc.). The findings through interviews and/or the investigation of source documentation must be confirmed through data profiling. In addition, the analyst should have the ability to interrogate the source data directly, crafting SQL queries or using some other query tool, to validate their understanding of the content and relationships. If these roles are separated out into technical and non-technical responsibilities, the gap can lead to mis-communication, time lags and errors.
 
It may seem strange to recommend knowledge of data design for an analyst, but it can help navigate the source data model, inform the line of questioning and assist communication with the target data designer. The analyst needs to be able to identify such aspects as the cardinality of relationships, the impact of redefined fields, the potential for composite and overloaded fields, issues with data retention and the ramifications of reference data. At minimum the analyst needs to be able to read a data model. It wouldn’t go amiss if they had some experience with designing data themselves.
 
While finding all of these qualities in a single analyst may prove difficult, it is essential that the group of analysts you engage on the project contain a full set of skills. As I mentioned, I wouldn’t make a distinction between technical and non-technical, but it is helpful to conduct the interview process in pairs, with one person asking the questions and the other keeping notes. The important thing to remember is that the analysis must be cohesive; the understanding of the sources needs to be gained, verified and communicated.  Breaking those steps across different parties is risky; unless they work very closely together and function as a unit.
 

Measure Twice: Analysis Part II

Continuing on from the previous posting on approaching analysis of source systems, here we will look at other aspects of the data sources to be explored by the analyst.

Relationships

For legacy mainframe systems that use Copy Books, ask about tables within the segment. The “occurs” clause in the copybook may contain the “table” keyword, indicating that the file contains a one-to-many relationship with what is essentially an inner table. The “occurs” will likely signal a one-to-many relationship of the subject of the segment to the set of multi-occurring fields. There could be a many–to-many relationship that needs to be identified by examining the relationship from the other perspective (e.g., customers to accounts).

For all relationships between tables, a number of questions need to be addressed:

  • Does this relationship change over time? If so, how frequently?
  • Is there enough information to uniquely identify the object of the relationship?
  • Are there a set of attributes dependant on the relationship? (e.g., Marketing indicators flagging the preferred mailing address related to a Customer Location)

As a data design consideration, relationships may lead to associative tables, attributive tables or foreign keys in the normalized Integration Layer of the data warehouse, and may require denormalization or relationship tables in the Data Mart. Having a clear understanding of relationships is essential to making informed modelling decisions.

Reference Data

Reference data is anything that involves sets of codes with corresponding descriptions. These are sometimes referred to as classifications, and are frequently found in subject-specific lookup tables containing distinct domains of values. These are points to consider when either encountering these types of data in tables or suspecting they may have been found.

  • Note fields that use a list of values
  • Ask about fields that might be codes. These may be fields with a small number of characters or digits (1-6)
  • Ask if there is a related reference table or known domain of values.
  • How frequently do these values (either codes or descriptions) change over time, if at all? It is important to know if the codes are ever recycled. I encountered a situation where a code that had meant a sale one month, was changed to mean a return the next. The change in meaning hadn’t been communicated to IT.
  • Confirm that any documented values are accurate in the source data.Numerous times profiling has revealed an unexpected domain of values, at odds with the documentation.

Redefines

Again, for legacy mainframe sources, ask how redefines are to be used. These are columns that are marked with the REDEFINE key word in the Copy Book, indicating that the field will have different meanings or even different formats depending on the context. For any of the following, we need to know how to handle the redefined field. (e.g., the redefine’s meaning is dependant on another field’s value, such as individual vs. company or the appearance of an alpha character within a set of bytes.)

Redefines could be:

  1. A different data type (e.g., TELEPHON vs. TELEPHONX = numeric vs. alpha)
  2. Because the row is a different type, the bytes are redefined for a completely different purpose
  3. The field has a different purpose depending on the value of another field in the row.

The redefine is specific to copy books, but a similar concept may exist in relational tables. When this type of situation does appear in the source, the Staging area should reflect multiple fields, to accommodate each of the different types of data held within the same source field.

Amounts

For numeric fields, particular attention needs to be paid to fully define its nature.

If it is a quantity, ask what time period it represents, such as:

  1. A snapshot of that day
  2. A cumulative amount from the beginning of a period. That period can be either:
    • Calendar
    • Fiscal

For all numeric fields, consider the following:

  1. What type of number?
    • Rate
    • Balance
    • Average
  2. Calculations: is this derived from atomic components? What are they? There may be additional sources to hunt down.
  3. Rates: what is the precision to be applied?
  4. Currency amounts:
    • Precision
    • Scaling Is the number in units, thousands, millions?
    • Currency

Consider if the number is a metric or an attribute. For example, the number of dependants of an individual may not be stored as a metric, to be used in calculations, but as a descriptive attribute to be applied to demographic groupings.

History

These points have been made in the context of specific columns and relationships. I make them again here, to stress the importance of probing for an understanding of both the availability of historical information and the requirement for storing history moving forward, in the data warehouse. The requirement to store history can have a profound impact on each layer of the data warehouse – both in terms of capacity planning and fundamental design decisions.

  • How frequently does the value change over time?
  • How much history currently exists?
  • How often do these relationships change over time?
  • Even if the value changes, is it necessary to store history? (e.g., change in Date of Birth)

This list has been accumulated over many engagements, although I’m sure it’s not exhaustive. There always seems to be some special hidden surprise within each source system. One thing is certain, however, and that is failure to closely examine each data element being fed into the warehouse will lead to a gotcha! moment – and invariably, costly rework.

Any additions that you would like to add to the list would be most welcome.

Follow

Get every new post delivered to your Inbox.