Data Design Principles

Obey the principles without being bound by them.

- Bruce Lee

Taking a practical approach to developing a well-formed enterprise data warehouse – and by that, I mean one that is accurate, efficient and productive – involves basing it on sound design principles. These principles are specific to each sector of the reference architecture; each of which enables specific capabilities and serves specific functions. Here, I would like to lay out the principles of the Information Warehousing layer’s normalized central repository – the system of record.

The Information Warehousing layer is designed as a normalized repository for the data that has been processed “upstream”. It arrives cleansed, transformed and mastered; and is consolidated here into a single “System of Record”.  The discipline of normalization restructures the data, removing it from the confines of the single perspective of the source system, into the multiple perspectives across the enterprise. The data is modelled according to its “essence” rather than its “use”.

This process of redesign imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The way the data is broken apart into separate tables makes it challenging to query, but this is not its main purpose. Data integrity and flexibility are the primary goals, and tuning is geared towards load performance rather than data access.

  1. Data Integrity
    Take a proactive stance to protect referential integrity and reduce instances of redundancy or potential for inconsistency.
  2. Scalability
    Allow for increases in volumes or additional sources of existing information, both within subject areas (e.g., issuers, counterparties) and core concepts (e.g., parties, locations).
  3. Flexibility
    Allow for additional sources or changes in existing sources, so that design is not tied to a given source or mirrors the source. Design will give primary consideration to reuse, then extension and finally modification of existing structures.
  4. Consistency
    Apply standard patterns for data design to promote efficiencies of data and ETL design. The decision-making process will be expedited as will data modelling work and ETL development.
  5. Efficiency
    Focus efficiency on three aspects:

    1. Implementation
      Use of repeatable patterns for data design will minimize data modelling and ETL work effort.
    2. Operation
      Ease ongoing maintenance by keeping the number of data objects to a minimum; maintain consistent standards; and apply logical structures for ease of navigation and use.
    3. Load Performance
      Priority given to performance of ETL load processes; including those that use the System of Record as a source to load the Data Mart sector.
  6. Enterprise Perspective
    For all data objects, retain and remain open to, a full range of existing and potential relationships between entities to ensure that data reflects an enterprise perspective that is not limited to the perspective of any given project’s requirements.

These foundational principles are implemented through strategies that impact storage of history, hierarchical structures, degree of normalization, classifications, surrogate keys and a number of other aspects of design. The principles form the criteria to judge the best approach to take in a given situation. It’s not always straightforward, even with the principles in place – at times one has to favour one principle over another (e.g., flexibility over load performance), but this list provides the guidance to frame the debate and take a considered approach.

As suggested by the opening quote, I’m not advocating a blind conformance to a set of rules; but, in my experience, one of the greatest obstacles to efficient and effective development is the decision-making process. Limiting the parameters of debate with intelligent guidelines can facilitate decisions being made quickly and correctly.

Feel free to suggest additions or amendments to this list of design principles for the System of Record.

In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

EDW Reference Architecture: Information Provisioning Layer

EDW Reference Architecture: Information Provisioning

Information Provisioning Layer

This is the fifth in a series introduced in EDW Reference Architecture: Why Bother?. Others in the series have looked at the Data Acquisition, Data Integration, and Information Warehousing layers.

The Information Provisioning Layer is designed for ease of navigation and optimized for retrieval performance for business intelligence and other consumers.

Data Mart

Purpose: Meeting Specific Needs

This sector is intended as a platform from which to meet specific information requirements. Where possible, the objects will be designed for re-use. However, the guiding principle is the effective servicing of business needs, and to this end, the solution’s primary focus is on the given project’s requirement.

Content: Golden Copies Plus

The primary source of content in this sector is the Information Warehousing layer; trusted and atomic data that can be used to populate the tables in the Data Marts as well as generate aggregated and computed measures, and derived dimensional attributes. In addition, it may be expedient to combine this information with data from other sources, that have not passed through the process of integration into the EDW. This should be permitted on an exception basis, with governance controls and a clear demarcation of the data’s lineage.

Structure: Denormalized

A common approach to data architecture in Data Marts is the “Star Schema”. This lends itself to ease of use and efficient performance. It also supports a high degree of re-use of objects, with “conformed” dimensions, to borrow a term from Mr. Ralph Kimball, being shared by a variety of users. Applying views on top of the tables provides a mechanism for security and a way to filter on rows and columns for specific use-cases. This sector is not limited to the “Star Schema” design, and should employ structures that are “fit for purpose” wherever greater efficiencies might be attained. This should include the option to apply views over table from the Information Warehousing layer, in cases where performance is not a concern, or the information resides within the Interactive Sector.

Cleansing: Cleansed

The Data Mart draws information primarily from a single trusted source; the ETL processes loading the Data Mart are not cleansing the data, although it will perform transformations to create derived values and aggregations. External sources being combined with Data Mart tables should not require cleansing. If they do, they should pass through a process of analysis and profiling to identify and remediate any anomalies. If the data is atomic, consideration should be given to making it a standard source input, subject to the full process of integration through the Information Warehousing layer.

Retention: Medium-term / Business-determined

The length of the retention period will depend on the business need rather than any regulatory requirement. If only current views of dimensions are required, the dimensions may be type one (i.e., changes are applied as updates to existing rows).

Consumption: Open

The information held in the Data Marts is intended for consumption, subject to privacy concerns and security constraints. Information will be accessed through business intelligence applications and other downstream systems. Appropriate levels of governance will be enforced for all information retrieval.

I welcome your comments.

As a final note on EDW Reference Architecture, the next article will discuss the Metadata Repository.

EDW Reference Architecture: Information Warehousing Layer

EDW Reference Architecture: Information Warehousing

Information Warehousing

This is the fourth article in a series introduced in EDW Reference Architecture: Why Bother?. Other articles have looked at the Data Acquisition and Data Integration layers.

The Information Warehousing layer is designed as a normalized repository for all the integrated and mastered data. The primary purpose of this area is to organize the data at an atomic level in a state that is cleansed, mastered and well-defined. The discipline of normalization imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The relatively complex structures are not conducive to ease of navigation, and are tuned for load performance rather than data access.

System of Record

Purpose: Storage of Single Trusted Source

As the name suggests, the system of record is designed to provide a single trusted source for all downstream consumption. Its focus is on flexibility, storage and load performance.

Content: Integrated / Atomic

This area contains the lowest level of detail available, with minimal redundancy, including the repetition of information within derived or computed values. The intention is to retain here all component fields that would be used to derive other values. Ideally, each piece of information exists only once within the system of record, although some controlled level of redundancy for convenience may be included (e.g., primary identifier for an Involved Party included on both the fundamental Involved Party table as well as the Involved Party Identifier attributive table).

Structure: Normalized

The principles of normalization isolate each distinct group of information to minimum repetition and ensure that it belongs logically together. The modelling directive is to model according to the “essence” of the entity rather than its “use”. Practically, this ends up meaning that tables contain columns closely related to a single fundamental concept. The business function of a given piece of data is revealed through its relationship to other entities, including role-based entities. Through the use of relationship and attributive table, new data elements can be added to the System of Record without requiring additional structures.

Cleansing: Mastered and Cleansed

The system of record is the target of the data that has passed through the Data Integration Layer. No data should be loaded into the System of Record that has not passed through the process of being mastered and cleansed. It is critical to user confidence that this area remain closely governed and that its integrity not be compromised by co-locating external data without applying the discipline of the acquisition and integration layers.

Retention: Medium-term / Selective / Regulation-determined

The normalized structures permit the storage of all data elements. Most data elements can have history applied at the attribute level, recording the change in a single attribute without needing to repeat the other values contained on the same row. This means that history can be applied more selectively, with some fields being overwritten without storing the original value. For long-term storage, beyond seven years, data will be moved to a separate archival storage platform. The retention period may be driven by regulatory requirements, so the data remains accessible for auditing; although regulatory needs may be met through archival storage.

Consumption: Limited

As has been stated, the primary function of the system of record is to remain optimized for storage and open to receiving new data sources. However, access to the system of record should not be barred to all users. The normalized structures offer a flexible path through the data, with multiple perspectives retained. The selective history approach provides greater visibility to changes in data values than denormalized tables. These structural advantages should be exploited.

It is important that this access not be allowed to compromise the integrity of the system. Design decisions should not be influenced by considerations of query performance. Should it become clear that usage of the normalized structures for data analysis is of sufficient value to users, consideration should be given to creating a Deep Analytics Sector.

Interactive Sector

Purpose: User-maintained Information

The Interactive Sector provides an area in which users can create and maintain authoritative data, directly feeding the System of Record. The content of this sector can be viewed as an extension of the System of Record as well as a source to enhance the Information Provisioning Layer, either through joins or through ETL migration to tables within that layer.

There are many circumstances in which users need to be able to set up and control sets of values, such as banding (e.g., age demographic ranges) or domains of values for lookup tables that may be relatively dynamic. In many organization, these may have been managed through spreadsheets or local databases; a mature solution will see these implemented through a centralized and technologically advanced approach.

Content: Integrated / Atomic and Derived

Although the data here is user-generated, it resides within the Information Warehousing layer, and is designed and quality-controlled in such a way to be integrated. This means that the applications created to insert and update data in its tables will enforce referential integrity from data within the System of Record, and make use of its domain schemes and values. The content is to be primarily atomic, although some data may be derived. The intent is to enable and govern user-maintained data, rather than constrain it.

Structure: Normalized / Denormalized

The data structures of the Interactive Sector will mirror the System of Record wherever possible. In cases where user data will be linked directly to reporting tables, a denormalized design may be adopted. The goal is to be agile and accommodating without compromising the reliability of the information being provisioned. The guiding principle should be that the users are creating the “golden copy” of the data they are generating, and therefore it should conform as much as possible to the design of the System of Record.

Cleansing: Cleansed

It is essential that the application underlying the user interface enforce integrity controls on all data being entered and maintained. The Interactive Sector is being defined as sitting within the Information Warehousing layer, a zone in which all data must be scrubbed clean, rationalized and consolidated into a single trusted source. If the user interface lacks such controls, it should be treated as a standard source and be passed through the Staging area and subject to the processes of analysis, profiling and ETL transformations.

Retention: Medium-term

As per the System of Record, this user-maintained sector tracks the history of changes as required, and stores it for one to seven years, as appropriate. In DW2.0, Mr. Inmon suggests that the interactive sector store data for the short-term only, and be retained for a longer time in the Integrated Sector (System of Record). This might make sense for your organization, but the industry is moving towards migrating data as little as possible, and it may be more efficient to hold onto this data and access it where it is generated.  

Consumption: Open

The Interactive Sector data is intended to be combined with the Information Provisioning layer for business intelligence and extracts for downstream use. The data may be subject to security restrictions, but is otherwise open to all consumers from the Information Delivery layer.

I welcome your comments.

The next article will look at the Information Provisioning layer.

EDW Reference Architecture: Why Bother?

EDW Reference Architecture

In the following series of articles I would like to define a possible reference architecture for an enterprise data warehouse. This represents an amalgam of real and imagined client sites, addressing all-too-real issues I’ve encountered. It is presented as a straw man; for discussion, for dressing up to fit the needs of your organization, or for burning on the bonfire of rejected ideas. I hope that if the last course be chosen, that it at least provides fuel for better ideas – and that you’ll share them with me.

In preparing a reference architecture the immediate question arises: why bother? As an artefact it can appear arcane, ivory-tower, and unnecessary. Certainly, in many cases, it is presented in a way that has little meaning to the consumers of business intelligence, or even the practitioners involved in the EDW design and development. So how should it be presented and who is it for?

The architecture provides a reference point for all the stakeholders of the EDW. If it is to have meaning and value it must directly address specific issues, resolving them and fitting in to a larger corporate landscape.

For sponsors of the EDW, it must represent the vision of the system; which will presumably be along the lines of “The EDW will provide a single source of the truth throughout the organization, in a timely, reliable and accessible fashion”. The totality of the vision should be encompassed by, and evident in, the description of the reference architecture. What is the purpose of each EDW component? How does this meet what is required of the system?

For managers of the EDW, it draws a picture of the scope of work that will be required. It will not define the processes, but the definition of those processes will be dependent on understanding the end-to-end architecture. What demands do each of the EDW areas make on resources, in terms of data migration, design and retention?

For those involved in its design, it sets out the boundaries of the components within the EDW, defining their structure and relationship to one another. What types of database objects inhabit each component? What processing is necessary as those objects are populated?

For business, the reference architecture should present a clear and concise picture of how raw data will be transformed into information. It shows where information will be delivered for consumption, and how the processes along the way will contribute to its quality, reliability and ease of access. The message must be your information is in good hands – you’ll have what you need when you need it; and it will be correct.

You can judge for yourself whether the way I’ve laid out the reference architecture meets the conditions set out for the various audiences. The next 4 articles will define each of the layers of the EDW. There are 2 components, or sectors, within each layer, and the following aspects of the components will be described:

  • Purpose – The general function of each component within the context of the EDW.
  • Content – The nature of the data within the component.
  • Structure – The type of data objects that will hold the sector’s data; normalized, denormalized, inherited from the source or requiring data design.
  • Transformations – The level of processing the data will undergo as it populates the sector.
  • Retention – How long the data will likely reside in the sector.
  • Consumers – Who will have access to the data and for what purpose.

As I suggested, this is offered in the spirit of discussion, and I welcome your comments and feedback from your own experiences.

The next article in the series is on the Data Acquisition layer.

Effective Communication: Team Meetings

There is a game played with groups of 10 to 20 people, where the object is to collectively count to 30. There are some rules to the game: if two people speak at once, the count returns to zero; no-one can say more than one number consecutively; and the count must be random, not simply moving from person to person around the circle in order, or controlled by anyone in any way.

This is extremely difficult to accomplish, and for the first 20 minutes may appear impossible. For a while, the failures are amusing, as people inevitably try to count at the same time. The amusement turns to irritation, particularly with those that are more outspoken, and repeatedly return the count to zero. And then something happens. The group becomes quiet and focused. The count is passed from one voice to another. No signals are made, no special order is followed. In the end, everyone has been part of the count; and the goal is achieved.

There is no doubt that any successful project will necessitate clear verbal communication among the group. As I have written elsewhere, documentation is the cornerstone of building a sound data warehouse; I would like to add to that it must be accompanied by a steady and cordial flow of human contact.

I’m not suggesting that everybody needs to be in the same room. I have been involved in highly successful projects where the participants were scattered around the globe. In one case, with part of the team in India, the client in the mid-western United States, and me in Toronto, Canada, we met on the phone twice each day, at the start and end. We leveraged the full 24 hours and were able to address issues as soon as they arose. The conference calls supplemented and clarified the documentation that was carefully version-controlled and our time was skilfully project-managed.

By contrast, I’ve seen projects where the team was all present on the client site, but the communication was less than effective. In one instance, little emphasis was placed on ensuring that everyone was on the same page. Instead, information about progress was held closely by the leadership. Issues and related decisions were never recorded or tracked. Project management methodology was in place, but the dialogue between the team leads, and the leads and their teams, was so fundamentally dysfunctional, the project fell behind its targets and never recovered.

Most remarkable is a development team I have witnessed that epitomizes effective communication. The team meets on an almost daily basis. Although there are more than a dozen participants at each meeting, representing the whole life cycle of development, the discussion is always focused and involves the active engagement of most of the participants. Each meeting is facilitated rather than led, walking through a structured agenda. Issues are never framed as blame to be apportioned; but rather problems to be solved. The atmosphere of the meetings I attended was always respectful and cordial. Jokes were welcome; though each item on the agenda was given appropriate time and weight. I had the sense that this was a group that would have easily counted to 30.

In conclusion, here are 5 factors towards effective communication:

5 Success Factors

Respect
It may sound trite, but one of the critical success factors for functional communication is respect. When people are treated like fools, they feel foolish and are less likely to think creatively. I’ve seen smart people shut down under harsh treatment. It stands to reason that the skills of the group are maximized when everyone feels comfortable to contribute.

Frequency
No-one wants to be in constant meetings, with nothing being accomplished; but it’s equally important to establish an easy rapport among the group. There’s nothing wrong with one-to-one direct communication either. However, this shouldn’t replace the formalized meetings which allow issues to be aired across the teams (analysts, designers, ETL etc.).

Engagement
The active involvement of the group is essential. If each meeting is being dominated by only a few voices, important perspectives may be overlooked. In my experience, people become disengaged when they feel that their contributions are never acknowledged or their suggestions considered. If the right people have been brought on board, you’ll want to hear from them.

Calm
Impending deadlines can ratchet up the tension in the room; as can personalities that are abrasive or fractious. Everyone will function much better if the stress is kept to a minimum and is never allowed to be mixed with heightened emotions.

Focus
The group is meeting to get a collective sense of the project’s status and air current issues. The discussion should be directed to move swiftly. If a given problem is taking too much time, take it up after the meeting. It’s also important that only a single discussion is taking place. Side conversations can develop easily, but they should be brought back to the group immediately.

Even as I write these out, they seem self-evident. But keeping them in balance is an act of will on the part of the whole group. A facilitator is key to finding that balance, while it is up to the group to maintain it. In the case of the group with such impressive status meetings, I understand that it took some time to achieve their level of success. It was dependent on the format, the personalities and the rapport of the individuals. I hope they continue to metaphorically count to 30 each day. It’s a pleasant and effective way to work.

Picking the Team, Part III: ETL Designer

In this series of articles, we’re looking at the top qualities required for candidates to fill three key positions in the data warehousing team. In part one we looked at the analyst, in part two, the role of data designer. Here we’ll explore the five top skills to look for in the ETL designer / developer. In larger operations, these will be two distinct roles, with perhaps multiple resources. There may be a designer, with a lead developer and a team of ETL developers. For the purposes of this article, we’ll roll ETL into one position: with ETL design and development taken together.

ETL Design/Development

The process of Extract Transform Load (ETL) involves the migration of data from one system to another. In the case of the data warehouse, this occurs in multiple stages: source to staging, staging to integration layer, and integration layer to data mart.

We’ll take the base technical skills as a given. The ETL resource must be proficient in the toolset being applied (DataStage, Informatica, Ab Initio etc.) While the functionality is essentially the same across the board, the features of each are quite distinct. Beyond this fundamental compentency, look for the candidate’s expertise in the following areas:

  1. Experience with Delta Loads to Staging
  2.  Experience with Normalized Structures (Integration Layer)
    • Change Capture / History
    • Reference Tables (consolidated vs separate)
    • Surrogate Keys
  3. Experience with Denormalized Structures (Provisioning Layer)
    • Slowly Changing Dimensions
    • Bridging and Relationship Tables
  4. Approach to Performance Tuning
  5. Approach to Rapid Development

It’s important to find someone with knowledge of the end-to-end process. They may have been involved in one aspect of the process in the past, or been responsible for maintenance of existing jobs. This will not be sufficient. The ETL needs to be designed with an understanding of the specific challenges involved at each stage, with a corresponding toolkit of solutions at their disposal to address them.

The staging area is loaded from the source systems, and requires strategies to extract only the latest data. The ETL candidate needs to be familiar with different strategies to manage this, such as: source system logs, datestamps on source tables, set comparison operations within staging.

It’s quite common for an ETL resource to have experience with loading data marts, but have no knowledge of normalization. This can colour their thinking regarding the population of the Integration Layer, and present challenges when using it as a source for the Provisioning Layer. You should specifically ensure that the candidate is aware of the different ways change capture is applied in the normalized environment, in particular the use of associative tables; that they have experience of consolidated reference tables; and that they understand that surrogate keys in the Integration Layer are static.

Dimensions and fact tables are standard structures in the industry and every ETL designer should be well-versed in their use. It would be a good idea to check that they know the three types of slowly changing dimension:

  1. Update the existing value with the new value
  2. Create a new row for the new value
  3. Design multiple columns for new and previous value

Another data mart structure that is not always applied is the bridging table, for navigating unbalanced hierchies, which require a separate ETL pattern to process. The candidate should have experience with both bridging and relationship tables, which can handle many-to-many relationships.

There is usually a window of time within which the nightly, weekly or monthly ETL processes must be completed. This puts a premium on the ability of the ETL designer to optimize the efficiency of the code. Make sure the candidate has practical experience with this and can list a number of strategies that they have employed. Don’t expect them to answer questions like: “How long should it take to load a million rows”. There are too many variables for them to give a meaningful response. What is important is that they can articulate a number of avenues of approach.

The final point relates to the candidate’s overall attitude to development. Most IT shops want agility, while the data warehouse life cycle contains inherent dependencies. The ETL designer should always be looking for repeatable patterns, building on previous experience as well as the blocks of code developed through the project. Find out what strategies they employ to create processes quickly and accurately. ETL can be the most time-intensive part of the project. An efficient ETL designer can be key to the warehouse ROI.

Picking the Team, Part II: Data Designer

In part one of picking a team, we discussed the role of the analyst. Here, we continue to look at skills required for the development team in the role of data designer. In large organizations, these will be the separate positions of architect and modeller, while in smaller organizations, architecture and modelling may be handled by the same individual.

Data Designer

The role of the data designer is two-fold, reflecting the two data layers of the warehouse to be addressed: the normalized Integration Layer and the denormalized Provisioning Layer. Data design is performed in a data modelling tool. Knowledge of a modelling tool is assumed, for the purposes of this article; although it is important to ensure the candidate knows how to use the same tool being used on the project. The data designer is responsible for creating a logical and physical data model and needs to communicate how the sources map to the target model to the ETL designer.

For the Integration Layer

  1. Mastery of principals of normalization with an understanding of the implications on:
  2. Understanding of Logical to Physical transformations
  3. Ability to make a distinction between a concept’s nature and its usage

Having a grasp of modelling to the third normal form and beyond (“the key, the whole key and nothing but the key, so help me Codd”) is a good and necessary grounding, but not all that is required of the data designer. This is not a one-off exercise. The data structures that are created must fit in with the existing warehouse (should one exist) and accommodate future additions. (One hopes that each project is only a piece of the whole; it certainly is discouraged to try to model the enterprise in one go!). Retaining the flexibility of the Integration Layer is essential. A well-designed structure should be able to accept new sources that, in many cases, will either find targets in existing database objects or require new tables rather than changes to existing tables. The other points listed above must also be taken into consideration by the designer. When finding this resource, it is important to ensure that the candidate have developed enough expertise to be able to weigh a spectrum of choices against their effect on these aspects of design.

The transformations from logical to physical are included as a separate skillset, because it is common in the market for data modellers to identify themselves as either “logical” or “physical” modellers.  If you do face such a scenario, and the modelling tasks must be divided between two modellers, it is important that as the logical model is transformed that it not lose the careful attention to the considerations listed in the first point. It’s no use if the logical model retains flexibility while the physical denormalization paints the model into a corner.

Normalization can be seen as a process of distinguishing a data element’s essential nature from the context in which it’s used. The ability to discern this from what can be “a bunch of tables” in the source system is invaluable. (For example, a condition is included on a transaction, could be seen as related to the account, but actually applies to the customer.) This is one of the areas that requires communication with the analyst and a shared understanding of the problem being examined.

For the Provisioning Layer

  1. Mastery of principals of denormalization with an understanding of the implications on:
    • Slowly changing dimensions
    • Hierarchies
    • Relationships
    • Additive, semi-additive and non-additive metrics
  2. Knowledge of applicability of Views including horizontal and vertical filtering
  3. Ability to translate reporting requirements into optimized data structures

On the face of it, dimensional modelling may seem a simpler discipline than normalization. However, it contains hidden complexities which are closely associated with the ability of users to navigate the structures intuitively and retrieve data on demand with great efficiency. This ground has been well-covered by Ralph Kimball and his associates, and any project should ensure that the modelling candidate is well-versed in Mr. Kimball’s teachings. I would go so far as to say that if the candidate doesn’t know Ralph Kimball’s name, you should look elsewhere.

Views are second on the list, because the designer needs to have some flexibility to cope with security concerns, the needs of specific lines of business, providing meaningful sets of data to given groups etc. The designer should have an extensive “bag of tricks” to apply to any situation so that they can reach beyond “Star Schemas” to provide whatever structure will best serve the situation.

This brings me to the third point, which is that the Provisioning Layer is all about optimizing data retrieval. In order to do this well, the person needs to be able to look at a report specification, be it a sample of the report output or a set of dimensional attributes and metrics, and see how it needs to be structured in such a way that users will be able to grasp what each table contains, how they relate to one another and how calculations can be performed quickly and accurately.

For both disciplines, I include the following 2 essential traits:

  1. Vision of the “Big Picture” for the data warehouse
  2. Systematic attention to detail

With the data warehouse being built project by project, it is critical that the design maintain a complete vision of where it is going. Tactical choices in the Integration Layer particularly are extremely difficult to adjust later to align with the strategic goals. In the Provisioning Layer, it is the maintenance of conformed dimensions and a system to guide the creation of fact tables that will minimize the number of objects created. In both cases, meeting the needs of the project while retaining an architectural vision are imperative.

The task of creating a data model is necessarily a detailed one, with many sub-tasks and a long checklist of validations to be applied. The best designers will systematically cross every datatype and dot every standardized name; and in the diagrams, entity relationship lines will never cross (ok, maybe sometimes)! Documentation must include a source to target mapping to communicate how the structures are to be loaded. Again, attention to detail is crucial, as is the clarity of the material being presented.

If you can run the development life cycle to produce the Integration and Provisioning Layers serially, these designs can be created by the same person. If time efficiencies are important, some parallelism is possible, and two resources will be required. In large organizations, it will be essential to have an architect overseeing multiple modellers designing for concurrent projects. In all cases, a full complement of these capabilities will be needed to create a solid foundation.

Next: ETL Designer

Measure Twice: Analysis Part II

Continuing on from the previous posting on approaching analysis of source systems, here we will look at other aspects of the data sources to be explored by the analyst.

Relationships

For legacy mainframe systems that use Copy Books, ask about tables within the segment. The “occurs” clause in the copybook may contain the “table” keyword, indicating that the file contains a one-to-many relationship with what is essentially an inner table. The “occurs” will likely signal a one-to-many relationship of the subject of the segment to the set of multi-occurring fields. There could be a many–to-many relationship that needs to be identified by examining the relationship from the other perspective (e.g., customers to accounts).

For all relationships between tables, a number of questions need to be addressed:

  • Does this relationship change over time? If so, how frequently?
  • Is there enough information to uniquely identify the object of the relationship?
  • Are there a set of attributes dependant on the relationship? (e.g., Marketing indicators flagging the preferred mailing address related to a Customer Location)

As a data design consideration, relationships may lead to associative tables, attributive tables or foreign keys in the normalized Integration Layer of the data warehouse, and may require denormalization or relationship tables in the Data Mart. Having a clear understanding of relationships is essential to making informed modelling decisions.

Reference Data

Reference data is anything that involves sets of codes with corresponding descriptions. These are sometimes referred to as classifications, and are frequently found in subject-specific lookup tables containing distinct domains of values. These are points to consider when either encountering these types of data in tables or suspecting they may have been found.

  • Note fields that use a list of values
  • Ask about fields that might be codes. These may be fields with a small number of characters or digits (1-6)
  • Ask if there is a related reference table or known domain of values.
  • How frequently do these values (either codes or descriptions) change over time, if at all? It is important to know if the codes are ever recycled. I encountered a situation where a code that had meant a sale one month, was changed to mean a return the next. The change in meaning hadn’t been communicated to IT.
  • Confirm that any documented values are accurate in the source data.Numerous times profiling has revealed an unexpected domain of values, at odds with the documentation.

Redefines

Again, for legacy mainframe sources, ask how redefines are to be used. These are columns that are marked with the REDEFINE key word in the Copy Book, indicating that the field will have different meanings or even different formats depending on the context. For any of the following, we need to know how to handle the redefined field. (e.g., the redefine’s meaning is dependant on another field’s value, such as individual vs. company or the appearance of an alpha character within a set of bytes.)

Redefines could be:

  1. A different data type (e.g., TELEPHON vs. TELEPHONX = numeric vs. alpha)
  2. Because the row is a different type, the bytes are redefined for a completely different purpose
  3. The field has a different purpose depending on the value of another field in the row.

The redefine is specific to copy books, but a similar concept may exist in relational tables. When this type of situation does appear in the source, the Staging area should reflect multiple fields, to accommodate each of the different types of data held within the same source field.

Amounts

For numeric fields, particular attention needs to be paid to fully define its nature.

If it is a quantity, ask what time period it represents, such as:

  1. A snapshot of that day
  2. A cumulative amount from the beginning of a period. That period can be either:
    • Calendar
    • Fiscal

For all numeric fields, consider the following:

  1. What type of number?
    • Rate
    • Balance
    • Average
  2. Calculations: is this derived from atomic components? What are they? There may be additional sources to hunt down.
  3. Rates: what is the precision to be applied?
  4. Currency amounts:
    • Precision
    • Scaling Is the number in units, thousands, millions?
    • Currency

Consider if the number is a metric or an attribute. For example, the number of dependants of an individual may not be stored as a metric, to be used in calculations, but as a descriptive attribute to be applied to demographic groupings.

History

These points have been made in the context of specific columns and relationships. I make them again here, to stress the importance of probing for an understanding of both the availability of historical information and the requirement for storing history moving forward, in the data warehouse. The requirement to store history can have a profound impact on each layer of the data warehouse – both in terms of capacity planning and fundamental design decisions.

  • How frequently does the value change over time?
  • How much history currently exists?
  • How often do these relationships change over time?
  • Even if the value changes, is it necessary to store history? (e.g., change in Date of Birth)

This list has been accumulated over many engagements, although I’m sure it’s not exhaustive. There always seems to be some special hidden surprise within each source system. One thing is certain, however, and that is failure to closely examine each data element being fed into the warehouse will lead to a gotcha! moment – and invariably, costly rework.

Any additions that you would like to add to the list would be most welcome.

Follow

Get every new post delivered to your Inbox.