Reference Domains Part II: Modelling Classifications

This is the second article in the series on working with reference domains, also commonly referred to as classifications. In Part I, we looked at the nature of classifications. Here we will discuss designing structures to accommodate them in the EDW’s Information Warehousing layer, within the normalized System of Record.

Two potential approaches to designing the Reference Domains are to:

  1. Put each reference domain of values into its own table.
  2. Collect all reference domains into two tables; one for a master set of domains and the other for all the values, with a column linking each to the relevant domain.

The second approach is the recommended one, and the remainder of this article will present the design and the rationale. I have seen the consolidated reference data approach implemented many times within the System of Record of the EDW; and it has proved highly effective.

Given this preferred approach, every Reference Domain and Value is to be placed in the following entities:

  • Reference Domain
  • Reference Value

As mentioned in Reference Domains Part I, reference domains are commonly made up of pairs of code and description attributes. However, it is possible that some domains may be comprised of additional attribution, such as short names, long names, multi-language names etc. In such cases, consideration should be given to extending the model to accommodate additional fields. (This is not the same as related  reference values that may be stored together in the source, but should be separated in the EDW.)

The domain values may be organized hierarchically. To support many-to-many relationships between domain values, in terms of hierarchies or other relationship types, the associative entity, Reference Value to Reference Value Relationship, should be used. As with all associatives, it will accommodate multiple concurrent as well as historical relationships.

To facilitate changes in values within a given scheme, a Reference Domain to Reference Value Relationship entity can be deployed.

Maintaining History

The history of values for a given scheme in relationship to other core concepts is to be retained within relevant associative structures.

For example, in the diagram below, the current value of Party Status Id, with its effective date, called Party Status Change Date, are held on the Party entity. Historical values are held on the Party to Domain Relationship entity, with the Domain Scheme Id holding the identifier for the Party Status domain, and the Domain Value Id holding the identifier for the previous value.

To illustrate the example more fully, consider the following:

The Domain Scheme and Domain Value tables contain rows for the Party Status domain scheme.

Day 1

On day one, the party enters the system with a party status of “Active”. There are no entries in the Party to Domain Relationship table.

Day 2

On day two, the party status changes to “Closed”. The current value in Party is overwritten with the new value; the change date is updated; and the old value is inserted into the associative.

Design Benefits

The following points outline the benefits of employing a consolidated set of domain schemes and values within a small set of tables, versus creating a separate table for each domain scheme’s set of values.

  1. Data Integrity: The integrity of the allowable set of values within a given domain are maintained through the application logic defined and implemented through the ETL. Maintaining a master list reduces the chance of inconsistencies or the appearance of anomalies.
  2. Flexibility: There is only one table to access for all domain value codes and descriptions within the SOR.
  3. Implementation Efficiency: No additional database entities or tables to be created, tested or maintained within the System of Record (SOR) logical or physical model. Fewer objects means less chance for error.
  4. Operational Efficiency: A single utility can be created to interface with this table. It is true that even with multiple tables a single utility could be created with minimal logic. However, each new object would require some changes to the application, whereas the consolidated version can be extended seamlessly.
  5. Consistency: History is stored in the classification associative tables (e.g., Party to Domain Relationship). The retention of the domain schemes and values as surrogate keys in the Domain Value entity facilitates these structures. This allows history to be gathered using a common mechanism, explicitly identifying both the scheme and the value.

In part three, we will go on to look at collecting and documenting classifications.

EDW Reference Architecture: Information Provisioning Layer

EDW Reference Architecture: Information Provisioning

Information Provisioning Layer

This is the fifth in a series introduced in EDW Reference Architecture: Why Bother?. Others in the series have looked at the Data Acquisition, Data Integration, and Information Warehousing layers.

The Information Provisioning Layer is designed for ease of navigation and optimized for retrieval performance for business intelligence and other consumers.

Data Mart

Purpose: Meeting Specific Needs

This sector is intended as a platform from which to meet specific information requirements. Where possible, the objects will be designed for re-use. However, the guiding principle is the effective servicing of business needs, and to this end, the solution’s primary focus is on the given project’s requirement.

Content: Golden Copies Plus

The primary source of content in this sector is the Information Warehousing layer; trusted and atomic data that can be used to populate the tables in the Data Marts as well as generate aggregated and computed measures, and derived dimensional attributes. In addition, it may be expedient to combine this information with data from other sources, that have not passed through the process of integration into the EDW. This should be permitted on an exception basis, with governance controls and a clear demarcation of the data’s lineage.

Structure: Denormalized

A common approach to data architecture in Data Marts is the “Star Schema”. This lends itself to ease of use and efficient performance. It also supports a high degree of re-use of objects, with “conformed” dimensions, to borrow a term from Mr. Ralph Kimball, being shared by a variety of users. Applying views on top of the tables provides a mechanism for security and a way to filter on rows and columns for specific use-cases. This sector is not limited to the “Star Schema” design, and should employ structures that are “fit for purpose” wherever greater efficiencies might be attained. This should include the option to apply views over table from the Information Warehousing layer, in cases where performance is not a concern, or the information resides within the Interactive Sector.

Cleansing: Cleansed

The Data Mart draws information primarily from a single trusted source; the ETL processes loading the Data Mart are not cleansing the data, although it will perform transformations to create derived values and aggregations. External sources being combined with Data Mart tables should not require cleansing. If they do, they should pass through a process of analysis and profiling to identify and remediate any anomalies. If the data is atomic, consideration should be given to making it a standard source input, subject to the full process of integration through the Information Warehousing layer.

Retention: Medium-term / Business-determined

The length of the retention period will depend on the business need rather than any regulatory requirement. If only current views of dimensions are required, the dimensions may be type one (i.e., changes are applied as updates to existing rows).

Consumption: Open

The information held in the Data Marts is intended for consumption, subject to privacy concerns and security constraints. Information will be accessed through business intelligence applications and other downstream systems. Appropriate levels of governance will be enforced for all information retrieval.

I welcome your comments.

As a final note on EDW Reference Architecture, the next article will discuss the Metadata Repository.

EDW Reference Architecture: Information Warehousing Layer

EDW Reference Architecture: Information Warehousing

Information Warehousing

This is the fourth article in a series introduced in EDW Reference Architecture: Why Bother?. Other articles have looked at the Data Acquisition and Data Integration layers.

The Information Warehousing layer is designed as a normalized repository for all the integrated and mastered data. The primary purpose of this area is to organize the data at an atomic level in a state that is cleansed, mastered and well-defined. The discipline of normalization imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The relatively complex structures are not conducive to ease of navigation, and are tuned for load performance rather than data access.

System of Record

Purpose: Storage of Single Trusted Source

As the name suggests, the system of record is designed to provide a single trusted source for all downstream consumption. Its focus is on flexibility, storage and load performance.

Content: Integrated / Atomic

This area contains the lowest level of detail available, with minimal redundancy, including the repetition of information within derived or computed values. The intention is to retain here all component fields that would be used to derive other values. Ideally, each piece of information exists only once within the system of record, although some controlled level of redundancy for convenience may be included (e.g., primary identifier for an Involved Party included on both the fundamental Involved Party table as well as the Involved Party Identifier attributive table).

Structure: Normalized

The principles of normalization isolate each distinct group of information to minimum repetition and ensure that it belongs logically together. The modelling directive is to model according to the “essence” of the entity rather than its “use”. Practically, this ends up meaning that tables contain columns closely related to a single fundamental concept. The business function of a given piece of data is revealed through its relationship to other entities, including role-based entities. Through the use of relationship and attributive table, new data elements can be added to the System of Record without requiring additional structures.

Cleansing: Mastered and Cleansed

The system of record is the target of the data that has passed through the Data Integration Layer. No data should be loaded into the System of Record that has not passed through the process of being mastered and cleansed. It is critical to user confidence that this area remain closely governed and that its integrity not be compromised by co-locating external data without applying the discipline of the acquisition and integration layers.

Retention: Medium-term / Selective / Regulation-determined

The normalized structures permit the storage of all data elements. Most data elements can have history applied at the attribute level, recording the change in a single attribute without needing to repeat the other values contained on the same row. This means that history can be applied more selectively, with some fields being overwritten without storing the original value. For long-term storage, beyond seven years, data will be moved to a separate archival storage platform. The retention period may be driven by regulatory requirements, so the data remains accessible for auditing; although regulatory needs may be met through archival storage.

Consumption: Limited

As has been stated, the primary function of the system of record is to remain optimized for storage and open to receiving new data sources. However, access to the system of record should not be barred to all users. The normalized structures offer a flexible path through the data, with multiple perspectives retained. The selective history approach provides greater visibility to changes in data values than denormalized tables. These structural advantages should be exploited.

It is important that this access not be allowed to compromise the integrity of the system. Design decisions should not be influenced by considerations of query performance. Should it become clear that usage of the normalized structures for data analysis is of sufficient value to users, consideration should be given to creating a Deep Analytics Sector.

Interactive Sector

Purpose: User-maintained Information

The Interactive Sector provides an area in which users can create and maintain authoritative data, directly feeding the System of Record. The content of this sector can be viewed as an extension of the System of Record as well as a source to enhance the Information Provisioning Layer, either through joins or through ETL migration to tables within that layer.

There are many circumstances in which users need to be able to set up and control sets of values, such as banding (e.g., age demographic ranges) or domains of values for lookup tables that may be relatively dynamic. In many organization, these may have been managed through spreadsheets or local databases; a mature solution will see these implemented through a centralized and technologically advanced approach.

Content: Integrated / Atomic and Derived

Although the data here is user-generated, it resides within the Information Warehousing layer, and is designed and quality-controlled in such a way to be integrated. This means that the applications created to insert and update data in its tables will enforce referential integrity from data within the System of Record, and make use of its domain schemes and values. The content is to be primarily atomic, although some data may be derived. The intent is to enable and govern user-maintained data, rather than constrain it.

Structure: Normalized / Denormalized

The data structures of the Interactive Sector will mirror the System of Record wherever possible. In cases where user data will be linked directly to reporting tables, a denormalized design may be adopted. The goal is to be agile and accommodating without compromising the reliability of the information being provisioned. The guiding principle should be that the users are creating the “golden copy” of the data they are generating, and therefore it should conform as much as possible to the design of the System of Record.

Cleansing: Cleansed

It is essential that the application underlying the user interface enforce integrity controls on all data being entered and maintained. The Interactive Sector is being defined as sitting within the Information Warehousing layer, a zone in which all data must be scrubbed clean, rationalized and consolidated into a single trusted source. If the user interface lacks such controls, it should be treated as a standard source and be passed through the Staging area and subject to the processes of analysis, profiling and ETL transformations.

Retention: Medium-term

As per the System of Record, this user-maintained sector tracks the history of changes as required, and stores it for one to seven years, as appropriate. In DW2.0, Mr. Inmon suggests that the interactive sector store data for the short-term only, and be retained for a longer time in the Integrated Sector (System of Record). This might make sense for your organization, but the industry is moving towards migrating data as little as possible, and it may be more efficient to hold onto this data and access it where it is generated.  

Consumption: Open

The Interactive Sector data is intended to be combined with the Information Provisioning layer for business intelligence and extracts for downstream use. The data may be subject to security restrictions, but is otherwise open to all consumers from the Information Delivery layer.

I welcome your comments.

The next article will look at the Information Provisioning layer.

EDW Reference Architecture: Data Integration Layer

EDW Reference Architecture: Data Integration

 

Data Integration Layer

This is the third part of the series, introduced in EDW Reference Architecture: Why Bother? and continued in a piece about the Data Acquisition Layer.

The Integration Layer marks the transition from raw data to integrated data; that is, data that has been consolidated, rationalized, duplication of records and values removed, and disparate sources combined into a single version. This layer represents the passage of the data through the process of integration, rather than the storage area for the data. For data with multiple sources, a mastering process is required; this is termed Master Data Management. Data from a single source may still require significant processing for de-duplication and cleansing, and this will be performed within the ETL application or processes.

Where the process of “mastering” disparate sources is not required, the Master Data Management process is bypassed. Instead the data is processed by ETL, undergoing systematic de-duplication, error-detection and cleansing routines.

Master Data Management

Purpose: Processing from Multiple Sources

Master Data Management is the process by which data from different sources is matched and processed to compile a single “golden copy” of a given entity.

Content: Core Attributes of Master Data

The MDM process should attach all relevant attribution related to the core entity. It is possible to have the MDM only process those attributes that are sourced from multiple systems. For those with a single source, no conflict exists to be resolved. Therefore, some attribution could flow directly through ETL processing to be attached to the related entity within the System of Record.

Structure: Source or Target Oriented

The MDM system will have its own internal data structures. These structures may be oriented towards the structure of the source files or the target System of Record.

Cleansing: Matching and Cleansing Processes

For data with multiple sources, a mastering process is required; this is termed Master Data Management. Data from a single source may still require significant processing for de-duplication and cleansing, and this will be performed within the ETL application or processes.

Retention: Transitory / (History dependent on solution)

Only the current version of the data need be retained. There are MDM solutions available for purchase and some come with the capacity to store history. In some cases, this ability is essential to maintain a mastered copy.

Consumption: None / (Operational Systems if required)

There is no direct consumption of MDM, beyond the migration of the data into the System of Record through ETL. However, in some cases, the MDM component is a direct supplier of operational systems. This is acceptable, given that the process produces a “golden copy”. The need for this may be driven by a timing issue, or reducing the load on other sectors of the EDW.

Cleansing Process

Purpose:  Processing from single sources

Here the data must be prepared for downstream consumption; applying any necessary de-duplication, matching, cleansing, rationalization of values, and other transformations.

Content: Non-MDM attributes deemed of value for System of Record

All data that is to be migrated to the System of Record and that is not to be mastered through MDM must pass through this process.

Structure: Source / Target / Special Purpose

The ETL application will use the source and target tables from other layers. The ETL maycreate temporary tables and files, logging and error-handling objects.

Cleansing: Cleansing Processes

This being the sole purpose of this component, the data will enter raw and exit transformed, ready for storage in the System of Record.

Retention: Transitory

Data retention is not applicable, except to the extent that tables with ETL processing logic may be required, as well as objects to handle surrogate key-cutting and determination of delta records requiring update of existing records or insertion of new ones.

Consumption: None

Only unit testers of the ETL processes will have any access to this area.

I welcome your comments.

The next article will look at the Information Warehousing Layer.