Reference Domains Part IV: Metadata & Governance

September 13, 2011 1 Comment

This is the fourth and final part in the series on working with reference domains, also called classifications. The first part provided an overview of their nature, the second recommended an approach to data modelling, and the third explored collecting and documenting them. Here we will discuss metadata related to classifications and how it can be used to assist the governance of content, with particular reference to data quality.

Profiling

Classifications will first be encountered through the analysis process. As the reference domain is identified and the master source of the full list of codes and descriptions is found, it is possible to compare this data against profile results to determine the integrity of the data. Imagine that the field under investigation is the marital status of an individual. The master source reveals that the full list of codes and descriptions include: 1=Married, 2=Single, 3=Divorced. The table-level profile shows that the minimum value is “1”, while the maximum value is “4”. With the profile output stored as metadata, and the classifications loaded into reference tables, it is possible to automatically test that the actual values found in the source are within the range expected in the reference tables.

Similarly, a more detailed test could be run at the column-level, with the frequency distribution output compared against the reference values to check that no aberrant values appear.

Aberrant values could be a sign of integrity issues, or may indicate that additional values need to be added to the reference tables.

In order to make full use of this comparison of reference domain values and profiling results, it is important to collect the external classifications as part of the analysis process. This will allow the team to catch anomalies early and avoid rework.

Data Content Governance

For external classifications there may well be decisions to be made around the collection and consolidation of reference domains. A protocol should be developed to address any issues with inconsistent domains of values. Multiple domains will need to be rationalized into a single set of values that will be acceptable to all lines of business. Care needs to be taken to ensure the most authoritative source has been identified, and that a process is in place to handle change notification. This is particularly important in situations where the source is a hard-coded list drawn from documentation.

Naming Standards Governance

For internal classifications, the content is not subject to content governance so much as the enforcement of naming standards. This is especially important in the naming of relationships, to ensure the nature of the relationship is being accurately described. The time to do this is as the logical mapping document is passed through the screening process to govern all logical names.

Data Architecture Governance

The vast majority of reference domains should pose no challenge to data architecture governance. Most data elements will fit neatly into the simple structures of the Reference Domain and Reference Value tables described in part three of this series. There may be a decision to house long lists of values in separate tables; setting a threshold as assessment criteria. For example, if a classification contained more than 500 values, it would be held in its own reference table. This would be done to help access performance, although it may not be required, and should be tested to determine suitability. If a threshold is used to influence design, the profile results can again be used to programmatically assist the design process.

Likewise, there may be a call to create special structures for classifications that have unique attribution or particular structures. For instance, a set of classifications may form a balanced tree hierarchy that could be usefully held in denormalized structures. Again, these exceptions should be rare; and I would suggest they be avoided, with a premium placed on consistency of design.

Model validation should ensure that the length of the source fields is accommodated by the target reference tables. The table profile results can be referenced to make this determination automatically.

This completes the series on reference domains. Please feel free to provide your feedback. What challenges have you faced with classifications? How did you resolve them?

Filed under Architecture, Process Tagged with architecture, classification, Data Content Governance, data profiling, data quality, Data Warehouse, development, EDW, Governance, Information Warehousing, Justin Hay, mapping, Metadata, methodology, practical governance, profiling, quality, reference, reference architecture, road map, roadmap, Testing

Reference Domains Part III: Collecting Classifications

September 6, 2011 2 Comments

This is part three in the series about reference domains (a.k.a. classifications). In the first part we looked at the nature of classifications. In part two, we discussed a recommended approach to designing the target structures in the EDW to accommodate them. Here we will look at the work of collecting and documenting reference domains.

The compilation of a master list of reference domains and their values straddles the analysis and design phases of development. For external classifications, those captured from the source system, the work must be done in the analysis phase, while internal classifications, being dependent on the design of the target data structures, are collected during the design phase.

As mentioned in part one, the external reference domains may first be encountered as code values in a source table; perhaps a status of a bank account, or a categorization of a customer. If you’re lucky, the word “code”, “type”, or “indicator” will appear in the column name. But it’s equally possible that the name will not make it immediately apparent that this is a classification. The analyst will need to determine it from the documentation or from the values contained within the data. Cryptic or abbreviated codes can take the form of numbers or character-based strings. It is important for the analyst to read the clues in the table layout, the documentation and the data to identify potential classifications and then follow up with the data steward or external provider of the source data to find the descriptions that explain the codes.

The code descriptions may accompany the codes in the same source table. However, even if this is the case, selecting a distinct set of the codes and descriptions from the table may not return a full set of potential values, only those that are currently being used. For example, pulling a distinct set of status types from a bank account table may return the values “open” and “closed”, but miss the fact that sometimes accounts are briefly “pending”. For this reason, it’s very important to find either a master table in the source that contains the full set of possible codes and values, or the documentation where the full set is listed. In either case, it’s essential to examine the list of occurring values and compare it against the master. The master may be out of sync with the newest list of allowable values, or there may be aberrant values appearing in the data which will need to be dealt with.

In some cases, no descriptions will exist. Business users may know the meaning of the codes, without requiring the definition. In these cases, it is best practice to create the description and include it in the reference value table. Codes should have explicit descriptions wherever possible, to avoid miscommunication and potential errors. It is a simple matter to misinterpret a status code of “A” as “Active”, when it actually means “Archived”.

Internal reference domains are quite different, in that they are entirely in the control of the EDW design team. The codes and descriptions are all generated and maintained internally. They take the same form of the external classifications, being codes and descriptions, and need to be compiled together with them in a master document and stored in the same reference tables in the EDW.

For compiling the master list, I recommend a pair of spreadsheets, one with a list of the reference domains (e.g., Account Type, Marital Status etc.) and the other a list of reference values for each domain. Here are some points to consider as you create and maintain this list:

Manually set the surrogate keys
Rather than simply pass the setting of surrogate keys over to the ETL process, this master list is an opportunity to control the keys being applied. In most cases, each domain will contain a relatively short list of values (<50) and this will be a simple task. The control and predictability it supplies can be useful for development, testing and usage of the database.
Reserve ranges of values for the surrogate keys
It’s a good idea to leave room between the domains of values to allow the reference domain to grow or be corrected over time. These ranges will not be uniform; one domain will reserve twenty sequential numbers while another will require a thousand.
Create dummy values for “Unknowns”
Many times, incoming records will not find a match to the allowable domain of values. Rather than leave these values null, a surrogate key corresponding to an “unknown” reference value within the domain can be applied.
Create dummy values at finer grain
In some cases it may be prudent to distinguish between different types of “unknown” values. For example, records for which a classification is not required could be marked “Not Applicable”, while an incoming value that is out of the allowable range could be assigned “Out of Range”. When applying this finer grain consideration should be given to what is practical as well as what will add value.
Create load-ready spreadsheet
Some external classifications can be loaded from tables, while others, along with all the internal classifications will need to be manually entered. In both cases, a spreadsheet can be used to control the process. I have seen an implementation where a single spreadsheet contained both manual entries and references to tables, and was used as input to the ETL load process.

In the final part of this series we will look at the collection of metadata related to reference domains and how it can be used to assist the governance of data content.

Filed under Architecture, Process Tagged with classification, Data Architect, Data Warehouse, development, Documentation, EDW, Information Warehousing, Justin Hay, Metadata, methodology, Normalization, reference, reference architecture, reference domains, road map, roadmap

Reference Domains Part II: Modelling Classifications

August 30, 2011 3 Comments

This is the second article in the series on working with reference domains, also commonly referred to as classifications. In Part I, we looked at the nature of classifications. Here we will discuss designing structures to accommodate them in the EDW’s Information Warehousing layer, within the normalized System of Record.

Two potential approaches to designing the Reference Domains are to:

Put each reference domain of values into its own table.
Collect all reference domains into two tables; one for a master set of domains and the other for all the values, with a column linking each to the relevant domain.

The second approach is the recommended one, and the remainder of this article will present the design and the rationale. I have seen the consolidated reference data approach implemented many times within the System of Record of the EDW; and it has proved highly effective.

Given this preferred approach, every Reference Domain and Value is to be placed in the following entities:

Reference Domain
Reference Value

As mentioned in Reference Domains Part I, reference domains are commonly made up of pairs of code and description attributes. However, it is possible that some domains may be comprised of additional attribution, such as short names, long names, multi-language names etc. In such cases, consideration should be given to extending the model to accommodate additional fields. (This is not the same as related reference values that may be stored together in the source, but should be separated in the EDW.)

The domain values may be organized hierarchically. To support many-to-many relationships between domain values, in terms of hierarchies or other relationship types, the associative entity, Reference Value to Reference Value Relationship, should be used. As with all associatives, it will accommodate multiple concurrent as well as historical relationships.

To facilitate changes in values within a given scheme, a Reference Domain to Reference Value Relationship entity can be deployed.

Maintaining History

The history of values for a given scheme in relationship to other core concepts is to be retained within relevant associative structures.

For example, in the diagram below, the current value of Party Status Id, with its effective date, called Party Status Change Date, are held on the Party entity. Historical values are held on the Party to Domain Relationship entity, with the Domain Scheme Id holding the identifier for the Party Status domain, and the Domain Value Id holding the identifier for the previous value.

To illustrate the example more fully, consider the following:

The Domain Scheme and Domain Value tables contain rows for the Party Status domain scheme.

Day 1

On day one, the party enters the system with a party status of “Active”. There are no entries in the Party to Domain Relationship table.

Day 2

On day two, the party status changes to “Closed”. The current value in Party is overwritten with the new value; the change date is updated; and the old value is inserted into the associative.

Design Benefits

The following points outline the benefits of employing a consolidated set of domain schemes and values within a small set of tables, versus creating a separate table for each domain scheme’s set of values.

Data Integrity: The integrity of the allowable set of values within a given domain are maintained through the application logic defined and implemented through the ETL. Maintaining a master list reduces the chance of inconsistencies or the appearance of anomalies.
Flexibility: There is only one table to access for all domain value codes and descriptions within the SOR.
Implementation Efficiency: No additional database entities or tables to be created, tested or maintained within the System of Record (SOR) logical or physical model. Fewer objects means less chance for error.
Operational Efficiency: A single utility can be created to interface with this table. It is true that even with multiple tables a single utility could be created with minimal logic. However, each new object would require some changes to the application, whereas the consolidated version can be extended seamlessly.
Consistency: History is stored in the classification associative tables (e.g., Party to Domain Relationship). The retention of the domain schemes and values as surrogate keys in the Domain Value entity facilitates these structures. This allows history to be gathered using a common mechanism, explicitly identifying both the scheme and the value.

In part three, we will go on to look at collecting and documenting classifications.

Filed under Architecture, Process Tagged with 3NF, architecture, classification, data profiling, Data Warehouse, EDW, Justin Hay, reference, reference domains, road map, roadmap

Reference Domains Part I: Overview of Classifications

August 23, 2011 3 Comments

Reference Domains, often referred to as Classifications, are essentially the sets of reference codes and descriptions that are used to categorize or classify entities throughout the EDW data model. They can be divided into two main groups:

Populated Externally by the Source System
Populated Internally by values determined within the EDW

External

By external I mean that the domains of values are contained within the content of the source data, which is outside of the EDW. It may be that the status for a given bank account is held as a coded value on the Bank Account table in the source; while the master list of the codes and descriptions of what those codes mean is held in a different table. In some cases, the list of codes and descriptions will not exist in a source table at all, and must be found in the source documentation. It is also unfortunately sometimes the case that no documented descriptions exist for coded values. It may be necessary to determine descriptions from the context.

The classification in all of these cases is external, because the domains of values are explicit within the source system feeding the EDW.

Internal

Some reference domains will not come from the source data at all; their determination will be internal to the data model. For example, you may have a party (i.e., a person or a company) with multiple telephone numbers: a home number and a work number, each of which occupies a separate column in the source table. Assume that these telephone numbers are being modelled in a normalized target warehouse; so, the phone numbers are contained in a phone number table, the party is in a party table, and the relationship between the two is in an associative table. The Party to Telephone Number Relationship Type is a classification, the value for which will not be populated directly from the source. The nature of the relationship in the denormalized source table is actually contained in either the name of the column or the column’s description.

The classification is internal because the need for a reference domain of the type of relationship is driven by the design of the EDW data model.

Structural vs. Descriptive

Another way to group classifications is structural vs. descriptive. Structural values are related to the EDW’s target data structures. For example, the Party Type is structural, acting as a determinant of whether the party is a person or a company. By contrast, the Marital Status Type of a person would be descriptive, being an attribute of the person. Either of those values could come directly from the source tables; or it’s equally possible that Party Type may be derived from the nature of the source (e.g., a list of employees being entered into Party).

Reference domains can be hierarchical, such as Standard Industry Codes. Some classifications seem to extend beyond the conventional pairing of code and description. Frequently, these are masking additional reference domains with related but separate domains of values. It is important to tease out the distinct sets of classifications during the analysis phase of the process.

In the following articles we will look at modelling classifications, collecting them, and ways of using them as part of initial quality assurance testing and governance.

Filed under Architecture, Process Tagged with 3NF, architecture, classification, Data Warehouse, development, EDW, Governance, Justin Hay, Metadata, methodology, reference, reference architecture, reference domains, roadmap

Naming Standards Governance

July 12, 2011 Leave a comment

Naming standards in the EDW apply to full descriptive names as well as abbreviations.

Full descriptive names are commonly referred to as Logical Names, and appear in documentation such as requirements, logical data models and report output. Abbreviations of logical names are referred to as Physical Names, and are applied to physical data models, ETL guidelines, database tables and columns, and report specifications.

The scope of naming standards extends to the determination of:

Terminology – with a common understanding of each term’s definition
Word Order – placement of words within a name (e.g., “Birth Date” vs. “Date of Birth”)
Domain of Class Words – use of a single convention (e.g., “Count” vs. “Number”)
Abbreviations – centrally-managed list of standard abbreviations for physical names

Issues

The challenges surrounding both logical and physical names arise from the variety of source systems from which the data is obtained and the disparate user communities who consume it.

With source systems, problems arise with internal inconsistencies in all aspects of naming standards outlined above. Even more prevalent are inconsistencies across sources from different vendors or systems.

The problem is carried into the EDW database, where the absence of naming standards and a process to govern them leads to inconsistent column names appearing in the tables. Inconsistencies can also appear in reports. Names are displayed in the heading and body of a report. These names are frequently determined by whatever group within the organization has requested the report and defined the requirements.

Solution

The solution is to apply standardization of names where it is practical, and ensure that those standards are followed through the layers of the EDW system to the point they become visible to end users.

Ungovernable Names

Through requirements gathering and analysis of the sources, it may be difficult or impractical to apply naming standards.

In the absence of a corporate standard, it would be difficult to have all requirements conform to a single set of names. In the long term, this may become achievable, as users become accustomed to the conventions and vocabulary incorporated into business intelligence from EDW.

For source systems, either internal or external, standardization of names will likely never be realized. Conventions will remain specific to a vendor or system. In the context of EDW, these names will remain visible only to Information Systems personnel involved in the development and maintenance of EDW.

For these reasons, both the impracticality and the low value to users, the names within these processes can remain ungoverned.

Governable Names

There are 3 points at which it becomes practical to apply governance to names:

Logical Source-Target Mapping / SoR Pre-model
Determination of Structures / DM Pre-model
Report Specification

Logical Source-Target Mapping / SoR Pre-model

This is part of modelling the System of Record (SoR), taking a list of source fields and assigning a placeholder in a target table. This work is done in a spreadsheet, with the target placeholders listed as logical names. Potentially cryptic or inconsistently abbreviated source fields are transposed into fully descriptive and standardized logical names. The logical names can be assessed as a complete list, which will help to determine patterns and identify anomalies.

Determination of Structures / DM Pre-model

As a part of modelling the Data Mart, the dimensions and measures are listed in a spreadsheet. The input to this process is twofold: the reporting requirements that have been gathered, as well as the information requirements that have passed through the SoR. Names that are inherited from the SoR will already conform to the MDS standard. However, there will likely be additional fields, derived from the SoR in some cases, or pulled from other sources, or representing computed or aggregated measures. These names need to be passed through governance as well.

Report Specification

In order for report output to conform to the same naming standards as the MDS database, governance should apply to report specifications as well. Many names will simply be drawn directly from the dimension and fact tables that already conform to the standard. In these cases, non-standard names from the original report requirements should be transposed. There will some cases where reports contain derived values; and in these cases, the new names should be identified, amended as per the standard, and passed through a governance process.

Governance Process

The governance of the standard can be performed in 3 stages of development, before, during and after:

Prescriptive: Documented standards that prescribe the various aspects of the names to be applied.
Decisive: Logical names and physical abbreviations are approved by a Naming Standards Governance body, potentially a task to be performed by the EDW Council, as described in the discussion on Data Architecture Governance.
Approbative: Modelling tools have embedded features to validate the logical word order and use of standard physical abbreviations (applicable to the System of Record and Data Mart models).

The process flow would be as follows:

Filed under Process Tagged with architecture, Business Intelligence, Data Architect, Data Warehouse, development, Documentation, EDW, Governance, Information Delivery, Information Provisioning, Information Warehousing, Justin Hay, Languages, Metadata, methodology, practical governance, reference, reference architecture, road map, roadmap, ZAMA

Data Architecture Governance

July 5, 2011 Leave a comment

In September of 2010, Marlene was a director responsible for the EDW program. She was juggling four or five project streams concurrently, each with overlapping subject areas and different delivery dates. Even though she had half a dozen data modellers available to her within the team, data design always seemed to be the bottleneck. Project managers would come to her complaining that the modellers were delivering late, or were occupied with rework. The business stakeholders were complaining that they couldn’t find the information they were looking for, and when they did, they didn’t trust it. A year into the EDW program with millions of dollars invested, the whole thing looked ready to collapse.

The director put together a task force to investigate the pain points. What they turned up surprised everybody. While the development was focussed on delivering each project’s requirements, it wasn’t keeping an eye on the bigger picture. The data modellers were working independently and when it came time to integrate, invariably realized that the pieces didn’t fit together.

The task force recommended a consolidated approach, with governance of the data architecture a key component. Marlene implemented all the task force’s suggestions. With clarity in the design, a process supportive of integration and a cohesive organizational structure, the bottleneck became an efficient pipeline.

EDW operations ran more smoothly and the conversation Marlene had with business stakeholders turned from fixing problems to enhancing capabilities.

How did Data Architecture Governance help to turn the situation around?

It was incorporated into three stages of the design lifecycle: before, during and after:

Prescriptive – What should be done – with design principles and guidelines
Decisive- What is being done – with regular review sessions and revision notes
Approbative – What has been done – with an approval work flow to authorize or disallow a design

Governance of Data Architecture coexisted with other domains of governance, including Data Content, Naming Standards, Business Definitions and Release Management; and took various forms in artifacts, processes and organizational structures.

Reference Architecture –

This is where proactive governance was applied; providing guidance to the modelling effort.

The reference architecture was fully documented, including a definition of the purpose, design and content parameters of each data sector.
An Enterprise Model template was purchased from a vendor specializing in the industry. Alternatively, this could have been developed in-house, but the opportunity cost of re-inventing the industry-specific wheel was too high. Instead, the standard model was customized to reflect the company’s differentiation in the market.

Development Process –

This is where decisive governance was applied; actively monitoring and collaborating in design work.

Regular Model Reviews were conducted between the data modeller and the Enterprise Data Architect.
In some cases these were twice weekly, in order to ensure the model was being mapped and revised accurately and consistently.
The same Data Architect was the central point of contact for all the modellers, and was given the necessary authority to make decisions.
In the event of a dispute, issues were escalated to the EDW Council.

Release Management –

This is where approbative governance was applied; acting as gatekeeper to ensure the quality of data design.

The release management strategy centralized the fulfillment of requirements as part of the EDW program.
Each release incorporated management of the data model artifacts from the various streams of development: comparing, merging and versioning the models.
A final Model Review was conducted between the data modeller and the EDW Council for authorization.

Organization –

The organizational structure supported the implementation of governance.

Data modellers were assigned sets of requirements, based on the intake assessment for the given release.
The Enterprise Data Architect was given responsibility for the oversight of the various streams of data design, with accountability to the EDW Release Manager.
The EDW Council was made up of the Release Manager, the Enterprise Data Architect, the ETL Architect, the lead Data Steward, and the Database Administrator.

Filed under Process Tagged with architecture, Data Architect, Data Warehouse, development, Documentation, EDW, Governance, Justin Hay, Metadata, practical governance, reference, reference architecture, road map, roadmap, ZAMA

EDW Reference Architecture: Metadata

May 30, 2011 1 Comment

EDW Reference Architecture: Metadata

As a final note on the series EDW Reference Architecture, introduced in Why Bother? and continued in discussions on the Acquisition, Integration, Warehousing and Provisioning layers, this article discusses an approach to metadata.

Metadata

Metadata is information about information, and can encompass anything pertaining to its definition, structure, content, integrity, lineage and processes.

Metadata Types

Ralph Kimball, a data management leader, usefully distinguishes between types of metadata:

Technical

This is information about the structures and content of the data, including data types, profiling results that reveal details of data quality and integrity, and other structural information that will have meaning primarily to those involved in development, but also to those who will maintain the system.

Business

This is the definition of the data as it pertains to the business, as well as information about the source of the data, its timeliness, and may link to where the data is being used in reports. These are aspects of the data that are of interest to business users.

Process

This is information about the processes through which the data has passed. This includes relationships to the multiple ETL processes that have migrated the data from source to target, but may also refer to the project responsible for bringing it into the EDW, processes and applications that extract it, and usage statistics related to operational and analytical applications. This will be of interest mostly to those maintaining the EDW.

Technical metadata on sources can be collected quite easily, and can be useful for downstream processing, audit and quality checks; but it has very limited (if any) use to the business, and taken in isolation contributes little to understanding the end-to-end system. Likewise, business definitions held in a glossary are useful to business, but without links to sources, targets and aliases throughout the system have limited value to the development process. Each organization must determine where and how it will derive value from metadata, but in most cases, optimal value will come from a cross-section of all three types, with a consolidated view of the technical, business and process-related aspects of the data.

Metadata Challenges

Some of the challenges related to metadata include:

Lack of connection between work processes and metadata collection
Inability to generate impact analysis and data lineage reports
Inability to connect disparate sets of metadata, where metadata collection does occur
Lack of leverage of metadata for support of development processes
Lack of comprehensive solution to collect, manage and consume metadata

A frequent complaint made of metadata is that it may be easy to collect, but it’s difficult to keep up to date. Similarly, the maintenance of metadata is often an extra job, that sits outside of the team developing and maintaining the EDW. Even when it is gathered and managed well, the metadata tends to reside in silos – with profiling, ETL, and business definitions residing separately and never cross-referenced.

To make it more cost-effective, metadata must be part of EDW change management; to keep it in good order, the responsibility to collect and maintain metadata must belong to the development team; and to extract the most value from it, the metadata must be integrated.

Metadata Uses

In order to establish a set of criteria by which to judge the value of a given metadata strategy, it is important to first identify the purpose of such an initiative. To be of use, metadata must be part of a Practical Governance Framework:

Governance is the foundation of the Development Activities
Development Activities populate the Metadata Repository
Metadata Repository makes possible Governance Activities
Governance Activities are targeted to meet Project Objectives
Business Objectives are designed to follow Business Drivers
Business Drivers show EDW benefit to the organization

Using this framework, the only metadata collected will be that which serves the overall objectives of improving:

Efficiency of the development process
Accuracy of the information being processed
Productivity of business users

A successful metadata strategy must be geared towards enabling capabilities. It will involve focusing on the type of metadata required, recognizing and addressing the challenges of collecting and maintaining that metadata and incorporating the use of the metadata in the development process.

Here is a list of potential requirements that metadata can fulfill:

Data Quality Assessment:
- Are the data elements adequately and accurately defined?
- Of the required data elements, how many show integrity issues through profiling? (Do we have to manually cross-reference them?)
  - Do duplicate records exist? (Analysis of distinct values through profiling)
  - Are there special characters to contend with?
  - Is the data populated to an acceptable level? (Analysis of null values through profiling)
  - Are the values within expected ranges? (Analysis of min and max values and frequency distributions through profiling)
Project/Release management:
- How much of the requirement has been completed? At what stage is it now?
- How much overlap in data is happening between projects?
Data Lineage:
- How is the data mapped from source through to information provisioning?
- Where does the data on the reports ultimately come from?
Impact Analysis:
- If a data element is removed from scope, what reports are affected?
- If a report is removed from scope what data elements aren’t needed any longer?
Model Validation:
- Will the data model accomodate the source data? (Cross-reference of target physical model with min and max lengths through profiling)
- Does the data model include all required source data elements?

There are many other valuable applications of metadata. It is important to identify each application, along with the type of metadata, the process that generates it, the mechanism that can be incorporated into the process to collect it and the way it can be employed to exploit its value.

What is your organization’s approach to metadata? What were your challenges? What capabilities did it give you?

Filed under Architecture Tagged with architecture, Data Architect, Data Warehouse, EDW, Governance, Justin Hay, Metadata, practical governance, reference, reference architecture, road map, roadmap, ZAMA

EDW Reference Architecture: Information Provisioning Layer

May 27, 2011 2 Comments

: EDW Reference Architecture: Information Provisioning

Information Provisioning Layer

This is the fifth in a series introduced in EDW Reference Architecture: Why Bother?. Others in the series have looked at the Data Acquisition, Data Integration, and Information Warehousing layers.

The Information Provisioning Layer is designed for ease of navigation and optimized for retrieval performance for business intelligence and other consumers.

Data Mart

Purpose: Meeting Specific Needs

This sector is intended as a platform from which to meet specific information requirements. Where possible, the objects will be designed for re-use. However, the guiding principle is the effective servicing of business needs, and to this end, the solution’s primary focus is on the given project’s requirement.

Content: Golden Copies Plus

The primary source of content in this sector is the Information Warehousing layer; trusted and atomic data that can be used to populate the tables in the Data Marts as well as generate aggregated and computed measures, and derived dimensional attributes. In addition, it may be expedient to combine this information with data from other sources, that have not passed through the process of integration into the EDW. This should be permitted on an exception basis, with governance controls and a clear demarcation of the data’s lineage.

Structure: Denormalized

A common approach to data architecture in Data Marts is the “Star Schema”. This lends itself to ease of use and efficient performance. It also supports a high degree of re-use of objects, with “conformed” dimensions, to borrow a term from Mr. Ralph Kimball, being shared by a variety of users. Applying views on top of the tables provides a mechanism for security and a way to filter on rows and columns for specific use-cases. This sector is not limited to the “Star Schema” design, and should employ structures that are “fit for purpose” wherever greater efficiencies might be attained. This should include the option to apply views over table from the Information Warehousing layer, in cases where performance is not a concern, or the information resides within the Interactive Sector.

Cleansing: Cleansed

The Data Mart draws information primarily from a single trusted source; the ETL processes loading the Data Mart are not cleansing the data, although it will perform transformations to create derived values and aggregations. External sources being combined with Data Mart tables should not require cleansing. If they do, they should pass through a process of analysis and profiling to identify and remediate any anomalies. If the data is atomic, consideration should be given to making it a standard source input, subject to the full process of integration through the Information Warehousing layer.

Retention: Medium-term / Business-determined

The length of the retention period will depend on the business need rather than any regulatory requirement. If only current views of dimensions are required, the dimensions may be type one (i.e., changes are applied as updates to existing rows).

Consumption: Open

The information held in the Data Marts is intended for consumption, subject to privacy concerns and security constraints. Information will be accessed through business intelligence applications and other downstream systems. Appropriate levels of governance will be enforced for all information retrieval.

I welcome your comments.

As a final note on EDW Reference Architecture, the next article will discuss the Metadata Repository.

Filed under Architecture Tagged with architecture, BI, Business Intelligence, data mart, Data Warehouse, denormalization, EDW, Information Delivery, Information Provisioning, Justin Hay, methodology, Normalization, reference, reference architecture, Reporting, ZAMA

EDW Reference Architecture: Information Warehousing Layer

May 24, 2011 3 Comments

: EDW Reference Architecture: Information Warehousing

Information Warehousing

This is the fourth article in a series introduced in EDW Reference Architecture: Why Bother?. Other articles have looked at the Data Acquisition and Data Integration layers.

The Information Warehousing layer is designed as a normalized repository for all the integrated and mastered data. The primary purpose of this area is to organize the data at an atomic level in a state that is cleansed, mastered and well-defined. The discipline of normalization imposes a strict order on the data that promotes data integrity and retains a high degree of flexibility. The relatively complex structures are not conducive to ease of navigation, and are tuned for load performance rather than data access.

System of Record

Purpose: Storage of Single Trusted Source

As the name suggests, the system of record is designed to provide a single trusted source for all downstream consumption. Its focus is on flexibility, storage and load performance.

Content: Integrated / Atomic

This area contains the lowest level of detail available, with minimal redundancy, including the repetition of information within derived or computed values. The intention is to retain here all component fields that would be used to derive other values. Ideally, each piece of information exists only once within the system of record, although some controlled level of redundancy for convenience may be included (e.g., primary identifier for an Involved Party included on both the fundamental Involved Party table as well as the Involved Party Identifier attributive table).

Structure: Normalized

The principles of normalization isolate each distinct group of information to minimum repetition and ensure that it belongs logically together. The modelling directive is to model according to the “essence” of the entity rather than its “use”. Practically, this ends up meaning that tables contain columns closely related to a single fundamental concept. The business function of a given piece of data is revealed through its relationship to other entities, including role-based entities. Through the use of relationship and attributive table, new data elements can be added to the System of Record without requiring additional structures.

Cleansing: Mastered and Cleansed

The system of record is the target of the data that has passed through the Data Integration Layer. No data should be loaded into the System of Record that has not passed through the process of being mastered and cleansed. It is critical to user confidence that this area remain closely governed and that its integrity not be compromised by co-locating external data without applying the discipline of the acquisition and integration layers.

Retention: Medium-term / Selective / Regulation-determined

The normalized structures permit the storage of all data elements. Most data elements can have history applied at the attribute level, recording the change in a single attribute without needing to repeat the other values contained on the same row. This means that history can be applied more selectively, with some fields being overwritten without storing the original value. For long-term storage, beyond seven years, data will be moved to a separate archival storage platform. The retention period may be driven by regulatory requirements, so the data remains accessible for auditing; although regulatory needs may be met through archival storage.

Consumption: Limited

As has been stated, the primary function of the system of record is to remain optimized for storage and open to receiving new data sources. However, access to the system of record should not be barred to all users. The normalized structures offer a flexible path through the data, with multiple perspectives retained. The selective history approach provides greater visibility to changes in data values than denormalized tables. These structural advantages should be exploited.

It is important that this access not be allowed to compromise the integrity of the system. Design decisions should not be influenced by considerations of query performance. Should it become clear that usage of the normalized structures for data analysis is of sufficient value to users, consideration should be given to creating a Deep Analytics Sector.

Interactive Sector

Purpose: User-maintained Information

The Interactive Sector provides an area in which users can create and maintain authoritative data, directly feeding the System of Record. The content of this sector can be viewed as an extension of the System of Record as well as a source to enhance the Information Provisioning Layer, either through joins or through ETL migration to tables within that layer.

There are many circumstances in which users need to be able to set up and control sets of values, such as banding (e.g., age demographic ranges) or domains of values for lookup tables that may be relatively dynamic. In many organization, these may have been managed through spreadsheets or local databases; a mature solution will see these implemented through a centralized and technologically advanced approach.

Content: Integrated / Atomic and Derived

Although the data here is user-generated, it resides within the Information Warehousing layer, and is designed and quality-controlled in such a way to be integrated. This means that the applications created to insert and update data in its tables will enforce referential integrity from data within the System of Record, and make use of its domain schemes and values. The content is to be primarily atomic, although some data may be derived. The intent is to enable and govern user-maintained data, rather than constrain it.

Structure: Normalized / Denormalized

The data structures of the Interactive Sector will mirror the System of Record wherever possible. In cases where user data will be linked directly to reporting tables, a denormalized design may be adopted. The goal is to be agile and accommodating without compromising the reliability of the information being provisioned. The guiding principle should be that the users are creating the “golden copy” of the data they are generating, and therefore it should conform as much as possible to the design of the System of Record.

Cleansing: Cleansed

It is essential that the application underlying the user interface enforce integrity controls on all data being entered and maintained. The Interactive Sector is being defined as sitting within the Information Warehousing layer, a zone in which all data must be scrubbed clean, rationalized and consolidated into a single trusted source. If the user interface lacks such controls, it should be treated as a standard source and be passed through the Staging area and subject to the processes of analysis, profiling and ETL transformations.

Retention: Medium-term

As per the System of Record, this user-maintained sector tracks the history of changes as required, and stores it for one to seven years, as appropriate. In DW2.0, Mr. Inmon suggests that the interactive sector store data for the short-term only, and be retained for a longer time in the Integrated Sector (System of Record). This might make sense for your organization, but the industry is moving towards migrating data as little as possible, and it may be more efficient to hold onto this data and access it where it is generated.

Consumption: Open

The Interactive Sector data is intended to be combined with the Information Provisioning layer for business intelligence and extracts for downstream use. The data may be subject to security restrictions, but is otherwise open to all consumers from the Information Delivery layer.

I welcome your comments.

The next article will look at the Information Provisioning layer.

Filed under Architecture Tagged with 3NF, agile, architecture, Data Warehouse, EDW, Justin Hay, methodology, Normalization, reference, reference architecture, road map, roadmap, ZAMA

EDW Reference Architecture: Data Integration Layer

May 21, 2011 4 Comments

: EDW Reference Architecture: Data Integration

Data Integration Layer

This is the third part of the series, introduced in EDW Reference Architecture: Why Bother? and continued in a piece about the Data Acquisition Layer.

The Integration Layer marks the transition from raw data to integrated data; that is, data that has been consolidated, rationalized, duplication of records and values removed, and disparate sources combined into a single version. This layer represents the passage of the data through the process of integration, rather than the storage area for the data. For data with multiple sources, a mastering process is required; this is termed Master Data Management. Data from a single source may still require significant processing for de-duplication and cleansing, and this will be performed within the ETL application or processes.

Where the process of “mastering” disparate sources is not required, the Master Data Management process is bypassed. Instead the data is processed by ETL, undergoing systematic de-duplication, error-detection and cleansing routines.

Master Data Management

Purpose: Processing from Multiple Sources

Master Data Management is the process by which data from different sources is matched and processed to compile a single “golden copy” of a given entity.

Content: Core Attributes of Master Data

The MDM process should attach all relevant attribution related to the core entity. It is possible to have the MDM only process those attributes that are sourced from multiple systems. For those with a single source, no conflict exists to be resolved. Therefore, some attribution could flow directly through ETL processing to be attached to the related entity within the System of Record.

Structure: Source or Target Oriented

The MDM system will have its own internal data structures. These structures may be oriented towards the structure of the source files or the target System of Record.

Cleansing: Matching and Cleansing Processes

For data with multiple sources, a mastering process is required; this is termed Master Data Management. Data from a single source may still require significant processing for de-duplication and cleansing, and this will be performed within the ETL application or processes.

Retention: Transitory / (History dependent on solution)

Only the current version of the data need be retained. There are MDM solutions available for purchase and some come with the capacity to store history. In some cases, this ability is essential to maintain a mastered copy.

Consumption: None / (Operational Systems if required)

There is no direct consumption of MDM, beyond the migration of the data into the System of Record through ETL. However, in some cases, the MDM component is a direct supplier of operational systems. This is acceptable, given that the process produces a “golden copy”. The need for this may be driven by a timing issue, or reducing the load on other sectors of the EDW.

Cleansing Process

Purpose: Processing from single sources

Here the data must be prepared for downstream consumption; applying any necessary de-duplication, matching, cleansing, rationalization of values, and other transformations.

Content: Non-MDM attributes deemed of value for System of Record

All data that is to be migrated to the System of Record and that is not to be mastered through MDM must pass through this process.

Structure: Source / Target / Special Purpose

The ETL application will use the source and target tables from other layers. The ETL maycreate temporary tables and files, logging and error-handling objects.

Cleansing: Cleansing Processes

This being the sole purpose of this component, the data will enter raw and exit transformed, ready for storage in the System of Record.

Retention: Transitory

Data retention is not applicable, except to the extent that tables with ETL processing logic may be required, as well as objects to handle surrogate key-cutting and determination of delta records requiring update of existing records or insertion of new ones.

Consumption: None

Only unit testers of the ETL processes will have any access to this area.

I welcome your comments.

The next article will look at the Information Warehousing Layer.

Filed under Architecture Tagged with architecture, Data Warehouse, EDW, Governance, Justin Hay, reference, reference architecture, road map, roadmap, ZAMA

← Older posts

Mindful Data Management

Reference Domains Part IV: Metadata & Governance

Reference Domains Part III: Collecting Classifications

Reference Domains Part II: Modelling Classifications

Reference Domains Part I: Overview of Classifications

Naming Standards Governance

Data Architecture Governance

Reference Architecture –

Development Process –

Release Management –

Organization –

EDW Reference Architecture: Metadata

Metadata

Metadata Types

Metadata Challenges

Metadata Uses

EDW Reference Architecture: Information Provisioning Layer

Information Provisioning Layer

EDW Reference Architecture: Information Warehousing Layer

Information Warehousing

EDW Reference Architecture: Data Integration Layer

Data Integration Layer

Recent Posts

Archives

Top Posts