In Defense of Surrogate Keys

Employing surrogate keys is an essential strategy to protect the integrity of data assets in the EDW. For those looking to defend such a position against those who would prefer human-readable business keys (a.k.a. natural keys), this article lays out some arguments. I hope by the end the non-believers will be converted, and those who already know the value of surrogates can move ahead without opposition.

Michelle A. Poolet has provided four criteria to determine whether a surrogate key is required:

1. Is the primary key unique?

Where a natural key is not unique, such as a person’s full name, a surrogate key must be used. However, a unique composite alternate key will be needed to match on the record for updates, to attach attribution or establish relationships.

2. Does it apply to all rows?

Take an example such as Customer. Customers may all come from the same source and be loaded into a Party table. However, parties that play different roles and that are loaded from different source systems, may have natural keys of different formats (e.g., company business number, vendor identifier)

3. Is it minimal?

Composite keys and long descriptive fields are obvious candidates for transformation to surrogate keys.(e.g., Postal Address)

4. Is it stable over time?

Identifiers generated by a Master Data Management (MDM) solution are designed not to alter over time. However, if the MDM solution was to change technology, or some sources were to be passed into the EDW through a different channel, the MDM key would be problematic. It is preferable for surrogate keys to be generated internally to the Information Warehousing layer.

Rationale for preference of Surrogate Keys over Natural Keys: 

# Issue Natural Key Surrogate Key
1 Primary and Foreign Key Size Natural keys and their indexes can be large, depending on the format and components involved in making a record unique. Surrogate keys are usually a single column of type integer, making them as small as possible.
2 Optionality and Applicability The population of natural keys may not be enforced in the source data. Surrogate keys can be attached to any record.
3 Uniqueness Natural keys always have the possibility of contention. Even where a single source has protected against it, contention may exist across sources. Surrogate values are guaranteed to be unique.
4 Privacy Natural keys may be prone to privacy or security concerns. Surrogate keys are cryptic values with no intrinsic meaning.
5 Cascading Updates If the business or source changes the format or datatype of the key, it will need to be updated everywhere it is used. Surrogate values never change.
6 Load Efficiencies With the exception of very short text values, VARCHAR values are less efficient to load than INTEGER. Surrogate values can be loaded more efficiently.
7 Join Efficiencies With the exception of very short text values, joins using VARCHAR values are less efficient than INTEGER. Surrogate values can be joined more efficiently.
8 Common key strategy A lack of a common key strategy will lead to longer decision-making process and more patterns for code development. A common key strategy based on the use of surrogates will streamline design and development.
9 Relationships Natural keys with different data types will lead to the creation of more objects in the database to accommodate the associations between entities. The use of surrogates promotes reuse of associative entities for multiple types of relationships.
10 History Related to points above, the use of natural keys inhibits the flexibility and reuse of objects and patterns to support change capture. Surrogate keys enable a strategic design to store history using associatives.

The list above is drawn in part from an article by Lee Richardson, discussing the pros and cons of surrogate keys.

In summary, the reasons to employ surrogate keys as a design policy are related to two factors

1. Performance

In the majority of cases the surrogate key will perform more efficiently for both loading and joining tables.

2. Productivity

Even in the minority of cases where a natural key will perform as well as a surrogate, by adopting a consistent surrogate key policy there are benefits to efficiencies gained in design, development and maintenance.

EDW Roles and Responsibilities: Part I

In the next couple of articles, we look at the organizational structure of the EDW development team. This needs to be considered in the context of the development life cycle and areas of governance that I have discussed elsewhere, with particular reference to Release Management.

A few things to note while reading this list:

  • Technical infrastructure is not included – this is taken as a given. In my experience, the hardware and software is set up and maintained by a different group, who are not usually involved directly in any design or development decisions. Of course, they are involved when it comes to the underlying technical architecture of the overall solution, and the infrastructure needs to support the flow of data through the warehouse layers.
  • The organization tree structure diagram above is conceptual, and intended to represent a generic organization. Some organizations are larger and have many resources, internal and external within any branch. Others have a small core team that does all the development, with a single person handling many roles.
  • The main thing to take away from the diagram is that the EDW Program team analyse, data model, ETL and QA their work, while project streams gather requirements, spec and build reports. The EDW Council is a body of higher authority who can arbitrate design decisions and provide governance of the process.

Program/Release Manager

The program manager is responsible for the overall management of the EDW release. This includes the fulfillment of all data requirements, adjusted after identification and analysis of the sources; as well as delivery of the validated, repeatable processes to populate the target databases.

Throughout the release life cycle, the program manager is accountable for all the major milestone deliverables that sit within the responsibility of the EDW development team. This excludes only the detailing of requirements for each project stream and the development of the business intelligence solution that is built on top of the data mart structures.

The program manager must be part of the intake process to determine the scope of each release. To this end, he/she is consulted on the definition of each project’s requirements.

While the program manager is not directly involved in, or responsible for, business intelligence, he/she is kept informed of BI progress to remain responsive to needs that may arise.

Project Manager

The project manager of each stream is directly accountable for the detailed definition of requirements feeding into the EDW, as well as the documentation of BI specifications and the delivery of BI reporting that will draw from the data mart.

The project manager will not be involved in the development of the EDW, but will remain informed of its progress. Early in the process, it is particularly important that the project manager be apprised of data quality issues that arise through identification and analysis of sources. The system of record is not relevant to an individual project, while the data mart will be a critical aspect of the project’s success; and so the project manager is informed of major milestones in the development of the data mart.

Business Analyst

The business analyst is responsible for the compiling the detailed definition of requirements. It is also the business analyst’s responsibility to identify the sources that will fulfill those requirements. It is to be expected that not all analysts will have the business knowledge to complete the first task as well as the technical knowledge to complete the second. This is a case where a single role could be performed by different people.

The business analyst supports processes that involve, or require input from, analysis. This means that may be actively involved in the analysis of profile results and writing SQL queries to interrogate the source data. He/she will find, amend and compose business definitions of fields. When it comes time to map sources to targets, the analyst will be engaged to communicate his/her understanding to the person writing the document. It may even be the case that the analyst’s skills in documentation mean that he/she writes does the writing.

Data content governance involves insight into the nature and quality of the data. To this end, the business analyst supports this governance process. The analyst is also consulted on the information and reporting requirements to determine suitable structures for the data mart.

Data Steward

The data steward is responsible for governing what data will be brought into the EDW, the governance of definitions for each of the data elements, and the identification and definition of business rules to ensure data quality. This means, that while the business analyst will see that definitions are attached to each field, the data steward must ensure that the definitions are complete and accurate.

The data steward supports processes that depend on an understanding of the business meaning of data. This includes the definition of requirements, the identification and analysis of sources, and the mapping of those sources to the system of record.

The data steward also supports naming standards, although, due to each steward being attached to a given subject area domain, or set of domains, they are not responsible for setting or maintaining these standards.

Data Quality Analyst

The data quality analyst is responsible for assessing the quality of data at various layers. Initially, the quality analyst is engaged in the activity of running the data profiling functions and any interrogation queries. While the data analyst may draft queries, it is ultimately the responsible of the quality analyst to ensure that the analyst’s questions get answered. Later in the development life cycle, the data quality analyst is responsible for testing the data populating the system of record and the data mart.

The quality analyst enables data analysis, and so supports the governance of data content in assessing the integrity and suitability of the sources.

The data quality analyst may be consulted on questions arising from mapping sources to the system of record. They may offer technical insights or be able to perform additional quality tests to support the mapping exercise.

The requirements and source fields are an input to the quality analyst’s activities, and so he/she is informed of the results of these processes.

Enterprise Data Architect

The enterprise data architect is responsible for the governance of data architecture, including the establishment of principles and guidelines, as well as reviews and final sign-off of the data models. He/she is also responsible for establishing guidelines for logical and physical naming standards.

The enterprise data architect is consulted on all data modelling activities, relating to both the system of record and the data mart. Data models for any other component of the architect will also pass through enterprise architecture for consultation.

The governance of data content relates to the placement of relevant data within the context of the reference architecture. The governance process involves the consultation of the enterprise data architect.

The requirements and source fields are an input to the enterprise data architect’s activities, and so he/she is informed of the results of these processes.

Data Modeller

The data modeller is responsible for all processes that are directly related to data model artifacts, including the logical mapping of the source to the system of record, logical modelling and physical modelling.

The data modeller supports processes that modelling depends on, including the analysis of the sources through profiling and interrogation; or are dependent on the models, which includes the documentation of the physical mappings to populate both the system of record and the data mart.

The governance of business definitions, naming standards and data architecture all pertain directly to the data models; and so the data modeller supports these processes, by ensuring that all fields contain complete, well-formed and accurate definitions and conform to standard naming conventions, and that the structure aligns with the design standards of the given component within the reference architecture.

The data modeller is informed of requirements and source fields, as these are inputs to the process of designing data structures.

Part II will complete the list of EDW Roles and Responsibilities.

Measuring Success

When measuring success the notion of what value is being offered to the business is always raised. This is fine and well – and should be assessed. However, this is done as part of the business rationale given in the requirements for a project, or even earlier, as part of the project charter. If the requirements of a given project are met, then it follows that the value of that project’s goals are realized. But that doesn’t give the EDW the ability to measure itself. How long is it taking to deliver projects? How do you quantify how much is being delivered? Is the work being done with sufficient accuracy and efficiency? Once it is delivered, is it being used?

EDW management needs a set of measures to determine its own performance; in terms of its own qualitative capabilities, as well as quantitative measures of data quality, process quality and utilization of the system.

Here is a list of potential measures of the EDW; measures that can continue to be applied beyond initial delivery to monitor ongoing performance. Success critieria are broken into capabilities and measures; with the measures being further divided into the sections:

  • Data Quality – measuring accuracy
  • Process Quality – measuring efficiency
  • Utilization – measuring productivity

Capabilities

  • Data Quality Assessment
    The ability to assess the accuracy and integrity of data. Data profiling and interrogation capabilities are acquired by setting up a suitable staging area with profiling and SQL query tools.
  • Data Lineage
    The ability to track the source of data elements used in reports and found in the various layers of the EDW; as well as make visible the transformations that may be applied along the way.
  • Impact Analysis
    The ability to assess the impact of a change in a source data element to EDW objects and processes such as target fields, ETL processes and reports.
  • Model Validation
    The ability to assess the validity and completeness of data models with regards to information and reporting requirements.
  • Release Management
    The ability to assess the extent to which requirements have been met and at what stage of development each required data element sits; as well as identify overlapping requirements across projects.

Data Quality Measures

  • Missing Values Count
    The number of fields that fall below an acceptable threshold of having null values.
  • Single Values Count
    The number of fields that fall below an acceptable threshold of having a single non-null value where more than one value is expected.
  • Zero Values Count
    The number of numeric fields that fall below an acceptable threshold of having zero or null values.
  • Special Characters Count
    The number of alphanumeric fields with one of a set of special characters (e.g., #,$,& etc.)
  • Invalid Values Count
    The number of fields with values that are outside the expected range. For classifications, this can be automated with a cross-reference of expected and actual values.
  • Invalid Dates Count
    The number of date fields with invalid dates, either malformed, inconsistently formatted or outside the expected range.
  • Multiple Domains Count
    The number of fields that contain multiple domains of values; either due to lack of rationalization or overloading.
  • Aberrant Records Count
    The number of records that are rejected, for any reason, from the source as they are processed.
  • Misidentified Fields Count
    The number of fields, the name and/or definition of which, do not accurately describe their content.
  • Orphan Records Count
    The number of records that lack expected relationships to other entities. (e.g., Positions without Securities, Securities without Issuers, Transactions without Stores)

Process Quality Measures

  • Malformed Definitions Count
    The number of fields that are blank or have only the logical name as a description.
  • Non-standard Names Count
    The number of tables and fields with names that contain components not found in the list of standard abbreviations.
  • Source Mappings Count
    The number of required source fields with no mapping to target tables.
  • Target Mappings Count
    The number of target fields with no mapping to source fields where a source is expected.
  • Productivity Count
    The number of fields added, and populated, to the System of Record  and Data Mart each month. The other EDW component sectors can be tracked in this way as well.
  • ETL Performance Measure
    The length of time it takes to load the System of Record and Data Mart. (Use number of records, tables and columns being processed to determine a relative measure.)
  • Rework Count
    Number of issue tickets generated. These may further broken down into type and severity.

Utilization Measures

  • Reports Count
    The number of reports in active use each period
  • Users Count
    The number of users running reports each period.
  • Access Count
    The number of access requests (through business intelligence tools) by table each period.

Data Content Governance

In the overview of EDW development, data content governance is identified as one of the core processes.

The governance of data content should be included early in the development life cycle. Data stewards, responsible for defining and maintaining the data, should support the gathering of requirements. As a checkpoint for data selection, governance must be applied to the identification of sources. The analysis of those sources through profiling and interrogation must be accountable to governance of data quality.

While the Data Steward is responsible for the governance of data content, he/she is accountable to the EDW Program Manager, who is ultimately responsible for release management. The Data Steward receives support from Business Analysts and Data Quality Analysts and consults with the Enterprise Data Architect for compliance with the reference architecture.

In order to decide which source data will be brought into the EDW, a number of criteria apply:

  1. Authority
  2. Relevance
  3. Consistency
  4. Accuracy

Authority

Only the most authoritative sources should be brought into the EDW, and redundancy should be minimized. If the vision for the EDW is to create a single version of the truth, that version needs to represent the master copy of the data. For example, where banking transactions have been sourced, the transaction file is not an appropriate place to draw account information, beyond identifying which account is involved. The transaction file is an authoritative source of the transaction, but not for branch, account or customer details.

There are two main risks with this. (1) A better source may be identified later, which would lead to rework of mappings and ETL, and potentially redesign of data structures. (2) Inconsistencies or gaps may appear if data is sourced at an inappropriate granularity. An example might be branch information on a transaction that will not supply the information for those branches not involved in transactions and may introduce multiple values where a single one is expected, such as branch type. Conversely, it will not be possible to check the integrity of the transactions against the master list of branches.

I have seen situations where due diligence was not done to identify master sources at the beginning of the project; and this caused significant revisions to the system when a new source was introduced. Likewise, expediency sometimes leads to designing and processing transactional sources before dimensional ones; such as sales before stores, or trades before securities. This is possible but, as I’ve suggested, problematic.

Relevance

Whether the project requirements are driven by specific reports or a given subject area, there is a tendency to want to reach beyond what is immediately requested and pull in all the data elements from the source. At first blush, this seems a reasonable course of action. Obtaining data extracts from sources, internal and external, can involve negotiating agreements that can be time-consuming and costly. However, what is often not appreciated is the expense, and additional time, that is incurred in passing each of those data elements through the development process. It takes time to do the analysis, design a structure, build the ETL and then test the result of each and every column. (Some of these efforts are detailed in Providing Data Modelling Time Estimates).

Given this concern, it is imperative that EDW intake be governed by what will add value to the business. Where the requirements are based on reports, it is not necessary to be restricted only to the reporting components; but the ad hoc reporting needs of users should be compiled to identify additional high value attributes to include. Where it is a subject area being sourced, input from the business is essential for the business intelligence solution to be effective. The adoption of a Data Retention Sector, or a similar holding area for raw data, will act as insurance against leaving out something important.

Also to be considered is that different types of content belong in different components of the reference architecture. The warehousing layer should hold atomic data, while the provisioning layer can hold derived values, including concatenations and calculations, and aggregates. System maintenance data and fields used in the source’s internal processes have no easy home in the EDW. If any such data is seen as essential, consider leaving it, too, in the Data Retention Sector.

Consistency

Where entities play multiple roles, matching logic should be determined so they can be uniquely identified. For example, if a company is both a customer and a supplier, it would be valuable to have a single unique identifier to view all of their activities in relation to the organization. In a normalized model, this company will occupy a single entry in a Party table, rather than one for each role. In order to accomplish this, it first needs to be recognized as a possibility that might occur, secondly that the organization wishes to track these roles as a single entity, thirdly, confirmation that it is possible to do so and lastly, that it be done. The need to match customers and suppliers may not come from distinct projects looking at invoicing and sales, but should be identified as part of the governance of the EDW’s content management.

Consistency also applies to domains of values. Different sources may have distinct codes for the same descriptions; such as marital status being numeric (1=single, 2=married, 3=divorced) versus character-based (s=single, m=married, d=divorced). Governance needs to determine how sets of values should be consolidated.

Accuracy

Data Quality thresholds and criteria must be defined and enforced. For example, governance could stipulate that more than 90% of a column be populated for it to be included in the EDW. These threshold rules can be general or specific; and will frequently need to be adjudicated on a case by case basis. Setting measurable standards for accuracy is a practical way to approach governance and be seen to be enforcing it.

Without these controls the risks are that the EDW will take on too much work; that rework will be needed with each discovery of “better” sources; that the data won’t be fully integrated; and that ultimately users won’t have confidence in it. Data content governance is an integral part of ensuring the EDW will be a valuable asset to the business.

EDW Process – Part IV: Common Processes

In this series, we are looking at the processes involved in EDW development.

Common Processes

Physical Model:

The data modeler generates the Physical Data Model (PDM) from the completed Logical Data Model (LDM). The PDM retains descriptions at both the table and column level. The PDM is fed directly to the database administrator to implement the tables in the database.

Physical Target-Source Mapping:

Each data element needs to be mapped from its target back to its source. This applies to both the Information Warehousing layer to the staging area, and the Provisioning layer back to the warehouse. The basic mappings from one physical structure to another are included together with any transformation logic to be applied by the ETL process.

I would suggest the following areas be addressed as part of a physical target-source mapping:

  1. Load dependency diagram
  2. For each subject area:
    • Summary mapping diagram showing links between source and target entities.
    • Change Capture Logical expressed in a diagram .
    • Detail mapping table with target orientation; all target columns are listed with corresponding source or expression to populate it.

This is an essential piece of documentation, and is not simply a restatement of the logical source-target mapping. It provides instructions for ETL development and a point of reference for QA testing.

General suggestions on documentation can be found in Memento: The Importance of Documentation.

Physical Implementation:

The database administrator (DBA) receives the physical model and applies database configuration settings (e.g., indexing, partitioning, volumetrics). I have now encountered a variety of clients whose DBA group run the gamut of engagement with projects. Active DBA’s will be involved much earlier in the design process, and may be part of a core leadership team that guides development of the whole system. Passive DBA’s sole participation may be to receive scripts from the data modeller to generate and revise tables in the database.

Database administrators can add tremendous value collaborating on data design and helping tune performance, both for ETL loading and reporting access.

ETL Specification:

The ETL architect/designer documents the technical specifications for the ETL developer, with reference to the implementation of ETL logic specific to the application being used (e.g., Informatica, Ab Initio, DataStage). The specification document communicates ETL processes beyond the logic of the mappings, including: scheduling dependences, specific jobs and change capture logic.

This is not the same artifact as the physical target-source mapping. Here, the purpose is to detail the actual solution to be implemented.

ETL Build:

The ETL developer creates jobs based on the ETL specifications. This process is dependent on the database being implemented and the specifications being prepared. I realize that this brief statement doesn’t do justice to the size of the task. Of course, this encompasses a world of data validations, transformations, cleansing routines, de-duplications, key assignment, exception-handling and logging. What I would like to highlight is that ETL can’t be done until the database has been designed and tables created.

QA Testing:

The QA developer designs and conducts a series of unit and system tests to determine if the system meets such criteria as accuracy, performance and data quality. The topic of testing the warehouse during development is addressed more fully in Finding Faults: Testing the EDW.

This completes the series on EDW development processes. Please feel free to comment or share your experiences.

EDW Process – Part III: Information Provisioning

In this series, we are looking at the processes involved in EDW development.

Information Provisioning Layer Specific Processes

Determination of Structure:

At this point of development there needs to be coordination between the BI architect and the data architect/modeller. A suitable set of tables needs to be identified, with consideration to the reporting requirements and the solution architecture. The granularity and additive nature of facts needs to be ascertained as well as their dimensional context. What would be helpful is to draw up a matrix consisting of a list of measures cross-referenced with applicable dimensions. What’s the lowest level the data will be accessed? What level of summarization would be best? How will the data be filtered?

This sounds like we’re already thinking in terms of Star Schemas, with dimension and fact tables. However, this is the chance to consider alternatives, based on the business need. Perhaps users would be best served by OLAP cubes; or the consumer is actually an application that requires the data to be served up in a specific format. A major strength of the reference architecture is that while the warehousing layer stores data to serve the needs of the many, the provisioning layer makes it available to serve the needs of the few, (or the one).

I’m not suggesting that users access the warehousing layer; I am saying that the provisioning layers design should be fit for purpose. Where structures can serve the needs of multiple sets of users – great! But we shouldn’t be limited to one size fits all.

Having said that, the next two processes assume that dimensions and fact tables are what’s required.

Logical Model of Dimensions:

In ETL processing, any central fact table is dependent on its dimensions being populated first; and so dimensions can usefully be designed and built first. The process should take into consideration an enterprise perspective for each dimension to maximize re-use. Over time this will be more a matter of maintenance of existing tables than the creation of new structures.

Logical Model of Facts:

Fact tables may be additive, semi-additive or non-additive. In some cases only an atomic level table need be created and aggregates can be automatically generated. There may be cases where the Information Warehousing layer may supply a fact table without the need to design one separately (e.g., sales transactions, bank transactions, financial market trades).

BI Specification:

The data mart tables in the provisioning layer, against which reports will run, must be designed before the specifications for report development can be completed. These specifications supply instructions to report developers to build the reporting solution. They will inherit details from the original reporting requirements, and will include references to physical tables and columns in the data mart, and logic to transform and/or aggregate the data.

Specifications should also include additional filtering to be applied, user interface parameters, headings, column titles and other labels. A sample of the report output is often attached, and should be visually indexed to the listings of the report content. Styling standards should also be included or referenced. Drill-down and drill-across functionality should be identified, and any related reports that may be part of a business user workflow.

On a personal note, it seems that usable report specifications are rarely created anymore. Subjectively, this is part of a general trend within business intelligence, that sees delivery of information as an afterthought. Comments?

In the fourth and final installment of this series, we’ll look at the processes common to both the warehousing and provisioning layers.

EDW Process – Part II: Analysis and Information Warehousing

In this series, we are looking at the processes involved in EDW development.

Here, we take a closer look at the analysis phase together with the processes specific to the development of the warehousing layer.

Analysis Processes

Identification of Data Sources:

This is the first step towards confirming that sources can be found to meet the requirements. Where requirements have been gathered without reference to sources, these sources must be investigated and identified. At this time, definitions for each data element need to either be garnered from existing source documentation or determined through interviews with data stewards and direct investigation of the source data metadata, structures and content.

Theoretically, this could be a very simple process. With information requirements specifying precisely which data elements are needed, documentation of the source systems providing clear and concise definitions, and a knowledgable data stewards with time to answer a few outstanding questions over coffee and donuts; the task almost performs itself.

In most cases though, it would be exceptional if even one of those three circumstances existed. I’ve often heard it expressed that the migration of the data between the layers is the most time-intensive aspect of warehouse development. I would say that it’s analysis. Where analysis is glossed over, both data design and ETL development will get bogged down in problem-solving.

A thorough analysis of the data can ensure requirements are met, facilitate data design, and identify potential issues to be addressed through ETL. This will expedite the entire development process.

Data Profiling:

This is the second step towards confirmation of the availability of source data to meet the requirements. The results of analysis should be fed back through the requirements process to re-establish scope. Data Profiling refers to the automatically generated reports that analyse data according to a pre-defined set of criteria (e.g., % null, % distinct, max and min values, max and min length of values). Once the reports on a given source file or table have been generated, the results must be analysed to determine if there are data elements that should be removed from scope or quality issues that need to be addressed either at the source or during data migration.

While it is essential to conduct profiling early in development, and enable the output to influence requirements, the profiles continue to have value throughout the development life cycle, and even beyond, to help monitor data quality across the EDW layers. Profiling can be used to determine data design, model validation, ETL cleansing routines, and QA balance and control reports.

A deeper discussion of this topic can be found in Effective Data Profiling.

Data Interrogation:

Data Interrogation refers to the manual process of querying the source data for such purposes as investigating anomalies or confirming relationships between source entities. It may well be the case that queries developed during the development phase will be used to monitor the quality of source data on an ongoing basis.

Similar to data profiling, this activity can be performed by data modellers, ETL designers and QA analysts. It is critical to implement data interrogation as part of the up front analysis; but the capability should be retained for the full development process.

Information Warehousing Layer Specific Processes

Source-Target Logical Mapping:

The Data Modeller maps the source data elements to their logical target entity and attribute; this work is commonly done in a spreadsheet. Note that this mapping has a source orientation. That is, all the source columns are listed, but some target columns will not appear, namely those that are not populated directly from a source field.

To be able to understand where the source fields should be mapped, a knowledge transfer must occur between the analyst, who has attained insights into the sources through the analysis phase, and the data modeller, who is responsible for designing the target structures. If the data modeller can be engaged in the analysis process, all the better. It’s a balance between consuming the data modeller’s time in discussions with users or data stewards, and risking having all source analysis communicated through the filter of the analyst. Partly, the concern is that things get lost in the re-telling, and partly, the modeller is going to ask questions geared towards making design decisions.

This process is often simply a scratch-pad for the modeller, who then throws away the mapping once the model is complete. However, the mapping presents an opportunity to gather notes on the sources, comments on target design decisions, input to governance of naming standards, and links between source and target fields. All in all, a nice bundle of metadata.

Logical Model:

The Data Modeller creates a Logical Data Model (LDM), with descriptive entity and attribute names as well as data types and definitions. Physical abbreviations are entered in a glossary used to generate the physical data model. A domain model can also be created to standardize data types. This logical / physical split is not unique to the warehousing layer, however it is more significant here than in the provisioning layer. The modeller can “break out” the model into smaller, more specific entities, as part of the normalized data design. This allows for more clarity and precision in the presentation of business entities and the relationsips between them.

Because the logical model applies fully descriptive names, the diagrams from this model can be used to communicate the design to others. A caveat with this is that the Information Warehousing layer is not for general consumption, even as a data model diagram; just as it’s a limited set of clients who would be taken on a tour of a retailer’s actual warehouse. How much better to take them to the showroom!

As a reflection of the core within an EDW, the System of Record logical model is an enterprise data model; a master model of the organization’s mastered data. This gives it special significance.

In part III, we’ll look at the processes specific to the information provisioning layer.

EDW Process – Part I: Overview

Over the next few articles, I would like to map out the processes involved to develop the EDW. These processes focus on the analysis, design and build of the data structures within the components of the reference architecture previously defined. These processes are not dependent on a specific physical platform; and assume that the technical infrastructure has been established outside of this process flow.

The intent is to identify each of the processes involved in development, and show the inherent dependencies between them. I would argue that every one of these processes is essential to development, and the flow from one to another represents a critical path for release management. I’m not suggesting that this set of processes is comprehensive; there are many relationships between processes that I don’t attempt to depict, and sub-processes that don’t appear.

Development

The diagram above represents the development life cycle of the EDW at the highest level. What follows is a brief overview of each summary process.

Definition of Requirements:

Requirements are defined in terms of (1) Information and (2) Reporting. The information requirements will determine what data elements will need to be loaded into the EDW from internal and external source systems; and the reporting requirements will specify fields and values that will need to be derived and calculated. In my experience, this reflects the two drivers of warehousing projects.

In some cases, the mandate is to make available a specific subject area of information; such as deposit accounts for a bank or inventory levels for retail. This is sometimes referred to as the “field of dreams” approach – if you build it, they will come. What this requirements process needs to do is to frame the scope of the project. There must be some criteria by which to judge what data will be of value.

The granularity of the requirements is also important. Sometimes “customer information” is requested, or the slightly more helpful “customer address information”. Even the address can have multiple component fields; and then there are phone numbers, email addresses and web sites. Is history required moving forward and are there historical records to be loaded initially? The more specific the requirement, the more efficient the development process.

Report requirements take a different perspective. They define the functionality and use cases of the information. This is not the report specification, which must reference the Information Provisioning Layer for any business intelligence application, but it is the input to the design of those structures.

Analysis:

During the analysis phase, the source data needed to support the requirements will be identified and checked to see if it is of sufficient quality. If it cannot be found, is not populated or is of insufficient quality, these findings are communicated back to inform and potentially revise the requirements.

Integration/Warehousing:

The System of Record within the Information Warehousing Layer is where the “golden copy” of atomic reference data resides. Data to be fully integrated in the EDW will first have data structures modelled and data migration processes built and tested. Some data may pass through a Master Data Management process.

Information Provisioning:

The Data Mart within the Information Provisioning Layer is intended as a platform from which to meet specific information and reporting requirements. Where possible, the objects will be designed for re-use. However, the guiding principle is the effective servicing of business needs, and to this end, the solution’s primary focus is on the given project’s requirements.

These layers are discussed as part of the series on EDW Reference Architecture.

Governance

Data Content:

Data Content governance includes establishing and enforcing criteria for determining which data will be brought into each layer of the data warehouse with consideration of a number of factors:

  1. Redundancy
  2. Level of atomic detail
  3. Volumes
  4. Value to the enterprise
  5. Quality

Business Definitions:

Controls must be in place to ensure that all incoming data elements are sufficiently and accurately defined to communicate understanding to information workers and business users.

Naming Standards:

All entity and attribute names applied to the Information Warehousing and Provisioning layers must conform to an enterprise standard, both in logical form and their physical abbreviations.

Data Architecture:

The architectural principles underlying the Information Warehousing and Provisioning layers are distinct; however, each should conform to enterprise standards to ensure flexibility, integrity and efficiency of the development process, and ultimately, the performance of the BI solution.

Release Management:

All the layers of the EDW will evolve, at times with multiple streams of development acting simultaneously. The interdependencies between and within the layers need to be considered, as well as the impact of any changes throughout the system. This topic has been discussed in more detail in Getting with the Program.

In the next article, we’ll take a closer look at analysis and the processes specific to the development of the Integration and Warehousing layers.

Follow

Get every new post delivered to your Inbox.