Picking the Team, Part III: ETL Designer

In this series of articles, we’re looking at the top qualities required for candidates to fill three key positions in the data warehousing team. In part one we looked at the analyst, in part two, the role of data designer. Here we’ll explore the five top skills to look for in the ETL designer / developer. In larger operations, these will be two distinct roles, with perhaps multiple resources. There may be a designer, with a lead developer and a team of ETL developers. For the purposes of this article, we’ll roll ETL into one position: with ETL design and development taken together.

ETL Design/Development

The process of Extract Transform Load (ETL) involves the migration of data from one system to another. In the case of the data warehouse, this occurs in multiple stages: source to staging, staging to integration layer, and integration layer to data mart.

We’ll take the base technical skills as a given. The ETL resource must be proficient in the toolset being applied (DataStage, Informatica, Ab Initio etc.) While the functionality is essentially the same across the board, the features of each are quite distinct. Beyond this fundamental compentency, look for the candidate’s expertise in the following areas:

  1. Experience with Delta Loads to Staging
  2.  Experience with Normalized Structures (Integration Layer)
    • Change Capture / History
    • Reference Tables (consolidated vs separate)
    • Surrogate Keys
  3. Experience with Denormalized Structures (Provisioning Layer)
    • Slowly Changing Dimensions
    • Bridging and Relationship Tables
  4. Approach to Performance Tuning
  5. Approach to Rapid Development

It’s important to find someone with knowledge of the end-to-end process. They may have been involved in one aspect of the process in the past, or been responsible for maintenance of existing jobs. This will not be sufficient. The ETL needs to be designed with an understanding of the specific challenges involved at each stage, with a corresponding toolkit of solutions at their disposal to address them.

The staging area is loaded from the source systems, and requires strategies to extract only the latest data. The ETL candidate needs to be familiar with different strategies to manage this, such as: source system logs, datestamps on source tables, set comparison operations within staging.

It’s quite common for an ETL resource to have experience with loading data marts, but have no knowledge of normalization. This can colour their thinking regarding the population of the Integration Layer, and present challenges when using it as a source for the Provisioning Layer. You should specifically ensure that the candidate is aware of the different ways change capture is applied in the normalized environment, in particular the use of associative tables; that they have experience of consolidated reference tables; and that they understand that surrogate keys in the Integration Layer are static.

Dimensions and fact tables are standard structures in the industry and every ETL designer should be well-versed in their use. It would be a good idea to check that they know the three types of slowly changing dimension:

  1. Update the existing value with the new value
  2. Create a new row for the new value
  3. Design multiple columns for new and previous value

Another data mart structure that is not always applied is the bridging table, for navigating unbalanced hierchies, which require a separate ETL pattern to process. The candidate should have experience with both bridging and relationship tables, which can handle many-to-many relationships.

There is usually a window of time within which the nightly, weekly or monthly ETL processes must be completed. This puts a premium on the ability of the ETL designer to optimize the efficiency of the code. Make sure the candidate has practical experience with this and can list a number of strategies that they have employed. Don’t expect them to answer questions like: “How long should it take to load a million rows”. There are too many variables for them to give a meaningful response. What is important is that they can articulate a number of avenues of approach.

The final point relates to the candidate’s overall attitude to development. Most IT shops want agility, while the data warehouse life cycle contains inherent dependencies. The ETL designer should always be looking for repeatable patterns, building on previous experience as well as the blocks of code developed through the project. Find out what strategies they employ to create processes quickly and accurately. ETL can be the most time-intensive part of the project. An efficient ETL designer can be key to the warehouse ROI.