vol 1, no 1
October 15, 2002

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Structural Inference Techniques Part 1: Identifying Candidate Primary Keys


Introduction

Structural Inference, also known as Dependency Profiling, has been relatively ignored in the data profiling industry, mainly because of the perception that it is not useful or too difficult to understand. However, understanding and evaluating the dependencies inherent in source data is crucial to the “structural integrity” of the data. Structural issues are the hardest and most costly to find and fix. Structural inference reduces the time and effort needed to find and solve structural integrity issues, making migration of non-relational data much easier and far more accurate.

Over the next few issues of “The Data Profiler,” each issue will investigate an analysis technique using structural inference results. In this article, we’ll define what a functional dependency is, discuss the breakthrough technology that launched the data profiling industry, and discuss how to identify potential candidate keys for a relation. Readers should be familiar with basis data profiling techniques and methodologies, including the steps and tasks involved.

Functional Dependencies Defined

Let’s first discuss some terminology and reading conventions. All of the following concepts are based upon the Relational Model as defined by Ted Codd in 1968. Though, this article is not as rigorous in its definitions, the basic ideas are the same.

A relation is analogous to, though not exactly, a “table,” “file,” or “record type” with a finite set of attributes and a finite number of rows. An attribute is analogous to, though not exactly, a “column” or “field” and represents a “named domain” associated with the relation. Throughout the article, we will mainly use the terms relation and attribute, though occasionally, “table” is used for “relation” and “column” is used for “attribute.”

All relations used in the examples are defined using standard SQL. When an example requires data, a table is presented below the relation definition. Each table of data contains a header row of attribute names corresponding to the relation definition.

Where applicable, empty values or NULLs are represented by “<NULL>” without quotes.

All inference of the inherent structure of data rests upon the functional dependency, defined as follows.

Given a relation R and attributes X and Y in R, Y is functionally dependent on X if for every unique value of X there is one and only one value of Y. The functional dependency is represented as,

X ® Y

and read as “X determines Y.”

In a functional dependency, the left hand side (“X”) is called the determinant and the right hand side (“Y”) is called the dependent.

For example, think about your organization. Every employee has an employee number. In fact, every employee should have a different employee number. So, for every unique value of employee number, there should be one, and only one, employee assigned that number (if not, think of the payroll, benefit, and tax implications). So, you could also say that the employee number functionally determines the employee name, address, phone number, etc.

The dependencies that illustrate this example are as follows.

employee number ® employee name
employee number ® employee address
employee number ® phone number

Though the definition does not explicitly state it, the determinant can contain more than one attribute (e.g., order_num + order_line_id ® ordered_prod_cd, given each unique value combination of order_num and order_line_id, there is one and only one value for ordered_prod_cd).

Embedded relations exist in non-normalized files and, through the use of dependencies, can be successfully normalized out of the file without changing the underlying data or meaning of the structure. For example, an insurance policy source file may contain customer data that can be normalized out into its own relation. The customer relation is embedded in the policy data.

When modeling new systems, data analysts rarely investigate existing source data to identify the structures that are needed. The business requirements, file layouts, and documentation are the main sources of information that an analyst uses.

Unfortunately, when the existing data is migrated to the new application, the inherent structure of the data may prevent a clean and easy conversion. An analyst that manually investigates the structure of source data finds that the process is slow and takes too much time. This problem was solved with the development of dependency inference technology.

Breakthrough Technology

With the advent of efficient and fast mathematical algorithms that infer functional dependencies inherent within data, profiling the structure of large quantities of data became much easier. Automated software products were developed around the “dependency inference engine.” These products test all possible dependencies for a relation, and return those that are true within the data.

An advantage to testing all possible dependencies is completeness of results. All possibilities are tested in a comprehensive, structured task. Nothing is left to chance. Human error is also removed from the process.

The importance of this analysis is in its ability to infer the third normal form structure of a specified source file. The knowledge of the business, combined with the known dependencies, defines the optimal normalized structure necessary to satisfy the business requirements.

This normalized model can be used as the basis for a staging area of data, a critique of a pre-defined database or application, or a data quality effort. There are, though, some other techniques that can aid in identifying business rules and validating existing business rules that may not be enforced by the data structure.

Techniques for Profiling

In reviewing the inferred dependencies, the analyst uses existing business knowledge to identify important structures of data. Sometimes, these structures are readily apparent, say with “customer_id” as the data’s inferred primary key. Other times, the structures are not so apparent or may require more or different data to see them. Regardless of the efficiency or speed of the algorithms, or the speed of the hardware and software, very large source files or tables must be sampled for effective and timely results. Finally, the analyst may also investigate interesting dependencies that may signal completely new structures or rules not explicitly documented or defined.

Primary Key Inference

One of the first, and most used techniques is identifying the primary key of the source file. During Attribute Analysis, a set of candidate keys may have been identified. Attributes that are 100% unique or a set of attributes that the business or DBA specified are candidates.

To identify a true candidate key, investigate dependencies for a determinant that appears most often in the list. For example, let’s examine the following table.

CREATE TABLE EMPL_JOBS
          (empl_id char(5) NOT NULL,
          job_id char(7) NOT NULL,
          job_start_dt datetime NOT NULL,
          job_end_dt datetime,
          sched_work_hours number(5,0) NOT NULL DEFAULT 0,
          actual_work_hours number(5,0) NOT NULL DEFAULT 0)

Under normal circumstances, a primary key would be defined. In this case, though, either the DBA has turned off unique indexing (to quickly load data or for performance reasons), or a key was never defined or is programmatically validated (pre-relational databases). The data analyst reviews the inferred dependencies obtained from the data and finds the following.

Table 1: EMPL_JOBS Inferred Dependencies

1.

empl_id+job_id ® actual_work_hours

2.

empl_id+job_id ® job_end_dt

3.

empl_id+job_id ® job_start_dt

4.

empl_id+job_id ® sched_work_hours

5.

empl_id+job_start_dt ® job_id

6.

job_start_dt ® actual_work_hours

7.

job_start_dt ® job_end_dt

8.

sched_work_hours ® actual_work_hours


The analyst determines that dependencies 6 through 8 are noise and meaningless to the business. However, dependencies 1-4 imply that empl_id+job_id is a valid primary key for the table since for each unique combination of empl_id along with the job_id found in the data, there is one and only one value each for actual_work_hours, job_end_dt, job_start_dt, and sched_work_hours.

If there is no determinant that appears more often than any other determinant, this may imply one of two things.

  1. A combination of determinants must be investigated.
  2. There are many embedded relations in the source file and one primary key is not required.

The rest of this article investigates the first possibility of determinant combinations. Embedded relations will be investigated in a future article.

Investigating Determinants

To illustrate possibility #1, let’s modify the example above by changing the inferred dependencies to the following.

Table 2: EMPL_JOBS Inferred Dependencies - REVISED

1.

<NULL> ® job_end_dt

2.

empl_id ® actual_work_hours

3.

empl_id+job_id ® job_start_dt

4.

job_id ® sched_work_hours

5.

empl_id+job_start_dt ® job_id

6.

job_start_dt ® actual_work_hours

7.

sched_work_hours ® actual_work_hours

We are going to rely on the following property.

    Given attributes A and B, if A ® B, then any other attribute or set of attributes X can be added to the determinant and the dependency is still true (A+X ® B). A ® B is known as the Covering Dependency.

Inference engines only show covering dependencies; otherwise the number of true dependencies is prohibitively large. However, we are not limited to the covering set. We can safely add any attribute(s) to any determinant without issue, in order to identify a candidate key.

Again, dependency 7 is deemed noise and discarded from consideration (dependency 6 takes on more meaning, as we shall see). Dependency 1 shows that job_end_dt is a constant or contains only 1 distinct value.

    NOTE: When the determinant is <NULL>, this implies that every possible dependency involving the dependent is true. Instead of listing every dependency, this convention is used.

Notice what has happened. Investigate individual attributes and see where they exist in determinants. Empl_id is found in three determinants (2, 3 and 5), job_id in two determinants (3 and 4), and job_start_dt in two determinants (5 and 6).

The candidate keys must uniquely identify all other attributes in the relation. The analyst has to investigate three possibilities, empl_id+job_id, empl_id+job_start_dt, and job_id+job_start_dt. Lists of dependents for each candidate are shown below.

empl_id+job_id
          actual_work_hours (dependency 2)
          job_end_dt (dep 1)
          job_start_dt (dep 3)
          sched_work_hours (dep 4)

empl_id+job_start_dt
          actual_work_hours (dep 2 or 6)
          job_end_dt (dep 1)
          job_id (dep 5)

job_id+job_start_dt
          actual_work_hours (dep 6)
          job_end_dt (dep 1)
          sched_work_hours (dep 4)

The only possible determinant combination that identifies all other attributes is empl_id+job_id. This is the smallest (in terms of attribute count) candidate key for the relation EMPL_JOBS. Of course, any other attribute can be added to create an additional candidate, but the data do not need other attributes.

Given that a typical inference can generate hundreds of dependencies, this process can get quite complex. Bear in mind, however, that this technique minimizes the reliance on any documentation, user input, or business knowledge. Eliminating possible avenues of investigation, based on such input, can significantly decrease the amount of time and effort required to identify candidate keys. The core process, though, remains unchanged, regardless of the data source environment.

Conclusion

Structural Inference is a highly effective tool that can accurately determine the structural integrity of a data source. Discovering a list of potential candidate keys for a source file is a useful method of accurately describing its structure. If the primary key is not documented or validated by the database or application environment, structural inference can quickly and concisely identify the candidate keys. Even if the primary key is well defined and maintained, structural inference can help the analyst determine if any other attribute(s) can be candidate keys.



Solutions Supporting Structural Inference


Evoke Software's Axio

Avellino's Discovery

Ascential Software's MetaRecon

KnowledgeDriver's Profiler Suitcase

Similarity System's Athanor



Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.




Copyright © 2002 Anthem Consulting, LLC. All Rights Reserved.