vol 2, no 1
January 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Structural Inference Techniques Part 2: Embedded Relations

 

Introduction

 

In the last issue of The Data Profiler, we discussed how inferred functional dependencies help in identifying candidate keys of a particular source file.  For non-relational legacy sources, though, it is crucial that the files are normalized accurately for easier migration to relational databases, implying that each source file can be transformed into many normalized relations.  Part 2 of Structural Inference Techniques outlines how to identify and normalize these “embedded relations.”

 

Part 1 Review

 

Let’s first review some definitions and conventions introduced in Part 1.

 

A relation is analogous to, though not exactly, a “table,” “file,” or “record type” with a finite set of attributes and a finite number of rows.  An attribute is analogous to, though not exactly, a “column” or “field” and represents a “named domain” associated with the relation.  Throughout the article, we will mainly use the terms relation and attribute, though occasionally, “table” is used for “relation” and “column” is used for “attribute.”

 

All relations used in the examples are defined using standard SQL.  When an example requires data, a table is presented below the relation definition.  Each table of data contains a header row of attribute names corresponding to the relation definition.

 

Empty values or NULLs are represented by “<NULL>” without quotes.

 

All inference of the inherent structure of data rests upon the functional dependency, defined as follows.

 

Given a relation R and attributes X and Y in R, Y is functionally dependent on X if for every unique value of X there is one and only one value of Y.  The functional dependency is represented as,

 

                        X->Y

 

and read as “X determines Y.”

 

In a functional dependency, the left hand side (“X”) is called the determinant and the right hand side (“Y”) is called the dependent.  The determinant can contain many attributes (e.g., order_num+order_line_num->ordered_prod_id).

 

An embedded relation is a relation that can be normalized out of existing source data without changing the structure or content of the data.  For example, code or reference tables may be considered embedded within a larger, de-normalized table.

 

Objectives of Normalization

 

There are many reasons an organization may normalize (or re-normalize) source data.  A company may want to change the focus of the data.  For example, insurance companies traditionally capture data in a policy-centric manner.  To develop CRM applications, these companies need to transform and capture this data into a customer-centric manner, requiring finding the customer information embedded within their policy files.

 

Another reason for normalization may be to determine whether or not a purchased application can capture the same or similar information captured within legacy systems.  This can be accomplished by a model-to-model comparison.  Of course, it is easier to compare normalized models than de-normalized structures.

 

Finally, a third reason is the development of custom databases, such as data warehouses or data marts.  A properly normalized data structure can form the basis of either an Operational Data Store (ODS) or a staging area database.  Also, the development of the data warehouse database, itself, is made easier by completely understanding the inherent structure of the source data.

 

Functional Dependencies and Normalization

 

The basis for all normalization rests in the functional dependency.  When data modelers define entities and attributes, they are implicitly defining functional dependencies that must be enforced by the database management system.

 

For example, consider the following entity, stated in standard SQL.

 

CREATE TABLE Employee

     (empl_id            char(6)     NOT NULL,

      empl_first_nm      varchar(30) NOT NULL,

      empl_last_nm       varchar(40) NOT NULL,

      empl_curr_sal_amt  number(10,2));

 

ALTER TABLE Employee

     ADD PRIMARY KEY (empl_id);

 

When the primary key of the entity is defined, in this case empl_id, the modeler is, in effect, creating three functional dependencies.

 

empl_id->empl_first_nm

empl_id->empl_last_nm

empl_id->empl_curr_sal_amt

 

The determinant becomes the primary key of the Employee entity.

 

There is a difference between the traditional data modeling exercise and the data profiling process.  A traditional data model is developed from business requirements, existing documentation, and possible file layouts.  The data profiling specialist develops data models and structures from inferred dependencies that exist within the existing source data, and then incorporates new business requirements into this model.

 

As seen in Part 1, a candidate key for a relation is defined as a common, minimal determinant that identifies all other attributes in the relation.  We’ll use this, in conjunction with other techniques, to identify embedded relations.

 

Patterns in the Determinant

 

The first step in finding embedded relations involves sets of dependencies with the same or similar determinant.  Dependencies with the same determinant possibly signal that the determinant is a candidate key for a relation with the dependents as non-key attributes.  If a particular determinant appears in the list of dependencies examine the dependent attributes to see if they (the dependent attributes) can be logically normalized into their own relation.

 

For example, consider the following list of dependencies for the relation specified.

 

CREATE TABLE order

     (order_id         char(8)     NOT NULL,

      order_dt         date        NOT NULL,

      terms            varchar(15) NOT NULL,

      cust_bill_id     char(10)    NOT NULL,

      cust_bill_nm     varchar(35) NOT NULL,

      bill_cd          char(3),

      bill_desc        char(30),

      cust_ship_id     char(10),

      tot_ord_amt      money,

      tot_sls_tax_amt  money)

 

Table 1: ORDER Inferred Dependencies

1.       

order_id->order_dt

2.       

order_id->terms

3.       

order_id->cust_bill_id

4.       

order_id->tot_ord_amt

5.       

order_id->tot_sls_tax_amt

6.       

cust_bill_id->cust_bill_nm

7.       

cust_bill_id->bill_cd

8.       

cust_bill_id->terms

9.       

order_dt->bill_cd

10.   

cust_ship_id->bill_cd

11.   

bill_cd->bill_desc

12.   

bill_desc->bill_cd

 

While investigating patterns in the determinants, the practitioner should first evaluate the suitability of the determinant as a primary key of a data structure.  For example, dependency 9 (order_dt->bill_cd), though true in the data, may not be considered important to the business.  Nor is a data structure with only “order_dt” as the primary key well defined or suitable as a table.

 

There are three sets of dependencies we need to investigate.

 

  • Dependencies 1-5 with “order_id” as the common determinant
  • Dependencies 6-8 with “cust_bill_id” as the common determinant
  • Dependencies 10-11 with "cust_ship_id" as the common determinant

 

Dependencies 1-5 indicate a potential “Order” relation because of the common “order_id” determinant.  We also see that each dependent in this set can reasonably be a non-key attribute of the Order relation.  Of course, this must be verified by the business, but having these inferred dependencies gives us a good data analysis foundation.  If dependencies 1-5 have “business value,” or if they contribute to the development of a model, then the practitioner can mark them as model dependencies.

 

Dependencies 6-8 indicate a possible “Customer Billing Location” relation.  However, since the “terms” attribute is already a part of the Order relation, and there may not be any added business value in associating it with the Customer Billing Location relation, we do not need to mark dependency 8 as a model dependency.  Once an attribute acts as a dependent for a model dependency, you may be able to ignore any other dependency with the same dependent.  Techniques involving multiple dependencies having the same dependent are the subject of a future article.

 

For dependency 10, “bill_cd” is already a dependent for a model dependency (cust_bill_id->bill_cd).

 

The last two dependencies (#11, #12) signal a domain or code table.  Typically, these tables contain two attributes, a code and a description.  When you see two dependencies like this, two attributes determining each other, especially with code attributes, a domain table is possible.  However, in order to construct the code relation, specify the dependency with the code attribute in the determinant as a model dependency (in this case, #11).  Do not mark the other dependency as part of the model, since that would create a relation with the description attribute as the primary key.

 

After this review, the following dependencies are now marked as model dependencies.

 

1.       

order_id->order_dt

2.       

order_id->terms

3.       

order_id->cust_bill_id

4.       

order_id->tot_ord_amt

5.       

order_id->tot_sls_tax_amt

6.       

cust_bill_id->cust_bill_nm

7.       

cust_bill_id->bill_cd

8.       

bill_cd->bill_desc

 

Though there is still some more analysis that must be performed, we have identified the embedded relations in the Order source file.

 

·        Order (dependencies 1-5)

·        Customer Billing Location (dependencies 6-7)

·        Billing Type (dependency 8)

 

Other Considerations

 

The intent of this technique is not to replace the data modeler.  Nor should it be suggested that the model is complete with the inference of dependencies from the data.  If the source data is of low quality, the dependencies that are inferred may not be suitable to the project needs and may never become model dependencies.  The data modelers and subject matter experts must make the decisions regarding the data structures, just as in traditional analysis methodologies.

 

Also, the automated data profiling solutions only infer covering dependencies.  The data profiling specialist may need to add attributes to the determinant to develop the data structure needed, especially if there are constants in the source data.  Again, the specialist, with the SMEs, must evaluate the business value of the dependencies.

 

Conclusion

 

The ability to normalize a source file is a crucial aspect in determining the structural integrity of data.  Without examining inferred dependencies and only relying on documentation or metadata, the analyst runs the risk of developing a model that does not support the actual data.  Though business requirements, both new and existing, are crucial to the process, the data analyst should examine the source data in an effort to fully understand and validate their models.

 


Structural Inference Vendors

Ascential Software

Avellino

Evoke Software

KnowledgeDriver

Similarity Systems, Inc.


Related Articles

Part 1 - Candidate Keys



Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved.