vol 2, no 2
April 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Structural Inference Part 3: Inferring and Validating Data Rules

 

by Joseph Novella

 

Introduction

 

The preceding parts to this series on Structural Inference have focused on developing and maintaining data structures from inferred functional dependencies.  The inference of functional dependencies from source data, though not sufficient, is a necessary step in implementing logically consistent target data structures that can be easily populated from source systems.

 

In order to successfully use functional dependencies, the practitioner must make a judgment on whether or not specific dependencies have business value.  In other words, does a functional dependency enforce or satisfy a particular business requirement?  If so, then the analyst marks the dependency as a “model” dependency.

 

However, inferring all of the functional dependencies in source data may not be enough to satisfy the business rules or requirements.  The analyst must create new functional dependencies and validate them against the source data, knowing that the dependencies created are not supported by the data.  In this article, we investigate how to convert certain types of data rules and business requirements into functional dependencies that can be validated against source data.

 

Dependency-Based Data Rules

 

There has been increasing business interest in business rules and rule management systems in the last few years.  Business Rule methodologies, if used correctly, can significantly decrease project costs and implementation times.  As part of these methodologies, it is necessary to determine whether existing data support newly identified business rules.  This validation step ensures correct migration of existing data into newer applications.

 

Let’s start with some concepts.  A data rule is a business rule that expresses a constraint on the data captured or maintained by the business.  For example, the following statement expresses a data rule.

 

“A customer with more than $20000 in sales in the previous twelve months is classified as a ‘favored’ customer.”

 

This business rule is enforced and maintained solely by the data captured by the organization.  The rule can even be enforced without human intervention.

 

Unlike other profiling techniques that use inferred dependencies, techniques validating existing data rules may require the analyst to create new dependencies that do not exist in the source data.  For this reason, data that conflicts with the dependencies should be recorded and evaluated.

 

For this article and the next in this series, we’ll focus on the following four classes of data rules with the first two covered in this article.

 

  • One-to-One Correspondences
  • Duplication Constraints
  • Hierarchies (Part 4)
  • Code or Reference Tables (Part 4)

 

Granted, these data rules can be captured and validated by developing custom-built test programs.  In some cases, standard SQL is sufficient.  However, functional dependencies offer a potentially more productive way of identifying, classifying, and verifying these data rules, without the need to develop custom code.

 

One-to-One Correspondence

 

Data rules of this type specify that for each unique value of a specific attribute “A”, there is one and only one unique value of another attribute “B.”  This creates a one-to-one correspondence between the respective values.  For example, consider the following data rule, required for address standardization, and relation.

 

“A postal code is contained within one region/state.”

 

CREATE TABLE customer

      (cust_id              char(10)     NOT NULL,

       cust_type_cd         char(5)      NOT NULL,

       cust_first_nm        varchar(30)  NOT NULL,

       cust_last_nm         varchar(40)  NOT NULL,

       cust_addr_1          varchar(35),

       cust_addr_2          varchar(35),

       cust_city_nm         varchar(35),

       cust_post_region_cd  char(2)      NOT NULL,

       cust_postal_cd       char(11)     NOT NULL,

       cust_phone_area_cd   char(3),

       cust_phone_num       char(7));

 

Using the relation above, the rule can be equivalently restated as follows.

 

“Given each unique value of ‘cust_postal_cd’, there is one and only one value for ‘cust_post_region_cd’.”

 

Stated this way, the data rule can be represented as a functional dependency between cust_postal_cd and cust_post_region_cd.  Specifically,

 

cust_postal_cd->cust_post_region_cd

 

Notice that the analyst may not want to develop a new data structure from the dependency with ‘cust_postal_cd’ as the primary key and ‘cust_post_region_cd’ as the non-key attribute.  The dependency validates whether or not this data rule governing addresses exists within the data.

 

The one-to-one correspondence may also involve a complex determinant, or a determinant with more than one attribute.  For example, consider the following data rule and corresponding relation.

 

“A credit card cannot be used on more than one transaction at a given time.”

 

CREATE TABLE transaction

      (transaction_id    char(15)       NOT NULL,

       transaction_dt    date           NOT NULL,

       transaction_tm    timestamp      NOT NULL,

       card_num          char(16)       NOT NULL,

       transaction_amt   number(15,2)   NOT NULL,

       terminal_num      varchar(10),

       merchant_num      char(16)       NOT NULL);

 

Restating the data rule as follows, we can see the dependency that must be enforced.

 

“Given a card_num, transaction_dt and transaction_tm, there is only one transaction_id.”

 

card_num+transaction_dt+transaction_tm -> transaction_id

 

In general, data rules that state a relationship that restricts the number of values of one of the attributes can be expressed as one or more functional dependencies.  Understand that the analyst may need to create these dependencies and validate them against the data, as opposed to using data profiling software to infer them from the data.

 

Duplicate Instances

 

It is relatively easy to identify duplicate values for primary keys.  Whether the key is single or multi-attribute, domain studies can quickly determine its uniqueness.

 

The problem occurs when the analyst must check for duplicate instances of business objects, regardless of the status of the key.  For example, how can the analyst find duplicate customer addresses even when the customer identifier is unique?  Though not an explicit data rule, duplicate customer (or product, region, employee, etc.) issues should be identified and quantified.

 

In order to test for uniqueness, create a dependency where the set of attributes you are checking is the determinant and the primary key of the table/file is the dependent.  For example, consider the following customer table.

 

CREATE TABLE customer

      (cust_id              char(10)     NOT NULL,

       cust_type_cd         char(5)      NOT NULL,

       cust_first_nm        varchar(30)  NOT NULL,

       cust_last_nm         varchar(40)  NOT NULL,

       cust_addr_1          varchar(35),

       cust_addr_2          varchar(35),

       cust_city_nm         varchar(35),

       cust_post_region_cd  char(2)      NOT NULL,

       cust_postal_cd       char(11)     NOT NULL,

       cust_phone_area_cd   char(3),

       cust_phone_num       char(7));

 

ALTER TABLE customer

     ADD PRIMARY KEY (cust_id);

 

Note: Even though the address columns are US-specific, the technique can be applied to any set of address columns.

 

The objective of this example is to determine whether or not there are duplicate customer instances, even though cust_id is unique.  In other words, are there any customers with more than one cust_id?

 

Create a dependency with the address columns in the determinant and the primary key (cust_id) as the dependent, and validate against the source data.

 

Cust_first_nm + cust_last_nm + cust_addr_1 + cust_addr_2 + cust_city_nm + cust_state_cd + cust_postal_cd -> cust_id.

 

If this dependency is not true, then the data contains duplicate customer instances.

 

Notice, we did not need to include all of the non-key attributes in the determinant to find duplicates.  Customer address columns were enough.  Of course, some data rules, say warehouse inventory rules, may require all of the non-key attributes as the determinant.  Though potentially more complex, the technique remains unchanged.

 

There are limitations to this technique, however.  The first involves the amount of data.  Validating functional dependencies, using today’s technology, against data sources larger than 100,000 records may take a significant amount of time.  This technique is not meant to identify ALL of the duplicate instances of a business object.  Instead, it is meant to quantify the extent of a duplication problem.  If there are 2,000 duplicate instances in a representative sample of 20,000 rows, the analyst does not need to check the full set of data to infer that 10% of the instances are duplicates.

 

The second limitation is more technical.  This technique assumes that no two records in a source file are exact duplicates of each other.  If there are duplicate records in the source data, structural inference techniques won’t find them.

 

For example, consider the following table.

 

Table: Customer

cust_id

cust_first_nm

cust_last_nm

cust_type_cd

1123

John

Smith

SVCS

1345

Jim

Smith

PTNR

6534

Barbara

Doe

PREF

1123

John

Smith

SVCS

 

The first and fourth rows in this table are duplicate rows.  To find duplicate instances of customers, validate the following dependency.

 

cust_first_nm + cust_last_nm + cust_type_cd -> cust_id

 

We find that the dependency is true even with the duplicate rows.  For every unique combined value of cust_first_nm, cust_last_nm, and cust_type_cd, there is always one and only one value of cust_id.  So, since there are no duplicate instances of a business object, do not assume that there are no duplicate rows or records in the data.

 

Note: It’s not wise to assume that since the source data resides in a RDBMS, there are no duplicate rows, or even duplicate keys.  Conversion and integration of data from older legacy environments may result in duplicate rows, especially if “bulk loads” are performed without unique index constraints enforced for performance.

 

What to capture

 

When validating dependencies against source data, the analyst is interested, not just in the dependency, but also in the data that does not support the dependency, called conflicting data.  The percentage and row counts of conflicting data are important.  These statistics identify the extent of the problem and whether the data rule successfully describes the business.

 

The analyst should also examine the conflicting rows for patterns or similar characteristics.  Perhaps the conflicting rows appear because one or more attributes have the same value.  Maybe NULLs exist in one of the attributes.  Do the conflicting rows represent “empty” transactions, placeholders, or even test data accidentally converted into production?

 

The techniques for investigating conflicting data are the subject for another article.  The point is that the conflicting data should be investigated to determine not just the extent of the problem, but also its potential source and solution.

 

Conclusion

 

Data rules must be identified and verified before conversion into new databases to ensure cleaner, more consistent information.  Functional dependencies offer an interesting way of validating some existing data rules in source data.  Though there are some limitations on using structural inference techniques, the minimizing of custom-built testing routines, as well as the natural integration of these techniques into data profiling methodologies can increase quality and productivity of system development.

 

The next article in this series focuses on two more types of data rules, code/reference table verification and hierarchies.

 



Structural Inference Vendors

Ascential Software

Avellino

Evoke Software

KnowledgeDriver

Similarity Systems, Inc.


Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.



Related Articles

Part 1 - Candidate Keys

Part 2 - Embedded Relations



Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved.