vol 2, no 3
July 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Structural Inference Techniques Part 4: Data Rules

 

Introduction

 

In the previous article, we discussed some techniques where functional dependencies are used to discover or validate certain types of data rules.  When validating data rules, the interpretation of dependencies change, as well as the information captured.  We’ll now continue the discussion and use functional dependencies to

 

  • Validate simple and complex hierarchies
  • Identify and verify code or reference tables

 

Hierarchies

 

Definition of Hierarchy

 

In order to restrict the discussion to hierarchies, let’s define the concept.  According to www.dictionary.com, a hierarchy is “an organisation with few things, or one thing, at the top and with several things below each other thing.  An inverted tree structure.  Examples in computing include a directory hierarchy where each directory may contain files or other directories; a hierarchical network.”

 

This implies that the lower-level branches of a hierarchy are completely “contained” or “owned” by only one higher-level root.  No one branch, or its sub-branches are related to multiple roots.

 

An example of a simple hierarchy involves geographic regions.  Consider the following hierarchy (with associated columns).

 

 

A Country is subdivided into regions.  Each region is further subdivided into postal codes (the fact that some countries’ postal codes are, themselves, hierarchies doesn’t change the basic structure).

 

A table may look as follows.

 

Table 1: Geography Table

Country_cd

Region_cd

Postal_cd

US

NY

10101

US

NY

10001

US

NJ

08003

AU

NSW

1001

AU

NSW

1003

 

A more complex hierarchy may involve product classifications, where, for example, the same product class value is found in multiple product types.

 

 

Table 2: Product Table

Prod_family_cd

Prod_type_cd

Prod_class_cd

Prod_cd

A

123

A34

3740

A

234

A34

9312

A

123

B86

7629

B

490

C91

3445

 

Notice, the prod_class_cd ‘A34’ seems to belong to two different prod_type_cd values.  The objective is to obtain a set of dependencies that can validate the defined hierarchy against the current source data.

 

Simple Hierarchies

 

Because the branches and sub-branches of a hierarchy relate to only one root, functional dependencies can be used to verify if the source data supports the hierarchy.  To construct the needed dependencies, the determinant, or left-hand side, consists of the attributes of the lower level branches and the dependent, or right-hand side, consists of the root attribute.

 

Our interest lies in tables/files that contain all of the attributes in the hierarchy (the same process applies to subsets of attributes).  A hierarchy of this type, where each root contains many branches can be expressed as a set of functional dependencies by working “up” the tree.  In other words, branches determine roots.

 

 

So, in order to validate whether or not a particular simple hierarchy exists, validate dependencies where the branch attribute is the determinant and the root attribute is the dependent.

 

Complex Hierarchies

 

In the instance where values of branches can be shared across roots, multiple levels of branch attributes are needed to construct the dependency.  In the instance of the product hierarchy (refer back to Table 2), the dependency prod_class_cd -> prod_type_cd is insufficient to validate the hierarchy.  There are many product types that are related to the product class ‘A34’.  An effective dependency involving the prod_cd attribute is needed as part of the determinant (left-hand side).

 

prod_class_cd + prod_cd -> prod_type_cd

 

This dependency gives a unique product type for every combination of product class and product.  So the dependencies that are needed for this hierarchy are

 

 

Keep in mind, this does not mean that product class ‘A34’ has the same meaning in both product types.  Merely that the same product class value exists in both types.  Conceivably, if the meaning of ‘A34’ changes, then the product class code and the product class description, together, should determine the product type.

 

Of course, if the common value of a hierarchy applies to multiple roots AND has the same meaning across branches, then you do not have a hierarchy and this technique won’t work.

 

Code/Reference Tables

 

For this discussion, code tables are structures that store valid values or codes for a specific domain used throughout the organization.  Code tables, generally, have only two columns, a code and a name/description.  An example of a code table is an “Invoice Status Type” table, which captures all possible statuses for an invoice.

 

This technique involves finding or validating code tables within source data.  Code tables can exist within other tables (a special case of embedded relations (see Part 1), or are already normalized as separate tables.  In both cases, the analyst must validate two functional dependencies.

 

code -> description

description -> code

 

Both dependencies must be true for the code table to be consistent.  Each code value must have one meaning (code -> description).  This ensures that the code is unique and is the key for the resulting table.  No two codes should have the same meaning (description -> code).  This implies that, even though there is no duplication of code values, there are no excess codes in the table.

 

Conclusion

 

The last two articles in this series have investigated different data rule verification techniques involving functional dependencies.  Are there other methods of validating one-to-one correspondences, duplicate records, hierarchies, and code tables?  Sure.  However, all of these other techniques are unique to a specific problem and involve custom development.  Using functional dependencies, in conjunction with data profiling technology, helps the analyst very quickly identify and resolve potential data rules issues using the same process and without requiring a developer to custom code queries.  Of course, this also lets the developer perform other coding tasks.

 

Functional dependencies are very powerful as a concept in data analysis and assessment.  Many types of data quality issues, which may be difficult or impossible to find using other methods, can be identified very easily.  Data profiling solutions can infer dependencies quickly and efficiently, and also provide a consistent framework for interpreting the results.  Structural inference is an important addition to the collection of analysis techniques.

 

Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.






Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved.