![]() |
|
vol 2, no 3 July 15, 2003 |
|
Brought to you by Anthem Consulting, LLC |
Join our Mailing List
|
|
|
|
|
Structural Inference Techniques Part 4: Data Rules Introduction In the previous article, we discussed some techniques where functional dependencies are used to discover or validate certain types of data rules. When validating data rules, the interpretation of dependencies change, as well as the information captured. We’ll now continue the discussion and use functional dependencies to
Hierarchies Definition of Hierarchy In order to restrict the discussion to hierarchies, let’s define the concept. According to www.dictionary.com, a hierarchy is “an organisation with few things, or one thing, at the top and with several things below each other thing. An inverted tree structure. Examples in computing include a directory hierarchy where each directory may contain files or other directories; a hierarchical network.” This implies that the lower-level branches of a hierarchy are completely “contained” or “owned” by only one higher-level root. No one branch, or its sub-branches are related to multiple roots. An example of a simple hierarchy involves geographic regions. Consider the following hierarchy (with associated columns).
A Country is subdivided into regions. Each region is further subdivided into postal codes (the fact that some countries’ postal codes are, themselves, hierarchies doesn’t change the basic structure). A table may look as follows. Table 1: Geography Table
A more complex hierarchy may involve product classifications, where, for example, the same product class value is found in multiple product types.
Table 2: Product Table
Notice, the prod_class_cd ‘A34’ seems to belong to two different prod_type_cd values. The objective is to obtain a set of dependencies that can validate the defined hierarchy against the current source data. Simple Hierarchies Because the branches and sub-branches of a hierarchy relate to only one root, functional dependencies can be used to verify if the source data supports the hierarchy. To construct the needed dependencies, the determinant, or left-hand side, consists of the attributes of the lower level branches and the dependent, or right-hand side, consists of the root attribute. Our interest lies in tables/files that contain all of the attributes in the hierarchy (the same process applies to subsets of attributes). A hierarchy of this type, where each root contains many branches can be expressed as a set of functional dependencies by working “up” the tree. In other words, branches determine roots.
So, in order to validate whether or not a particular simple hierarchy exists, validate dependencies where the branch attribute is the determinant and the root attribute is the dependent. Complex Hierarchies In the instance where values of branches can be shared across roots, multiple levels of branch attributes are needed to construct the dependency. In the instance of the product hierarchy (refer back to Table 2), the dependency prod_class_cd -> prod_type_cd is insufficient to validate the hierarchy. There are many product types that are related to the product class ‘A34’. An effective dependency involving the prod_cd attribute is needed as part of the determinant (left-hand side). prod_class_cd + prod_cd -> prod_type_cd This dependency gives a unique product type for every combination of product class and product. So the dependencies that are needed for this hierarchy are
Keep in mind, this does not mean that product class ‘A34’ has the same meaning in both product types. Merely that the same product class value exists in both types. Conceivably, if the meaning of ‘A34’ changes, then the product class code and the product class description, together, should determine the product type. Of course, if the common value of a hierarchy applies to multiple roots AND has the same meaning across branches, then you do not have a hierarchy and this technique won’t work. Code/Reference Tables For this discussion, code tables are structures that store valid values or codes for a specific domain used throughout the organization. Code tables, generally, have only two columns, a code and a name/description. An example of a code table is an “Invoice Status Type” table, which captures all possible statuses for an invoice. This technique involves finding or validating code tables within source data. Code tables can exist within other tables (a special case of embedded relations (see Part 1), or are already normalized as separate tables. In both cases, the analyst must validate two functional dependencies. code -> description description -> code Both dependencies must be true for the code table to be consistent. Each code value must have one meaning (code -> description). This ensures that the code is unique and is the key for the resulting table. No two codes should have the same meaning (description -> code). This implies that, even though there is no duplication of code values, there are no excess codes in the table. Conclusion The last two articles in this series have investigated different data rule verification techniques involving functional dependencies. Are there other methods of validating one-to-one correspondences, duplicate records, hierarchies, and code tables? Sure. However, all of these other techniques are unique to a specific problem and involve custom development. Using functional dependencies, in conjunction with data profiling technology, helps the analyst very quickly identify and resolve potential data rules issues using the same process and without requiring a developer to custom code queries. Of course, this also lets the developer perform other coding tasks. Functional dependencies are very powerful as a concept in data analysis and assessment. Many types of data quality issues, which may be difficult or impossible to find using other methods, can be identified very easily. Data profiling solutions can infer dependencies quickly and efficiently, and also provide a consistent framework for interpreting the results. Structural inference is an important addition to the collection of analysis techniques. Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners. |
|
|
|
|
|
Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved. |