![]() |
|
vol 2, no 1 January 15, 2003 |
|
Brought to you by Anthem Consulting, LLC | Join our Mailing List
|
|
|
|
|
Structural Inference Techniques Part 2: Embedded Relations Introduction In the last issue of The Data Profiler, we discussed how inferred functional dependencies help in identifying candidate keys of a particular source file. For non-relational legacy sources, though, it is crucial that the files are normalized accurately for easier migration to relational databases, implying that each source file can be transformed into many normalized relations. Part 2 of Structural Inference Techniques outlines how to identify and normalize these “embedded relations.” Part 1 Review Let’s first review some definitions and conventions introduced in Part 1. A relation is analogous to, though not exactly, a “table,” “file,” or “record type” with a finite set of attributes and a finite number of rows. An attribute is analogous to, though not exactly, a “column” or “field” and represents a “named domain” associated with the relation. Throughout the article, we will mainly use the terms relation and attribute, though occasionally, “table” is used for “relation” and “column” is used for “attribute.” All relations used in the examples are defined using standard SQL. When an example requires data, a table is presented below the relation definition. Each table of data contains a header row of attribute names corresponding to the relation definition. Empty values or NULLs are represented by “<NULL>” without quotes. All inference of the inherent structure of data rests upon the functional dependency, defined as follows. Given a relation R and attributes X and Y in R, Y is functionally dependent on X if for every unique value of X there is one and only one value of Y. The functional dependency is represented as, X->Y and read as “X determines Y.” In a functional dependency, the left hand side (“X”) is called the determinant and the right hand side (“Y”) is called the dependent. The determinant can contain many attributes (e.g., order_num+order_line_num->ordered_prod_id). An embedded relation is a relation that can be normalized out of existing source data without changing the structure or content of the data. For example, code or reference tables may be considered embedded within a larger, de-normalized table. Objectives of Normalization There are many reasons an organization may normalize (or re-normalize) source data. A company may want to change the focus of the data. For example, insurance companies traditionally capture data in a policy-centric manner. To develop CRM applications, these companies need to transform and capture this data into a customer-centric manner, requiring finding the customer information embedded within their policy files. Another reason for normalization may be to determine whether or not a purchased application can capture the same or similar information captured within legacy systems. This can be accomplished by a model-to-model comparison. Of course, it is easier to compare normalized models than de-normalized structures. Finally, a third reason is the development of custom databases, such as data warehouses or data marts. A properly normalized data structure can form the basis of either an Operational Data Store (ODS) or a staging area database. Also, the development of the data warehouse database, itself, is made easier by completely understanding the inherent structure of the source data. Functional Dependencies and Normalization The basis for all normalization rests in the functional dependency. When data modelers define entities and attributes, they are implicitly defining functional dependencies that must be enforced by the database management system. For example, consider the following entity, stated in standard SQL. CREATE
TABLE Employee (empl_id char(6) NOT
NULL, empl_first_nm
varchar(30) NOT NULL, empl_last_nm
varchar(40) NOT NULL, empl_curr_sal_amt
number(10,2)); ALTER
TABLE Employee ADD PRIMARY KEY (empl_id); When the primary key of the entity is defined, in this case empl_id, the modeler is, in effect, creating three functional dependencies. empl_id->empl_first_nm empl_id->empl_last_nm empl_id->empl_curr_sal_amt The determinant becomes the primary key of the Employee entity. There is a difference between the traditional data modeling exercise and the data profiling process. A traditional data model is developed from business requirements, existing documentation, and possible file layouts. The data profiling specialist develops data models and structures from inferred dependencies that exist within the existing source data, and then incorporates new business requirements into this model. As seen in Part 1, a candidate key for a relation is defined as a common, minimal determinant that identifies all other attributes in the relation. We’ll use this, in conjunction with other techniques, to identify embedded relations. Patterns in the Determinant The first step in finding embedded relations involves sets of dependencies with the same or similar determinant. Dependencies with the same determinant possibly signal that the determinant is a candidate key for a relation with the dependents as non-key attributes. If a particular determinant appears in the list of dependencies examine the dependent attributes to see if they (the dependent attributes) can be logically normalized into their own relation. For example, consider the following list of dependencies for the relation specified. CREATE
TABLE order (order_id char(8) NOT
NULL, order_dt date NOT NULL, terms varchar(15) NOT NULL, cust_bill_id
char(10) NOT NULL, cust_bill_nm
varchar(35) NOT NULL, bill_cd char(3), bill_desc char(30), cust_ship_id char(10), tot_ord_amt money, tot_sls_tax_amt money) Table 1: ORDER Inferred Dependencies
While investigating patterns in the determinants, the practitioner should first evaluate the suitability of the determinant as a primary key of a data structure. For example, dependency 9 (order_dt->bill_cd), though true in the data, may not be considered important to the business. Nor is a data structure with only “order_dt” as the primary key well defined or suitable as a table. There are three sets of dependencies we need to investigate.
Dependencies 1-5 indicate a potential “Order” relation because of the common “order_id” determinant. We also see that each dependent in this set can reasonably be a non-key attribute of the Order relation. Of course, this must be verified by the business, but having these inferred dependencies gives us a good data analysis foundation. If dependencies 1-5 have “business value,” or if they contribute to the development of a model, then the practitioner can mark them as model dependencies. Dependencies 6-8 indicate a possible “Customer Billing Location” relation. However, since the “terms” attribute is already a part of the Order relation, and there may not be any added business value in associating it with the Customer Billing Location relation, we do not need to mark dependency 8 as a model dependency. Once an attribute acts as a dependent for a model dependency, you may be able to ignore any other dependency with the same dependent. Techniques involving multiple dependencies having the same dependent are the subject of a future article. For dependency 10, “bill_cd” is already a dependent for a model dependency (cust_bill_id->bill_cd). The last two dependencies (#11, #12) signal a domain or code table. Typically, these tables contain two attributes, a code and a description. When you see two dependencies like this, two attributes determining each other, especially with code attributes, a domain table is possible. However, in order to construct the code relation, specify the dependency with the code attribute in the determinant as a model dependency (in this case, #11). Do not mark the other dependency as part of the model, since that would create a relation with the description attribute as the primary key. After this review, the following dependencies are now marked as model dependencies.
Though there is still some more analysis that must be performed, we have identified the embedded relations in the Order source file. · Order (dependencies 1-5) · Customer Billing Location (dependencies 6-7) · Billing Type (dependency 8) Other Considerations The intent of this technique is not to replace the data modeler. Nor should it be suggested that the model is complete with the inference of dependencies from the data. If the source data is of low quality, the dependencies that are inferred may not be suitable to the project needs and may never become model dependencies. The data modelers and subject matter experts must make the decisions regarding the data structures, just as in traditional analysis methodologies. Also, the automated data profiling solutions only infer covering dependencies. The data profiling specialist may need to add attributes to the determinant to develop the data structure needed, especially if there are constants in the source data. Again, the specialist, with the SMEs, must evaluate the business value of the dependencies. Conclusion The ability to normalize a source file is a crucial aspect in determining the structural integrity of data. Without examining inferred dependencies and only relying on documentation or metadata, the analyst runs the risk of developing a model that does not support the actual data. Though business requirements, both new and existing, are crucial to the process, the data analyst should examine the source data in an effort to fully understand and validate their models. |
Structural Inference Vendors Ascential Software Avellino Evoke Software KnowledgeDriver Similarity Systems, Inc. Related Articles Part 1 - Candidate Keys |
|
|
|
|
|
Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved. |