![]() |
|
vol 1, no 1 October 15, 2002 |
|
Brought to you by Anthem Consulting, LLC | Join our Mailing List
|
|
|
|
|
Structural Inference Techniques Part 1: Identifying Candidate Primary KeysIntroduction Structural Inference, also known as Dependency Profiling, has been relatively ignored in the data profiling industry, mainly because of the perception that it is not useful or too difficult to understand. However, understanding and evaluating the dependencies inherent in source data is crucial to the “structural integrity” of the data. Structural issues are the hardest and most costly to find and fix. Structural inference reduces the time and effort needed to find and solve structural integrity issues, making migration of non-relational data much easier and far more accurate. Over the next few issues of “The Data Profiler,” each issue will investigate an analysis technique using structural inference results. In this article, we’ll define what a functional dependency is, discuss the breakthrough technology that launched the data profiling industry, and discuss how to identify potential candidate keys for a relation. Readers should be familiar with basis data profiling techniques and methodologies, including the steps and tasks involved. Functional Dependencies DefinedLet’s first discuss some terminology and reading conventions. All of the following concepts are based upon the Relational Model as defined by Ted Codd in 1968. Though, this article is not as rigorous in its definitions, the basic ideas are the same. A relation is analogous to, though not exactly, a “table,” “file,” or “record type” with a finite set of attributes and a finite number of rows. An attribute is analogous to, though not exactly, a “column” or “field” and represents a “named domain” associated with the relation. Throughout the article, we will mainly use the terms relation and attribute, though occasionally, “table” is used for “relation” and “column” is used for “attribute.” All relations used in the examples are defined using standard SQL. When an example requires data, a table is presented below the relation definition. Each table of data contains a header row of attribute names corresponding to the relation definition. Where applicable, empty values or NULLs are represented by “<NULL>” without quotes. All inference of the inherent structure of data rests upon the functional dependency, defined as follows.
For example, think about your organization. Every employee has an employee number. In fact, every employee should have a different employee number. So, for every unique value of employee number, there should be one, and only one, employee assigned that number (if not, think of the payroll, benefit, and tax implications). So, you could also say that the employee number functionally determines the employee name, address, phone number, etc. The dependencies that illustrate this example are as follows. employee number ® employee nameemployee number ® employee address employee number ® phone number Though the definition does not explicitly state it, the determinant can contain more than one attribute (e.g., order_num + order_line_id ® ordered_prod_cd, given each unique value combination of order_num and order_line_id, there is one and only one value for ordered_prod_cd). Embedded relations exist in non-normalized files and, through the use of dependencies, can be successfully normalized out of the file without changing the underlying data or meaning of the structure. For example, an insurance policy source file may contain customer data that can be normalized out into its own relation. The customer relation is embedded in the policy data. When modeling new systems, data analysts rarely investigate existing source data to identify the structures that are needed. The business requirements, file layouts, and documentation are the main sources of information that an analyst uses. Unfortunately, when the existing data is migrated to the new application, the inherent structure of the data may prevent a clean and easy conversion. An analyst that manually investigates the structure of source data finds that the process is slow and takes too much time. This problem was solved with the development of dependency inference technology. Breakthrough TechnologyWith the advent of efficient and fast mathematical algorithms that infer functional dependencies inherent within data, profiling the structure of large quantities of data became much easier. Automated software products were developed around the “dependency inference engine.” These products test all possible dependencies for a relation, and return those that are true within the data. An advantage to testing all possible dependencies is completeness of results. All possibilities are tested in a comprehensive, structured task. Nothing is left to chance. Human error is also removed from the process. The importance of this analysis is in its ability to infer the third normal form structure of a specified source file. The knowledge of the business, combined with the known dependencies, defines the optimal normalized structure necessary to satisfy the business requirements. This normalized model can be used as the basis for a staging area of data, a critique of a pre-defined database or application, or a data quality effort. There are, though, some other techniques that can aid in identifying business rules and validating existing business rules that may not be enforced by the data structure. Techniques for ProfilingIn reviewing the inferred dependencies, the analyst uses existing business knowledge to identify important structures of data. Sometimes, these structures are readily apparent, say with “customer_id” as the data’s inferred primary key. Other times, the structures are not so apparent or may require more or different data to see them. Regardless of the efficiency or speed of the algorithms, or the speed of the hardware and software, very large source files or tables must be sampled for effective and timely results. Finally, the analyst may also investigate interesting dependencies that may signal completely new structures or rules not explicitly documented or defined. Primary Key InferenceOne of the first, and most used techniques is identifying the primary key of the source file. During Attribute Analysis, a set of candidate keys may have been identified. Attributes that are 100% unique or a set of attributes that the business or DBA specified are candidates. To identify a true candidate key, investigate dependencies for a determinant that appears most often in the list. For example, let’s examine the following table. CREATE TABLE EMPL_JOBS Under normal circumstances, a primary key would be defined. In this case, though, either the DBA has turned off unique indexing (to quickly load data or for performance reasons), or a key was never defined or is programmatically validated (pre-relational databases). The data analyst reviews the inferred dependencies obtained from the data and finds the following.
The analyst determines that dependencies 6 through 8 are noise and meaningless to the business. However, dependencies 1-4 imply that empl_id+job_id is a valid primary key for the table since for each unique combination of empl_id along with the job_id found in the data, there is one and only one value each for actual_work_hours, job_end_dt, job_start_dt, and sched_work_hours. If there is no determinant that appears more often than any other determinant, this may imply one of two things.
The rest of this article investigates the first possibility of determinant combinations. Embedded relations will be investigated in a future article. Investigating DeterminantsTo illustrate possibility #1, let’s modify the example above by changing the inferred dependencies to the following.
We are going to rely on the following property.
Inference engines only show covering dependencies; otherwise the number of true dependencies is prohibitively large. However, we are not limited to the covering set. We can safely add any attribute(s) to any determinant without issue, in order to identify a candidate key. Again, dependency 7 is deemed noise and discarded from consideration (dependency 6 takes on more meaning, as we shall see). Dependency 1 shows that job_end_dt is a constant or contains only 1 distinct value.
Notice what has happened. Investigate individual attributes and see where they exist in determinants. Empl_id is found in three determinants (2, 3 and 5), job_id in two determinants (3 and 4), and job_start_dt in two determinants (5 and 6). The candidate keys must uniquely identify all other attributes in the relation. The analyst has to investigate three possibilities, empl_id+job_id, empl_id+job_start_dt, and job_id+job_start_dt. Lists of dependents for each candidate are shown below. empl_id+job_id empl_id+job_start_dt
job_id+job_start_dt The only possible determinant combination that identifies all other attributes is empl_id+job_id. This is the smallest (in terms of attribute count) candidate key for the relation EMPL_JOBS. Of course, any other attribute can be added to create an additional candidate, but the data do not need other attributes. Given that a typical inference can generate hundreds of dependencies, this process can get quite complex. Bear in mind, however, that this technique minimizes the reliance on any documentation, user input, or business knowledge. Eliminating possible avenues of investigation, based on such input, can significantly decrease the amount of time and effort required to identify candidate keys. The core process, though, remains unchanged, regardless of the data source environment. ConclusionStructural Inference is a highly effective tool that can accurately determine the structural integrity of a data source. Discovering a list of potential candidate keys for a source file is a useful method of accurately describing its structure. If the primary key is not documented or validated by the database or application environment, structural inference can quickly and concisely identify the candidate keys. Even if the primary key is well defined and maintained, structural inference can help the analyst determine if any other attribute(s) can be candidate keys. |
Solutions Supporting Structural Inference Evoke Software's Axio Avellino's Discovery Ascential Software's MetaRecon KnowledgeDriver's Profiler Suitcase Similarity System's Athanor Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners. |
|
|
|
|
|
Copyright © 2002 Anthem Consulting, LLC. All Rights Reserved. |