![]() |
|
vol 2, no 4 October 15, 2003 |
|
Brought to you by Anthem Consulting, LLC |
Join our Mailing List
|
|
|
|
|
Standardized Domains and Code SetsJoseph Novella Introduction Few projects so successfully illustrate the need for data quality solutions like data warehousing. Departmental databases or applications, generally, hide data quality problems by their general nature. Databases tailored and exclusively used by individual departments or organizations satisfy specific problems and are not really meant for generalized use. When other departments need data from these systems, a tailored “cleansed” point-to-point interface is developed. One of the problems encountered while integrating these data sources involves the many different valid value sets, or code sets, for any particular business concept, e.g., US States. Though each department may justify its version of the codes in question, the organization as a whole incurs increased costs when information is shared. We’ll investigate methods for standardizing on one (or at least fewer) code sets as a first step towards conforming dimensions. Four Considerations In order to design a code set that both satisfies business requirements and simplifies systems development, there are four areas on which attention must be focused. Prevalence Also called acceptance, prevalence involves quantifying the number of users, organizations, systems, and tables/files that reference and utilize the code set. The prevalence of a code set is, obviously, the most visible measurement used in standardizing a domain. However, some organizations only examine the number of users or departments that reference a code set. The number of systems and tables/files can now be a consideration since the advent of automated data profiling solutions. A simple way of measuring the prevalence of a code set is to use Redundancy Inference. Create a code or reference table of the code set and treat it as a source file in the data profiling engine. The number of overlap pairs, as well as percentage of overlap, gives a good indication of the prevalence of a code set in a source system. Table 1: Example of a Set of Overlap Candidates
For each code set, determine the number of potentially related source files or tables that reference the code set. Each pair should be examined for a minimum amount of overlap. For example, row 4 in Table 1 may be rejected as a legitimate reference to post_state_cd because of the low percentage overlap (12%). Some measurements that can be used to quantify the prevalence of a code set are listed below.
In effect, the more organizations, users, etc. that utilize and reference a code set, the more likely a candidate for a standardized domain. Consistency Consistency is related to Prevalence in that similar techniques are used. The purpose is to understand how well each organization or system has implemented the code set. Though an organization may indicate that the use a code set, certain problems may exist that indicates the code set is not well maintained.
One way of determining consistency is to use Redundancy Inference techniques to identify referential integrity (RI) problems between the code set and the source tables that use the code set. The lower the percentage overlap between each pair of attributes indicates more severe RI issues. If the profiling engine has the capability, drill down to actual source rows that do not have valid codes for the attribute in question to further investigate the problem. Some measurements to use to determine the consistency of implementation are
As a side benefit, capturing these measurements identifies any cleansing needed in order to successfully migrate the data, even if the domains aren’t standardized. If the code set is well maintained and is consistently implemented, it’s a good candidate for the standard. Completeness The root cause of many invalid value problems can be traced to the ability of the code set to accurately and completely describe the business concept it represents. For example, an organization that originally did business only in the US will find its state code table insufficient for Canadian provinces. Instead of extending an existing code set or introducing a new one, the organization either removes edits to the existing state columns (to make entry simpler), or may just “live” with the new, nonstandard values. The standardization of a domain requires that the domain completely describe the underlying concept. Examine the code set for any missing values or to see if the concept it’s describing should be broadened or expanded. One way to partially quantify the completeness of a code set is to examine the invalid values of related attributes. Invalid values may consist of genuinely bad data (e.g., ‘?C’ in a state column, or potential new values that can be added to a new domain (e.g., ‘ON’ for Ontario as a state code). If the invalid values of a column mostly consist of potential new values, the existing code set can be considered less complete than necessary. Otherwise the code set is fairly complete with the values it has. A final consideration involves the amount of information captured for each code set. A code table that only captures codes and descriptions is not as complete as a similar code table capturing more information. For example, consider a state code table capturing
This table is far more descriptive than one with only the state code and name. Of course, the data in the code table must also be examined for quality issues as well (e.g., is the sales region updated yearly?). Source As with every other type of data, the quality of a code set is also affected by its source. The interest in the source is not in the exact department or organization that maintains the code set. Rather, it’s the type of organization that is the concern. Generally, there are four types of organizations that maintain code sets.
In many cases, the scope of the code set and subsequent domain is affected by its maintaining source. For example, an internal department may maintain its own industry codes, while a standard maintained by a 3rd party company, such as the Standard Industry Classification (SIC) codes may have more complete coverage. Finally, the governments subject to the North American Free Trade Agreement (NAFTA) maintain the North American Industry Classification System (NAICS), which may be needed for international companies. The selection of a standard code set or domain is influenced by its originating and maintaining source. Generally, government-maintained codes have greater scope. However, they are not modified or extended very easily or often. Conclusion The acceptance of a data warehousing environment by the user community is directly influenced by the quality of the data. A component of this is how well data administration constructs standardized domains. These domains must not only be generally accepted by the organization, they must also be of high quality and easily maintained. Select and modify domains based on their prevalence, consistency, completeness, and source to minimize the users’ learning curves and maximize the domain’s applicability. |
|
|
|
|
|
Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved. |