vol 2, no 1
January 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Introduction to Domain Studies

 

Introduction

 

The vast majority of existing data profiling efforts focus primarily on the column or column level.  Commonly known as domain studies, this level of analysis provides a solid foundation for future tasks such as data rule validation and redundancy inference.  Unfortunately, the current efforts end with reviewing value frequencies, and possibly, NULL rule analysis.  This article presents some newer techniques that reveal interesting and important knowledge about the behavior of domains.

 

Scope of Domain Studies

 

Before discussing techniques, let’s define the goals, objectives, and scope of domain studies.  Here’s a list of general objectives of domain studies.  Specific types of projects may add to this list, but this is a good starting point.

 

Objectives:

  1. Identify the content of each individual column
  2. Determine the scope or extent of the information captured in the column
  3. Understand the quality and suitability of columns for migration or transformation

 

Before we continue, let’s clarify some terminology.  This article uses the terms column, field and attribute interchangeably to mean an individual characteristic or fact about a business concept physically stored in a table, file, or document.  A domain is a set of valid values, or range of values, coupled with a user-defined data type.  This data type can be technically-defined (number(5,2)), or abstract (newspaper article).  A domain can be associated with many different columns, but a column has one and only one domain.

 

Domain studies involves the analysis of individual columns as independent units.  We minimize investigating relationships between columns to decrease analysis complexity at this early stage.  Of course, potential or documented relationships are identified as future avenues of investigation, and data rules may be discovered and validated.  However, practitioners should resist the temptation to continually verify relationships between columns.  There is a tendency to get lost verifying these rules and not truly understanding the content of the column under consideration.

 

Discovering Column Characteristics

 

Column characteristics or metadata can either be documented in business requirements, metadata, or other documentation, or discovered from the actual source data.  Though the automated data profiling solutions have different methods of acquiring and storing such information, we focus on what types of information should be captured.

 

Here are two tables, one of documented metadata, the other of discovered or actual metadata.

 

Table 1: Documented Column Metadata

Characteristic

Description

Business Definition

A description of the column developed by the business Subject Matter Expert (SME).  This may also include the business domain of the column.

Technical Definition

The system or application description of the column, usually specified by the DBA or system administrator.

Documented NULL Rule

A column rule specifying whether or not NULLs are allowed in the column.

Documented Data Type

The physical data type expected or defined for the column.

Declared Domain

The values, range of values, or characteristics of values that are valid for the particular column.

 

 

Table 2: Discovered/Inferred Column Metadata

Characteristic

Description

Actual NULL Rule

An indicator stating that the column contains NULL rows.  Include the number of NULL rows that exist.

Actual Data Type

The physical data type(s) that best fits all of the values found in the column.  The analyst may also capture all data types that fit at least some values in the column.

Value Frequencies

The list of all values found within the specific column.  For columns whose values are 100% unique, the analyst may want to capture a sample of the values, a range of the values, and/or an input mask.  Variations on this may also include the top 50 and bottom 50 values sorted by frequency or alphabetically.

Mean

The arithmetic average of the column.  Divide the sum of all rows by the number of rows.

Median

The “middle” value that appears in the column.  For example, if five values appear in a column, sort by value (alphabetic or numeric) and the third value is the median.  For a column with an even number of values, take the two middle values, add them and divide by two.  For a non-numeric column with and even number of values, specify no median.

Mode

The value that appears most often in the column. This measurement is applicable to both character and numeric data.

Standard Deviation

A statistical measurement for numeric data to determine the rate of clustering of all values to the mean.  This is used to find the affect of outliers or extremely large or small values on the distribution of the values.

Format/Pattern

The actual patterns or formats found within the column.  For example, a phone number column may have some values that correspond to the US format for phone numbers and other values corresponding to the UK format.  Invalid patterns are also recorded.

Soundex

An index that converts how a word sounds into a standard code.  Used to compare similarly defined words or phrases that may be spelled differently.  Applied mostly for proper names (e.g., surnames) and can help standardize and household names and addresses.

 

All of the discovered column metadata is just that, discovered.  The data profiling specialist shouldn’t need, for example, to specify the formats, values, or data types for which the profiling software should search or test.  The software should be able to interrogate each column and discover this information.

 

Once this metadata is captured, there are other possible problems or issues that must be documented by the data profiling specialist.  Investigating the value frequencies is one way of discovering these issues.

 

Table 3: Possible Data Quality/Action Item Issues

Characteristic

Description

Zeros Present

Indicates that the related column contains various forms of a zero value.  For example, 0, 0.000, 0000, etc. are different forms of a zero value.

Differing Case

An indicator stating that the column contains some values of all uppercase characters and other values with mixed upper and lowercase characters.  For example, “123 MAIN STREET” and “456 Maple way”.  This type of problem may indicate previous conversions from Mainframe systems, and could cause another type of problem, duplicate values.

Unused

Indicates if a column either is 100% NULL or 100% a single value.  Useful in removing column from migration consideration

Duplicates found

Indicates the existence of similar values that may mean the same thing.  For example, state values of “NJ” and “New Jersey”.  This may also be a result of Differing Case.  For example, “NJ” and “nj”.

Possible Default Value

Indicates that one value in a column “acts” like a default.  There are many ways to define “acts”.  A common definition is where the most frequently occurring value appears in 50% of the rows and appears twice as often as the next most frequently occurring value.  For example, a state code column has “NY”, the most frequently occurring value, in 55% of the rows while “CA”, the second most frequently occurring value, occurs 20% of the time.  “NY” acts like a default.  Defaults may not be documented for Yes/No, Y/N, 0/1 indicators.

Bad Data

Indicates that there is a relatively large amount of bad or inaccurate values in the column.  May also indicate that packed data exists where it wasn’t known or expected.

Date Issue

Indicates the presence of invalid dates, incorrectly formatted dates, two-digit years, or multiple formats for dates.

Code Table

Indicates that the column contains a finite set of values and that the meanings of those values are needed.  Any unusual values can also be highlighted.

 

Conclusion

 

Domain studies encompasses more than just value frequency analysis.  Though the techniques stated in this article are merely a start to domain studies, they help the analyst compile a strong base of knowledge about the data.  Combining standard documented metadata discovery with actual domain characteristics and business knowledge can reveal potential problems not otherwise noticed by data analysts.  In future articles, we may investigate more detailed and complex techniques of domain studies.


Domain Studies Vendors

AMB Dataminers

Ascential Software

Avellino

DQ Now

Evoke Software

KnowledgeDriver

SAS Corporation

Similarity Systems, Inc.

Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.



Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved.