Introduction to Domain Studies
Introduction
The vast majority of existing data profiling efforts focus
primarily on the column or column level.
Commonly known as domain studies, this level of analysis provides
a solid foundation for future tasks such as data rule validation and redundancy
inference. Unfortunately, the current
efforts end with reviewing value frequencies, and possibly, NULL rule
analysis. This article presents some
newer techniques that reveal interesting and important knowledge about the
behavior of domains.
Scope of Domain Studies
Before discussing techniques, let’s define the goals,
objectives, and scope of domain studies.
Here’s a list of general objectives of domain studies. Specific types of projects may add to this
list, but this is a good starting point.
Objectives:
- Identify
the content of each individual column
- Determine
the scope or extent of the information captured in the column
- Understand
the quality and suitability of columns for migration or transformation
Before we continue, let’s clarify some terminology. This article uses the terms column, field
and attribute interchangeably to mean an individual characteristic or
fact about a business concept physically stored in a table, file, or
document. A domain is a set of valid
values, or range of values, coupled with a user-defined data type. This data type can be technically-defined
(number(5,2)), or abstract (newspaper article). A domain can be associated with many different columns, but a
column has one and only one domain.
Domain studies involves the analysis of individual columns
as independent units. We minimize
investigating relationships between columns to decrease analysis complexity at
this early stage. Of course, potential
or documented relationships are identified as future avenues of investigation,
and data rules may be discovered and validated. However, practitioners should resist the temptation to
continually verify relationships between columns. There is a tendency to get lost verifying these rules and not
truly understanding the content of the column under consideration.
Discovering Column Characteristics
Column characteristics or metadata can either be
documented in business requirements, metadata, or other documentation, or
discovered from the actual source data.
Though the automated data profiling solutions have different methods of
acquiring and storing such information, we focus on what types of information
should be captured.
Here are two tables, one of documented metadata, the other
of discovered or actual metadata.
Table 1: Documented Column Metadata
|
Characteristic
|
Description
|
|
Business
Definition
|
A
description of the column developed by the business Subject Matter Expert
(SME). This may also include the
business domain of the column.
|
|
Technical
Definition
|
The
system or application description of the column, usually specified by the DBA
or system administrator.
|
|
Documented
NULL Rule
|
A
column rule specifying whether or not NULLs are allowed in the column.
|
|
Documented
Data Type
|
The
physical data type expected or defined for the column.
|
|
Declared
Domain
|
The
values, range of values, or characteristics of values that are valid for the
particular column.
|
Table 2: Discovered/Inferred Column Metadata
|
Characteristic
|
Description
|
|
Actual
NULL Rule
|
An indicator
stating that the column contains NULL rows.
Include the number of NULL rows that exist.
|
|
Actual
Data Type
|
The
physical data type(s) that best fits all of the values found in the
column. The analyst may also capture
all data types that fit at least some values in the column.
|
|
Value
Frequencies
|
The
list of all values found within the specific column. For columns whose values are 100% unique,
the analyst may want to capture a sample of the values, a range of the
values, and/or an input mask. Variations
on this may also include the top 50 and bottom 50 values sorted by frequency
or alphabetically.
|
|
Mean
|
The
arithmetic average of the column.
Divide the sum of all rows by the number of rows.
|
|
Median
|
The
“middle” value that appears in the column.
For example, if five values appear in a column, sort by value
(alphabetic or numeric) and the third value is the median. For a column with an even number of
values, take the two middle values, add them and divide by two. For a non-numeric column with and even
number of values, specify no median.
|
|
Mode
|
The
value that appears most often in the column. This measurement is applicable to both character and numeric data.
|
|
Standard
Deviation
|
A
statistical measurement for numeric data to determine the rate of clustering
of all values to the mean. This is
used to find the affect of outliers or extremely large or small values on the
distribution of the values.
|
|
Format/Pattern
|
The
actual patterns or formats found within the column. For example, a phone number column may have some values that
correspond to the US format for phone numbers and other values corresponding
to the UK format. Invalid patterns
are also recorded.
|
|
Soundex
|
An
index that converts how a word sounds into a standard code. Used to compare similarly defined words or
phrases that may be spelled differently.
Applied mostly for proper names (e.g., surnames) and can help
standardize and household names and addresses.
|
All of the discovered column metadata is just that, discovered. The data profiling specialist shouldn’t
need, for example, to specify the formats, values, or data types for which the
profiling software should search or test.
The software should be able to interrogate each column and discover this
information.
Once this metadata is captured, there are other possible
problems or issues that must be documented by the data profiling
specialist. Investigating the value
frequencies is one way of discovering these issues.
Table 3: Possible Data Quality/Action Item Issues
|
Characteristic
|
Description
|
|
Zeros
Present
|
Indicates
that the related column contains various forms of a zero value. For example, 0, 0.000, 0000, etc. are
different forms of a zero value.
|
|
Differing
Case
|
An
indicator stating that the column contains some values of all uppercase
characters and other values with mixed upper and lowercase characters. For example, “123 MAIN STREET” and “456
Maple way”. This type of problem may
indicate previous conversions from Mainframe systems, and could cause another
type of problem, duplicate values.
|
|
Unused
|
Indicates
if a column either is 100% NULL or 100% a single value. Useful in removing column from migration
consideration
|
|
Duplicates
found
|
Indicates
the existence of similar values that may mean the same thing. For example, state values of “NJ” and “New
Jersey”. This may also be a result of
Differing Case. For example, “NJ” and
“nj”.
|
|
Possible
Default Value
|
Indicates
that one value in a column “acts” like a default. There are many ways to define “acts”. A common definition is where the most frequently occurring
value appears in 50% of the rows and appears twice as often as the next most
frequently occurring value. For
example, a state code column has “NY”, the most frequently occurring value,
in 55% of the rows while “CA”, the second most frequently occurring value,
occurs 20% of the time. “NY” acts
like a default. Defaults may not be
documented for Yes/No, Y/N, 0/1 indicators.
|
|
Bad
Data
|
Indicates
that there is a relatively large amount of bad or inaccurate values in the
column. May also indicate that packed
data exists where it wasn’t known or expected.
|
|
Date
Issue
|
Indicates
the presence of invalid dates, incorrectly formatted dates, two-digit years,
or multiple formats for dates.
|
|
Code
Table
|
Indicates
that the column contains a finite set of values and that the meanings of
those values are needed. Any unusual
values can also be highlighted.
|
Conclusion
Domain studies encompasses more than just value frequency
analysis. Though the techniques stated
in this article are merely a start to domain studies, they help the analyst
compile a strong base of knowledge about the data. Combining standard documented metadata discovery with actual
domain characteristics and business knowledge can reveal potential problems not
otherwise noticed by data analysts. In
future articles, we may investigate more detailed and complex techniques of
domain studies.