vol 2, no 3
July 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Dynamic Meta Data

 

Introduction

 

As data profiling products mature, their capabilities regarding the statistics they capture will improve.  Currently, the major products (Evoke Axio™, Avellino Discovery™, and Ascential ProfileStage™) all capture data assessment statistics on static data sets.  Repeated profiling of an existing source table/file, however, either overwrites the existing set of statistics, or writes the statistics to a new internal structure entirely, making trending of the results over time difficult with custom queries or snapshots.

 

Meta data repository products excel in capturing the existing documented structure of data from a variety of platforms.  They have developed processes to scan catalogs and data dictionaries, and almost seamlessly update the current meta data cleanly and efficiently.

 

However, documenting existing data structures without understanding what truly comprises the actual data is analogous to recording the labels on the bins, shelves, and rows in a finished-goods warehouse without performing a detailed and periodic inventory of what’s contained in the warehouse.

 

The business cannot make accurate and consistent decisions based on the warehouse “meta-inventory”.  The inventories are required.  Similarly, periodic assessments tell far more about the actual content of the required business data.

 

We’ll call the information gleaned from existing documentation, whether from a repository or data dictionary, “static meta data” and statistics obtained from a data assessment, whether one-time or periodic, “dynamic meta data”.  The discussions focus on new and existing types of dynamic meta data captured, how to capture the statistics, and the management and use of dynamic meta data.

 

Types of Dynamic Meta Data

 

Many software solutions for ETL and query management already capture significant amounts of statistics needed to measure the content and quality of existing source data.  The focus of these products is to improve the ETL process, more efficiently manage the CPU and I/O resources needed, and to create optimized code.  They can indirectly measure the data quality if the ETL statistics reveal unknown problems in the data converted, or if query statistics reveal unknown conditions that affect the performance or accuracy of the queries.

 

For our purposes, we’ll focus on the data profiling statistics that directly measure the conditions of the existing source data.  Understand that our measurements are quantitative, not qualitative.  Like a finished-goods inventory, we’ll develop metrics and measurements that identify what is in the data.  Whether or not the statistics reveal positive or negative results is up to the data specialist.  Dynamic meta data can help the analyst make business decisions.  It cannot, however, make those decisions automatically.

 

There are three types of dynamic meta data that directly measure the data itself, one of which is derived from the other two.

 

  • Content – Statistics measuring what is contained in individual attributes (e.g., value frequencies) and rows (e.g., duplication and redundancy statistics)
  • Structure – Details about the structural integrity of the data, including dependencies, data rules, primary, and foreign keys
  • Trends – Answers how the content and structure of data changes over time

 

Notice, each type of dynamic meta data can be said to focus, though not exclusively, on a specific dimension of static meta data.  Content focuses on attributes.  Structure focuses on relations.  Trends focus on time.

 

Content

 

Prior articles on “domain studies” have discussed what can be captured at the attribute level during a data profiling session.  Row level statistics also include value redundancy across attributes, enumerated domain comparisons, and duplicate rows across relations.

 

Value redundancy statistics measure the amount of overlap between two attributes, based on the value sets.  For example, consider two attributes with the following values and row counts.  “Row Frequency” represents the number of rows that contain the value, and “Attribute Rows” represents the total number of rows.

 

Attribute 1

Value

Row Frequency

Attribute Rows

A

50

1000

B

100

1000

C

500

1000

D

350

1000

 

Attribute 2

Value

Row Frequency

Attribute Rows

<Null>

1000

10000

A

300

10000

C

3000

10000

E

4000

10000

X

700

10000

Z

1000

10000

 

Some basic statistics include

 

  • Percent Overlapping Values – Attribute 1: Number of overlapping values/Value Count (50%)
  • Percent Overlapping Values – Attribute 2 (40%)
  • Percent Rows Overlapping – Attribute 1: Sum of Overlapping Row Frequency/Attribute Rows (55%)
  • Percent Rows Overlapping – Attribute 2 (33%)
  • Completeness  – Attribute 1: Percent of NULL, blank, and zero rows (0%)
  • Completeness – Attribute 2 (90%)

 

Notice that attributes 1 and 2 need not be in the same relation/table.

 

These kinds of statistics can also be applied to enumerated domains and row-level duplication.  Attribute 1 could be an enumerated domain of valid values.  If so, then these statistics can be interpreted to show that 60% of the values and 57% of the rows in Attribute 2 are invalid (assuming NULLs are allowed).

 

Substituting attribute values for row values, the same types of statistics can be used and interpreted with row-level duplication in mind.

 

Structure

 

Measuring the structural integrity of data is mainly accomplished with evaluating functional dependencies.  Though various techniques have been investigated (see the “Structural Inference” series), what has not been covered are some statistics.

 

Once dependencies are marked as “model” or “rule”, and data rules are captured, we can compile the following measures.

 

 

The characteristics of the samples, as well as the sets of conflicting data, may also be needed.

 

Trends

 

The truly interesting dynamic meta data type is the trending possibilities of the base set of statistics.  As repeated assessments are performed, compliance and non-compliance trends can be constructed, on any level.  These trends are important in measuring the effectiveness of data quality solutions or in identifying new problems that may appear.  Of course, trending in nothing new.  What is new is the profiling statistics we can capture and the uses for such statistics.

 

Consider the trending of one statistic, percentage of rows with invalid patterns/formats.  A hypothetical trend is graphed below.

 

 

This chart shows the number of rows with invalid patterns/formats found in new data over time.  The chart tells a story that a one-time assessment cannot.  The general trend is down, perhaps in response to new training in entering the data.  However, though not nearly as bad as before, the percentage (and count) of invalid rows is increasing.  This information can be used by a variety of IT specialists.

 

  • The data quality analyst can measure the effectiveness of quality resolutions or facilitate root-cause analysis
  • The data modeler can investigate the valid patterns/formats for better domain definition
  • The ETL analyst/developer can change the transformations to accommodate the errors
  • The business analyst can determine if more training is needed or if the structure of the data is changing
  • The business user can understand the validity of the data in the attribute and can quantify the risk, if any, of using the data

 

Infrastructure

 

The underlying infrastructure to capture and maintain these types of dynamic meta data over time does not really exist in current data profiling software.  The emphasis of these products is to assess data at a point in time.  Organizations that want to use these statistics may need to develop their own dynamic meta data repositories.  They can leverage the static meta data maintained in traditional repositories and the newer data profiling repositories.  The trick is to capture the statistics over time.

 

One way is to associate each set of statistics with a “Project/Effort” and a “Profiling Task”.  This way, every time a data profiling task is executed on new or changed data, new statistics are recorded without overwriting previous sets.  The user should retain the option of committing the new statistics to the repository, since the analyst may be performing multiple test runs and not want each run permanently recorded.

 

Finally, a Business Intelligence platform and integrated ETL development application can be implemented over the repository to accommodate trending analysis and ETL analysis, respectively.  In fact, as statistics are created and populated, the meta data repository starts looking more and more like a “data warehouse” for IT.  One can even imagine a “meta data mart” for each subject area or business process within a source system.  What makes this possible is the existence of dynamic meta data over time.

 

Conclusion

 

To complete this article, here is an initial list of statistics that may be useful in data assessment efforts.  Each statistic can be trended over time, if needed.  This is not a comprehensive list, but merely a starting point.  If you have any other statistics that you have used in the past, let us know.

 

Content

 

·        Count and percentage of valid (and invalid) values for each coded source attribute

·        Count and percentage of rows with valid or invalid values for each coded source attribute

·        Count and percentage of rows with valid or invalid patterns/formats

·        Count and percentage of rows with NULLs, blanks, and zeros in attribute (Completeness)

·        Count of values in attribute found in only one row

·        Uniqueness percentage of attribute (Number of unique values/Number of rows)

·        Count and percentage of values in one attribute that overlap with another attribute

·        Count of potentially similarly meaning values in attribute (e.g. New York, NY, nyc)

 

Structure

 

·        Count and percentage of true dependencies in source sample

·        Count and percentage of rows conflicting (or supporting) per dependency

·        Count and percentage of rows conflicting (or supporting) per data rule

·        Count and percentage of parent values in foreign key relationship without children

·        Count and percentage of child values in foreign key relationship without parents

 

Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.






Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved.