![]() |
|
vol 2, no 3 July 15, 2003 |
|
Brought to you by Anthem Consulting, LLC |
Join our Mailing List
|
|
|
|
|
Dynamic Meta Data Introduction As data profiling products mature, their capabilities regarding the statistics they capture will improve. Currently, the major products (Evoke Axio™, Avellino Discovery™, and Ascential ProfileStage™) all capture data assessment statistics on static data sets. Repeated profiling of an existing source table/file, however, either overwrites the existing set of statistics, or writes the statistics to a new internal structure entirely, making trending of the results over time difficult with custom queries or snapshots. Meta data repository products excel in capturing the existing documented structure of data from a variety of platforms. They have developed processes to scan catalogs and data dictionaries, and almost seamlessly update the current meta data cleanly and efficiently. However, documenting existing data structures without understanding what truly comprises the actual data is analogous to recording the labels on the bins, shelves, and rows in a finished-goods warehouse without performing a detailed and periodic inventory of what’s contained in the warehouse. The business cannot make accurate and consistent decisions based on the warehouse “meta-inventory”. The inventories are required. Similarly, periodic assessments tell far more about the actual content of the required business data. We’ll call the information gleaned from existing documentation, whether from a repository or data dictionary, “static meta data” and statistics obtained from a data assessment, whether one-time or periodic, “dynamic meta data”. The discussions focus on new and existing types of dynamic meta data captured, how to capture the statistics, and the management and use of dynamic meta data. Types of Dynamic Meta Data Many software solutions for ETL and query management already capture significant amounts of statistics needed to measure the content and quality of existing source data. The focus of these products is to improve the ETL process, more efficiently manage the CPU and I/O resources needed, and to create optimized code. They can indirectly measure the data quality if the ETL statistics reveal unknown problems in the data converted, or if query statistics reveal unknown conditions that affect the performance or accuracy of the queries. For our purposes, we’ll focus on the data profiling statistics that directly measure the conditions of the existing source data. Understand that our measurements are quantitative, not qualitative. Like a finished-goods inventory, we’ll develop metrics and measurements that identify what is in the data. Whether or not the statistics reveal positive or negative results is up to the data specialist. Dynamic meta data can help the analyst make business decisions. It cannot, however, make those decisions automatically. There are three types of dynamic meta data that directly measure the data itself, one of which is derived from the other two.
Notice, each type of dynamic meta data can be said to focus, though not exclusively, on a specific dimension of static meta data. Content focuses on attributes. Structure focuses on relations. Trends focus on time. Content Prior articles on “domain studies” have discussed what can be captured at the attribute level during a data profiling session. Row level statistics also include value redundancy across attributes, enumerated domain comparisons, and duplicate rows across relations. Value redundancy statistics measure the amount of overlap between two attributes, based on the value sets. For example, consider two attributes with the following values and row counts. “Row Frequency” represents the number of rows that contain the value, and “Attribute Rows” represents the total number of rows. Attribute 1
Attribute 2
Some basic statistics include
Notice that attributes 1 and 2 need not be in the same relation/table. These kinds of statistics can also be applied to enumerated domains and row-level duplication. Attribute 1 could be an enumerated domain of valid values. If so, then these statistics can be interpreted to show that 60% of the values and 57% of the rows in Attribute 2 are invalid (assuming NULLs are allowed). Substituting attribute values for row values, the same types of statistics can be used and interpreted with row-level duplication in mind. Structure Measuring the structural integrity of data is mainly accomplished with evaluating functional dependencies. Though various techniques have been investigated (see the “Structural Inference” series), what has not been covered are some statistics. Once dependencies are marked as “model” or “rule”, and data rules are captured, we can compile the following measures.
The characteristics of the samples, as well as the sets of conflicting data, may also be needed. Trends
The truly interesting dynamic meta data type is the trending possibilities of the base set of statistics. As repeated assessments are performed, compliance and non-compliance trends can be constructed, on any level. These trends are important in measuring the effectiveness of data quality solutions or in identifying new problems that may appear. Of course, trending in nothing new. What is new is the profiling statistics we can capture and the uses for such statistics. Consider the trending of one statistic, percentage of rows with invalid patterns/formats. A hypothetical trend is graphed below.
This chart shows the number of rows with invalid patterns/formats found in new data over time. The chart tells a story that a one-time assessment cannot. The general trend is down, perhaps in response to new training in entering the data. However, though not nearly as bad as before, the percentage (and count) of invalid rows is increasing. This information can be used by a variety of IT specialists.
Infrastructure The underlying infrastructure to capture and maintain these types of dynamic meta data over time does not really exist in current data profiling software. The emphasis of these products is to assess data at a point in time. Organizations that want to use these statistics may need to develop their own dynamic meta data repositories. They can leverage the static meta data maintained in traditional repositories and the newer data profiling repositories. The trick is to capture the statistics over time. One way is to associate each set of statistics with a “Project/Effort” and a “Profiling Task”. This way, every time a data profiling task is executed on new or changed data, new statistics are recorded without overwriting previous sets. The user should retain the option of committing the new statistics to the repository, since the analyst may be performing multiple test runs and not want each run permanently recorded. Finally, a Business Intelligence platform and integrated ETL development application can be implemented over the repository to accommodate trending analysis and ETL analysis, respectively. In fact, as statistics are created and populated, the meta data repository starts looking more and more like a “data warehouse” for IT. One can even imagine a “meta data mart” for each subject area or business process within a source system. What makes this possible is the existence of dynamic meta data over time. Conclusion To complete this article, here is an initial list of statistics that may be useful in data assessment efforts. Each statistic can be trended over time, if needed. This is not a comprehensive list, but merely a starting point. If you have any other statistics that you have used in the past, let us know. Content · Count and percentage of valid (and invalid) values for each coded source attribute · Count and percentage of rows with valid or invalid values for each coded source attribute · Count and percentage of rows with valid or invalid patterns/formats · Count and percentage of rows with NULLs, blanks, and zeros in attribute (Completeness) · Count of values in attribute found in only one row · Uniqueness percentage of attribute (Number of unique values/Number of rows) · Count and percentage of values in one attribute that overlap with another attribute · Count of potentially similarly meaning values in attribute (e.g. New York, NY, nyc) Structure
· Count and percentage of true dependencies in source sample · Count and percentage of rows conflicting (or supporting) per dependency · Count and percentage of rows conflicting (or supporting) per data rule · Count and percentage of parent values in foreign key relationship without children · Count and percentage of child values in foreign key relationship without parents Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners. |
|
|
|
|
|
Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved. |