vol 2, no 2
April 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Trends in Data Profiling Technology

 

by Joseph Novella

 

Introduction

 

From simple domain analysis to complex data rule and specialized analysis, today’s data profiling technology form the basis for any data quality or assessment task.  Though only a few years old, data profiling has made great strides in capabilities and acceptance.  The next couple of years promise exciting improvements in usability, presentation, and functionality.

 

This article reviews some of the potential trends and new features that may appear in the data profiling market.  Each new feature will be explored extensively in future articles.  Some of these features are, even now, beginning to appear.

 

Continuous Data Assessment

 

Today’s data profiling solutions do an exceptional job of profiling and analyzing static data sets of almost any size.  They give a very good picture of existing quality problems.  Some products even develop and code cleansing routines that can be incorporated into new applications or system interfaces.

 

Profiling static data sets is only part of the solution.  Continuous data assessment is the other half of the solution.  As records are added or changed in existing data sources, they must be profiled to measure the effectiveness of newer quality processes.  Before data extracts are added to a data warehouse, they must be assessed for completeness and consistency.  An automated, scheduled process can improve this process tremendously.

 

There are two areas where data profiling solutions can help with ongoing or repeated data assessment tasks.

 

Unattended or Productionalized Profiling

 

All data profiling solutions require a relatively large amount of human interaction, and for static data sets, this interaction is necessary.  A practitioner must create the extract file or table, manually invoke each profiling step, manually review the results, and manually record potential or actual problems and solutions.

 

Productivity and ease of use would be greatly improved if the user could schedule or productionalize repeated analysis of existing sources.  In some cases, this can be done using system audit trails.  Or, in the event of periodic data warehouse extracts, files can be profiled automatically on construction.  The assessment can also be automatically sent to the analyst upon completion.

 

Some products have already added some scheduling capabilities.  Avellino has announced that Discovery 3.1 has this functionality.

 

Checking against prior results

 

Business decisions made from profiling initial data sets can be applied to newer data over time.  Valid values, patterns, model dependencies, and foreign keys are verified against new data in existing tables or extracts.

 

Conflicting values and rows are automatically sent to the analyst for further study.  Trending of exceptions can also be performed without having to constantly invoke the profiling software manually.

 

Currently, ETL and data cleansing products can implement the rules discovered during data profiling.  Though effective for conversions, error rates may not be captured or seen by the data quality team.

 

Dynamic Metadata

 

The emphasis for metadata repository technology has been on capturing the current structure of the data systems.  For the purposes of this discussion, we’ll call that “static metadata.”  As data profiling technology adds the ability to perform scheduled or continuous assessment tasks, “dynamic metadata” is captured.  Dynamic metadata captures statistics on the data’s conformance to desired static metadata characteristics.

 

For example, every time a customer source table is profiled, the invalid values of “state_cd,” number of rows with the invalid values, percentage of rows, and possible corrections based on zip code are captured and recorded as dynamic metadata.  This information can be combined with prior assessments to arrive at trends in quality resolution.

 

Dynamic metadata is found primarily in ETL products, but the results are limited to load statistics (e.g., records processed, records accepted, records rejected, exception codes).  While necessary to a data integration project, these statistics are a partial solution and do not apply to data quality assessments.

 

Data profiling technology offers new ways to measure how data changes over time.  Currently, data analysts can use data profiling software to capture statistics as

 

  • Domain or valid value violations
  • Conflicting rows and percentage of rows for individual functional dependencies
  • Percentage of rows and row counts of foreign key violations
  • Percentage of rows and row counts of data rule violations
  • Growth rates in rows, domain values and foreign key connections

 

Only two data profiling vendors offer a subset of dynamic metadata statistics.  Avellino Discovery recently announced a new scheduling feature, though its dynamic metadata capture capability was not mentioned.  A promising new entrant in the data profiling space, Data Quality Suite from DQS Software, captures file and field characteristics over time.  However, this solution is currently limited to certain domain statistics for mainframe sources only.

 

Profiling XML Data

 

Extensible Markup Language (XML) is fast becoming the standard for data interchange between systems and organizations.  It is also being used more and more as a database technology.  Profiling this data natively is getting more attention as well.

 

XML is the primary method of capturing Internet-sourced data.  New standards, such as the Association for Cooperative Operations Research & Development (ACORD) standards for the insurance industry, and the Common Warehouse Metamodel (CWM) have also defined standard XML documents.  Quite often, trained personnel do not enter this data.  Rather, customers, suppliers, or other sources that may not be as rigorous in providing quality information supply this data.  Before converting XML documents into existing sources, this data must be assessed for content, structure and quality.

 

Some data profiling products, like Evoke Axio, generate XML document type definitions (DTDs), and eXtensible Stylesheet Language Transformations (XSLT) documents from profiling and mapping results.  Unfortunately, no data profiling software can natively profile XML documents.  This data must be loaded into a relational database or flat file (delimited or fixed length) before profiling.

 

Assessing and profiling XML data natively is more important as XML becomes the standard for capturing all of an organization’s new data.

 

Inferred vs. User-Defined Investigation

 

There are two schools of thought regarding data profiling methodology.  One is the “user-directed” approach, where the analyst identifies data rules, domains, primary and foreign keys, etc., and the profiling technology verifies or validates what is specified.  Most data quality solutions take this approach.

 

The other approach is the “inference-based” approach, where the technology infers or discovers what is contain within the data, with the analyst determining what is important.  All of the major data profiling products use this approach.

 

What is happening, though, is a convergence of these approaches.  User-defined approaches are targeted to known, highly prioritized issues, but may miss problems.  Inference-based approaches are very efficient at discovering data issues, but tend to provide too much information irrelevant to the business or analyst.  Combining the approaches promises greater effectiveness and productivity.

 

The profiling software would infer the content, structure and quality of the data, as done today.  Then, the software could apply certain types of knowledge to filter out irrelevant results or highlight more important information.

 

For example DQNow infers attribute values initially.  It then has the ability to “recognize” certain types of data, like state codes, and returns values that aren’t valid.  AMB DataMiners’ eCartography product allows the analyst to create complex filters, rules, and statistics dynamically as new repository metadata, in addition to standard inferred statistics.

 

Legacy Renewal

 

The ability of organizations to migrate from and retire older mainframe systems depends on a new type of software called “legacy renewal” software.  These solutions scan existing legacy application code, capture system and business rules, and convert the code into newer-generation languages.  A large percentage of the rules captured by legacy renewal software are data rules.  The source data can be validated against these rules using data profiling technology.

 

Because the legacy code has changed over time, the source data may not always support current data rules.  Data profiling software could be used to determine conformity of data to those rules.  This ensures that the new system not only contains all of the business rules embedded in legacy code, it also ensures consistent, rule-compliant data.

 

As legacy renewal software proliferates, data profiling solutions providers will add the ability to integrate rule information from legacy renewal engines.  This integration may consist of either repository integration, or the ability to automatically accept and verify data rules.  Further integration of data profiling software with business rule engines, used to construct new rule-based applications, may also happen.

 

Conclusion

 

Data profiling technology has made significant strides in the past seven years.  Organizations now have the ability to analyze large amounts of legacy data quickly and discover extensive information not previously known.  However, as with any new technology, enhancements in usability, features, and productivity are still needed.  In future articles, we’ll examine what the features could look like, as well as what the data profiling vendors see in the marketplace regarding new trends and features.

 

The Data Profiler encourages suggestions and ideas from our readers.  Let us know what you want in data profiling technology.

 



Companies Referenced in this Article

AMB DataMiners, Inc.

Ascential Software, Inc.

Avellino, Inc.

DQ Now, Inc.

DQS Software

Evoke Software Corp.


Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.





Copyright © 2002 Anthem Consulting, LLC. All Rights Reserved.