vol 3, no 1
July 15, 2004

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Business Rule Validation

 

Joseph Novella

 

Introduction

 

With the new focus on business rules and business rule engines, interest in documenting and validating business rules against the data is increasing.  Corporations are learning that developing new applications using business rule techniques, though potentially more efficient and productive, is not enough to guarantee acceptable quality in the data.

 

In response, data profiling and data quality vendors have added or planning to add business rule validation capabilities, in some form, to their products.  The solutions incorporate the ability to define a specific type of business rule, called a data rule, and return the number of source rows/records satisfying the rule.

 

This article explores where data rule validation fits within the data profiling methodology, what information should be captured, and how that information can be used to measure the quality of data.

 

Where Data Rules Fit in the Methodology

 

First, let’s define data rule.  Data rules are business rules that can be or are enforced or maintained by the data captured by the organization.  For example, consider the following business rule.

 

“Payments cannot be issued to vendors for research projects not completed by vendors.”

 

Though this rule can be coded in the application, querying the source data identifies whether or not the rule has been enforced in the past.  Also, the data directs what further action, if any, takes place.

 

Data profiling and quality products cannot “infer” most types of data rules.  The analyst must document the rules discovered, classify or categorize them, and enter them into the profiling product (or query tool) for validation.  The only types of data rules that can be inferred are

 

  • Functional Dependencies – which include primary keys, code tables, etc.  (NOTE: Rules enforced by functional dependencies were covered in previous articles in The Data Profiler and are not discussed here.)
  • Foreign Keys – discovered through identifying significant overlap of values between sets of attributes.

 

Data rules are discovered within each profiling dimension.  During Domain Studies, columns are tested against defined valid value sets or value ranges, with the invalid or outlier values documented.  Analysts may also document valid patterns for data values, NULL rules, and rules that embed logic into the column values (e.g., concatenated keys or parsed strings with “meaning”).

 

Other data rules, such as the above example, describe relationships between sets of attributes.  These rules can be declarative or conditional, and can be found in policy manuals, business requirements sessions, and data profiling sessions.  The objective is to document as many rules as possible and determine the best way to represent the rules as queries or coded statements for later validation against source data.

 

Managing the Validation Statistics

 

Most profiling and quality engines use standard languages, such as SQL.  What makes these products far more productive than simpler query tools is that the data profiling products capture and manage not just the documented rule, but the SQL or coded version of the rule, as well as the validation results, in the same logical environment as the rest of the profiling metadata.

 

Once the rule is expressed as a coded or SQL statement, the following statistics should be captured.

 

Statistic

Description

Total Rows in Source Data

A count of rows that exist in the table (or tables) at time of validation.

Count of Applicable Rows

The number of rows that are subject to the data rule itself.  Always a subset of the total rows.  In the example above, only research projects are subject to the rule, not all projects.

Percentage of Applicable Rows

Applicable rows expressed as a percentage of the Total Rows in Source Data.

Count of Valid Rows

The number of applicable rows that satisfy all conditions of the data rule.

Percentage of Valid Rows

Valid rows expressed as a percentage of Applicable Rows (NOT a percentage of total rows).

Valid Row Key Values

The list of primary key values for the valid rows.

Count of Invalid Rows

The number of applicable rows that do not satisfy all conditions of the data rule.

Percentage of Invalid Rows

Invalid rows expressed as a percentage of Applicable rows.

Invalid Row Key Values

The list of primary key values for the invalid rows.

Date Validated

The date that the specific data rule was validated against the source data.

 

The validation statistics serve three purposes.  The first, and most cited, is to identify quantify any data quality issues that currently exist.  The count and percentage of invalid rows is most suited for this.  One way to think of this is the validation of the data against the rule.

 

A second purpose is the opposite, to validate the rule against the data.  Instead of thinking that the rule is correct and the data is wrong, perhaps the rule is wrong and the data is correct.  Is the rule properly expressed?  Could the rule be outdated?  If the invalid row count/percentage is consistently high, or if the count/percentage of applicable rows is consistently low (meaning few rows satisfy the criteria), the rule may no longer apply or another rule has supplanted it.

 

The third purpose of the statistics is to measure the effectiveness of the data quality solution(s) implemented.  If a cleansing routine or new business process is instituted, the data steward/analyst can measure how effective the implementation(s) were over time.  Also, implementing a data quality improvement solution for a specific set of data rules may inadvertently reduce the percentage of valid rows for other data rules.  Validating the rules over time allows the analyst to catch those problems earlier.

 

Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners.






Copyright © 2004 Anthem Consulting, LLC. All Rights Reserved.