![]() |
|
vol 3, no 1 July 15, 2004 |
|
Brought to you by Anthem Consulting, LLC |
Join our Mailing List
|
|
|
|
|
Business Rule Validation
Joseph Novella Introduction With the new focus on business rules and business rule engines, interest in documenting and validating business rules against the data is increasing. Corporations are learning that developing new applications using business rule techniques, though potentially more efficient and productive, is not enough to guarantee acceptable quality in the data. In response, data profiling and data quality vendors have added or planning to add business rule validation capabilities, in some form, to their products. The solutions incorporate the ability to define a specific type of business rule, called a data rule, and return the number of source rows/records satisfying the rule. This article explores where data rule validation fits within the data profiling methodology, what information should be captured, and how that information can be used to measure the quality of data. Where Data Rules Fit in the Methodology First, let’s define data rule. Data rules are business rules that can be or are enforced or maintained by the data captured by the organization. For example, consider the following business rule. “Payments cannot be issued to vendors for research projects not completed by vendors.” Though this rule can be coded in the application, querying the source data identifies whether or not the rule has been enforced in the past. Also, the data directs what further action, if any, takes place. Data profiling and quality products cannot “infer” most types of data rules. The analyst must document the rules discovered, classify or categorize them, and enter them into the profiling product (or query tool) for validation. The only types of data rules that can be inferred are
Data rules are discovered within each profiling dimension. During Domain Studies, columns are tested against defined valid value sets or value ranges, with the invalid or outlier values documented. Analysts may also document valid patterns for data values, NULL rules, and rules that embed logic into the column values (e.g., concatenated keys or parsed strings with “meaning”). Other data rules, such as the above example, describe relationships between sets of attributes. These rules can be declarative or conditional, and can be found in policy manuals, business requirements sessions, and data profiling sessions. The objective is to document as many rules as possible and determine the best way to represent the rules as queries or coded statements for later validation against source data. Managing the Validation Statistics Most profiling and quality engines use standard languages, such as SQL. What makes these products far more productive than simpler query tools is that the data profiling products capture and manage not just the documented rule, but the SQL or coded version of the rule, as well as the validation results, in the same logical environment as the rest of the profiling metadata. Once the rule is expressed as a coded or SQL statement, the following statistics should be captured.
The validation statistics serve three purposes. The first, and most cited, is to identify quantify any data quality issues that currently exist. The count and percentage of invalid rows is most suited for this. One way to think of this is the validation of the data against the rule. A second purpose is the opposite, to validate the rule against the data. Instead of thinking that the rule is correct and the data is wrong, perhaps the rule is wrong and the data is correct. Is the rule properly expressed? Could the rule be outdated? If the invalid row count/percentage is consistently high, or if the count/percentage of applicable rows is consistently low (meaning few rows satisfy the criteria), the rule may no longer apply or another rule has supplanted it. The third purpose of the statistics is to measure the effectiveness of the data quality solution(s) implemented. If a cleansing routine or new business process is instituted, the data steward/analyst can measure how effective the implementation(s) were over time. Also, implementing a data quality improvement solution for a specific set of data rules may inadvertently reduce the percentage of valid rows for other data rules. Validating the rules over time allows the analyst to catch those problems earlier. Trademark Notice All products or company names are used for identification purposes only, and may be trademarks of their respective owners. |
|
|
|
|
|
Copyright © 2004 Anthem Consulting, LLC. All Rights Reserved. |