![]() |
|
vol 1, no 1 October 15, 2002 |
|
Brought to you by Anthem Consulting, LLC | Join our Mailing List
|
|
|
|
|
Data Preparation StrategyIntroductionWhen planning a data profiling effort, regardless of the type of project, the process should always begin with preparation of the data sources. The goal of data preparation is to make the data consistent and accessible for accurate and timely, manual or automated profiling. Data preparation also is a good way to scope and plan the project. The project team may also use the data preparation results for future ETL processes.We’ll review how data preparation can assist the project planning effort and discuss developing a data preparation strategy. Source File InventoryTraditionally, project planning has not focused on the business information needed to satisfy the objectives of the project. Business processes performed by the relevant business area were the focus. An innovative approach may be to start with an understanding of the data subject area in which you are working. Identify major information concepts needed and determine initial source data for conversion, migration or integration. This approach will minimize “scope-creep” and may even be a second check on the business processes in and out of scope. A good deliverable for the planning process is an inventory of relevant source systems and source data files. There are two sections to the inventory, source systems and source files. As a starting point, here are two tables that show the minimum needed for a successful inventory. Table 1: System Inventory
Table 2: Source File Inventory
Obviously, there can be far more information included in the inventory, depending on the detail wanted. You may want to specify the file layouts, expected requirements for extraction, and peak/off-peak use of the systems or files. Eventually, all of this information is captured during the data profiling process, but may not be needed during planning. The inventory also helps with planning the time needed to profile the data. Automated profiling engines allow analysts to profile each column or field in an average of about 20-25 minutes, at most, machine time included. Multiplying the number of fields in a file by 25 gives a good estimate as to the elapsed data profiling time needed for a particular source file. Finally, a person can keep track of five concepts at a time, plus or minus two. This fact, plus the amount of time scheduled for analysis, provides the approximate number of analysts needed. Data Preparation StrategyOnce the source inventory is completed, the actual data preparation begins. Though many aspects presented in this section apply exclusively to non-relational source data (VSAM, IMS, flat files, etc.), the strategy also includes relational data. This strategy not only minimizes errors in the data profiling process, it also provides a solid foundation of information for the data analyst at the beginning of the profiling process. The goals of the data preparation strategy include the following.
The first step is determining groups of source files that have strong relationships with each other. For example, “orders” and “order lines.” The extraction process must preserve these relationships for accurate profiling. This would also be the time where full file layouts or DDL are provided. Of course, you may discover that not all source files were identified, based on documented foreign keys. Add those files to the scope as appropriate. The next step in the data preparation process is determining, for each file, how much data to extract, and for all files, in what sequence to extract them. Extracting full files eliminates many considerations, such as which records to extract and when. However, though data profiling solutions can process extremely large amounts of data, considerations such as available disk space, source system access and security, peak/off-peak runtimes, and even project objectives preclude full extracts. The following discussion assumes that subsets of source files are required. Prior experience shows that certain categories of files help determine a sequence and a strategy of preparation. Let’s first define the categories and then show how a sequence can be developed. · Resource or Asset A particular item, object, or asset that is tracked by the organization. For example, customers, products, sales offices, and contracts. May also include descriptive files, as well (customer demographics, product specifications, etc.) · Transaction A file that captures "events" performed on or by Resources that move the resource through its lifecycle. For example, orders, invoices, journal entries, and payments may be classified as Transactions. · Code or Reference File Structures or files that convert Resource classifications to "codes," such as "customer types" and "product families." Generally, a code table contains a code attribute and a description attribute at a minimum. Hierarchies may also be part of this category. · Cross Reference File Files containing two or more similar Resource "identifiers" relating them. For example, a company may have multiple customer IDs. A cross-reference table relating the different IDs may be used. · System Application-specific files used to govern the system, itself. Usually System files do not contain business data, so they may not be profiled. System files are generally not profiled, so we shall ignore them, here. Code tables are profiled and should be extracted in full since they are the basis for domain checking during data profiling. That leaves Resources, Transactions, and Cross-Reference sources. Preservation of relationships between these categories of files is essential. Random extracts of, say, customers (Resource) and invoices (Transaction) won’t preserve the relationship between the files. You’ll have customers without invoices and vice versa. This causes inaccurate results in data profiling (during Redundancy Profiling or Cross-File Dependency Profiling). A more consistent extraction starts with customers and then proceeds with invoices for those customers extracted. Similarly with Cross-Reference files. In general, Resources are extracted before Transactions and Cross-References.
Figure 1: Proposed Source File Extract Sequence This extraction sequence preserves relationships and provides a potentially cleaner way to extract representative samples. Sometimes, however, you may need to create multiple samples that together are representative. These samples may be based on the following. 1. Time frames (peak vs. off-peak transaction times) 2. Classifications (customer types) 3. Geography (countries) 4. Age of data (30 years vs. 5 years) Finally, projects may require similar source files from different systems. For example, a CRM project will integrate multiple customer lists, each from a different source. One of the objectives of profiling these files may be to determine the most accurate, consistent, or “best” source. To do this objectively, the data analyst may want only “common customers” or extract only certain date ranges for added or updated customer records. Regardless, synchronization of disparate files, whether by “id”, date, or selection criteria is necessary. ConclusionDeveloping a strategy for data preparation, like other planning processes, is an essential step in the data profiling process. It streamlines the data extraction process, minimizes errors and development time, and provides an important first look at the data for the analyst. |
Solutions Supporting Data Preparation Most ETL products (too many to name) Evoke Software's VSAM/IMS Importer AMB Dataminer's Flat File Migrator |
|
|
|
|
|
Copyright © 2002 Anthem Consulting, LLC. All Rights Reserved. |