vol 2, no 1
January 15, 2003

Brought to you by Anthem Consulting, LLC

Join our Mailing List


Enterprise Data Forum Special Report

 

From November 4-7, 2002, Wilshire Conferences presented the first annual Enterprise Data Forum (EDF) in Pittsburgh, PA.  Over 300 data specialists attended the different seminars, talks, and sessions.

 

The conference sessions and tutorials were categorized into four tracks.  These tracks included

 

  • XML Data
  • Integration
  • Modeling
  • Analytics

 

The speakers included popular and distinguished industry experts presenting a variety of compelling topics and techniques.  Everyone learned something new.  The Keynote speech, given by John Ladley, was entertaining and informative.

 

Of course, some topics were not without controversy.  A session devoted to a Data Management Capability Maturity Model discussed whether or not such a model could be developed, what it would include, and who or what will administer it.  A panel discussion on data modeling controversies lived up to its name.

 

The next few sections describe some of the sessions I attended.  Most of my time was spent in the Integration track sessions, though I certainly enjoyed talks from the other tracks.

 

Avoiding Catastrophe in Data Integration and ETL

Developing a Master Plan for Managing Your Ever-Growing Data

CMM for Data Management

The Semantic Web

 

Avoiding Catastrophe in Data Integration and ETL

 

The first two days were devoted to full-day tutorials.  On Monday, I attended the tutorial titled “Avoiding Catastrophe in Data Integration and ETL” given by Michael Scofield.  Devoted mainly to domain studies, this session reviewed the importance of data profiling to ETL projects and the techniques a data specialist can use to discover issues and problems with the data.

 

Given Mr. Scofield’s extensive experience in data quality and data profiling as former Director of Data Quality at Experian, I expected that he would present new data profiling techniques.  I was not disappointed.

 

Mr. Scofield justified data profiling by first presenting how a source system’s data architecture evolves and fragments over time.  Since each source system’s architecture evolves differently, bringing this data into a consolidated environment is increasingly complex and difficult.  The technical integration is not the issue.  It’s the semantic integration of the data that causes the most concern.  This semantic integration of data occurs at both the table/file and column/field levels and can be quite complex.

 

Once presenting the problem, Mr. Scofield outlined the facets of data quality and how understanding each facet of the data contributes to successful integration.  Each facet was defined, reviewed and contrasted with each other.

 

  • Completeness
  • Validity
  • Reasonability
  • Accuracy

 

The data profiling specialist must understand the difference between these facets, as well as their implications on data behavior.

 

These data quality facets contribute to the techniques used to evaluate the source data.  A data inventory is the first essential step and helps the analyst compile the needed metadata and documentation.  Once inventoried, a variety of techniques are used to evaluate the quality and scope of the source data.

 

From simple value frequency and distribution analysis to complex cross-tabulation and even byte-level investigation, the analyst develops a highly detailed and comprehensive knowledge of the source data.  An analyst can even infer the scope of the data source over time.  Though somewhat manual in nature, the techniques presented could easily be performed with a data mining or reporting product.

 

The point of domain studies is to acquire as much knowledge about the underlying source data in order to make fact-based integration, cleansing, and migration decisions.  Michael Scofield did a great job presenting these techniques and processes.

 

 

Developing a Master Plan for Managing Your Ever-Growing Data

 

Dan Linstedt, Principal Consultant for Core Integration Partners, Inc. presented the tutorial discussing management, analytics and reporting of Very/Extremely Large Databases (VLDB/ELDB).  These databases start at about 500 GB and have exceeded 300 TB (terabytes) of data.  The challenges of such large databases are unique and numerous.  Mr. Linstedt presented clear and concise planning solutions for VLDB and ELDB projects.

 

Without detailed up-front planning of a VLDB/ELDB warehouse, organizations run the risk of significant cost overruns, performance degradation, data quality problems, and exponential maintenance cost increases.  The thinking behind these huge databases is quite different than traditional approaches.  Though the ROI can be significant, organizations must develop new skills and approaches if they are to be successful.

 

Such topics as change data capture (CDC), audit trails, bandwidth, Extract Load Transform (ELT), data mining, and Near-Real Time (NRT) feeds were covered in detail during the session.  Not just query times are affected, but update windows, performance tuning, backup and recovery, and fault tolerance processes have to change.  Finally, there are a few available hardware and software platforms for VLDB systems, but again, detailed and constant planning are required.

 

What I found interesting is that the definition of a VLDB system changed dramatically in the last few years.  The last time I attended a seminar on VLDB was in 1994 and the upper limit on size was 1 TB.  Today, that’s a small warehouse and vendors are investigating and developing multi-petabyte database solutions.

 

Another interesting thing is that data quality becomes even more of an issue than before.  There is no opportunity to fix existing quality problems in VLDB systems, and down time is not acceptable.  Correcting data quality issues must happen before the data is loaded and transformed.

 

CMM for Data Management

 

The Software Engineering Institute (SEI) developed a Capability Maturity Model (CMM) as a mechanism to measure consultants’ abilities to develop software for the US Air Force and Department of Defense.  This effort now includes multiple CMMs for corporations and non-profits.  The objective of this discussion was to see if there was interest in developing a Data Management Maturity Model (DM3)

 

Moderators:     Hal Davis

                        Brett Champlain

                        Peter Aiken

 

The CMM includes a scoring of software development and process performance.  These scoring levels are

 

1.      Initial – Process is ad-hoc.  Few defined processes.

2.      Repeatable – Basic project management (PM) processes defined.

3.      Documented – Documented PM and development processed followed.

4.      Managed – Detailed, on-going measurement of processes.

5.      Optimizing – Continuous process improvement.

 

The standards needed to maintain a high level of maturity are quite rigorous.  Current scoring of DoD contractors indicate 27% are below Level 2.  Most contractors are at the entry point of the process.  Only a few companies have any department consistently above Level 3.

 

There is some movement to develop a Data Management Maturity Model.  Some of the components or practice areas under consideration for scoring are:

 

1.      Data Program Coordination

2.      Enterprise Data Integration – Organization-wide vs. ad-hoc approach

3.      Data Stewardship

4.      Data Development – New data sources

5.      Data Support Operations – Backup and Recovery

 

Generally, there didn’t seem to be that much enthusiasm for a DM3.  It was suggested that DAMA should construct and manage the process of scoring.  However, even that didn’t get much support.  The Institute for Data Research (IDR) is surveying some organizations to see what kinds of data management practices are being implemented.  This may contribute to developing an acceptable DM3.  Progress will be slow, however.

 

 

The Semantic Web

 

This panel discussion focused more on the future than most other sessions at EDF.  The Semantic Web is a further evolution of the Web to encompass human-to-human and computer-to-computer interaction.  The types of information and data that the Semantic Web will use are broader than currently used by applications.

 

Moderators:     Brett Champlin

                        William Ruh

                        Dave McComb

 

This next evolution in web technology seems to want to enable computers to interact and make decisions as much like humans as possible.  For example, a small application of the Semantic Web would be software development.  By finding the proper components and incorporating them successfully, a computer can develop software similarly as a person would.

 

In order to accomplish this, the nature of information captured has to change.  Here are some available frameworks around the Semantic Web.

 

  • XML
  • Resource Description Framework
  • Ontologies (Taxonomies and Inferences)

 

What would a document or piece of information look like on the Semantic Web?  Think of what happens when a person reads a document and what characteristics of that document allow you to reason and make decisions.

 

  • Self describing
  • Hard to forge
  • Issue by “Trusted Authority”
  • World-wide standard
  • Convertible
  • Easy to understand
  • Machine readable

 

An example of such a document is the US dollar bill.  Though we don’t consciously realize it, each one of the characteristics of a Semantic Web document allows us to use the US dollar bill successfully.  It is this goal that drives the development of the Semantic Web.

 

Now, think of the types of metadata that must be captured for this to work.  Also, how would we manage such metadata?  Traditional approaches may not work.  This, as well as other questions must be answered before the Semantic Web is successfully implemented.

 




Copyright © 2003 Anthem Consulting, LLC. All Rights Reserved.