Title: METADATA for Data Stewards: DQI and the Customer Intelligence Life Cycle
1METADATA for Data StewardsDQI and the Customer
Intelligence Life Cycle
- DAMA
- 9 January 2002
- Columbus, Ohio
- Derek J Strauss
- (with input from Alan Kolber and Larry English)
2Levels of Data
Meta Schema Object, Property, Constraint,
... Schema Entity, Attribute, Relationship,
Screen, Report, Structure, Field, Application,
... Metadata CUSTOMER EMPLOYEE RESOURCE
Cust-Name Emp-Day-Sal-Amt Res-Hm-St
r-Addr Character-32 Integer Variable
Character Not Null 100ltvaluelt3,000 Data Joe
Jones 300 4 Elm Street Real World
3Customer Intelligence Lifecycle
ANALYZE
1. Data Integration Tools
Data Audit
Data Quality
Data Warehouse
ETL
Operational Data Store
Data Discovery
Exploration Warehouse
CAPTURE
PLAN
Data Mart
Data Mart
Data Mining Warehouse
Data Mart
Data Mart
INTERACT
4Customer Intelligence Lifecycle
ANALYZE
2. Analytical Tools
1. Data Integration Tools
Data Audit
Query Reports
Data Quality
OLAP
Data Warehouse
ETL
Data Mining
Operational Data Store
Data Discovery
Exploration Warehouse
CAPTURE
PLAN
Portals
Data Mart
Data Mart
Data Mining Warehouse
Data Mart
Data Mart
INTERACT
5Customer Intelligence Lifecycle
ANALYZE
2. Analytical Tools
1. Data Integration Tools
Data Audit
Query Reports
Data Quality
OLAP
Data Warehouse
ETL
Data Mining
Operational Data Store
Data Discovery
Exploration Warehouse
CAPTURE
PLAN
Portals
Data Mart
Data Mart
Data Mining Warehouse
Data Mart
Data Mart
Campaign Mgt
Segment Analysis
Profitability
Loyalty Analysis
3. Analytical Applications
INTERACT
6Customer Intelligence Lifecycle
ANALYZE
2. Analytical Tools
1. Data Integration Tools
Data Audit
Query Reports
Data Quality
OLAP
Data Warehouse
ETL
Data Mining
Operational Data Store
Data Discovery
Exploration Warehouse
CAPTURE
PLAN
Portals
Data Mart
Data Mart
Data Mining Warehouse
Data Mart
Sales/Stores
Data Mart
Web/ Email
Campaign Mgt
Call Center
Segment Analysis
Wireless
Direct Mail
Profitability
Loyalty Analysis
3. Analytical Applications
4. Interaction Systems
INTERACT
7Customer Intelligence Lifecycle
ANALYZE
2. Analytical Tools
1. Data Integration Tools
Data Audit
Query Reports
Data Quality
OLAP
ETL
Data Mining
Data Discovery
CAPTURE
PLAN
Portals
?
Sales/Stores
Web/ Email
Campaign Mgt
Call Center
Segment Analysis
Wireless
Direct Mail
Profitability
Loyalty Analysis
3. Analytical Applications
4. Interaction Systems
INTERACT
86 Kinds of Data Quality
- Accuracy
- Of Value
- Of Meaning
- Precision
- Level of detail
- Level of granularity
- Completeness
- Consistency
- With other instances
- With other information
- Timeliness
- Consistent at a point in time
- Responsive delivery
- Accessibility
- Security
- Technology
9Metadata for Data Quality
- Applications Inventory
- Data Models
- Logical Data Modeling
- Domain Analysis
- Transformations
10Applications Inventory
- Applications
- People
- Documents
- File Structures
- Platforms - HW,SW (OS,DBMS,Language)
- Software Code
- Screens, Forms, Reports
11Application Inventory
12Applications Inventory Issues
- Useful for many other purposes
- Application Development
- Application Maintenance
- Disaster Recovery
- Other information, including more detail, needed
for other purposes - Platform information tough to model to meet
different needs
13Model Management
14Model Management Issues
- No one tool available to manage same component
across different model types supported by
different CASE tools - Must manage versioning over time
- Must manage versioning across environments
(Requirements Analysis, Development, Test,
Production)
15Sources of Information
- Business Units/Users
- Technical Support
- Forms, Screens, Reports
- Documents
- The Data
16Data Quality Sources of Information
17Data Models
- Source (Legacy) System Physical
- Enterprise/Warehouse Logical
- Target Physical
- Physical to Logical Mapping
18Data Model Types
WHAT (DATA)
Scope
Lists of subjects, containing major
entities Subject Diagrams
(Planner)
Conceptual Data Model (ERM)
TECHNOLOGY INDEPENDENT
(Owner)
Logical Data Model (Normalised to 5NF)
(Designer)
multidimensional relational hierarchical, etc..
(Builder)
Physical Model
DBMS-specific models Data Definition Language
(DDL)
eg Sybase or UDB (UNIX) eg DB2 (Mainframe)
TECHNOLOGY DEPENDENT
(SubContractor)
Functioning System
Files/Tables Fields
19Data Models Needed
Corporate Logical Data Model
Reverse Engineered Physical Data Model
Data Warehouse Logical Data Model
Data Mart Logical Data Model
Data Warehouse Physical Data Model
Data Mart Physical Data Model
20Logical Data Modeling
- Entities
- Relationships
- Attributes
- Rules
- About Entities
- About Relationships
- About Attributes (Domains)
21Logical Data Modeling
22Attribute Purposes
- Identifier
- Names
- ID Numbers
- Descriptor
- Locator
- Geographic (real)
- Electronic (virtual)
- Date/Time
- Classifier
- Codes/Types
- Indicators
- Measure
- Amounts (money)
- Counts (discrete)
- Quantities (continuous)
23Attribute Purposes
24Quality Issues for Attributes
- Classifier
- Accuracy
- Precision
- Consistency
- Measure
- Accuracy
- Precision
- Consistency (UOM)
- Timeliness
- Identifier
- Accuracy
- Precision
- Completeness
- Consistency
- Accessibility
- Descriptor
- Accuracy
- Locator
- Accuracy
- Precision
25Domain Rule Types
- Note All Domains constrained by format
- Other (for IDs, descriptors, and locators)
- Value Expressions (for measures)
- Range Limits
- Derivations
- Value Sets (for classifiers)
- Multiple Symbol Sets
- One Official Symbol Set
26Domain Rules
27Domain Analysis - Value Expressions
28Domain Analysis - Value Sets
29Value Sets Versus Symbol Sets
- Value Sets composed of Values
- U.S. State Alabama, Arkansas
- U.S. State, Territory, Possession Alabama,
District of Columbia, Puerto Rico... - Symbol Sets composed of Symbols
- 4 Character U.S. State ALAB, ARK
- 2 Digit U.S. State 01, 02
- 2 Character U.S. State AL, AKofficial
- 2 Character U.S. State, Territory, Possession
AL, AK, DC, PRofficial
30Logical to Physical Mapping
31Data Modeling Issues
- Tools dont support enough rows of ZFEA
- N-ary versus binary relationships
- Multiple type hierarchies (super/sub-types)
- Domain information not supported well
- Physical to Logical mapping
- Many to one
- Many to many
325 Cleansing Mechanisms
- Fix the data at the source
- Fix the code at the source
- Fix the process at the source
- Fix the metadata
- Transform the data on the way into a data
warehouse
33Source to Target Transformations
- Transformations
- Transformation Components
- Physical Data Elements involved in the various
Transformation Components - Source Elements
- Target Elements
- Reference Elements
- Metrics on the whole transformation process
34Source to Target Transformations
35Data Quality Assessment Audit Process
Implement Data Audit
After Larry English
- Number of Records
- Hash Totals
- Financial Amount Balances
- Reasonableness Checks
- Row counts
- Referential Integrity Checks
Analyze Data Definition Quality
Analyze Data Content Quality
- Conformance to Meaningful
- Enterprise Standards
- Meaningful Business Names
- Clear, Precise, Complete
- Consensus Definition
- Singularity of Definition
- Complete, Exact Definition of
- Domain
- Completeness, Accuracy and
- Usefulness of Integrity Policies
- and Business Rules
- Measure Completeness
- Measure Validity (domain values,
- ranges,derived data)
- Identify Potential Duplicate Records
- Verify Consistency across all Copies
- Verify Dependency Rules are Valid
- Verify Derived Data is Correctly Calculated
- from its Base Data
- Verify Format is Consistent/Complete
- Interpret and Report Results
- Track Over Time to Communicate Progress
- Satisfies Stakeholder Expectations
Implement Data Cleanup Actions
- Produce Data Assessment Report
- Produce Assessment Procedure Report
- Summary Reports
- Exception Reports
- Control Charts
- Pareto Diagrams
- Weight Errors (not all errors are equally bad)
- Recommend Corrective Quality Improvement Actions
364 Key Questions from Users
- Where did this data come from?
- Where did my data go to (so I can find it)?
- Nobody seems to need this data. What would
happen if we get rid of it? - Is this new data clean? (Did the clean-up work
and can I use this data now?)
37(No Transcript)
38Key Assumptions
- ? The framework is a classification schema
- ? The rows represent roles that use information
- ? The columns represent aspects of information
- -- must be understood
- -- must be controlled
- -- can be optimized
- ? The cells represent unique types of
components
39The Repository Model - A Starter Set