Rule Based Data Cleansing for Data Warehousing - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Rule Based Data Cleansing for Data Warehousing

Description:

Infocentric - Business Intelligence and Knowledge Systems Consulting ... Trust in the warehouse as a reliable source of business information is diminished. ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 38
Provided by: davidlw150
Category:

less

Transcript and Presenter's Notes

Title: Rule Based Data Cleansing for Data Warehousing


1
Rule Based Data Cleansingfor Data Warehousing
David L. Wells Infocentric - Business
Intelligence and Knowledge Systems
Consulting (425) 503-4352 / dlwells_at_sprynet.com
2
Role of Data Cleansing in a Warehouse Environment
Data Warehouse
3
A Real Life Example
2 million claims per month. 377 data items per
claim. 9 items with cleansing rules.
Warehouse users sometimes receive incomplete or
inaccurate information.
Business decisions may be adversely affected.
368 data items with no rules for validation and
cleansing
An error rate of of .001 means 736,000 errors per
month -- more than 8.8 million per year!
DW
Trust in the warehouse as a reliable source of
business information is diminished.
And those errors are compounded and propagated in
data marts!
4
Rule Based Data Cleansing Options
Audit, filter, and correct data using rules for
cleansing of source data
  • Find rules in source data models
  • package as executable components
  • embed in data acquisition processes

Audit, filter, and correct data using rules for
cleansing of warehouse data
DW
  • Find rules in warehouse data models
  • package as executable components
  • embed in warehouse population processes

5
Recent Statistics on Data Quality Plans
  • In a recent TDWI study, 175 data warehousing
    professionals responded to the question Do you
    have a data quality plan for your data
    warehouse? Survey results showed that
  • 7 have no current plan
  • 24 recognize the importance but had no current
    plan
  • 30 recognize the importance, have a plan, but
    not yet implemented
  • 22 have a plan and have started implementation
  • 17 have implemented a solution
  • 31 of respondents have no quality plan!
  • 83 do not have an implemented solution!

Source TDWI Flashpoint July 21, 1999
6
Data Cleansing Issues
What are data cleansing rules?
Business-based rules for data integrity, combined
with known actions to be taken when integrity
violations are detected.
Extract
Transform
Where do I find them?
Load
Integrity rules can be found in the data models
both source warehouse data models contain
rules. Analysis is necessary to determine
actions upon violation.
Secondary Extracts, Transforms, Loads
And how do I use them?
Data auditing, data filtering, and data
correction are all possible once the set of data
cleansing rules is known.
Dependent Data Marts
7
A Classification of Data Integrity Rules
Data Integrity Rules
8
An Example Data Model
9
Identity Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
10
Reference Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
11
Cardinal Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
12
Value Set Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
  • list of allowable values
  • range of allowable values
  • allowable character set
  • pattern mask
  • combination of above

account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
13
Inheritance Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
14
Relationship Dependency
State Dependent
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
15
Relationship Dependency
Mutually Dependent
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
16
Relationship Dependency
Mutually Exclusive
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
17
Attribute Dependency
State Dependent
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-status-code empl-separa
tion-date
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
18
Attribute Dependency
Mutually Exclusive
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-hourly-p
ay-rate empl-monthly-pay-rate
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
19
Attribute Dependency
Mutually Dependent - Derived
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-fulltime
-salary empl-monthly-pay-rate
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
20
Attribute Dependency
Mutually Dependent - Constrained
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-initial-
hire-date empl-separation-date
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
21
Attribute Dependency
Mutually Dependent - Constrained (across
relationship)
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-initial-
hire-date empl-separation-date
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
22
The Number of Possible Rules
This small data model yields . . .
  • 3 identity rules - 1 for each non-subtype
    entity
  • 6 reference rules - 1 per 1many relationship
    and 2 per manymany relationship
  • 4 cardinal rules - 1 per cardinality (except
    zero-many)
  • 13 value set rules - at least 1 per attribute
  • 6 inheritance rules - 2 per supertype and 2 per
    subtype
  • 32 explicit rules (and we saw 8 dependency rules
    in the examples)
  • How many rules are possible in your warehouse?
    How many are applied?

23
Applying Rules for Data Cleansing
Extract
SOURCE CLEANSING
Transform
Load
TARGET CLEANSING
DW
24
Choosing Data Cleansing Techniques
Audit Tests data against integrity rules and
individually and/or statistically reports rule
violations. Helps to know size and scope of
integrity problems, but does not fix the
problems. Filter Tests data against integrity
rules and removes data that violates the rules
from the set of data used to populate the data
warehouse. May result in a warehouse with
incomplete information. Correct Tests data
against integrity rules and repairs data found
to be in violation. Repairs may involve finding
an alternate data source, deriving a value from
other data, using a default value, etc.
25
Data Auditing - A Continuous Process
  • Auditing the data is a continuous process
    expected to occur in many places

Implementation
26
Data Auditing Techniques - Noting Integrity
Violations
Documentation or Metadata
Fields added to DW tables
new DW tables
Location
View quality info together with data
Simple Approach
Support for extensive detail
Pros
Quality info separated from the data
Static with limited detail
Cons
Complexity
27
Data Auditing - Summary
  • Auditing is an essential first step to data
    quality
  • Problem scope cannot be assessed until the
    problem is understood
  • Solutions cannot be appropriately designed until
    the problem is understood
  • Auditing validates data models used to define
    integrity and cleansing rules

28
Data Filtering - Techniques
  • Filtering can be performed at multiple levels
  • Element Level - Setting null values for
    individual elements
  • Row Level - Removing an entire row from a target
    load set
  • Data Group - Removing rows from multiple related
    target load sets

29
Data Filtering - Summary
CAUTION!
  • Filtering must be used with CAUTION!
  • Removing data from a load set may decrease data
    quality
  • Generally, an ineffective technique when used
    with staging areas or atomic warehouses
  • More applicable for data marts

30
Data Correction - Techniques
  • Correction alters the specific value of
    individual fields
  • Replacement value techniques include
  • Inserting a default value
  • Translating based on error patterns
  • Deriving a value based on surrounding contents
  • Searching for alternative sources

31
Data Correction - Summary
  • Corrections are often forced into the data
    warehousing space
  • Corrections may be made close to the point of
    entry for global changes
  • Other corrections may be better suited for
    specific data marts
  • Always consider corrections to the source systems

32
Choosing Data Cleansing Techniques
Source Data Cleansing requires source data
models typically large number of rules implicit
rules difficult to determine fixes problems at
the source one fix applies to many targets may
require operational system changes
Target Data Cleansing rules from target data
models typically smaller number of rules implicit
rules readily identified may only fix
symptoms multiple targets redundant
cleansing avoids operational systems issues
33
Choosing Data Cleansing Techniques
near source
Completeness of Solution
Cost and Complexity
near target
filter
audit
correct
34
Recalling the Example
2 million claims per month. 377 data items per
claim. 9 items with cleansing rules.
368 data items with no rules for validation and
cleansing
An error rate of of .001 means 736,000 errors per
month -- more than 8.8 million per year!
DW
And those errors are compounded and propagated in
data marts!
35
Applying Rule Based Cleansing to the Example
Identify data integrity rules (IRs) and package
them as modules or objects.
Audit after extract and begin to monitor source
data quality.
DW
Correct and filter at transformation. Using
results of monitoring to address most severe
problems first.
Finally, consider using IRs to correct quality
problems at the source.
36
Summary of Key Points
  • Data quality directly affects data warehouse
    acceptance.
  • Source data is typically dirty and must be
    cleaned for warehousing.
  • Data models are a good source of data integrity
    rules.
  • Seven kinds of integrity rules are either
    expressed or implied by entity- relationship data
    models.
  • Both source and target data models yield
    integrity rules.
  • When combined with actions to be taken upon
    violation, data integrity rules become data
    cleansing rules.
  • Both source data and target data may be cleansed.
  • Auditing reports data integrity violations but
    does not fix them.
  • Filtering discards data that violates integrity
    rules.
  • Correction repairs data that violates integrity
    rules.
  • Choosing a technique depends on many factors
    including severity of the problem, available data
    models, and commitment of time resources.
  • Whatever techniques you choose, a systematic
    rule-based approach yields better results than an
    unstructured approach.

37

Thank You !
Please feel free to contact me with questions and
comments, or to share your experiences with this
or other data cleansing techniques
David L. Wells Infocentric - Business
Intelligence and Knowledge Systems
Consulting (425) 503-4352 / dlwells_at_sprynet.com
Write a Comment
User Comments (0)
About PowerShow.com