RealWorld Data Is Dirty - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

RealWorld Data Is Dirty

Description:

... over the data set. rmk 06.April.2004. Selection of Keys ... Equational Theory - Example. Two names are spelled nearly identically and have the same address ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 50
Provided by: davidg70
Category:
Tags: realworld | data | dirty | keys

less

Transcript and Presenter's Notes

Title: RealWorld Data Is Dirty


1
Real-World Data Is Dirty
  • Data Cleansing and the Merge/Purge Problem
  • Hernandez Stolfo Columbia University - 1998

Class Presentation by Rhonda Kost, 06.April 2004
2
TOPICS
  • Introduction
  • A Basic Data Cleansing Solution
  • Test Real World Results
  • Incremental Merge Purge w/ New Data
  • Conclusion
  • Recap

3
Introduction
4
The problem
  • Some corporations acquire large amounts of
    information every month
  • The data is stored in many large databases (DB)
  • These databases may be heterogeneous
  • Variations in schema
  • The data may be represented differently across
    the various datasets
  • Data in these DB may simply be inaccurate

5
Requirement of the analysis
  • The data mining needs to be done
  • Quickly
  • Efficiently
  • Accurately

6
Examples of real-world applications
  • Credit card companies
  • Assess risk of potential new customers
  • Find false identities
  • Match disparate records concerning a customer
  • Mass Marketing companies
  • Government agencies

7
A Basic Data Cleansing Solution
8
Duplicate Elimination
  • Sorted-Neighborhood Method (SNM)
  • This is done in three phases
  • Create a Key for each record
  • Sort records on this key
  • Merge/Purge records

9
SNM Create key
  • Compute a key for each record by extracting
    relevant fields or portions of fields
  • Example

10
SNM Sort Data
  • Sort the records in the data list using the key
    in step 1
  • This can be very time consuming
  • O(NlogN) for a good algorithm,
  • O(N2) for a bad algorithm

11
SNM Merge records
  • Move a fixed size window through the sequential
    list of records.
  • This limits the comparisons to the records in the
    window

12
SNM Considerations
  • What is the optimal window size while
  • Maximizing accuracy
  • Minimizing computational cost
  • Execution time for large DB will be bound by
  • Disk I/O
  • Number of passes over the data set

13
Selection of Keys
  • The effectiveness of the SNM highly depends on
    the key selected to sort the records
  • A key is defined to be a sequence of a subset of
    attributes
  • Keys must provide sufficient discriminating power

14
Example of Records and Keys
15
Equational Theory
  • The comparison during the merge phase is an
    inferential process
  • Compares much more information than simply the
    key
  • The more information there is, the better
    inferences can be made

16
Equational Theory - Example
  • Two names are spelled nearly identically and have
    the same address
  • It may be inferred that they are the same person
  • Two social security numbers are the same but the
    names and addresses are totally different
  • Could be the same person who moved
  • Could be two different people and there is an
    error in the social security number

17
A simplified rule in English
  • Given two records, r1 and r2
  • IF the last name of r1 equals the last name of
    r2,
  • AND the first names differ slightly,
  • AND the address of r1 equals the address of r2
  • THEN
  • r1 is equivalent to r2

18
The distance function
  • A distance function is used to compare pieces
    of data (usually text)
  • Apply distance function to data that differ
    slightly
  • Select a threshold to capture obvious
    typographical errors.
  • Impacts number of successful matches and number
    of false positives

19
Examples of matched records
20
Building an equational theory
  • The process of creating a good equational theory
    is similar to the process of creating a good
    knowledge-base for an expert system
  • In complex problems, an experts assistance is
    needed to write the equational theory

21
Transitive Closure
  • In general, no single pass (i.e. no single key)
    will be sufficient to catch all matching records
  • An attribute that appears first in the key has
    higher discriminating power than those appearing
    after them
  • If an employee has two records in a DB with SSN
    193456782 and 913456782, its unlikely they will
    fall under the same window

22
Transitive Closure
  • To increase the number of similar records merged
  • Widen the scanning window size, w
  • Execute several independent runs of the SNM
  • Use a different key each time
  • Use a relatively small window
  • Call this the Multi-Pass approach

23
Transitive Closure
  • Each independent run of the Multi-Pass approach
    will produce a set of pairs of records
  • Although one field in a record may be in error,
    another field may not
  • Transitive closure can be applied to those pairs
    to be merged

24
Multi-pass Matches
  • Pass 1 (Lastname discriminates)
  • KSNKAT48NRTH789 (Kathi Kason 789912345 )
  • KSNKAT48NRTH879 (Kathy Kason 879912345 )
  • Pass 2 (Firstname discriminates)
  • KATKSN48NRTH789 (Kathi Kason 789912345 )
  • KATKSN48NRTH879 (Kathy Kason 879912345 )
  • Pass 3 (Address discriminates)
  • 48NRTH879KSNKAT (Kathy Kason 879912345 )
  • 48NRTH879SMTKAT (Kathy Smith 879912345 )

25
Transitive Equality Example
  • IF A implies B
  • AND B implies C
  • THEN A implies C
  • From example
  • 789912345 Kathi Kason 48 North St. (A)
  • 879912345 Kathy Kason 48 North St. (B)
  • 879912345 Kathy Smith 48 North St. (C)

26
Test Results
27
Test Environment
  • Test data was created by a database generator
  • Names are randomly chosen from a list of 63000
    real names
  • The database generator provides a large number of
    parameters
  • size of the DB,
  • percentage of duplicates,
  • amount of error

28
Correct Duplicate Detection
29
Time for each run
30
Accuracy for each run
31
Real-World Test
  • Data was obtained from the Office of Children
    Administrative Research (OCAR) of the Department
    of Social and Health Services (State of
    Washington)
  • OCARs goals
  • How long do children stay in foster care?
  • How many different homes do children typically
    stay in?

32
OCARs Database
  • Most of OCARs data is stored in one relation
  • The DB contains 6,000,000 total records
  • The DB grows by about 50,000 records per month

33
Typical Problems in the DB
  • Names are frequently misspelled
  • SSN or birthdays are either missing or clearly
    wrong
  • Case number often changes when the childs family
    moves to another part of the state
  • Some records use service provider names instead
    of the childs
  • No reliable unique identifier

34
OCAR Equational Theory
  • Keys for the independent runs
  • Last Name, First Name, SSN, Case Number
  • First Name, Last Name, SSN, Case Number
  • Case Number, First Name, Last Name, SSN

35
OCAR Results
36
Incremental Merge/Purge w/ New Data
37
Incremental Merge/Purge
  • Lists are concatenated for first time processing
  • Concatenating new data before reapplying the
    merge/purge process may be very expensive in both
    time and space
  • An incremental merge/purge approach is needed
    Prime Representatives method

38
Prime-Representative Definition
  • A set of records extracted from each cluster of
    records used to represent the information in the
    cluster
  • The Cluster Centroid or base element of
    equivalence class

39
Prime-Representative creation
  • Initially, no PR exists
  • After the execution of the first merge/purge
    create clusters of similiar records
  • Correct selection of PR from cluster impacts
    accuracy of results
  • No PR can be the best selection for some clusters

40
3 Strategies for Choosing PR
  • Random Sample
  • Select a sample of records at random from each
    cluster
  • N-Latest
  • Most recent elements entered in DB
  • Syntactic
  • Choose the largest or more complete record

41
Important Assumption
  • No data previously used to select each clusters
    PR will be deleted
  • Deleted records could require restructuring of
    clusters (expensive)
  • No changes in the rule-set will occur after the
    first increment of data is processed
  • Substantial rule change could invalidate
    clusters.

42
Results
  • Cumulative running time for the Incremental
    Merge/Purge algorithm is higher than the classic
    algorithm
  • PR selection methodology could improve cumulative
    running time
  • Total running time of the Incremental Merge/Purge
    algorithm is always smaller

43
Conclusion
44
Cleansing of Data
  • Sorted-Neighborhood Method is expensive due to
  • the sorting phase
  • the need for large windows for high accuracy
  • Multiple passes with small windows followed by
    transitive closure improves accuracy and
    performance for level of accuracy
  • increasing number of successful matches
  • decreasing number of false positives

45
Recap
46
  • 2 major reasons merging large databases becomes
    a difficult problem
  • The databases are heterogeneous
  • The identifiers or strings differ in how they
    are represented within each DB

47
  • The 3 steps in SNM are
  • Creation of key(s)
  • Sorting records on this key
  • Merge/Purge records

48
  • Prime representative - set of records from
    cluster considered to be representative of data
    contained in cluster
  • 3 strategies for selecting a PR
  • Random Sample
  • N-Latest
  • Syntactic

49
Questions
Write a Comment
User Comments (0)
About PowerShow.com